官术网_书友最值得收藏!

Working with big data

What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:

B
D
X
A
D
A

We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:

bash> sort file | uniq -c

The output is as shown ahead:

2 A
1 B
1 D
1 X

However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.

主站蜘蛛池模板: 柞水县| 莲花县| 永胜县| 开江县| 吉安县| 阿图什市| 丹江口市| 库尔勒市| 嘉善县| 开江县| 巍山| 惠州市| 南充市| 吴忠市| 阿拉善左旗| 青铜峡市| 苏尼特右旗| 茶陵县| 廉江市| 江川县| 民乐县| 井陉县| 丰县| 姜堰市| 益阳市| 揭东县| 遂平县| 疏勒县| 台北市| 花莲县| 称多县| 广宁县| 铜鼓县| 铜鼓县| 冀州市| 金川县| 马关县| 乐清市| 韩城市| 蓬溪县| 东辽县|