官术网_书友最值得收藏!

Working with big data

What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:

B
D
X
A
D
A

We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:

bash> sort file | uniq -c

The output is as shown ahead:

2 A
1 B
1 D
1 X

However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.

主站蜘蛛池模板: 额济纳旗| 宝山区| 竹北市| 双牌县| 鄂州市| 鹿泉市| 天峨县| 六安市| 大同县| 宽甸| 紫金县| 汾西县| 青河县| 个旧市| 微山县| 布拖县| 巢湖市| 资中县| 新营市| 高要市| 裕民县| 鄂伦春自治旗| 安义县| 阳春市| 礼泉县| 襄垣县| 常熟市| 庆元县| 蕉岭县| 凤冈县| 阿城市| 东海县| 神农架林区| 文水县| 晋江市| 郧西县| 江都市| 修水县| 龙南县| 昌乐县| 东乌珠穆沁旗|