- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 225字
- 2021-07-02 18:46:04
Working with big data
What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:
B
D
X
A
D
A
We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:
bash> sort file | uniq -c
The output is as shown ahead:
2 A
1 B
1 D
1 X
However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.
- Objective-C Memory Management Essentials
- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Unity 2020 Mobile Game Development
- PyTorch Artificial Intelligence Fundamentals
- Python機器學習經典實例
- 網絡爬蟲原理與實踐:基于C#語言
- Kinect for Windows SDK Programming Guide
- 學習OpenCV 4:基于Python的算法實戰
- 計算機應用基礎教程(Windows 7+Office 2010)
- 深入實踐Kotlin元編程
- R Data Science Essentials
- BeagleBone Robotic Projects(Second Edition)
- Getting Started with Python
- 石墨烯改性塑料
- Web程序設計:ASP.NET(第2版)