官术网_书友最值得收藏!

Introduction to Large-Scale Machine Learning and Spark

"Information is the oil of the 21 st century, and analytics is the combustion engine." 


                                                                                                         --Peter Sondergaard, Gartner Research

By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop.

However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverage the power of modern computers to apply various mathematical methods in order to learn more about data and patterns, and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation-expensive statistical or machine learning algorithms have started to help extract nuggets of information from data.

One of the first well-adopted big data technologies was Hadoop, which allows for the  MapReduce computation by saving intermediate results on a disk. However, it still lacks proper big data tools for information extraction. Nevertheless, Hadoop was just the beginning. With the growing size of machine memory, new in-memory computation frameworks appeared, and they also started to provide basic support for conducting data analysis and modeling—for example, SystemML or Spark ML for Spark and FlinkML for Flink. These frameworks represent only the tip of the iceberg—there is a lot more in the big data ecosystem, and it is permanently evolving, since the volume of data is constantly growing, demanding new big data algorithms and processing methods. For example, the Internet of Things (IoT) represents a new domain that produces huge amount of streaming data from various sources (for example, home security system, Alexa Echo, or vital sensors) and brings not only an unlimited potential to mind useful information from data, but also demands new kind of data processing and modeling methods.

Nevertheless, in this chapter, we will start from the beginning and explain the following topics:

  • Basic working tasks of data scientists
  • Aspect of big data computation in distributed environment
  • The big data ecosystem
  • Spark and its machine learning support
主站蜘蛛池模板: 弥勒县| 崇仁县| 福建省| 邵阳县| 仙游县| 龙陵县| 扬州市| 阿勒泰市| 洛川县| 祁门县| 锦屏县| 芜湖市| 专栏| 邛崃市| 定结县| 宁乡县| 拜泉县| 灵山县| 凤山市| 靖安县| 临洮县| 辛集市| 驻马店市| 出国| 浮山县| 全南县| 昭觉县| 六枝特区| 临夏市| 垦利县| 醴陵市| 攀枝花市| 宜州市| 巴林右旗| 将乐县| 治县。| 海城市| 军事| 江城| 陇南市| 沾化县|