官术网_书友最值得收藏!

Introduction to Spark

"We call this the problem of big data."

Arguably, the first time big data was being talked about in a context we know now was in July, 1997. Michael Cox and David Ellsworth, scientists/researchers from NASA, described the problem they faced when processing humongous amounts of data with the traditional computers of that time. In the early 2000s, Lexis Nexis designed a proprietary system, which later went on to become the High-Performance Computing Cluster (HPCC), to address the growing need of processing data on a cluster. It was later open sourced in 2011.

It was an era of dot coms and Google was challenging the limits of the internet by crawling and indexing the entire internet. With the rate at which the internet was expanding, Google knew it would be difficult if not impossible to scale vertically to process data of that size. Distributed computing, though still in its infancy, caught Google's attention. They not only developed a distributed fault tolerant filesystem, Google File System (GFS), but also a distributed processing engine/system called MapReduce. It was then in 2003-2004 that Google released the white paper titled The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, and shortly thereafter they released another white paper titled MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.

Doug Cutting, an open source contributor, around the same time was looking for ways to make an open source search engine and like Google was failing to process the data at the internet scale. By 1999, Doug Cutting had developed Lucene, a Java library with the capability of text/web searching among other things. Nutch, an open source web crawler and data indexer built by Doug Cutting along with Mike Cafarella, was not scaling well. As luck would have it, Google's white paper caught Doug Cutting's attention. He began working on similar concepts calling them Nutch Distributed File System (NDFS) and Nutch MapReduce. By 2005, he was able to scale Nutch, which could index from 100 million pages to multi-billion pages using the distributed platform.

However, it wasn't just Doug Cutting but Yahoo! too who became interested in the development of the MapReduce computing framework to serve its processing capabilities. It is here that Doug Cutting refactored the distributed computing framework of Nutch and named it after his kid's elephant toy, Hadoop. By 2008, Yahoo! was using Hadoop in its production cluster to build its search index and metadata called web map. Despite being a direct competitor to Google, one distinct strategic difference that Yahoo! took while co-developing Hadoop was the nature in which the project was to be developed: they open sourced it. And the rest, as we know is history!

In this chapter, we will cover the following topics:

  • What is big data?
  • Why Apache Spark?
  • RDD the first citizen of Spark
  • Spark ecosystem -- Spark SQL, Spark Streaming, Milb, Graphx
  • What's new in Spark 2.X?
主站蜘蛛池模板: 霍林郭勒市| 花莲县| 昭平县| 黄陵县| 新昌县| 洪泽县| 土默特左旗| 齐河县| 昭觉县| 东兰县| 大安市| 峨边| 开江县| 浦北县| 平和县| 罗城| 云阳县| 渝北区| 东乡族自治县| 河间市| 旺苍县| 大丰市| 钟山县| 肃宁县| 东乡县| 乐至县| 嘉峪关市| 沛县| 平昌县| 云龙县| 襄垣县| 突泉县| 资兴市| 句容市| 惠安县| 利川市| 天门市| 肥西县| 确山县| 赤壁市| 怀集县|