官术网_书友最值得收藏!

Data technologies

When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.

Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.

All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.

The role of Apache Spark

In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.

主站蜘蛛池模板: 东安县| 吴堡县| 调兵山市| 前郭尔| 剑河县| 沅陵县| 石狮市| 康马县| 定安县| 冀州市| 海阳市| 大兴区| 盐池县| 北川| 临桂县| 南充市| 平远县| 六安市| 玛多县| 保山市| 旬阳县| 平江县| 吉木乃县| 光泽县| 新化县| 扬州市| 石渠县| 丰宁| 东阳市| 保山市| 双城市| 麻江县| 依安县| 深圳市| 扶余县| 吴江市| 淮阳县| 赤水市| 保亭| 沾益县| 那坡县|