官术网_书友最值得收藏!

  • Apache Oozie Essentials
  • Jagat Jasjit Singh
  • 369字
  • 2021-07-30 09:58:22

Book case study

Throughout this book, we will try to solve case study that will revolve around various concepts of Oozie.

One of the main use cases of Hadoop is ETL data processing.

Suppose we work for a large consulting company and have won a project to set up a Big Data cluster inside the customer data center. On a high level, the requirements are to set up an environment that will satisfy the following flow:

  1. Get data from various sources in Hadoop (file-based loads and Sqoop-based loads).
  2. Preprocess them with various scripts (Pig, Hive, and MapReduce).
  3. Insert that data into Hive tables for use by analysts and data scientists.
  4. Data scientists then write machine learning models (Spark).

We will use Oozie as our processing scheduling system to do all the preceding tasks. Since writing actual Hive, Sqoop, MapReduce, Pig, and Spark code is not in the scope of this book, I will not dive into explaining business logic for those. So I have kept them very simple.

In our architecture, we have one landing server that sits outside as the front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep it simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron-driven. It has a very simple business logic: run every night at 8:00 P.M. and move all the files that it sees on the landing server into HDFS.

The work of Oozie starts from this point:

  1. Oozie picks the file and cleans it using Pig Script to replace all the delimiters, from comma (,) to pipes (|). We will write the same code using Pig and MapReduce.
  2. Then, push those processed files into a Hive table.
  3. For different source systems which are database-based MySQL tables, we do nightly Sqoop when the load of the database is light. So, we extract all the records that have been generated on the previous business day.
  4. We insert the output of that too into Hive tables.
  5. Analyst and data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables.
  6. We will use Oozie to schedule all of these regular tasks.
主站蜘蛛池模板: 蓬溪县| 金湖县| 内江市| 蒙城县| 宁晋县| 墨脱县| 尉犁县| 鹤山市| 临西县| 宜丰县| 三穗县| 尚志市| 南充市| 南京市| 汕尾市| 云龙县| 江阴市| 龙口市| 赤峰市| 庐江县| 乐昌市| 五指山市| 宝应县| 云安县| 夏邑县| 定结县| 乐都县| 西青区| 文水县| 抚顺市| 乌拉特前旗| 盐津县| 开阳县| 乌兰浩特市| 乌拉特中旗| 深泽县| 洛川县| 临澧县| 海安县| 武隆县| 永嘉县|