官术网_书友最值得收藏!

The scope of design patterns in Pig

This book deals with patterns that were encountered while solving real-world, recurrent Big Data problems in an enterprise setting. The need for these patterns takes root in the evolution of Pig to solve the emerging problems of large volumes and a variety of data, and the perceived need for a pattern catalog to document their solutions.

The emerging problems of handling large volumes of data, typically deal with getting a firm grip on understanding whether the data can be used or not to generate analytical insights and, if possible, how to efficiently generate these insights. Imagine yourself to be in the shoes of a data scientist who has been given a massive volume of data that does not have a proper schema, is messy, and has not been documented for ages. You have been asked to integrate this with other enterprise data sources and generate spectacular analytical insights. How do you start? Would you start integrating data and fire up your favorite analytics sandbox and begin generating results? Would it be handy if you knew beforehand the existence of design patterns that can be applied systematically and sequentially in this kind of scenario to reduce the error and increase the efficiency of Big Data analytics? The design patterns discussed in this book will definitely appeal to you in this case.

Design patterns in Pig are geared to enhance your ability to take a problem of Big Data and quickly apply the patterns to solve it. Successful development of Big Data solutions using Pig requires considering issues early in the lifecycle of development, and these patterns help to uncover those issues. Reusing Pig design patterns helps identify and address such subtleties and prevents them from growing into major problems. The by-product of the application of the patterns is readability and maintainability of the resultant code. These patterns provide developers a valuable communication tool by allowing them to use a common vocabulary to discuss problems in terms of what a pattern could solve, rather than explaining the internals of a problem in a verbose way. Design patterns for Pig are not a cookbook for success; they are a rule of thumb. Reading specific cases in this book about Pig design patterns may help you recognize problems early, saving you from the exponential cost of reworks later on.

The popularity of design patterns is very much dependent on the domain. For example, the state patterns, proxies, and facades of the Gang of Four book are very common with applications that communicate a lot with other systems. In the same way, the enterprises, which consume Big Data to understand analytical insights, use patterns related to solving problems of data pipelines since this is a very common use case. These patterns specifically elaborate the usage of Pig in data ingest, profiling, cleansing, transformation, reduction, analytics, and egress.

A few patterns discussed in Chapter 5, Data Transformation Patterns and Chapter 6, Understanding Data Reduction Patterns, adapt the existing patterns to new situations, and in the process modify the existing pattern itself. These patterns deal with the usage of Pig in incremental data integration and creation of quick prototypes.

These design patterns also go deeper and enable you to decide the applicability of specific language constructs of Pig for a given problem. The following questions illustrate this point better:

  • What is the recommended usage of projections to solve specific patterns?
  • In which pattern is the usage of scalar projections ideal to access aggregates?
  • For which patterns is it not recommended to use COUNT, SUM, and COUNT_STAR?
  • How to effectively use sorting in patterns where key distributions are skewed?
  • Which patterns are related to the correct usage of spill-able data types?
  • When not to use multiple FLATTENS operators, which can result in CROSS on bags?
  • What patterns depict the ideal usage of the nested FOREACH method?
  • Which patterns to choose for a JOIN operation when one dataset can fit into memory?
  • Which patterns to choose for a JOIN operation when one of the relations joined has a key that dominates?
  • Which patterns to choose for a JOIN operation when two datasets are already ordered?
主站蜘蛛池模板: 宝坻区| 日喀则市| 乐至县| 新晃| 米脂县| 方山县| 仁寿县| 肥乡县| 连江县| 永胜县| 潜山县| 安远县| 常德市| 宜君县| 长治县| 冕宁县| 盐边县| 曲周县| 玛纳斯县| 林西县| 阿克苏市| 永泰县| 平江县| 荆州市| 西丰县| 荣昌县| 顺平县| 通道| 崇阳县| 噶尔县| 平乐县| 虹口区| 繁峙县| 翁牛特旗| 台州市| 宜君县| 富顺县| 英吉沙县| 华亭县| 巴青县| 织金县|