官术网_书友最值得收藏!

Chapter 2. Understanding Hadoop Internals and Architecture

Hadoop is currently the most widely adopted Big Data platform, with a diverse ecosystem of applications and data sources for forensic evidence. An Apache Foundation framework solution, Hadoop has been developed and tested in enterprise systems as a Big Data solution. Hadoop is virtually synonymous with Big Data and has become the de facto standard in the industry.

As a new Big Data solution, Hadoop has experienced a high adoption rate by many types of organizations and users. Developed by Yahoo! in the mid-2000s—and released to the Apache Foundation as one of the first major open source Big Data frameworks—Hadoop is designed to enable the distributed processing of large, complex data sets across a set of clustered computers. Hadoop's distributed architecture and open source ecosystem of software packages make it ideal for speed, scalability, and flexibility. Hadoop's adoption by large-scale technology companies is well publicized, and many other types of organizations and users have come to adopt Hadoop as well. These include scientific researchers, healthcare corporations, and data-driven marketing firms. Understanding how Hadoop works and how to perform forensics on Hadoop enables investigators to apply that same understanding to other Big Data solutions, such as PyTables.

Performing Big Data forensic investigations requires knowledge of Hadoop's internals and architecture. Just as knowing how the NTFS filesystem works is important for performing forensics in Windows, knowing the layers within a Hadoop solution is vital for properly identifying, collecting, and analyzing evidence in Hadoop. Moreover, Hadoop is rapidly changing—new software packages are being added and updates to Hadoop are being applied on a regular basis. Having a foundational knowledge of Hadoop's architecture and how it functions will enable an investigator to perform forensics on Hadoop as it continues to expand and evolve.

With its own filesystem, databases, and application layers, Hadoop can store data (that is, evidence) in various forms—and in different locations. Hadoop's multilayer architecture runs on top of the host operating system, which means evidence may need to be collected from the host operating system or from within the Hadoop ecosystem. Evidence can reside in each of the layers. This may require performing forensic collection and analysis in a manner specific to each layer.

This chapter explores how Hadoop works. The following topics are covered in detail: Hadoop's architecture, files, and data input/output (I/O). This is done to provide an understanding of the technical underpinnings of Hadoop. The key components of the Hadoop forensic evidence ecosystem are mapped out, and how to locate evidence within a Hadoop solution is covered. Finally, this chapter concludes with instructions on how to set up and run LightHadoop and Amazon Web Services. These are introduced as the Hadoop instances that serve as the basis for the examples used in this book. If you are interested in performing forensic investigations, you should follow the instructions on how to install LightHadoop and set up an Amazon Web Services instance at the end of this chapter. These systems are necessary to follow the examples presented throughout this book.

主站蜘蛛池模板: 苏尼特右旗| 饶河县| 尼木县| 扶绥县| 兴国县| 大埔区| 江川县| 淮阳县| 广东省| 怀宁县| 昌江| 郁南县| 芜湖市| 九寨沟县| 信丰县| 西宁市| 江阴市| 金塔县| 汝州市| 揭阳市| 永济市| 内黄县| 巴马| 汝南县| 渝中区| 乌海市| 潢川县| 姚安县| 泸溪县| 南充市| 广平县| 涿鹿县| 泰来县| 菏泽市| 海阳市| 丹阳市| 睢宁县| 唐山市| 峨边| 泾源县| 靖安县|