官术网_书友最值得收藏!

Chapter 2. Understanding Hadoop Internals and Architecture

Hadoop is currently the most widely adopted Big Data platform, with a diverse ecosystem of applications and data sources for forensic evidence. An Apache Foundation framework solution, Hadoop has been developed and tested in enterprise systems as a Big Data solution. Hadoop is virtually synonymous with Big Data and has become the de facto standard in the industry.

As a new Big Data solution, Hadoop has experienced a high adoption rate by many types of organizations and users. Developed by Yahoo! in the mid-2000s—and released to the Apache Foundation as one of the first major open source Big Data frameworks—Hadoop is designed to enable the distributed processing of large, complex data sets across a set of clustered computers. Hadoop's distributed architecture and open source ecosystem of software packages make it ideal for speed, scalability, and flexibility. Hadoop's adoption by large-scale technology companies is well publicized, and many other types of organizations and users have come to adopt Hadoop as well. These include scientific researchers, healthcare corporations, and data-driven marketing firms. Understanding how Hadoop works and how to perform forensics on Hadoop enables investigators to apply that same understanding to other Big Data solutions, such as PyTables.

Performing Big Data forensic investigations requires knowledge of Hadoop's internals and architecture. Just as knowing how the NTFS filesystem works is important for performing forensics in Windows, knowing the layers within a Hadoop solution is vital for properly identifying, collecting, and analyzing evidence in Hadoop. Moreover, Hadoop is rapidly changing—new software packages are being added and updates to Hadoop are being applied on a regular basis. Having a foundational knowledge of Hadoop's architecture and how it functions will enable an investigator to perform forensics on Hadoop as it continues to expand and evolve.

With its own filesystem, databases, and application layers, Hadoop can store data (that is, evidence) in various forms—and in different locations. Hadoop's multilayer architecture runs on top of the host operating system, which means evidence may need to be collected from the host operating system or from within the Hadoop ecosystem. Evidence can reside in each of the layers. This may require performing forensic collection and analysis in a manner specific to each layer.

This chapter explores how Hadoop works. The following topics are covered in detail: Hadoop's architecture, files, and data input/output (I/O). This is done to provide an understanding of the technical underpinnings of Hadoop. The key components of the Hadoop forensic evidence ecosystem are mapped out, and how to locate evidence within a Hadoop solution is covered. Finally, this chapter concludes with instructions on how to set up and run LightHadoop and Amazon Web Services. These are introduced as the Hadoop instances that serve as the basis for the examples used in this book. If you are interested in performing forensic investigations, you should follow the instructions on how to install LightHadoop and set up an Amazon Web Services instance at the end of this chapter. These systems are necessary to follow the examples presented throughout this book.

主站蜘蛛池模板: 广州市| 米泉市| 闵行区| 沈丘县| 监利县| 长春市| 酒泉市| 衡阳市| 太仓市| 桃园县| 安仁县| 永修县| 西昌市| 江山市| 泾阳县| 六安市| 资中县| 清徐县| 肥东县| 百色市| 德清县| 麻城市| 贺兰县| 黔南| 梅河口市| 遂平县| 册亨县| 白水县| 疏勒县| 东光县| 祁门县| 龙江县| 石嘴山市| 体育| 紫云| 青冈县| 丹东市| 武功县| 云梦县| 普洱| 阳山县|