舉報(bào)

會(huì)員
Hadoop MapReduce v2 Cookbook(Second Edition)
最新章節(jié):
Index
IfyouareaBigDataenthusiastandwishtouseHadoopv2tosolveyourproblems,thenthisbookisforyou.ThisbookisforJavaprogrammerswithlittletomoderateknowledgeofHadoopMapReduce.Thisisalsoaone-stopreferencefordevelopersandsystemadminswhowanttoquicklygetuptospeedwithusingHadoopv2.ItwouldbehelpfultohaveabasicknowledgeofsoftwaredevelopmentusingJavaandabasicworkingknowledgeofLinux.
最新章節(jié)
- Index
- Document classification using Mahout Naive Bayes Classifier
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Clustering text data using Apache Mahout
- Creating TF and TF-IDF vectors for the text data
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
品牌:中圖公司
上架時(shí)間:2021-07-23 19:09:52
出版社:Packt Publishing
本書數(shù)字版權(quán)由中圖公司提供,并由其授權(quán)上海閱文信息技術(shù)有限公司制作發(fā)行
- Index 更新時(shí)間:2021-07-23 20:33:18
- Document classification using Mahout Naive Bayes Classifier
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Clustering text data using Apache Mahout
- Creating TF and TF-IDF vectors for the text data
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
- De-duplicating data using Hadoop streaming
- Data preprocessing using Hadoop streaming and Python
- Introduction
- Chapter 10. Mass Text Data Processing
- Assigning advertisements to keywords using the Adwords balance algorithm
- Classification using the na?ve Bayes classifier
- Performing content-based recommendations
- Introduction
- Chapter 9. Classifications Recommendations and Finding Relationships
- Generating the in-links graph for crawled web pages
- Elasticsearch for indexing and searching
- Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
- Configuring Apache HBase as the backend data store for Apache Nutch
- Indexing and searching web documents using Apache Solr
- Intradomain web crawling using Apache Nutch
- Generating an inverted index using Hadoop MapReduce
- Introduction
- Chapter 8. Searching and Indexing
- Exporting data from HDFS to a relational database using Apache Sqoop
- Importing data to HDFS from a relational database using Apache Sqoop
- Running K-means with Mahout
- Getting started with Apache Mahout
- Using Hive to insert data into HBase tables
- Running MapReduce jobs on HBase
- Data random access using Java client APIs
- Getting started with Apache HBase
- Accessing a Hive table data in Pig using HCatalog
- Joining two datasets using Pig
- Getting started with Apache Pig
- Introduction
- Chapter 7. Hadoop Ecosystem II – Pig HBase Mahout and Sqoop
- HCatalog – writing data to Hive tables from Java MapReduce computations
- HCatalog – performing Java MapReduce computations on data mapped to Hive tables
- Writing Hive User-defined Functions (UDF)
- Creating partitioned Hive tables
- Performing a join with Hive
- Hive batch mode - using a query file
- Using Hive built-in functions
- Utilizing different storage formats in Hive - storing table data using ORC files
- Creating and populating Hive tables and views using Hive query results
- Simple SQL-style data querying using Apache Hive
- Creating databases and tables using Hive CLI
- Getting started with Apache Hive
- Introduction
- Chapter 6. Hadoop Ecosystem – Apache Hive
- Joining two datasets using MapReduce
- Parsing a complex dataset with Hadoop
- Calculating Scatter plots using MapReduce
- Calculating histograms using MapReduce
- Plotting the Hadoop MapReduce results using gnuplot
- Calculating frequency distributions and sorting using MapReduce
- Performing GROUP BY using MapReduce
- Simple analytics using MapReduce
- Introduction
- Chapter 5. Analytics
- Hadoop counters to report custom metrics
- Adding dependencies between MapReduce jobs
- Using Hadoop with legacy applications – Hadoop streaming
- Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- Secondary sorting – sorting Reduce input values
- Hadoop intermediate data partitioning
- Writing multiple outputs from a MapReduce computation
- Formatting the results of MapReduce computations – using Hadoop OutputFormats
- Adding support for new input data formats – implementing a custom InputFormat
- Choosing a suitable Hadoop InputFormat for your input data format
- Emitting data of different value types from a Mapper
- Implementing a custom Hadoop key type
- Implementing a custom Hadoop Writable data type
- Choosing appropriate Hadoop data types
- Introduction
- Chapter 4. Developing Complex Hadoop MapReduce Applications
- Using the HDFS Java API
- Setting the file replication factor
- Setting the HDFS block size
- Using multiple disks/volumes and limiting HDFS disk usage
- Decommissioning DataNodes
- Adding a new DataNode
- Integration testing Hadoop MapReduce applications using MiniYarnCluster
- Unit testing Hadoop MapReduce applications using MRUnit
- Speculative execution of straggling tasks
- Setting classpath precedence to user-provided JARs
- Shared user Hadoop clusters – using Fair and Capacity schedulers
- Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
- Introduction
- Chapter 3. Hadoop Essentials – Configurations Unit Tests and Other APIs
- Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
- Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
- Deploying an Apache HBase cluster on Amazon EC2 using EMR
- Creating an Amazon EMR job flow using the AWS Command Line Interface
- Executing a Hive script using EMR
- Executing a Pig script using EMR
- Saving money using Amazon EC2 Spot Instances to execute EMR job flows
- Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
- Introduction
- Chapter 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
- Benchmarking Hadoop MapReduce using TeraSort
- Benchmarking HDFS using DFSIO
- Running the WordCount program in a distributed cluster environment
- HDFS command-line file operations
- Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
- Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
- Setting up HDFS
- Adding a combiner step to the WordCount MapReduce program
- Writing a WordCount MapReduce application bundling it and running it using the Hadoop local mode
- Setting up Hadoop v2 on your local machine
- Introduction
- Chapter 1. Getting Started with Hadoop v2
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Support files eBooks discount offers and more
- www.PacktPub.com
- About the Reviewers
- About the Author
- Acknowledgments
- About the Author
- Credits
- Hadoop MapReduce v2 Cookbook Second Edition
- coverpage
- coverpage
- Hadoop MapReduce v2 Cookbook Second Edition
- Credits
- About the Author
- Acknowledgments
- About the Author
- About the Reviewers
- www.PacktPub.com
- Support files eBooks discount offers and more
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Chapter 1. Getting Started with Hadoop v2
- Introduction
- Setting up Hadoop v2 on your local machine
- Writing a WordCount MapReduce application bundling it and running it using the Hadoop local mode
- Adding a combiner step to the WordCount MapReduce program
- Setting up HDFS
- Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
- Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
- HDFS command-line file operations
- Running the WordCount program in a distributed cluster environment
- Benchmarking HDFS using DFSIO
- Benchmarking Hadoop MapReduce using TeraSort
- Chapter 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
- Introduction
- Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
- Saving money using Amazon EC2 Spot Instances to execute EMR job flows
- Executing a Pig script using EMR
- Executing a Hive script using EMR
- Creating an Amazon EMR job flow using the AWS Command Line Interface
- Deploying an Apache HBase cluster on Amazon EC2 using EMR
- Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
- Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
- Chapter 3. Hadoop Essentials – Configurations Unit Tests and Other APIs
- Introduction
- Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
- Shared user Hadoop clusters – using Fair and Capacity schedulers
- Setting classpath precedence to user-provided JARs
- Speculative execution of straggling tasks
- Unit testing Hadoop MapReduce applications using MRUnit
- Integration testing Hadoop MapReduce applications using MiniYarnCluster
- Adding a new DataNode
- Decommissioning DataNodes
- Using multiple disks/volumes and limiting HDFS disk usage
- Setting the HDFS block size
- Setting the file replication factor
- Using the HDFS Java API
- Chapter 4. Developing Complex Hadoop MapReduce Applications
- Introduction
- Choosing appropriate Hadoop data types
- Implementing a custom Hadoop Writable data type
- Implementing a custom Hadoop key type
- Emitting data of different value types from a Mapper
- Choosing a suitable Hadoop InputFormat for your input data format
- Adding support for new input data formats – implementing a custom InputFormat
- Formatting the results of MapReduce computations – using Hadoop OutputFormats
- Writing multiple outputs from a MapReduce computation
- Hadoop intermediate data partitioning
- Secondary sorting – sorting Reduce input values
- Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- Using Hadoop with legacy applications – Hadoop streaming
- Adding dependencies between MapReduce jobs
- Hadoop counters to report custom metrics
- Chapter 5. Analytics
- Introduction
- Simple analytics using MapReduce
- Performing GROUP BY using MapReduce
- Calculating frequency distributions and sorting using MapReduce
- Plotting the Hadoop MapReduce results using gnuplot
- Calculating histograms using MapReduce
- Calculating Scatter plots using MapReduce
- Parsing a complex dataset with Hadoop
- Joining two datasets using MapReduce
- Chapter 6. Hadoop Ecosystem – Apache Hive
- Introduction
- Getting started with Apache Hive
- Creating databases and tables using Hive CLI
- Simple SQL-style data querying using Apache Hive
- Creating and populating Hive tables and views using Hive query results
- Utilizing different storage formats in Hive - storing table data using ORC files
- Using Hive built-in functions
- Hive batch mode - using a query file
- Performing a join with Hive
- Creating partitioned Hive tables
- Writing Hive User-defined Functions (UDF)
- HCatalog – performing Java MapReduce computations on data mapped to Hive tables
- HCatalog – writing data to Hive tables from Java MapReduce computations
- Chapter 7. Hadoop Ecosystem II – Pig HBase Mahout and Sqoop
- Introduction
- Getting started with Apache Pig
- Joining two datasets using Pig
- Accessing a Hive table data in Pig using HCatalog
- Getting started with Apache HBase
- Data random access using Java client APIs
- Running MapReduce jobs on HBase
- Using Hive to insert data into HBase tables
- Getting started with Apache Mahout
- Running K-means with Mahout
- Importing data to HDFS from a relational database using Apache Sqoop
- Exporting data from HDFS to a relational database using Apache Sqoop
- Chapter 8. Searching and Indexing
- Introduction
- Generating an inverted index using Hadoop MapReduce
- Intradomain web crawling using Apache Nutch
- Indexing and searching web documents using Apache Solr
- Configuring Apache HBase as the backend data store for Apache Nutch
- Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
- Elasticsearch for indexing and searching
- Generating the in-links graph for crawled web pages
- Chapter 9. Classifications Recommendations and Finding Relationships
- Introduction
- Performing content-based recommendations
- Classification using the na?ve Bayes classifier
- Assigning advertisements to keywords using the Adwords balance algorithm
- Chapter 10. Mass Text Data Processing
- Introduction
- Data preprocessing using Hadoop streaming and Python
- De-duplicating data using Hadoop streaming
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
- Creating TF and TF-IDF vectors for the text data
- Clustering text data using Apache Mahout
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Document classification using Mahout Naive Bayes Classifier
- Index 更新時(shí)間:2021-07-23 20:33:18