What this book covers

Chapter 1, Installing Pyspark and Setting up Your Development Environment, covers the installation of PySpark and learning about core concepts in Spark, including resilient distributed datasets (RDDs), SparkContext, and Spark tools, such as SparkConf and SparkShell.

Chapter 2, Getting Your Big Data into the Spark Environment Using RDDs, explains how to get your big data into the Spark environment using RDDs using a wide array of tools to interact and modify this data so that useful insights can be extracted.

Chapter 3, Big Data Cleaning and Wrangling with Spark Notebooks, covers how to use Spark in notebook applications, thereby facilitating the effective use of RDDs.

Chapter 4, Aggregating and Summarizing Data into Useful Reports, describes how to calculate averages with the map and reduce function, perform faster average computation, and use a pivot table with key/value pair data points.

Chapter 5, Powerful Exploratory Data Analysis with MLlib, examines Spark's ability to perform regression tasks with models including linear regression and SVMs.

Chapter 6, Putting Structure on Your Big Data with SparkSQL, explains how to manipulate DataFrames with Spark SQL schemas, and use the Spark DSL to build queries for structured data operations.

Chapter 7, Transformations and Actions, looks at Spark transformations to defer computations and then considers transformations that should be avoided. We will also use the reduce and reduceByKey methods to carry out calculations from a dataset.

Chapter 8, Immutable Design, explains how to use DataFrame operations for transformations with a view to discussing immutability in a highly concurrent environment.

Chapter 9, Avoid Shuffle and Reduce Operational Expenses, covers shuffling and the operations of Spark API that should be used. We will then test operations that cause a shuffle in Apache Spark to know which operations should be avoided.

Chapter 10, Saving Data in the Correct Format, explains how to save data in the correct format and also save data in plain text using Spark's standard API.

Chapter 11, Working with the Spark Key/Value API, discusses the transformations available on key/value pairs. We will look at actions on key/value pairs and look at the available partitioners on key/value data.

Chapter 12, Testing Apache Spark Jobs, goes into further detail about testing Apache Spark jobs in different versions of Spark.

Chapter 13, Leveraging the Spark GraphX API, covers how to leverage Spark GraphX API. We will carry out experiments with the Edge API and Vertex API.

官术网_书友最值得收藏!

Hands-On Big Data Analytics with PySpark

What this book covers