- Hands-On Big Data Analytics with PySpark
- Rudy Lai Bart?omiej Potaczek
- 430字
- 2021-06-24 15:52:30
What this book covers
Chapter 1, Installing Pyspark and Setting up Your Development Environment, covers the installation of PySpark and learning about core concepts in Spark, including resilient distributed datasets (RDDs), SparkContext, and Spark tools, such as SparkConf and SparkShell.
Chapter 2, Getting Your Big Data into the Spark Environment Using RDDs, explains how to get your big data into the Spark environment using RDDs using a wide array of tools to interact and modify this data so that useful insights can be extracted.
Chapter 3, Big Data Cleaning and Wrangling with Spark Notebooks, covers how to use Spark in notebook applications, thereby facilitating the effective use of RDDs.
Chapter 4, Aggregating and Summarizing Data into Useful Reports, describes how to calculate averages with the map and reduce function, perform faster average computation, and use a pivot table with key/value pair data points.
Chapter 5, Powerful Exploratory Data Analysis with MLlib, examines Spark's ability to perform regression tasks with models including linear regression and SVMs.
Chapter 6, Putting Structure on Your Big Data with SparkSQL, explains how to manipulate DataFrames with Spark SQL schemas, and use the Spark DSL to build queries for structured data operations.
Chapter 7, Transformations and Actions, looks at Spark transformations to defer computations and then considers transformations that should be avoided. We will also use the reduce and reduceByKey methods to carry out calculations from a dataset.
Chapter 8, Immutable Design, explains how to use DataFrame operations for transformations with a view to discussing immutability in a highly concurrent environment.
Chapter 9, Avoid Shuffle and Reduce Operational Expenses, covers shuffling and the operations of Spark API that should be used. We will then test operations that cause a shuffle in Apache Spark to know which operations should be avoided.
Chapter 10, Saving Data in the Correct Format, explains how to save data in the correct format and also save data in plain text using Spark's standard API.
Chapter 11, Working with the Spark Key/Value API, discusses the transformations available on key/value pairs. We will look at actions on key/value pairs and look at the available partitioners on key/value data.
Chapter 12, Testing Apache Spark Jobs, goes into further detail about testing Apache Spark jobs in different versions of Spark.
Chapter 13, Leveraging the Spark GraphX API, covers how to leverage Spark GraphX API. We will carry out experiments with the Edge API and Vertex API.
- 數據之巔:數據的本質與未來
- InfluxDB原理與實戰
- Enterprise Integration with WSO2 ESB
- Mastering Machine Learning with R(Second Edition)
- OracleDBA實戰攻略:運維管理、診斷優化、高可用與最佳實踐
- 科研統計思維與方法:SPSS實戰
- IPython Interactive Computing and Visualization Cookbook(Second Edition)
- Oracle數據庫管理、開發與實踐
- 企業主數據管理實務
- 菜鳥學SPSS數據分析
- 改進的群智能算法及其應用
- 大數據計算系統原理、技術與應用
- 大學計算機:理解和運用計算思維
- Discovering Business Intelligence Using MicroStrategy 9
- SQL必知必會(第5版)