舉報

會員
Hands-On Big Data Analytics with PySpark
ApacheSparkisanopensourceparallel-processingframeworkthathasbeenaroundforquitesometimenow.OneofthemanyusesofApacheSparkisfordataanalyticsapplicationsacrossclusteredcomputers.Inthisbook,youwillnotonlylearnhowtouseSparkandthePythonAPItocreatehigh-performanceanalyticswithbigdata,butalsodiscovertechniquesfortesting,immunizing,andparallelizingSparkjobs.Youwilllearnhowtosourcedatafromallpopulardatahostingplatforms,includingHDFS,Hive,JSON,andS3,anddealwithlargedatasetswithPySparktogainpracticalbigdataexperience.Thisbookwillhelpyouworkonprototypesonlocalmachinesandsubsequentlygoontohandlemessydatainproductionandatscale.ThisbookcoversinstallingandsettingupPySpark,RDDoperations,bigdatacleaningandwrangling,andaggregatingandsummarizingdataintousefulreports.YouwillalsolearnhowtoimplementsomepracticalandproventechniquestoimprovecertainaspectsofprogrammingandadministrationinApacheSpark.Bytheendofthebook,youwillbeabletobuildbigdataanalyticalsolutionsusingthevariousPySparkofferingsandalsooptimizethemeffectively.
目錄(122章)
倒序
- coverpage
- Title Page
- Copyright and Credits
- Hands-On Big Data Analytics with PySpark
- About Packt
- Why subscribe?
- Packt.com
- Contributors
- About the authors
- Packt is searching for authors like you
- Preface
- Who this book is for
- What this book covers
- To get the most out of this book
- Download the example code files
- Download the color images
- Conventions used
- Get in touch
- Reviews
- Installing Pyspark and Setting up Your Development Environment
- An overview of PySpark
- Spark SQL
- Setting up Spark on Windows and PySpark
- Core concepts in Spark and PySpark
- SparkContext
- Spark shell
- SparkConf
- Summary
- Getting Your Big Data into the Spark Environment Using RDDs
- Loading data on to Spark RDDs
- The UCI machine learning repository
- Getting the data from the repository to Spark
- Getting data into Spark
- Parallelization with Spark RDDs
- What is parallelization?
- Basics of RDD operation
- Summary
- Big Data Cleaning and Wrangling with Spark Notebooks
- Using Spark Notebooks for quick iteration of ideas
- Sampling/filtering RDDs to pick out relevant data points
- Splitting datasets and creating some new combinations
- Summary
- Aggregating and Summarizing Data into Useful Reports
- Calculating averages with map and reduce
- Faster average computations with aggregate
- Pivot tabling with key-value paired data points
- Summary
- Powerful Exploratory Data Analysis with MLlib
- Computing summary statistics with MLlib
- Using Pearson and Spearman correlations to discover correlations
- The Pearson correlation
- The Spearman correlation
- Computing Pearson and Spearman correlations
- Testing our hypotheses on large datasets
- Summary
- Putting Structure on Your Big Data with SparkSQL
- Manipulating DataFrames with Spark SQL schemas
- Using Spark DSL to build queries
- Summary
- Transformations and Actions
- Using Spark transformations to defer computations to a later time
- Avoiding transformations
- Using the reduce and reduceByKey methods to calculate the results
- Performing actions that trigger computations
- Reusing the same rdd for different actions
- Summary
- Immutable Design
- Delving into the Spark RDD's parent/child chain
- Extending an RDD
- Chaining a new RDD with the parent
- Testing our custom RDD
- Using RDD in an immutable way
- Using DataFrame operations to transform
- Immutability in the highly concurrent environment
- Using the Dataset API in an immutable way
- Summary
- Avoiding Shuffle and Reducing Operational Expenses
- Detecting a shuffle in a process
- Testing operations that cause a shuffle in Apache Spark
- Changing the design of jobs with wide dependencies
- Using keyBy() operations to reduce shuffle
- Using a custom partitioner to reduce shuffle
- Summary
- Saving Data in the Correct Format
- Saving data in plain text format
- Leveraging JSON as a data format
- Tabular formats – CSV
- Using Avro with Spark
- Columnar formats – Parquet
- Summary
- Working with the Spark Key/Value API
- Available actions on key/value pairs
- Using aggregateByKey instead of groupBy()
- Actions on key/value pairs
- Available partitioners on key/value data
- Implementing a custom partitioner
- Summary
- Testing Apache Spark Jobs
- Separating logic from Spark engine-unit testing
- Integration testing using SparkSession
- Mocking data sources using partial functions
- Using ScalaCheck for property-based testing
- Testing in different versions of Spark
- Summary
- Leveraging the Spark GraphX API
- Creating a graph from a data source
- Creating the loader component
- Revisiting the graph format
- Loading Spark from file
- Using the Vertex API
- Constructing a graph using the vertex
- Creating couple relationships
- Using the Edge API
- Constructing the graph using edge
- Calculating the degree of the vertex
- The in-degree
- The out-degree
- Calculating PageRank
- Loading and reloading data about users and followers
- Summary
- Other Books You May Enjoy
- Leave a review - let other readers know what you think 更新時間:2021-06-24 15:52:53
推薦閱讀
- PyTorch深度學習實戰:從新手小白到數據科學家
- DB29forLinux,UNIX,Windows數據庫管理認證指南
- 云計算與大數據應用
- 云計算服務保障體系
- 數據庫開發實踐案例
- Python金融數據分析(原書第2版)
- Apache Kylin權威指南
- 辦公應用與計算思維案例教程
- Solaris操作系統原理實驗教程
- SAS金融數據挖掘與建模:系統方法與案例解析
- 利用Python進行數據分析(原書第2版)
- MySQL數據庫應用與管理
- 工業大數據融合體系結構與關鍵技術
- 大數據計算系統原理、技術與應用
- 數據時代的品牌智造
- Mobile Application Penetration Testing
- Enterprise API Management
- 七周七數據庫
- 云邊協同大數據技術與應用
- 算法通關之路
- SQL Server 2012實用教程
- 基于SAP的企業級實用數據分析
- 百度統計:網站數據分析實戰
- DB2數據庫管理最佳實踐
- 實戰大數據:分布式大數據分析處理系統開發與應用
- Box2D for Flash Games
- 數據產品開發與經營:從數據資源到數據資本
- Metasploit滲透測試與開發實踐指南
- Access 2010數據庫技術及應用實訓教程
- 計算機視覺特征檢測及應用