會(huì)員

Mastering Apache Spark 2.x（Second Edition）

更新時(shí)間：2021-07-02 18:56:09

開會(huì)員，本書免費(fèi)讀 >

計(jì)算機(jī)網(wǎng)絡(luò) 編程語言與程序設(shè)計(jì)

IfyouareadeveloperwithsomeexperiencewithSparkandwanttostrengthenyourknowledgeofhowtogetaroundintheworldofSpark,thenthisbookisidealforyou.BasicknowledgeofLinux,HadoopandSparkisassumed.ReasonableknowledgeofScalaisexpected.

目錄(235章)

倒序

coverpage
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
A First Taste and What’s New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What's new in Apache Spark V2?
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Cloud-based deployments
Performance
The cluster structure
Hadoop Distributed File System
Data locality
Memory
Coding
Cloud
Summary
Apache Spark SQL
The SparkSession--your gateway to structured data processing
Importing and saving data
Processing the text files
Processing JSON files
Processing the Parquet files
Understanding the DataSource API
Implicit schema discovery
Predicate push-down on smart data sources
DataFrames
Using SQL
Defining schemas manually
Using SQL subqueries
Applying SQL table joins
Using Datasets
The Dataset API in action
User-defined functions
RDDs versus DataFrames versus Datasets
Summary
The Catalyst Optimizer
Understanding the workings of the Catalyst Optimizer
Managing temporary views with the catalog API
The SQL abstract syntax tree
How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
Internal class and object representations of LEPs
How to optimize the Resolved Logical Execution Plan
Physical Execution Plan generation and selection
Code generation
Practical examples
Using the explain method to obtain the PEP
How smart data sources work internally
Summary
Project Tungsten
Memory management beyond the Java Virtual Machine Garbage Collector
Understanding the UnsafeRow object
The null bit set region
The fixed length values region
The variable length values region
Understanding the BytesToBytesMap
A practical example on memory usage and performance
Cache-friendly layout of data in memory
Cache eviction strategies and pre-fetching
Code generation
Understanding columnar storage
Understanding whole stage code generation
A practical example on whole stage code generation performance
Operator fusing versus the volcano iterator model
Summary
Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka
Summary
Structured Streaming
The concept of continuous applications
True unification - same code same engine
Windowing
How streaming engines use windowing
How Apache Spark improves windowing
Increased performance with good old friends
How transparent fault tolerance and exactly-once delivery guarantee is achieved
Replayable sources can replay streams from a given offset
Idempotent sinks prevent data duplication
State versioning guarantees consistent results after reruns
Example - connection to a MQTT message broker
Controlling continuous applications
More on stream life cycle management
Summary
Apache Spark MLlib
Architecture
The development environment
Classification with Naive Bayes
Theory on Classification
Naive Bayes in practice
Clustering with K-Means
Theory on Clustering
K-Means in practice
Artificial neural networks
ANN in practice
Summary
Apache SparkML
What does the new API look like?
The concept of pipelines
Transformers
String indexer
OneHotEncoder
VectorAssembler
Pipelines
Estimators
RandomForestClassifier
Model evaluation
CrossValidation and hyperparameter tuning
CrossValidation
Hyperparameter tuning
Winning a Kaggle competition with Apache SparkML
Data preparation
Feature engineering
Testing the feature engineering pipeline
Training the machine learning model
Model evaluation
CrossValidation and hyperparameter tuning
Using the evaluator to assess the quality of the cross-validated and tuned model
Summary
Apache SystemML
Why do we need just another library?
Why on Apache Spark?
The history of Apache SystemML
A cost-based optimizer for machine learning algorithms
An example - alternating least squares
ApacheSystemML architecture
Language parsing
High-level operators are generated
How low-level operators are optimized on
Performance measurements
Apache SystemML in action
Summary
Deep Learning on Apache Spark with DeepLearning4j and H2O
H2O
Overview
The build environment
Architecture
Sourcing the data
Data quality
Performance tuning
Deep Learning
Example code – income
The example code – MNIST
H2O Flow
Deeplearning4j
ND4J - high performance linear algebra for the JVM
Deeplearning4j
Example: an IoT real-time anomaly detector
Mastering chaos: the Lorenz attractor model
Deploying the test data generator
Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
Deploying the test data generator flow
Testing the test data generator
Install the Deeplearning4j example within Eclipse
Running the examples in Eclipse
Run the examples in Apache Spark
Summary
Apache Spark GraphX
Overview
Graph analytics/processing with GraphX
The raw data
Creating a graph
Example 1 – counting
Example 2 – filtering
Example 3 – PageRank
Example 4 – triangle counting
Example 5 – connected components
Summary
Apache Spark GraphFrames
Architecture
Graph-relational translation
Materialized views
Join elimination
Join reordering
Examples
Example 1 – counting
Example 2 – filtering
Example 3 – page rank
Example 4 – triangle counting
Example 5 – connected components
Summary
Apache Spark with Jupyter Notebooks on IBM DataScience Experience
Why notebooks are the new standard
Learning by example
The IEEE PHM 2012 data challenge bearing dataset
ETL with Scala
Interactive exploratory analysis using Python and Pixiedust
Real data science work with SparkR
Summary
Apache Spark on Kubernetes
Bare metal virtual machines and containers
Containerization
Namespaces
Control groups
Linux containers
Understanding the core concepts of Docker
Understanding Kubernetes
Using Kubernetes for provisioning containerized Spark applications
Example--Apache Spark on Kubernetes
Prerequisites
Deploying the Apache Spark master
Deploying the Apache Spark workers
Deploying the Zeppelin notebooks
Summary 更新時(shí)間：2021-07-02 18:56:09

官术网_书友最值得收藏!

Mastering Apache Spark 2.x（Second Edition）