目錄(235章)
倒序
- coverpage
- Title Page
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- A First Taste and What’s New in Apache Spark V2
- Spark machine learning
- Spark Streaming
- Spark SQL
- Spark graph processing
- Extended ecosystem
- What's new in Apache Spark V2?
- Cluster design
- Cluster management
- Local
- Standalone
- Apache YARN
- Apache Mesos
- Cloud-based deployments
- Performance
- The cluster structure
- Hadoop Distributed File System
- Data locality
- Memory
- Coding
- Cloud
- Summary
- Apache Spark SQL
- The SparkSession--your gateway to structured data processing
- Importing and saving data
- Processing the text files
- Processing JSON files
- Processing the Parquet files
- Understanding the DataSource API
- Implicit schema discovery
- Predicate push-down on smart data sources
- DataFrames
- Using SQL
- Defining schemas manually
- Using SQL subqueries
- Applying SQL table joins
- Using Datasets
- The Dataset API in action
- User-defined functions
- RDDs versus DataFrames versus Datasets
- Summary
- The Catalyst Optimizer
- Understanding the workings of the Catalyst Optimizer
- Managing temporary views with the catalog API
- The SQL abstract syntax tree
- How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
- Internal class and object representations of LEPs
- How to optimize the Resolved Logical Execution Plan
- Physical Execution Plan generation and selection
- Code generation
- Practical examples
- Using the explain method to obtain the PEP
- How smart data sources work internally
- Summary
- Project Tungsten
- Memory management beyond the Java Virtual Machine Garbage Collector
- Understanding the UnsafeRow object
- The null bit set region
- The fixed length values region
- The variable length values region
- Understanding the BytesToBytesMap
- A practical example on memory usage and performance
- Cache-friendly layout of data in memory
- Cache eviction strategies and pre-fetching
- Code generation
- Understanding columnar storage
- Understanding whole stage code generation
- A practical example on whole stage code generation performance
- Operator fusing versus the volcano iterator model
- Summary
- Apache Spark Streaming
- Overview
- Errors and recovery
- Checkpointing
- Streaming sources
- TCP stream
- File streams
- Flume
- Kafka
- Summary
- Structured Streaming
- The concept of continuous applications
- True unification - same code same engine
- Windowing
- How streaming engines use windowing
- How Apache Spark improves windowing
- Increased performance with good old friends
- How transparent fault tolerance and exactly-once delivery guarantee is achieved
- Replayable sources can replay streams from a given offset
- Idempotent sinks prevent data duplication
- State versioning guarantees consistent results after reruns
- Example - connection to a MQTT message broker
- Controlling continuous applications
- More on stream life cycle management
- Summary
- Apache Spark MLlib
- Architecture
- The development environment
- Classification with Naive Bayes
- Theory on Classification
- Naive Bayes in practice
- Clustering with K-Means
- Theory on Clustering
- K-Means in practice
- Artificial neural networks
- ANN in practice
- Summary
- Apache SparkML
- What does the new API look like?
- The concept of pipelines
- Transformers
- String indexer
- OneHotEncoder
- VectorAssembler
- Pipelines
- Estimators
- RandomForestClassifier
- Model evaluation
- CrossValidation and hyperparameter tuning
- CrossValidation
- Hyperparameter tuning
- Winning a Kaggle competition with Apache SparkML
- Data preparation
- Feature engineering
- Testing the feature engineering pipeline
- Training the machine learning model
- Model evaluation
- CrossValidation and hyperparameter tuning
- Using the evaluator to assess the quality of the cross-validated and tuned model
- Summary
- Apache SystemML
- Why do we need just another library?
- Why on Apache Spark?
- The history of Apache SystemML
- A cost-based optimizer for machine learning algorithms
- An example - alternating least squares
- ApacheSystemML architecture
- Language parsing
- High-level operators are generated
- How low-level operators are optimized on
- Performance measurements
- Apache SystemML in action
- Summary
- Deep Learning on Apache Spark with DeepLearning4j and H2O
- H2O
- Overview
- The build environment
- Architecture
- Sourcing the data
- Data quality
- Performance tuning
- Deep Learning
- Example code – income
- The example code – MNIST
- H2O Flow
- Deeplearning4j
- ND4J - high performance linear algebra for the JVM
- Deeplearning4j
- Example: an IoT real-time anomaly detector
- Mastering chaos: the Lorenz attractor model
- Deploying the test data generator
- Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
- Deploying the test data generator flow
- Testing the test data generator
- Install the Deeplearning4j example within Eclipse
- Running the examples in Eclipse
- Run the examples in Apache Spark
- Summary
- Apache Spark GraphX
- Overview
- Graph analytics/processing with GraphX
- The raw data
- Creating a graph
- Example 1 – counting
- Example 2 – filtering
- Example 3 – PageRank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark GraphFrames
- Architecture
- Graph-relational translation
- Materialized views
- Join elimination
- Join reordering
- Examples
- Example 1 – counting
- Example 2 – filtering
- Example 3 – page rank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark with Jupyter Notebooks on IBM DataScience Experience
- Why notebooks are the new standard
- Learning by example
- The IEEE PHM 2012 data challenge bearing dataset
- ETL with Scala
- Interactive exploratory analysis using Python and Pixiedust
- Real data science work with SparkR
- Summary
- Apache Spark on Kubernetes
- Bare metal virtual machines and containers
- Containerization
- Namespaces
- Control groups
- Linux containers
- Understanding the core concepts of Docker
- Understanding Kubernetes
- Using Kubernetes for provisioning containerized Spark applications
- Example--Apache Spark on Kubernetes
- Prerequisites
- Deploying the Apache Spark master
- Deploying the Apache Spark workers
- Deploying the Zeppelin notebooks
- Summary 更新時間:2021-07-02 18:56:09
推薦閱讀
- Java多線程編程實戰指南:設計模式篇(第2版)
- Spring Cloud Alibaba核心技術與實戰案例
- ThinkPHP 5實戰
- Learning Real-time Processing with Spark Streaming
- Getting Started with PowerShell
- Java編程指南:基礎知識、類庫應用及案例設計
- C語言程序設計教程(第2版)
- Python數據可視化之Matplotlib與Pyecharts實戰
- 程序員修煉之道:通向務實的最高境界(第2版)
- OpenCV with Python By Example
- QGIS Python Programming Cookbook(Second Edition)
- Unity Android Game Development by Example Beginner's Guide
- 奔跑吧 Linux內核
- C語言程序設計與應用實驗指導書(第2版)
- Pandas入門與實戰應用:基于Python的數據分析與處理
- Mastering Android Application Development
- Sony Vegas Pro 11 Beginner’s Guide
- 響應式編程實戰:構建彈性、可伸縮、事件驅動的分布式系統
- Python Machine Learning / Second Edition
- Spring 5.0 Cookbook
- Unreal Engine Lighting and Rendering Essentials
- Node Cookbook(Third Edition)
- Mastering Application Development with Force.com
- JavaScript for .NET Developers
- OpenLayers 3.x Cookbook(Second Edition)
- 信息無障礙:提升用戶體驗的另一種視角
- The Data Wrangling Workshop
- R數據可視化手冊(第2版)
- Oracle公有云實用指南
- Java語言程序設計(第3版)