最新章節
- Summary
- Deploying the Zeppelin notebooks
- Deploying the Apache Spark workers
- Deploying the Apache Spark master
- Prerequisites
- Example--Apache Spark on Kubernetes
品牌:中圖公司
上架時間:2021-07-02 18:17:30
出版社:Packt Publishing
本書數字版權由中圖公司提供,并由其授權上海閱文信息技術有限公司制作發行
- Summary 更新時間:2021-07-02 18:56:09
- Deploying the Zeppelin notebooks
- Deploying the Apache Spark workers
- Deploying the Apache Spark master
- Prerequisites
- Example--Apache Spark on Kubernetes
- Using Kubernetes for provisioning containerized Spark applications
- Understanding Kubernetes
- Understanding the core concepts of Docker
- Linux containers
- Control groups
- Namespaces
- Containerization
- Bare metal virtual machines and containers
- Apache Spark on Kubernetes
- Summary
- Real data science work with SparkR
- Interactive exploratory analysis using Python and Pixiedust
- ETL with Scala
- The IEEE PHM 2012 data challenge bearing dataset
- Learning by example
- Why notebooks are the new standard
- Apache Spark with Jupyter Notebooks on IBM DataScience Experience
- Summary
- Example 5 – connected components
- Example 4 – triangle counting
- Example 3 – page rank
- Example 2 – filtering
- Example 1 – counting
- Examples
- Join reordering
- Join elimination
- Materialized views
- Graph-relational translation
- Architecture
- Apache Spark GraphFrames
- Summary
- Example 5 – connected components
- Example 4 – triangle counting
- Example 3 – PageRank
- Example 2 – filtering
- Example 1 – counting
- Creating a graph
- The raw data
- Graph analytics/processing with GraphX
- Overview
- Apache Spark GraphX
- Summary
- Run the examples in Apache Spark
- Running the examples in Eclipse
- Install the Deeplearning4j example within Eclipse
- Testing the test data generator
- Deploying the test data generator flow
- Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
- Deploying the test data generator
- Mastering chaos: the Lorenz attractor model
- Example: an IoT real-time anomaly detector
- Deeplearning4j
- ND4J - high performance linear algebra for the JVM
- Deeplearning4j
- H2O Flow
- The example code – MNIST
- Example code – income
- Deep Learning
- Performance tuning
- Data quality
- Sourcing the data
- Architecture
- The build environment
- Overview
- H2O
- Deep Learning on Apache Spark with DeepLearning4j and H2O
- Summary
- Apache SystemML in action
- Performance measurements
- How low-level operators are optimized on
- High-level operators are generated
- Language parsing
- ApacheSystemML architecture
- An example - alternating least squares
- A cost-based optimizer for machine learning algorithms
- The history of Apache SystemML
- Why on Apache Spark?
- Why do we need just another library?
- Apache SystemML
- Summary
- Using the evaluator to assess the quality of the cross-validated and tuned model
- CrossValidation and hyperparameter tuning
- Model evaluation
- Training the machine learning model
- Testing the feature engineering pipeline
- Feature engineering
- Data preparation
- Winning a Kaggle competition with Apache SparkML
- Hyperparameter tuning
- CrossValidation
- CrossValidation and hyperparameter tuning
- Model evaluation
- RandomForestClassifier
- Estimators
- Pipelines
- VectorAssembler
- OneHotEncoder
- String indexer
- Transformers
- The concept of pipelines
- What does the new API look like?
- Apache SparkML
- Summary
- ANN in practice
- Artificial neural networks
- K-Means in practice
- Theory on Clustering
- Clustering with K-Means
- Naive Bayes in practice
- Theory on Classification
- Classification with Naive Bayes
- The development environment
- Architecture
- Apache Spark MLlib
- Summary
- More on stream life cycle management
- Controlling continuous applications
- Example - connection to a MQTT message broker
- State versioning guarantees consistent results after reruns
- Idempotent sinks prevent data duplication
- Replayable sources can replay streams from a given offset
- How transparent fault tolerance and exactly-once delivery guarantee is achieved
- Increased performance with good old friends
- How Apache Spark improves windowing
- How streaming engines use windowing
- Windowing
- True unification - same code same engine
- The concept of continuous applications
- Structured Streaming
- Summary
- Kafka
- Flume
- File streams
- TCP stream
- Streaming sources
- Checkpointing
- Errors and recovery
- Overview
- Apache Spark Streaming
- Summary
- Operator fusing versus the volcano iterator model
- A practical example on whole stage code generation performance
- Understanding whole stage code generation
- Understanding columnar storage
- Code generation
- Cache eviction strategies and pre-fetching
- Cache-friendly layout of data in memory
- A practical example on memory usage and performance
- Understanding the BytesToBytesMap
- The variable length values region
- The fixed length values region
- The null bit set region
- Understanding the UnsafeRow object
- Memory management beyond the Java Virtual Machine Garbage Collector
- Project Tungsten
- Summary
- How smart data sources work internally
- Using the explain method to obtain the PEP
- Practical examples
- Code generation
- Physical Execution Plan generation and selection
- How to optimize the Resolved Logical Execution Plan
- Internal class and object representations of LEPs
- How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
- The SQL abstract syntax tree
- Managing temporary views with the catalog API
- Understanding the workings of the Catalyst Optimizer
- The Catalyst Optimizer
- Summary
- RDDs versus DataFrames versus Datasets
- User-defined functions
- The Dataset API in action
- Using Datasets
- Applying SQL table joins
- Using SQL subqueries
- Defining schemas manually
- Using SQL
- DataFrames
- Predicate push-down on smart data sources
- Implicit schema discovery
- Understanding the DataSource API
- Processing the Parquet files
- Processing JSON files
- Processing the text files
- Importing and saving data
- The SparkSession--your gateway to structured data processing
- Apache Spark SQL
- Summary
- Cloud
- Coding
- Memory
- Data locality
- Hadoop Distributed File System
- The cluster structure
- Performance
- Cloud-based deployments
- Apache Mesos
- Apache YARN
- Standalone
- Local
- Cluster management
- Cluster design
- What's new in Apache Spark V2?
- Extended ecosystem
- Spark graph processing
- Spark SQL
- Spark Streaming
- Spark machine learning
- A First Taste and What’s New in Apache Spark V2
- Questions
- Piracy
- Errata
- Downloading the color images of this book
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- Why subscribe?
- www.PacktPub.com
- About the Reviewer
- About the Author
- Credits
- Title Page
- coverpage
- coverpage
- Title Page
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- A First Taste and What’s New in Apache Spark V2
- Spark machine learning
- Spark Streaming
- Spark SQL
- Spark graph processing
- Extended ecosystem
- What's new in Apache Spark V2?
- Cluster design
- Cluster management
- Local
- Standalone
- Apache YARN
- Apache Mesos
- Cloud-based deployments
- Performance
- The cluster structure
- Hadoop Distributed File System
- Data locality
- Memory
- Coding
- Cloud
- Summary
- Apache Spark SQL
- The SparkSession--your gateway to structured data processing
- Importing and saving data
- Processing the text files
- Processing JSON files
- Processing the Parquet files
- Understanding the DataSource API
- Implicit schema discovery
- Predicate push-down on smart data sources
- DataFrames
- Using SQL
- Defining schemas manually
- Using SQL subqueries
- Applying SQL table joins
- Using Datasets
- The Dataset API in action
- User-defined functions
- RDDs versus DataFrames versus Datasets
- Summary
- The Catalyst Optimizer
- Understanding the workings of the Catalyst Optimizer
- Managing temporary views with the catalog API
- The SQL abstract syntax tree
- How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
- Internal class and object representations of LEPs
- How to optimize the Resolved Logical Execution Plan
- Physical Execution Plan generation and selection
- Code generation
- Practical examples
- Using the explain method to obtain the PEP
- How smart data sources work internally
- Summary
- Project Tungsten
- Memory management beyond the Java Virtual Machine Garbage Collector
- Understanding the UnsafeRow object
- The null bit set region
- The fixed length values region
- The variable length values region
- Understanding the BytesToBytesMap
- A practical example on memory usage and performance
- Cache-friendly layout of data in memory
- Cache eviction strategies and pre-fetching
- Code generation
- Understanding columnar storage
- Understanding whole stage code generation
- A practical example on whole stage code generation performance
- Operator fusing versus the volcano iterator model
- Summary
- Apache Spark Streaming
- Overview
- Errors and recovery
- Checkpointing
- Streaming sources
- TCP stream
- File streams
- Flume
- Kafka
- Summary
- Structured Streaming
- The concept of continuous applications
- True unification - same code same engine
- Windowing
- How streaming engines use windowing
- How Apache Spark improves windowing
- Increased performance with good old friends
- How transparent fault tolerance and exactly-once delivery guarantee is achieved
- Replayable sources can replay streams from a given offset
- Idempotent sinks prevent data duplication
- State versioning guarantees consistent results after reruns
- Example - connection to a MQTT message broker
- Controlling continuous applications
- More on stream life cycle management
- Summary
- Apache Spark MLlib
- Architecture
- The development environment
- Classification with Naive Bayes
- Theory on Classification
- Naive Bayes in practice
- Clustering with K-Means
- Theory on Clustering
- K-Means in practice
- Artificial neural networks
- ANN in practice
- Summary
- Apache SparkML
- What does the new API look like?
- The concept of pipelines
- Transformers
- String indexer
- OneHotEncoder
- VectorAssembler
- Pipelines
- Estimators
- RandomForestClassifier
- Model evaluation
- CrossValidation and hyperparameter tuning
- CrossValidation
- Hyperparameter tuning
- Winning a Kaggle competition with Apache SparkML
- Data preparation
- Feature engineering
- Testing the feature engineering pipeline
- Training the machine learning model
- Model evaluation
- CrossValidation and hyperparameter tuning
- Using the evaluator to assess the quality of the cross-validated and tuned model
- Summary
- Apache SystemML
- Why do we need just another library?
- Why on Apache Spark?
- The history of Apache SystemML
- A cost-based optimizer for machine learning algorithms
- An example - alternating least squares
- ApacheSystemML architecture
- Language parsing
- High-level operators are generated
- How low-level operators are optimized on
- Performance measurements
- Apache SystemML in action
- Summary
- Deep Learning on Apache Spark with DeepLearning4j and H2O
- H2O
- Overview
- The build environment
- Architecture
- Sourcing the data
- Data quality
- Performance tuning
- Deep Learning
- Example code – income
- The example code – MNIST
- H2O Flow
- Deeplearning4j
- ND4J - high performance linear algebra for the JVM
- Deeplearning4j
- Example: an IoT real-time anomaly detector
- Mastering chaos: the Lorenz attractor model
- Deploying the test data generator
- Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
- Deploying the test data generator flow
- Testing the test data generator
- Install the Deeplearning4j example within Eclipse
- Running the examples in Eclipse
- Run the examples in Apache Spark
- Summary
- Apache Spark GraphX
- Overview
- Graph analytics/processing with GraphX
- The raw data
- Creating a graph
- Example 1 – counting
- Example 2 – filtering
- Example 3 – PageRank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark GraphFrames
- Architecture
- Graph-relational translation
- Materialized views
- Join elimination
- Join reordering
- Examples
- Example 1 – counting
- Example 2 – filtering
- Example 3 – page rank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark with Jupyter Notebooks on IBM DataScience Experience
- Why notebooks are the new standard
- Learning by example
- The IEEE PHM 2012 data challenge bearing dataset
- ETL with Scala
- Interactive exploratory analysis using Python and Pixiedust
- Real data science work with SparkR
- Summary
- Apache Spark on Kubernetes
- Bare metal virtual machines and containers
- Containerization
- Namespaces
- Control groups
- Linux containers
- Understanding the core concepts of Docker
- Understanding Kubernetes
- Using Kubernetes for provisioning containerized Spark applications
- Example--Apache Spark on Kubernetes
- Prerequisites
- Deploying the Apache Spark master
- Deploying the Apache Spark workers
- Deploying the Zeppelin notebooks
- Summary 更新時間:2021-07-02 18:56:09