最新章節
- Summary
- StumbleUponExecutor
- Machine learning pipeline with an example
- How pipelines work
- Estimators
- Transformers
品牌:中圖公司
上架時間:2021-07-09 18:34:54
出版社:Packt Publishing
本書數字版權由中圖公司提供,并由其授權上海閱文信息技術有限公司制作發行
- Summary 更新時間:2021-07-09 21:08:48
- StumbleUponExecutor
- Machine learning pipeline with an example
- How pipelines work
- Estimators
- Transformers
- Pipeline components
- DataFrames
- Introduction to pipelines
- Pipeline APIs for Spark ML
- Summary
- Structured Streaming
- Comparing model performance with Spark Streaming
- Online model evaluation
- Streaming K-means
- Creating a streaming regression model
- Creating a streaming data producer
- A simple streaming regression program
- Streaming regression
- Online learning with Spark Streaming
- Stateful streaming
- Streaming analytics
- Creating a basic streaming application
- The producer application
- Creating a basic streaming application
- Caching and fault tolerance with Spark Streaming
- Window operators
- Actions
- General transformations
- Keeping track of state
- Transformations
- Input sources
- An introduction to Spark Streaming
- Stream processing
- Online learning
- Real-Time Machine Learning with Spark Streaming
- Summary
- Word2Vec with Spark ML on the 20 Newsgroups dataset
- Word2Vec with Spark MLlib on the 20 Newsgroups dataset
- Word2Vec models
- Text classification with Spark 2.0
- Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
- Evaluating the impact of text processing
- Training a text classifier on the 20 Newsgroups dataset using tf-idf
- Document similarity with the 20 Newsgroups dataset and tf-idf features
- Using a tf-idf model
- Analyzing the tf-idf weightings
- Building a tf-idf model
- Feature Hashing
- A note about stemming
- Excluding terms based on frequency
- Removing stop words
- Improving our tokenization
- Applying basic tokenization
- Exploring the 20 Newsgroups data
- Extracting the tf-idf features from the 20 Newsgroups dataset
- Feature hashing
- Term weighting schemes
- Extracting the right features from your data
- What's so special about text data?
- Advanced Text Processing with Spark
- Summary
- Singular values
- Evaluating k for SVD on the LFW dataset
- Evaluating dimensionality reduction models
- The relationship between PCA and SVD
- Projecting data using PCA on the LFW dataset
- Using a dimensionality reduction model
- Interpreting the Eigenfaces
- Visualizing the Eigenfaces
- Running PCA on the LFW dataset
- Training a dimensionality reduction model
- Normalization
- Extracting feature vectors
- Converting to grayscale and resizing the images
- Loading images
- Extracting facial images as vectors
- Visualizing the face data
- Exploring the face data
- Extracting features from the LFW dataset
- Extracting the right features from your data
- Clustering as dimensionality reduction
- Relationship with matrix factorization
- Singular value decomposition
- Principal components analysis
- Types of dimensionality reduction
- Dimensionality Reduction with Spark
- Summary
- GMM - effect of iterations on cluster boundaries
- Plotting the user and item data with GMM clustering
- Clustering using GMM
- Gaussian Mixture Model
- WSSSE and iterations
- Bisecting K-means - training a clustering model
- Bisecting KMeans
- Effect of iterations on WSSSE
- Computing performance metrics on the MovieLens dataset
- External evaluation metrics
- Internal evaluation metrics
- K-means - evaluating the performance of clustering models
- Interpreting the movie clusters
- Interpreting the movie clusters
- K-means - interpreting cluster predictions on the MovieLens dataset
- Training a clustering model on the MovieLens dataset
- K-means - training a clustering model
- Extracting features from the MovieLens dataset
- Extracting the right features from your data
- Hierarchical clustering
- Mixture models
- Initialization methods
- k-means clustering
- Types of clustering models
- Building a Clustering Model with Spark
- Summary
- MaxBins
- Iterations
- The impact of parameter settings for the Gradient Boosted Trees
- Maximum bins
- Tree depth
- The impact of parameter settings for the decision tree
- Intercept
- L1 regularization
- L2 regularization
- Step size
- Iterations
- The impact of parameter settings for linear models
- Splitting data for Decision tree
- Creating training and testing sets to evaluate parameters
- Tuning model parameters
- Impact of training on log-transformed targets
- Transforming the target variable
- Improving model performance and tuning parameters
- Gradient boosted tree regression
- Random forest regression
- Ensembles of trees
- Decision tree regression
- Generalized linear regression
- Training a regression model on the bike sharing dataset
- BikeSharingExecutor
- Training and using regression models
- Extracting features from the bike sharing dataset
- Extracting the right features from your data
- The R-squared coefficient
- Root Mean Squared Log Error
- Mean Absolute Error
- Mean Squared Error and Root Mean Squared Error
- Evaluating the performance of regression models
- Decision trees for regression
- Least squares regression
- Types of regression models
- Building a Regression Model with Spark
- Summary
- Cross-validation
- The naive Bayes model
- Tuning tree depth and impurity
- Decision trees
- Regularization
- Step size
- Iterations
- Linear models
- Tuning model parameters
- Using the correct form of data
- Additional features
- Feature standardization
- Improving model performance and tuning parameters
- ROC curve and AUC
- Precision and recall
- Accuracy and prediction error
- Evaluating the performance of classification models
- Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
- Using classification models
- Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
- Training classification models
- Extracting the right features from your data
- Multilayer perceptron classifier
- Gradient-Boosted Trees
- Random Forests
- Ensembles of trees
- Decision trees
- The naive Bayes model
- Linear support vector machines
- StumbleUponExecutor
- Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
- Visualizing the StumbleUpon dataset
- Multinomial logistic regression
- Logistic regression
- Linear models
- Types of classification models
- Building a Classification Model with Spark
- Summary
- FP-Growth Applied to Movie Lens Data
- FP-Growth Basic Sample
- FP-Growth algorithm
- MAP
- RMSE and MSE
- Using MLlib's built-in evaluation functions
- Mean Average Precision at K
- Mean Squared Error
- ALS Model Evaluation
- Evaluating the performance of recommendation models
- Inspecting the similar items
- Generating similar movies for the MovieLens 100k dataset
- Item recommendations
- Inspecting the recommendations
- Generating movie recommendations from the MovieLens 100k dataset
- User recommendations
- ALS Model recommendations
- Using the recommendation model
- Training a model using Implicit feedback data
- Training a model on the MovieLens 100k dataset
- Training the recommendation model
- Extracting features from the MovieLens 100k dataset
- Extracting the right features from your data
- Alternating least squares
- Basic model for Matrix Factorization
- Implicit Matrix Factorization
- Explicit matrix factorization
- Matrix factorization
- Collaborative filtering
- Content-based filtering
- Types of recommendation models
- Building a Recommendation Engine with Spark
- Summary
- Standard scalar
- Skip-gram model
- Word2Vector
- IDF
- TFID
- Using packages for feature extraction
- Using ML for feature normalization
- Normalizing features
- Sparse Vectors from Titles
- Simple text feature extraction
- Text features
- Extract time of day
- Transforming timestamps into categorical features
- Derived features
- Categorical features
- Numerical features
- Extracting useful features from your data
- Filling in bad or missing data
- Processing and transforming your data
- Distribution of number ratings
- Rating count bar chart
- Exploring the rating dataset
- Movie dataset
- Count by occupation
- Exploring the user dataset
- Exploring and visualizing your data
- The MovieLens 100k dataset
- Accessing publicly available datasets
- Obtaining Processing and Preparing Data with Spark
- Summary
- Spark 1.6 to 2.0
- MLlib versions compared
- MLlib vision
- Spark Integration
- MLlib supported methods and developer APIs
- Regression
- Clustering
- Classification
- Comparing algorithms supported by MLlib
- Performance improvements in Spark ML over Spark MLlib
- Spark MLlib
- An architecture for a machine learning system
- Data Pipeline in Apache Spark
- Batch versus real time
- Model monitoring and feedback
- Model deployment and integration
- Model training and testing loop
- Data cleansing and transformation
- Data ingestion and storage
- The components of a data-driven machine learning system
- Types of machine learning models
- Predictive modeling and analytics
- Targeted marketing and customer segmentation
- Personalization
- Business use cases for a machine learning system
- Introducing MovieStream
- What is Machine Learning?
- Designing a Machine Learning System
- Summary
- Plotting
- Lagranges multipliers
- Integral calculus
- Differential calculus
- Calculus
- Prior likelihood and posterior
- Gradient descent
- Hypothesis
- Functional composition
- Function types
- Functions
- Matrices in machine learning
- Singular value decomposition
- Eigenvalues and eigenvectors
- Determinant
- Matrix operations
- Distributed matrix in Spark
- Matrix in Spark
- Types of matrices
- Matrix
- Vectors in machine learning
- Hyperplanes
- Vector operations
- Vectors in Spark
- Vectors in Breeze
- Vector types
- Vector spaces
- Vectors
- Complex numbers
- Real numbers
- Fields
- Setting up the Scala environment on the Command Line
- Setting up the Scala environment in Intellij
- Linear algebra
- Math for Machine Learning
- Summary
- Submitting a Job
- Creating a Cluster
- Hadoop and Spark Versions
- Spark Cluster on Google Compute Engine - DataProc
- Benefits of using Spark ML as compared to existing libraries
- Supported machine learning algorithms by Spark
- UI in Spark
- Configuring and running Spark on Amazon Elastic Map Reduce
- Launching an EC2 Spark cluster
- Getting Spark running on Amazon EC2
- SparkR DataFrames
- The first step to a Spark program in R
- The first step to a Spark program in Python
- The first step to a Spark program in Java
- The first step to a Spark program in Scala
- Spark data frame
- SchemaRDD
- Broadcast variables and accumulators
- Caching RDDs
- Spark operations
- Creating RDDs
- Resilient Distributed Datasets
- The Spark shell
- SparkSession
- SparkContext and SparkConf
- The Spark programming model
- Spark clusters
- Installing and setting up Spark locally
- Getting Up and Running with Spark
- Questions
- Piracy
- Errata
- Downloading the color images of this book
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- www.PacktPub.com
- About the Reviewer
- About the Authors
- Credits
- Copyright
- Title Page
- coverpage
- coverpage
- Title Page
- Copyright
- Credits
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- Getting Up and Running with Spark
- Installing and setting up Spark locally
- Spark clusters
- The Spark programming model
- SparkContext and SparkConf
- SparkSession
- The Spark shell
- Resilient Distributed Datasets
- Creating RDDs
- Spark operations
- Caching RDDs
- Broadcast variables and accumulators
- SchemaRDD
- Spark data frame
- The first step to a Spark program in Scala
- The first step to a Spark program in Java
- The first step to a Spark program in Python
- The first step to a Spark program in R
- SparkR DataFrames
- Getting Spark running on Amazon EC2
- Launching an EC2 Spark cluster
- Configuring and running Spark on Amazon Elastic Map Reduce
- UI in Spark
- Supported machine learning algorithms by Spark
- Benefits of using Spark ML as compared to existing libraries
- Spark Cluster on Google Compute Engine - DataProc
- Hadoop and Spark Versions
- Creating a Cluster
- Submitting a Job
- Summary
- Math for Machine Learning
- Linear algebra
- Setting up the Scala environment in Intellij
- Setting up the Scala environment on the Command Line
- Fields
- Real numbers
- Complex numbers
- Vectors
- Vector spaces
- Vector types
- Vectors in Breeze
- Vectors in Spark
- Vector operations
- Hyperplanes
- Vectors in machine learning
- Matrix
- Types of matrices
- Matrix in Spark
- Distributed matrix in Spark
- Matrix operations
- Determinant
- Eigenvalues and eigenvectors
- Singular value decomposition
- Matrices in machine learning
- Functions
- Function types
- Functional composition
- Hypothesis
- Gradient descent
- Prior likelihood and posterior
- Calculus
- Differential calculus
- Integral calculus
- Lagranges multipliers
- Plotting
- Summary
- Designing a Machine Learning System
- What is Machine Learning?
- Introducing MovieStream
- Business use cases for a machine learning system
- Personalization
- Targeted marketing and customer segmentation
- Predictive modeling and analytics
- Types of machine learning models
- The components of a data-driven machine learning system
- Data ingestion and storage
- Data cleansing and transformation
- Model training and testing loop
- Model deployment and integration
- Model monitoring and feedback
- Batch versus real time
- Data Pipeline in Apache Spark
- An architecture for a machine learning system
- Spark MLlib
- Performance improvements in Spark ML over Spark MLlib
- Comparing algorithms supported by MLlib
- Classification
- Clustering
- Regression
- MLlib supported methods and developer APIs
- Spark Integration
- MLlib vision
- MLlib versions compared
- Spark 1.6 to 2.0
- Summary
- Obtaining Processing and Preparing Data with Spark
- Accessing publicly available datasets
- The MovieLens 100k dataset
- Exploring and visualizing your data
- Exploring the user dataset
- Count by occupation
- Movie dataset
- Exploring the rating dataset
- Rating count bar chart
- Distribution of number ratings
- Processing and transforming your data
- Filling in bad or missing data
- Extracting useful features from your data
- Numerical features
- Categorical features
- Derived features
- Transforming timestamps into categorical features
- Extract time of day
- Text features
- Simple text feature extraction
- Sparse Vectors from Titles
- Normalizing features
- Using ML for feature normalization
- Using packages for feature extraction
- TFID
- IDF
- Word2Vector
- Skip-gram model
- Standard scalar
- Summary
- Building a Recommendation Engine with Spark
- Types of recommendation models
- Content-based filtering
- Collaborative filtering
- Matrix factorization
- Explicit matrix factorization
- Implicit Matrix Factorization
- Basic model for Matrix Factorization
- Alternating least squares
- Extracting the right features from your data
- Extracting features from the MovieLens 100k dataset
- Training the recommendation model
- Training a model on the MovieLens 100k dataset
- Training a model using Implicit feedback data
- Using the recommendation model
- ALS Model recommendations
- User recommendations
- Generating movie recommendations from the MovieLens 100k dataset
- Inspecting the recommendations
- Item recommendations
- Generating similar movies for the MovieLens 100k dataset
- Inspecting the similar items
- Evaluating the performance of recommendation models
- ALS Model Evaluation
- Mean Squared Error
- Mean Average Precision at K
- Using MLlib's built-in evaluation functions
- RMSE and MSE
- MAP
- FP-Growth algorithm
- FP-Growth Basic Sample
- FP-Growth Applied to Movie Lens Data
- Summary
- Building a Classification Model with Spark
- Types of classification models
- Linear models
- Logistic regression
- Multinomial logistic regression
- Visualizing the StumbleUpon dataset
- Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
- StumbleUponExecutor
- Linear support vector machines
- The naive Bayes model
- Decision trees
- Ensembles of trees
- Random Forests
- Gradient-Boosted Trees
- Multilayer perceptron classifier
- Extracting the right features from your data
- Training classification models
- Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
- Using classification models
- Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
- Evaluating the performance of classification models
- Accuracy and prediction error
- Precision and recall
- ROC curve and AUC
- Improving model performance and tuning parameters
- Feature standardization
- Additional features
- Using the correct form of data
- Tuning model parameters
- Linear models
- Iterations
- Step size
- Regularization
- Decision trees
- Tuning tree depth and impurity
- The naive Bayes model
- Cross-validation
- Summary
- Building a Regression Model with Spark
- Types of regression models
- Least squares regression
- Decision trees for regression
- Evaluating the performance of regression models
- Mean Squared Error and Root Mean Squared Error
- Mean Absolute Error
- Root Mean Squared Log Error
- The R-squared coefficient
- Extracting the right features from your data
- Extracting features from the bike sharing dataset
- Training and using regression models
- BikeSharingExecutor
- Training a regression model on the bike sharing dataset
- Generalized linear regression
- Decision tree regression
- Ensembles of trees
- Random forest regression
- Gradient boosted tree regression
- Improving model performance and tuning parameters
- Transforming the target variable
- Impact of training on log-transformed targets
- Tuning model parameters
- Creating training and testing sets to evaluate parameters
- Splitting data for Decision tree
- The impact of parameter settings for linear models
- Iterations
- Step size
- L2 regularization
- L1 regularization
- Intercept
- The impact of parameter settings for the decision tree
- Tree depth
- Maximum bins
- The impact of parameter settings for the Gradient Boosted Trees
- Iterations
- MaxBins
- Summary
- Building a Clustering Model with Spark
- Types of clustering models
- k-means clustering
- Initialization methods
- Mixture models
- Hierarchical clustering
- Extracting the right features from your data
- Extracting features from the MovieLens dataset
- K-means - training a clustering model
- Training a clustering model on the MovieLens dataset
- K-means - interpreting cluster predictions on the MovieLens dataset
- Interpreting the movie clusters
- Interpreting the movie clusters
- K-means - evaluating the performance of clustering models
- Internal evaluation metrics
- External evaluation metrics
- Computing performance metrics on the MovieLens dataset
- Effect of iterations on WSSSE
- Bisecting KMeans
- Bisecting K-means - training a clustering model
- WSSSE and iterations
- Gaussian Mixture Model
- Clustering using GMM
- Plotting the user and item data with GMM clustering
- GMM - effect of iterations on cluster boundaries
- Summary
- Dimensionality Reduction with Spark
- Types of dimensionality reduction
- Principal components analysis
- Singular value decomposition
- Relationship with matrix factorization
- Clustering as dimensionality reduction
- Extracting the right features from your data
- Extracting features from the LFW dataset
- Exploring the face data
- Visualizing the face data
- Extracting facial images as vectors
- Loading images
- Converting to grayscale and resizing the images
- Extracting feature vectors
- Normalization
- Training a dimensionality reduction model
- Running PCA on the LFW dataset
- Visualizing the Eigenfaces
- Interpreting the Eigenfaces
- Using a dimensionality reduction model
- Projecting data using PCA on the LFW dataset
- The relationship between PCA and SVD
- Evaluating dimensionality reduction models
- Evaluating k for SVD on the LFW dataset
- Singular values
- Summary
- Advanced Text Processing with Spark
- What's so special about text data?
- Extracting the right features from your data
- Term weighting schemes
- Feature hashing
- Extracting the tf-idf features from the 20 Newsgroups dataset
- Exploring the 20 Newsgroups data
- Applying basic tokenization
- Improving our tokenization
- Removing stop words
- Excluding terms based on frequency
- A note about stemming
- Feature Hashing
- Building a tf-idf model
- Analyzing the tf-idf weightings
- Using a tf-idf model
- Document similarity with the 20 Newsgroups dataset and tf-idf features
- Training a text classifier on the 20 Newsgroups dataset using tf-idf
- Evaluating the impact of text processing
- Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
- Text classification with Spark 2.0
- Word2Vec models
- Word2Vec with Spark MLlib on the 20 Newsgroups dataset
- Word2Vec with Spark ML on the 20 Newsgroups dataset
- Summary
- Real-Time Machine Learning with Spark Streaming
- Online learning
- Stream processing
- An introduction to Spark Streaming
- Input sources
- Transformations
- Keeping track of state
- General transformations
- Actions
- Window operators
- Caching and fault tolerance with Spark Streaming
- Creating a basic streaming application
- The producer application
- Creating a basic streaming application
- Streaming analytics
- Stateful streaming
- Online learning with Spark Streaming
- Streaming regression
- A simple streaming regression program
- Creating a streaming data producer
- Creating a streaming regression model
- Streaming K-means
- Online model evaluation
- Comparing model performance with Spark Streaming
- Structured Streaming
- Summary
- Pipeline APIs for Spark ML
- Introduction to pipelines
- DataFrames
- Pipeline components
- Transformers
- Estimators
- How pipelines work
- Machine learning pipeline with an example
- StumbleUponExecutor
- Summary 更新時間:2021-07-09 21:08:48