舉報

會員
Mastering Machine Learning with Spark 2.x
最新章節:
Summary
Areyouadeveloperwithabackgroundinmachinelearningandstatisticswhoisfeelinglimitedbythecurrentslowand“smalldata”machinelearningtools?Thenthisisthebookforyou!Inthisbook,youwillcreatescalablemachinelearningapplicationstopoweramoderndata-drivenbusinessusingSpark.WeassumethatyoualreadyknowthemachinelearningconceptsandalgorithmsandhaveSparkupandrunning(whetheronaclusterorlocally)andhaveabasicknowledgeofthevariouslibrariescontainedinSpark.
最新章節
- Summary
- Stream output
- Stream transformation
- Stream creation
- Model deployment
- Using models for scoring
品牌:中圖公司
上架時間:2021-07-02 18:15:52
出版社:Packt Publishing
本書數字版權由中圖公司提供,并由其授權上海閱文信息技術有限公司制作發行
- Summary 更新時間:2021-07-02 18:46:37
- Stream output
- Stream transformation
- Stream creation
- Model deployment
- Using models for scoring
- Interest RateModel
- The desc column transformation
- The emp_title column transformation
- Base model
- Loan status model
- Prediction targets
- Missing data
- Text columns
- Categorical columns
- Loan progress columns
- String columns
- Useless columns
- Basic clean up
- Exploration – data analysis
- Data load
- Preparation of the environment
- Data dictionary
- Data
- Goal
- Motivation
- Lending Club Loan Prediction
- Summary
- GraphX in context
- Vertex importance
- Clustering
- Graph algorithms and applications
- GraphFrames
- Pregel
- Aggregating messages
- Advanced graph processing
- Creating GEXF files from GraphX graphs
- Gephi
- Visualizing graphs with Gephi
- Building and loading graphs
- Graph properties and operations
- Graph representation in GraphX
- GraphX distributed graph processing engine
- Property graphs
- Multigraphs
- Trees
- Connected components
- Directed acyclic graphs
- Order and degree
- Directed and undirected graphs
- Graphs
- Basic graph theory
- Graph Analytics with GraphX
- Summary
- The Spark Streaming module
- Deploying a pattern mining application
- Pattern mining on MSNBC clickstream data
- Sequential pattern mining with prefix span
- Association rule mining
- Frequent pattern mining with FP-growth
- Pattern mining with Spark MLlib
- The sequential pattern mining problem
- The association rule mining problem
- Frequent pattern mining problem
- Pattern mining terminology
- Frequent pattern mining
- Extracting Patterns from Clickstream Data
- Summary
- Supervised learning task
- Creating document vectors
- Applying word2vec and exploring our data with vectors
- The distributed bag-of-words model
- The distributed-memory model
- Doc2vec explained
- Cosine similarity
- Fun with word vectors
- The skip-gram model
- The CBOW model
- What is a word vector?
- Word2vec explained
- Motivation of word vectors
- Word2vec for Prediction and Clustering
- Summary
- Using the super-learner model
- Composing all transformations together
- Super learner
- Super-learner model
- Spark GBM model
- Spark random forest model
- Spark Naive Bayes model
- Spark decision tree model
- Let's do some (model) training!
- Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
- Featurization - feature hashing
- Stemming and lemmatization
- Declaring our stopwords list
- Text tokenization
- Feature extraction method– bag-of-words model
- Feature extraction
- Dataset preparation
- The dataset
- NLP - a brief primer
- Predicting Movie Reviews Using NLP and Spark Streaming
- Summary
- Building a classification model using H2O RandomForest
- Spark model metrics
- Classification model evaluation
- Building a classification model using Spark RandomForest
- Modelling data with Random Forest
- Final transformation
- Categorical values
- Missing values
- Data unification
- Summary of missing value analysis
- Missing data
- Exploring data
- Starting Spark shell
- Machine learning workflow
- Challenges
- Modeling goal
- Data
- Ensemble Methods for Multi-Class Classification
- Summary
- Building models and inspecting results
- Adding more layers
- Build a 3-layer DNN
- Last model - H2O deep learning
- Gradient boosting machine
- Grid search
- Random forest model
- Next model – tree ensembles
- Gini versus Entropy
- Our first model – decision tree
- What about cross-validation?
- Creating a training and testing set
- Data caching
- Labeled point vector
- Spark start and data load
- The dataset
- Measuring for the Higgs-Boson
- The theory behind the Higgs-Boson
- The LHC and data creation
- Finding the Higgs-Boson particle
- Type I versus type II error
- Detecting Dark Matter - The Higgs-Boson Particle
- Summary
- Data science - an iterative process
- Data munging
- What's the difference between H2O and Spark's MLlib?
- Design of Sparkling Water
- Introducing H2O.ai
- Inside the box
- What is Databricks?
- From Hadoop MapReduce to Spark
- Splitting of data into multiple machines
- The machine learning algorithm using a distributed environment
- Working with big data
- A day in the life of a data scientist
- The sexiest role of the 21st century – data scientist?
- Data science
- Introduction to Large-Scale Machine Learning and Spark
- Questions
- Piracy
- Errata
- Downloading the color images of this book
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- Why subscribe?
- www.PacktPub.com
- About the Reviewer
- About the Authors
- Credits
- Mastering Machine Learning with Spark 2.x
- Copyright
- Title Page
- cover
- cover
- Title Page
- Copyright
- Mastering Machine Learning with Spark 2.x
- Credits
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- Introduction to Large-Scale Machine Learning and Spark
- Data science
- The sexiest role of the 21st century – data scientist?
- A day in the life of a data scientist
- Working with big data
- The machine learning algorithm using a distributed environment
- Splitting of data into multiple machines
- From Hadoop MapReduce to Spark
- What is Databricks?
- Inside the box
- Introducing H2O.ai
- Design of Sparkling Water
- What's the difference between H2O and Spark's MLlib?
- Data munging
- Data science - an iterative process
- Summary
- Detecting Dark Matter - The Higgs-Boson Particle
- Type I versus type II error
- Finding the Higgs-Boson particle
- The LHC and data creation
- The theory behind the Higgs-Boson
- Measuring for the Higgs-Boson
- The dataset
- Spark start and data load
- Labeled point vector
- Data caching
- Creating a training and testing set
- What about cross-validation?
- Our first model – decision tree
- Gini versus Entropy
- Next model – tree ensembles
- Random forest model
- Grid search
- Gradient boosting machine
- Last model - H2O deep learning
- Build a 3-layer DNN
- Adding more layers
- Building models and inspecting results
- Summary
- Ensemble Methods for Multi-Class Classification
- Data
- Modeling goal
- Challenges
- Machine learning workflow
- Starting Spark shell
- Exploring data
- Missing data
- Summary of missing value analysis
- Data unification
- Missing values
- Categorical values
- Final transformation
- Modelling data with Random Forest
- Building a classification model using Spark RandomForest
- Classification model evaluation
- Spark model metrics
- Building a classification model using H2O RandomForest
- Summary
- Predicting Movie Reviews Using NLP and Spark Streaming
- NLP - a brief primer
- The dataset
- Dataset preparation
- Feature extraction
- Feature extraction method– bag-of-words model
- Text tokenization
- Declaring our stopwords list
- Stemming and lemmatization
- Featurization - feature hashing
- Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
- Let's do some (model) training!
- Spark decision tree model
- Spark Naive Bayes model
- Spark random forest model
- Spark GBM model
- Super-learner model
- Super learner
- Composing all transformations together
- Using the super-learner model
- Summary
- Word2vec for Prediction and Clustering
- Motivation of word vectors
- Word2vec explained
- What is a word vector?
- The CBOW model
- The skip-gram model
- Fun with word vectors
- Cosine similarity
- Doc2vec explained
- The distributed-memory model
- The distributed bag-of-words model
- Applying word2vec and exploring our data with vectors
- Creating document vectors
- Supervised learning task
- Summary
- Extracting Patterns from Clickstream Data
- Frequent pattern mining
- Pattern mining terminology
- Frequent pattern mining problem
- The association rule mining problem
- The sequential pattern mining problem
- Pattern mining with Spark MLlib
- Frequent pattern mining with FP-growth
- Association rule mining
- Sequential pattern mining with prefix span
- Pattern mining on MSNBC clickstream data
- Deploying a pattern mining application
- The Spark Streaming module
- Summary
- Graph Analytics with GraphX
- Basic graph theory
- Graphs
- Directed and undirected graphs
- Order and degree
- Directed acyclic graphs
- Connected components
- Trees
- Multigraphs
- Property graphs
- GraphX distributed graph processing engine
- Graph representation in GraphX
- Graph properties and operations
- Building and loading graphs
- Visualizing graphs with Gephi
- Gephi
- Creating GEXF files from GraphX graphs
- Advanced graph processing
- Aggregating messages
- Pregel
- GraphFrames
- Graph algorithms and applications
- Clustering
- Vertex importance
- GraphX in context
- Summary
- Lending Club Loan Prediction
- Motivation
- Goal
- Data
- Data dictionary
- Preparation of the environment
- Data load
- Exploration – data analysis
- Basic clean up
- Useless columns
- String columns
- Loan progress columns
- Categorical columns
- Text columns
- Missing data
- Prediction targets
- Loan status model
- Base model
- The emp_title column transformation
- The desc column transformation
- Interest RateModel
- Using models for scoring
- Model deployment
- Stream creation
- Stream transformation
- Stream output
- Summary 更新時間:2021-07-02 18:46:37