- Summary 更新時間:2021-07-02 19:02:35
- Connected components
- Triangle counting
- Dynamic PageRank
- Static PageRank
- PageRank
- Graph algorithms
- outerJoinVertices
- aggregateMessages
- subgraph
- reverse
- mapTriplets
- mapEdges
- mapVertices
- Graph operations
- EdgeTriplet
- From edges
- Using vertex and edge RDDs
- Getting started with the GraphX API
- Introduction to Property Graph
- Introduction to GraphX
- Learning Spark GraphX
- Summary
- Feature selectors
- Feature transformers
- Feature extractors
- Operations on feature vectors
- Pipelines
- Machine learning work flow
- Datatypes
- Concepts of machine learning
- Introduction to machine learning
- Machine Learning Analytics with Spark MLlib
- Summary
- Built-in Sinks
- Input sources
- Built-in input sources and sinks
- Structured streaming - programming model
- Recap of the use case
- Structured Streaming
- Output stage
- Transformation stage
- Advanced streaming sources
- File streams
- Data receiver stage
- Fault tolerance and reliability
- Transform operation
- Windowing
- Checkpointing
- Stateful transformation
- Stateless transformation
- Streaming transformations
- Kafka
- fileStream
- Streaming sources
- Getting started with Spark Streaming jobs
- Understanding micro batching
- Introducing Spark Streaming
- Near Real-Time Processing with Spark Streaming
- Summary
- Table Persistence
- Hive integration
- Type-safe UDAF:
- Untyped UDAF
- Spark UDAF
- Spark UDF
- Global temporary view
- Temporary view
- Untyped dataset operation
- Spark SQL operations
- Data persistence
- Unified dataframe and dataset API
- Creating a dataset using StructType
- Creating a dataset using encoders
- Dataset
- Dataframe
- SchemaRDD
- Dataframe and dataset
- Reading CSV using SparkSession
- Initializing SparkSession
- SQLContext and HiveContext
- Working with Spark SQL
- Summary
- Driver program
- Accumulators
- Map-side join using broadcast variable
- Lifecycle of a broadcast variable
- Properties of the broadcast variable
- Broadcast variable
- Shared variable
- Miscellaneous actions
- Asynchronous actions
- Approximate actions
- Advanced actions
- combineByKey
- aggregateByKey
- foldByKey
- coalesce
- repartitionAndSortWithinPartitions
- flatMapValues
- mapValues
- mapPartitionsToPair
- mapPartitionsWithIndex
- mapPartitions
- Advanced transformations
- Custom Partitioner
- Range Partitioner
- Hash Partitioner
- Partitioner
- How Spark calculates the partition count for transformations with shuffling (wide transformations )
- Repartitioning
- RDD partitioning
- Spark Programming Model - Advanced
- Summary
- Useful job configuration
- YARN cluster
- YARN client
- Yet Another Resource Negotiator (YARN)
- Useful cluster level configurations (Spark standalone)
- Useful job configurations
- Cluster mode
- Client mode
- Deploying applications on Spark standalone cluster
- Stop master and slaves
- Start slave
- Start master
- Installation of Spark standalone cluster
- Spark standalone
- Cluster managers
- Executor program
- Driver program
- Spark application in distributed-mode
- Spark on Cluster
- Summary
- References
- Working with XML Data
- Working with JSON data
- Working with CSV data
- Plain and specially formatted text
- Working with different data formats
- Interaction with Cassandra
- Interaction with HDFS
- Interaction with Amazon S3
- Interaction with local filesystem
- Interaction with external storage systems
- Working with Data and Storage
- Summary
- RDD persistence and cache
- saveAsObjectFile
- saveAsTextFile
- forEach
- aggregate
- Fold
- reduce
- top
- takeSample
- takeOrdered
- Take
- First
- Min
- Max
- countByValue
- countByKey
- count
- collectAsMap
- collect
- isEmpty
- Common RDD actions
- CoGroup
- Join
- sortByKey
- reduceByKey
- groupByKey
- Cartesian
- Distinct
- Intersection
- union
- flatMapToPair
- mapToPair
- flatMap
- Filter
- Map
- Common RDD transformations
- Prerequisites
- Hello Spark
- Understanding the Spark Programming Model
- Summary
- Spark REST APIs
- Spark job configuration and submission
- Streaming
- SQL
- Executors
- Environment
- Storage
- Stages
- Jobs
- Spark Driver Web UI
- Spark components
- Counting the number of words in a file
- Finding the sum of all even numbers in an RDD of integers
- Word count on RDD
- Creating and filtering RDD
- Checking Spark version
- Some basic exercises using Spark shell
- Spark REPL also known as CLI
- Getting started with Spark
- Let Us Spark
- Summary
- Finding elements
- Matching
- Partitioning
- Groupings
- Map collectors
- Collection collectors
- String collectors
- Working with terminal operations
- Terminal operations
- Working with intermediate operations
- Intermediate operations
- Generating streams
- Streams
- Understanding closures
- Method reference
- Lexical scoping
- Syntax of Lambda expressions
- Functional interface
- Lambda expressions
- Anonymous inner classes
- What if a class implements two interfaces which have default methods with same name and signature?
- Default method in interface
- Static method in an interface
- Interfaces
- Creating your own generic type
- Generics
- Why use Java for Spark?
- Revisiting Java
- Summary
- References
- What's new in Spark 2.X?
- Exploring the Spark ecosystem
- Benefits of RDD
- Lazy evaluation
- Operations on RDD
- RDD - the first citizen of Spark
- Why Apache Spark?
- Overview of MapReduce
- Processing the flow of application submission in YARN
- YARN
- HDFS I/O
- NameNode
- Defining HDFS
- What makes Hadoop so revolutionary?
- Dimensions of big data
- Introduction to Spark
- Questions
- Piracy
- Errata
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- Why subscribe?
- www.PacktPub.com
- About the Reviewer
- About the Authors
- Foreword
- Credits
- Title Page
- coverpage
- coverpage
- Title Page
- Credits
- Foreword
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Errata
- Piracy
- Questions
- Introduction to Spark
- Dimensions of big data
- What makes Hadoop so revolutionary?
- Defining HDFS
- NameNode
- HDFS I/O
- YARN
- Processing the flow of application submission in YARN
- Overview of MapReduce
- Why Apache Spark?
- RDD - the first citizen of Spark
- Operations on RDD
- Lazy evaluation
- Benefits of RDD
- Exploring the Spark ecosystem
- What's new in Spark 2.X?
- References
- Summary
- Revisiting Java
- Why use Java for Spark?
- Generics
- Creating your own generic type
- Interfaces
- Static method in an interface
- Default method in interface
- What if a class implements two interfaces which have default methods with same name and signature?
- Anonymous inner classes
- Lambda expressions
- Functional interface
- Syntax of Lambda expressions
- Lexical scoping
- Method reference
- Understanding closures
- Streams
- Generating streams
- Intermediate operations
- Working with intermediate operations
- Terminal operations
- Working with terminal operations
- String collectors
- Collection collectors
- Map collectors
- Groupings
- Partitioning
- Matching
- Finding elements
- Summary
- Let Us Spark
- Getting started with Spark
- Spark REPL also known as CLI
- Some basic exercises using Spark shell
- Checking Spark version
- Creating and filtering RDD
- Word count on RDD
- Finding the sum of all even numbers in an RDD of integers
- Counting the number of words in a file
- Spark components
- Spark Driver Web UI
- Jobs
- Stages
- Storage
- Environment
- Executors
- SQL
- Streaming
- Spark job configuration and submission
- Spark REST APIs
- Summary
- Understanding the Spark Programming Model
- Hello Spark
- Prerequisites
- Common RDD transformations
- Map
- Filter
- flatMap
- mapToPair
- flatMapToPair
- union
- Intersection
- Distinct
- Cartesian
- groupByKey
- reduceByKey
- sortByKey
- Join
- CoGroup
- Common RDD actions
- isEmpty
- collect
- collectAsMap
- count
- countByKey
- countByValue
- Max
- Min
- First
- Take
- takeOrdered
- takeSample
- top
- reduce
- Fold
- aggregate
- forEach
- saveAsTextFile
- saveAsObjectFile
- RDD persistence and cache
- Summary
- Working with Data and Storage
- Interaction with external storage systems
- Interaction with local filesystem
- Interaction with Amazon S3
- Interaction with HDFS
- Interaction with Cassandra
- Working with different data formats
- Plain and specially formatted text
- Working with CSV data
- Working with JSON data
- Working with XML Data
- References
- Summary
- Spark on Cluster
- Spark application in distributed-mode
- Driver program
- Executor program
- Cluster managers
- Spark standalone
- Installation of Spark standalone cluster
- Start master
- Start slave
- Stop master and slaves
- Deploying applications on Spark standalone cluster
- Client mode
- Cluster mode
- Useful job configurations
- Useful cluster level configurations (Spark standalone)
- Yet Another Resource Negotiator (YARN)
- YARN client
- YARN cluster
- Useful job configuration
- Summary
- Spark Programming Model - Advanced
- RDD partitioning
- Repartitioning
- How Spark calculates the partition count for transformations with shuffling (wide transformations )
- Partitioner
- Hash Partitioner
- Range Partitioner
- Custom Partitioner
- Advanced transformations
- mapPartitions
- mapPartitionsWithIndex
- mapPartitionsToPair
- mapValues
- flatMapValues
- repartitionAndSortWithinPartitions
- coalesce
- foldByKey
- aggregateByKey
- combineByKey
- Advanced actions
- Approximate actions
- Asynchronous actions
- Miscellaneous actions
- Shared variable
- Broadcast variable
- Properties of the broadcast variable
- Lifecycle of a broadcast variable
- Map-side join using broadcast variable
- Accumulators
- Driver program
- Summary
- Working with Spark SQL
- SQLContext and HiveContext
- Initializing SparkSession
- Reading CSV using SparkSession
- Dataframe and dataset
- SchemaRDD
- Dataframe
- Dataset
- Creating a dataset using encoders
- Creating a dataset using StructType
- Unified dataframe and dataset API
- Data persistence
- Spark SQL operations
- Untyped dataset operation
- Temporary view
- Global temporary view
- Spark UDF
- Spark UDAF
- Untyped UDAF
- Type-safe UDAF:
- Hive integration
- Table Persistence
- Summary
- Near Real-Time Processing with Spark Streaming
- Introducing Spark Streaming
- Understanding micro batching
- Getting started with Spark Streaming jobs
- Streaming sources
- fileStream
- Kafka
- Streaming transformations
- Stateless transformation
- Stateful transformation
- Checkpointing
- Windowing
- Transform operation
- Fault tolerance and reliability
- Data receiver stage
- File streams
- Advanced streaming sources
- Transformation stage
- Output stage
- Structured Streaming
- Recap of the use case
- Structured streaming - programming model
- Built-in input sources and sinks
- Input sources
- Built-in Sinks
- Summary
- Machine Learning Analytics with Spark MLlib
- Introduction to machine learning
- Concepts of machine learning
- Datatypes
- Machine learning work flow
- Pipelines
- Operations on feature vectors
- Feature extractors
- Feature transformers
- Feature selectors
- Summary
- Learning Spark GraphX
- Introduction to GraphX
- Introduction to Property Graph
- Getting started with the GraphX API
- Using vertex and edge RDDs
- From edges
- EdgeTriplet
- Graph operations
- mapVertices
- mapEdges
- mapTriplets
- reverse
- subgraph
- aggregateMessages
- outerJoinVertices
- Graph algorithms
- PageRank
- Static PageRank
- Dynamic PageRank
- Triangle counting
- Connected components
- Summary 更新時間:2021-07-02 19:02:35