會員

Apache Spark 2.x for Java Developers

更新時間：2021-07-02 19:02:35

開會員，本書免費讀 >

IfyouareaJavadeveloperinterestedinlearningtousethepopularApacheSparkframework,thisbookistheresourceyouneedtogetstarted.ApacheSparkdeveloperswhoarelookingtobuildenterprise-gradeapplicationsinJavawillalsofindthisbookveryuseful.

目錄(274章)

倒序

coverpage
Title Page
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Spark
Dimensions of big data
What makes Hadoop so revolutionary?
Defining HDFS
NameNode
HDFS I/O
YARN
Processing the flow of application submission in YARN
Overview of MapReduce
Why Apache Spark?
RDD - the first citizen of Spark
Operations on RDD
Lazy evaluation
Benefits of RDD
Exploring the Spark ecosystem
What's new in Spark 2.X?
References
Summary
Revisiting Java
Why use Java for Spark?
Generics
Creating your own generic type
Interfaces
Static method in an interface
Default method in interface
What if a class implements two interfaces which have default methods with same name and signature?
Anonymous inner classes
Lambda expressions
Functional interface
Syntax of Lambda expressions
Lexical scoping
Method reference
Understanding closures
Streams
Generating streams
Intermediate operations
Working with intermediate operations
Terminal operations
Working with terminal operations
String collectors
Collection collectors
Map collectors
Groupings
Partitioning
Matching
Finding elements
Summary
Let Us Spark
Getting started with Spark
Spark REPL also known as CLI
Some basic exercises using Spark shell
Checking Spark version
Creating and filtering RDD
Word count on RDD
Finding the sum of all even numbers in an RDD of integers
Counting the number of words in a file
Spark components
Spark Driver Web UI
Jobs
Stages
Storage
Environment
Executors
SQL
Streaming
Spark job configuration and submission
Spark REST APIs
Summary
Understanding the Spark Programming Model
Hello Spark
Prerequisites
Common RDD transformations
Map
Filter
flatMap
mapToPair
flatMapToPair
union
Intersection
Distinct
Cartesian
groupByKey
reduceByKey
sortByKey
Join
CoGroup
Common RDD actions
isEmpty
collect
collectAsMap
count
countByKey
countByValue
Max
Min
First
Take
takeOrdered
takeSample
top
reduce
Fold
aggregate
forEach
saveAsTextFile
saveAsObjectFile
RDD persistence and cache
Summary
Working with Data and Storage
Interaction with external storage systems
Interaction with local filesystem
Interaction with Amazon S3
Interaction with HDFS
Interaction with Cassandra
Working with different data formats
Plain and specially formatted text
Working with CSV data
Working with JSON data
Working with XML Data
References
Summary
Spark on Cluster
Spark application in distributed-mode
Driver program
Executor program
Cluster managers
Spark standalone
Installation of Spark standalone cluster
Start master
Start slave
Stop master and slaves
Deploying applications on Spark standalone cluster
Client mode
Cluster mode
Useful job configurations
Useful cluster level configurations (Spark standalone)
Yet Another Resource Negotiator (YARN)
YARN client
YARN cluster
Useful job configuration
Summary
Spark Programming Model - Advanced
RDD partitioning
Repartitioning
How Spark calculates the partition count for transformations with shuffling (wide transformations )
Partitioner
Hash Partitioner
Range Partitioner
Custom Partitioner
Advanced transformations
mapPartitions
mapPartitionsWithIndex
mapPartitionsToPair
mapValues
flatMapValues
repartitionAndSortWithinPartitions
coalesce
foldByKey
aggregateByKey
combineByKey
Advanced actions
Approximate actions
Asynchronous actions
Miscellaneous actions
Shared variable
Broadcast variable
Properties of the broadcast variable
Lifecycle of a broadcast variable
Map-side join using broadcast variable
Accumulators
Driver program
Summary
Working with Spark SQL
SQLContext and HiveContext
Initializing SparkSession
Reading CSV using SparkSession
Dataframe and dataset
SchemaRDD
Dataframe
Dataset
Creating a dataset using encoders
Creating a dataset using StructType
Unified dataframe and dataset API
Data persistence
Spark SQL operations
Untyped dataset operation
Temporary view
Global temporary view
Spark UDF
Spark UDAF
Untyped UDAF
Type-safe UDAF:
Hive integration
Table Persistence
Summary
Near Real-Time Processing with Spark Streaming
Introducing Spark Streaming
Understanding micro batching
Getting started with Spark Streaming jobs
Streaming sources
fileStream
Kafka
Streaming transformations
Stateless transformation
Stateful transformation
Checkpointing
Windowing
Transform operation
Fault tolerance and reliability
Data receiver stage
File streams
Advanced streaming sources
Transformation stage
Output stage
Structured Streaming
Recap of the use case
Structured streaming - programming model
Built-in input sources and sinks
Input sources
Built-in Sinks
Summary
Machine Learning Analytics with Spark MLlib
Introduction to machine learning
Concepts of machine learning
Datatypes
Machine learning work flow
Pipelines
Operations on feature vectors
Feature extractors
Feature transformers
Feature selectors
Summary
Learning Spark GraphX
Introduction to GraphX
Introduction to Property Graph
Getting started with the GraphX API
Using vertex and edge RDDs
From edges
EdgeTriplet
Graph operations
mapVertices
mapEdges
mapTriplets
reverse
subgraph
aggregateMessages
outerJoinVertices
Graph algorithms
PageRank
Static PageRank
Dynamic PageRank
Triangle counting
Connected components
Summary 更新時間：2021-07-02 19:02:35

官术网_书友最值得收藏!

Apache Spark 2.x for Java Developers