舉報

會員
Python Web Scraping Cookbook
Michael Heydt 著
更新時間:2021-06-30 18:45:06
開會員,本書免費讀 >
ThisbookisidealforPythonprogrammers,webadministrators,securityprofessionalsorsomeonewhowantstoperformwebanalyticswouldfindthisbookrelevantanduseful.FamiliaritywithPythonandbasicunderstandingofwebscrapingwouldbeusefultotakefulladvantageofthisbook.
最新章節
- Leave a review - let other readers know what you think
- Other Books You May Enjoy
- There's more...
- How to do it
- Getting ready
- Starting and accessing the containers in AWS
品牌:中圖公司
上架時間:2021-06-30 18:26:05
出版社:Packt Publishing
本書數字版權由中圖公司提供,并由其授權上海閱文信息技術有限公司制作發行
- Leave a review - let other readers know what you think 更新時間:2021-06-30 18:45:06
- Other Books You May Enjoy
- There's more...
- How to do it
- Getting ready
- Starting and accessing the containers in AWS
- How it works
- How to do it
- Getting ready
- Creating a task to run our containers
- How to do it
- Creating an ECS cluster
- How to do it
- Getting ready
- Pushing containers into ECR
- How to do it
- Getting ready
- Configuring Docker to authenticate with ECR
- How to do it
- Getting ready
- Creating an AWS IAM user and a key pair for ECS
- How to do it
- Storing configuration in the environment
- There's more...
- How it works
- How to do it
- Modifying the API to search for jobs by skill
- How to do it
- Getting ready
- Using Elasticsearch to query for jobs with specific skills
- There's more...
- How to do it
- Getting ready
- Performing an Elasticsearch query with the Python API
- There's more...
- How to do it
- Getting ready
- Connecting to the Elastic Cloud cluster with Python
- How to do it
- Accessing the Elastic Cloud cluster with curl
- How to do it
- Creating and configuring an Elastic Cloud trial account
- Introduction
- Making the Scraper as a Service Real
- There's more...
- How to do it
- Getting ready
- Composing and running the scraper locally with docker-compose
- There's more...
- How to do it
- Getting ready
- Creating an API container
- How it works
- How to do it
- Getting ready
- Creating a scraper container
- There's more...
- How to do it
- Creating a scraping microservice
- There's more...
- How it works
- How to do it
- Getting ready
- Creating a generic microservice with Nameko
- There's more...
- How to do it
- Stopping/restarting a container and removing the image
- How to do it
- Creating and running an Elasticsearch container
- There's more...
- How to do it
- Getting ready
- Running a Docker container (RabbitMQ)
- How to do it
- Getting ready
- Installing a RabbitMQ container from Docker Hub
- How to do it
- Getting ready
- Installing Docker
- Introduction
- Creating Scraper Microservices with Docker
- There's more...
- How to do it
- Checking Elasticsearch for a listing before scraping
- There's more...
- How it works
- How to do it
- Getting ready
- Storing data in Elasticsearch as the result of a scraping request
- How to do it
- Getting ready
- Adding an API to find the skills for a job listing
- How to do it
- Getting ready
- Integrating the REST API with scraping code
- There's more...
- How it works
- How to do it
- Getting ready
- Creating a REST API with Flask-RESTful
- Introduction
- Creating a Simple Data API
- There's more...
- How it works
- How to do it
- Calculating degrees of separation
- There's more...
- How it works
- How to do it
- Getting ready
- Visualizing page relationships on Wikipedia
- Theres more...
- How it works
- How to do it
- Getting ready
- Crawling links on Wikipedia
- How to do it
- Getting ready
- Creating a word cloud from a StackOverflow job listing
- How to do it
- Visualizing contributor location frequency on Wikipedia
- There's more...
- How it works
- How to do it
- Getting ready
- How to collect IP addresses of Wikipedia edits
- How to do it
- Getting ready
- Geocoding an IP address
- Introduction
- Searching Mining and Visualizing Data
- How to do it...
- Getting ready
- Reading and cleaning the description in the job listing
- There's more...
- How to do it
- Getting ready
- Scraping a job listing from StackOverflow
- There's more...
- How to do it
- Piecing together n-grams
- There's more...
- How to do it
- Removing punctuation marks
- How to do it
- Identifying and removing rare words
- How to do it
- Identifying and removing rare words
- There's more...
- How to do it
- Calculating the frequency distributions of words
- There's more...
- How to do it
- Determining and removing stop words
- How to do it
- Performing lemmatization
- How to do it
- Performing stemming
- How to do it
- Performing tokenization
- There's more...
- How to do it
- Performing sentence splitting
- How to do it
- Installing NLTK
- Introduction
- Text Wrangling and Analysis
- There's more...
- How to do it
- Caching responses
- How to do it
- Randomizing user agents
- How it works
- How to do it
- Getting ready
- Preventing bans by scraping via proxies
- There's more...
- How it works
- How to do it
- Handling basic authorization
- There's more...
- How it works
- How to do it
- Getting ready
- Handling forms and forms-based authorization
- There's more...
- How it works
- How to do it
- Getting ready
- Handling paginated websites
- How it works
- How to do it
- Controlling the length of a crawl
- How it works
- How to do it
- Controlling the depth of a crawl
- There's more...
- How it works
- How to do it
- Getting ready
- Processing infinitely scrolling pages
- How it works
- How to do it
- Limiting crawling to a single domain
- How it works
- How to do it
- Waiting for content to be available in Selenium
- How it works
- How to do it
- Supporting page redirects
- How it works
- How to do it
- Retrying failed page downloads
- Introduction
- Scraping Challenges and Solutions
- There's more...
- How it works
- How to do it
- Using an HTTP cache for development
- There's more...
- How it works
- How to do it
- Using auto throttling
- How it works
- Setting the number of concurrent requests per domain
- There's more...
- How it works
- How to do it
- Using identifiable user agents
- There's more...
- How it works
- How to do it
- Getting ready
- Crawling with delays
- There's more...
- How it works
- How to do it
- Getting ready
- Crawling using the sitemap
- There's more...
- How it works
- How to do it
- Getting ready
- Respecting robots.txt
- How to do it
- Getting ready
- Scraping legality and scraping politely
- Introduction
- Scraping - Code of Conduct
- There's more...
- How to do it
- Getting ready
- Ripping an MP4 video to an MP3
- There's more..
- How it works
- How to do it
- Getting ready
- Creating a Video Thumbnail
- There's more...
- How it works
- How to do it
- Getting ready
- Performing OCR on an image with pytesseract
- There's more...
- How it works
- How to do it
- Getting ready
- Taking a screenshot of a website with an external service
- How it works
- How to do it
- Getting ready
- Taking a screenshot of a website
- How it works
- How to do it
- Getting ready
- Generating thumbnails for images
- There's more...
- How it works
- How to do it
- Getting ready
- Downloading and saving images to S3
- There's more...
- How it works
- How to do it
- Downloading and saving images to the local file system
- There's more...
- How it works
- How to do it
- Getting ready
- Determining the file extension from a content type
- There's more...
- How it works
- How to do it
- Getting ready
- Determining the type of content for a URL
- There's more...
- How it works
- How to do it
- Getting ready
- Parsing a URL with urllib to get the filename
- There's more...
- How it works
- How to do it
- Getting ready
- Downloading media content from the web
- Introduction
- Working with Images Audio and other Assets
- There's more...
- How it works
- How to do it - reading and processing messages
- How it works
- How to do it - posting messages to an AWS queue
- Getting ready
- How to build robust ETL pipelines with AWS SQS
- There's more...
- How it works
- How to do it
- Getting ready
- Storing data in Elasticsearch
- There's more...
- How it works
- How to do it
- Getting ready
- Storing data using PostgreSQL
- There's more...
- How it works
- How to do it
- Getting ready
- Storing data using MySQL
- There's more...
- How it works
- How to do it
- Getting ready
- Storing data using AWS S3
- There's more...
- How it works
- How to do it
- Getting ready
- Working with CSV and JSON data
- Introduction
- Processing Data
- There's more...
- How it works
- How to do it...
- Getting ready
- Loading data in unicode / UTF-8
- There's more...
- How it works
- How to do it...
- Getting ready
- Using Scrapy selectors
- There's more...
- How it works
- How to do it...
- Getting ready
- Querying data with XPath and CSS selectors
- There's more...
- How it works
- How to do it...
- Getting ready
- Querying the DOM with XPath and lxml
- How to do it...
- Getting ready
- Searching the DOM with Beautiful Soup's find methods
- There's more...
- How it works
- How to do it...
- Getting ready
- How to parse websites and navigate the DOM using BeautifulSoup
- Introduction
- Data Acquisition and Extraction
- There's more...
- How it works
- How to do it...
- Getting ready
- Scraping Python.org with Selenium and PhantomJS
- How it works
- How to do it...
- Getting ready...
- Scraping Python.org with Scrapy
- There's more...
- How it works
- How to do it...
- Getting ready...
- Scraping Python.org in urllib3 and Beautiful Soup
- How it works...
- How to do it...
- Getting ready...
- Scraping Python.org with Requests and Beautiful Soup
- How to do it...
- Getting ready
- Setting up a Python development environment
- Introduction
- Getting Started with Scraping
- Reviews
- Get in touch
- Conventions used
- Download the example code files
- To get the most out of this book
- What this book covers
- Who this book is for
- Preface
- PacktPub.com
- Why subscribe?
- Packt Upsell
- Packt is searching for authors like you
- About the reviewers
- About the author
- Contributors
- Title Page
- coverpage
- coverpage
- Title Page
- Contributors
- About the author
- About the reviewers
- Packt is searching for authors like you
- Packt Upsell
- Why subscribe?
- PacktPub.com
- Preface
- Who this book is for
- What this book covers
- To get the most out of this book
- Download the example code files
- Conventions used
- Get in touch
- Reviews
- Getting Started with Scraping
- Introduction
- Setting up a Python development environment
- Getting ready
- How to do it...
- Scraping Python.org with Requests and Beautiful Soup
- Getting ready...
- How to do it...
- How it works...
- Scraping Python.org in urllib3 and Beautiful Soup
- Getting ready...
- How to do it...
- How it works
- There's more...
- Scraping Python.org with Scrapy
- Getting ready...
- How to do it...
- How it works
- Scraping Python.org with Selenium and PhantomJS
- Getting ready
- How to do it...
- How it works
- There's more...
- Data Acquisition and Extraction
- Introduction
- How to parse websites and navigate the DOM using BeautifulSoup
- Getting ready
- How to do it...
- How it works
- There's more...
- Searching the DOM with Beautiful Soup's find methods
- Getting ready
- How to do it...
- Querying the DOM with XPath and lxml
- Getting ready
- How to do it...
- How it works
- There's more...
- Querying data with XPath and CSS selectors
- Getting ready
- How to do it...
- How it works
- There's more...
- Using Scrapy selectors
- Getting ready
- How to do it...
- How it works
- There's more...
- Loading data in unicode / UTF-8
- Getting ready
- How to do it...
- How it works
- There's more...
- Processing Data
- Introduction
- Working with CSV and JSON data
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using AWS S3
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using MySQL
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using PostgreSQL
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data in Elasticsearch
- Getting ready
- How to do it
- How it works
- There's more...
- How to build robust ETL pipelines with AWS SQS
- Getting ready
- How to do it - posting messages to an AWS queue
- How it works
- How to do it - reading and processing messages
- How it works
- There's more...
- Working with Images Audio and other Assets
- Introduction
- Downloading media content from the web
- Getting ready
- How to do it
- How it works
- There's more...
- Parsing a URL with urllib to get the filename
- Getting ready
- How to do it
- How it works
- There's more...
- Determining the type of content for a URL
- Getting ready
- How to do it
- How it works
- There's more...
- Determining the file extension from a content type
- Getting ready
- How to do it
- How it works
- There's more...
- Downloading and saving images to the local file system
- How to do it
- How it works
- There's more...
- Downloading and saving images to S3
- Getting ready
- How to do it
- How it works
- There's more...
- Generating thumbnails for images
- Getting ready
- How to do it
- How it works
- Taking a screenshot of a website
- Getting ready
- How to do it
- How it works
- Taking a screenshot of a website with an external service
- Getting ready
- How to do it
- How it works
- There's more...
- Performing OCR on an image with pytesseract
- Getting ready
- How to do it
- How it works
- There's more...
- Creating a Video Thumbnail
- Getting ready
- How to do it
- How it works
- There's more..
- Ripping an MP4 video to an MP3
- Getting ready
- How to do it
- There's more...
- Scraping - Code of Conduct
- Introduction
- Scraping legality and scraping politely
- Getting ready
- How to do it
- Respecting robots.txt
- Getting ready
- How to do it
- How it works
- There's more...
- Crawling using the sitemap
- Getting ready
- How to do it
- How it works
- There's more...
- Crawling with delays
- Getting ready
- How to do it
- How it works
- There's more...
- Using identifiable user agents
- How to do it
- How it works
- There's more...
- Setting the number of concurrent requests per domain
- How it works
- Using auto throttling
- How to do it
- How it works
- There's more...
- Using an HTTP cache for development
- How to do it
- How it works
- There's more...
- Scraping Challenges and Solutions
- Introduction
- Retrying failed page downloads
- How to do it
- How it works
- Supporting page redirects
- How to do it
- How it works
- Waiting for content to be available in Selenium
- How to do it
- How it works
- Limiting crawling to a single domain
- How to do it
- How it works
- Processing infinitely scrolling pages
- Getting ready
- How to do it
- How it works
- There's more...
- Controlling the depth of a crawl
- How to do it
- How it works
- Controlling the length of a crawl
- How to do it
- How it works
- Handling paginated websites
- Getting ready
- How to do it
- How it works
- There's more...
- Handling forms and forms-based authorization
- Getting ready
- How to do it
- How it works
- There's more...
- Handling basic authorization
- How to do it
- How it works
- There's more...
- Preventing bans by scraping via proxies
- Getting ready
- How to do it
- How it works
- Randomizing user agents
- How to do it
- Caching responses
- How to do it
- There's more...
- Text Wrangling and Analysis
- Introduction
- Installing NLTK
- How to do it
- Performing sentence splitting
- How to do it
- There's more...
- Performing tokenization
- How to do it
- Performing stemming
- How to do it
- Performing lemmatization
- How to do it
- Determining and removing stop words
- How to do it
- There's more...
- Calculating the frequency distributions of words
- How to do it
- There's more...
- Identifying and removing rare words
- How to do it
- Identifying and removing rare words
- How to do it
- Removing punctuation marks
- How to do it
- There's more...
- Piecing together n-grams
- How to do it
- There's more...
- Scraping a job listing from StackOverflow
- Getting ready
- How to do it
- There's more...
- Reading and cleaning the description in the job listing
- Getting ready
- How to do it...
- Searching Mining and Visualizing Data
- Introduction
- Geocoding an IP address
- Getting ready
- How to do it
- How to collect IP addresses of Wikipedia edits
- Getting ready
- How to do it
- How it works
- There's more...
- Visualizing contributor location frequency on Wikipedia
- How to do it
- Creating a word cloud from a StackOverflow job listing
- Getting ready
- How to do it
- Crawling links on Wikipedia
- Getting ready
- How to do it
- How it works
- Theres more...
- Visualizing page relationships on Wikipedia
- Getting ready
- How to do it
- How it works
- There's more...
- Calculating degrees of separation
- How to do it
- How it works
- There's more...
- Creating a Simple Data API
- Introduction
- Creating a REST API with Flask-RESTful
- Getting ready
- How to do it
- How it works
- There's more...
- Integrating the REST API with scraping code
- Getting ready
- How to do it
- Adding an API to find the skills for a job listing
- Getting ready
- How to do it
- Storing data in Elasticsearch as the result of a scraping request
- Getting ready
- How to do it
- How it works
- There's more...
- Checking Elasticsearch for a listing before scraping
- How to do it
- There's more...
- Creating Scraper Microservices with Docker
- Introduction
- Installing Docker
- Getting ready
- How to do it
- Installing a RabbitMQ container from Docker Hub
- Getting ready
- How to do it
- Running a Docker container (RabbitMQ)
- Getting ready
- How to do it
- There's more...
- Creating and running an Elasticsearch container
- How to do it
- Stopping/restarting a container and removing the image
- How to do it
- There's more...
- Creating a generic microservice with Nameko
- Getting ready
- How to do it
- How it works
- There's more...
- Creating a scraping microservice
- How to do it
- There's more...
- Creating a scraper container
- Getting ready
- How to do it
- How it works
- Creating an API container
- Getting ready
- How to do it
- There's more...
- Composing and running the scraper locally with docker-compose
- Getting ready
- How to do it
- There's more...
- Making the Scraper as a Service Real
- Introduction
- Creating and configuring an Elastic Cloud trial account
- How to do it
- Accessing the Elastic Cloud cluster with curl
- How to do it
- Connecting to the Elastic Cloud cluster with Python
- Getting ready
- How to do it
- There's more...
- Performing an Elasticsearch query with the Python API
- Getting ready
- How to do it
- There's more...
- Using Elasticsearch to query for jobs with specific skills
- Getting ready
- How to do it
- Modifying the API to search for jobs by skill
- How to do it
- How it works
- There's more...
- Storing configuration in the environment
- How to do it
- Creating an AWS IAM user and a key pair for ECS
- Getting ready
- How to do it
- Configuring Docker to authenticate with ECR
- Getting ready
- How to do it
- Pushing containers into ECR
- Getting ready
- How to do it
- Creating an ECS cluster
- How to do it
- Creating a task to run our containers
- Getting ready
- How to do it
- How it works
- Starting and accessing the containers in AWS
- Getting ready
- How to do it
- There's more...
- Other Books You May Enjoy
- Leave a review - let other readers know what you think 更新時間:2021-06-30 18:45:06