官术网_书友最值得收藏!

How to build robust ETL pipelines with AWS SQS

Scraping a large quantity of sites and data can be a complicated and slow process.  But it is one that can take great advantage of parallel processing, either locally with multiple processor threads, or distributing scraping requests to report scrapers using a message queue system. There may also be the need for multiple steps in a process similar to an Extract, Transform, and Load pipeline (ETL). These pipelines can also be easily built using a message queuing architecture in conjunction with the scraping.

Using a message queuing architecture gives our pipeline two advantages:

  • Robustness
  • Scalability

The processing becomes robust, as if processing of an individual message fails, then the message can be re-queued for processing again. So if the scraper fails, we can restart it and not lose the request for scraping the page, or the message queue system will deliver the request to another scraper.

It provides scalability, as multiple scrapers on the same, or different, systems can listen on the queue. Multiple messages can then be processed at the same time on different cores or, more importantly, different systems. In a cloud-based scraper, you can scale up the number of scraper instances on demand to handle greater load.

Common message queueing systems that can be used include: Kafka, RabbitMQ, and Amazon SQS. Our example will utilize Amazon SQS, although both Kafka and RabbitMQ are quite excellent to use (we will see RabbitMQ in use later in the book). We use SQS to stay with a model of using AWS cloud-based services as we did earlier in the chapter with S3.

主站蜘蛛池模板: 鹤峰县| 巴里| 广汉市| 阿坝| 讷河市| 罗源县| 凉城县| 突泉县| 高邑县| 哈巴河县| 吉木乃县| 保亭| 阿勒泰市| 郁南县| 江陵县| 鄂托克旗| 文山县| 左权县| 广西| 龙川县| 西乌珠穆沁旗| 务川| 蛟河市| 宁乡县| 昌图县| 津市市| 张掖市| 福鼎市| 志丹县| 桐城市| 永仁县| 三江| 门头沟区| 肇州县| 南岸区| 马关县| 东乌珠穆沁旗| 宁都县| 建德市| 宣化县| 天津市|