官术网_书友最值得收藏!

Store

In this section, we will discuss storing data that has been collected from various sources. Let's consider an example of crawling reviews of organizations for sentiment analysis, wherein each gathers data from different sites with each of them having data uniquely displayed.

Traditionally, data was processed using the ETL (Extract, Transform, and Load) procedure, which used to gather data from various sources, modify it according to the requirements, and upload it to the store for further processing or display. Tools that were every so often used for such scenarios were spreadsheets, relational databases, business intelligence tools, and so on, and sometimes manual effort was also a part of it.

The most common storage used in Big Data platform is HDFS. HDFS also provides HQL (Hive Query Language), which helps us do many analytical tasks that are traditionally done in business intelligence tools. A few other storage options that can be considered are Apache Spark, Redis, and MongoDB. Each storage option has their own way of working in the backend; however, most storage providers exposes SQL APIs which can be used to do further data analysis.

There might be a case where we need to gather real-time data and showcase in real time, which practically doesn't need the data to be stored for future purposes and can run real-time analytics to produce results based on the requests.

主站蜘蛛池模板: 武城县| 梓潼县| 革吉县| 安陆市| 南昌市| 马山县| 朔州市| 西林县| 游戏| 商水县| 金塔县| 瑞昌市| 宾川县| 平安县| 石林| 方山县| 牟定县| 县级市| 万年县| 华蓥市| 昌吉市| 岗巴县| 桂平市| 治县。| 南昌县| 鄂托克前旗| 思南县| 兴安县| 含山县| 武安市| 红原县| 北票市| 湄潭县| 阳朔县| 万全县| 金阳县| 长海县| 临漳县| 大埔县| 称多县| 海伦市|