官术网_书友最值得收藏!

Store

In this section, we will discuss storing data that has been collected from various sources. Let's consider an example of crawling reviews of organizations for sentiment analysis, wherein each gathers data from different sites with each of them having data uniquely displayed.

Traditionally, data was processed using the ETL (Extract, Transform, and Load) procedure, which used to gather data from various sources, modify it according to the requirements, and upload it to the store for further processing or display. Tools that were every so often used for such scenarios were spreadsheets, relational databases, business intelligence tools, and so on, and sometimes manual effort was also a part of it.

The most common storage used in Big Data platform is HDFS. HDFS also provides HQL (Hive Query Language), which helps us do many analytical tasks that are traditionally done in business intelligence tools. A few other storage options that can be considered are Apache Spark, Redis, and MongoDB. Each storage option has their own way of working in the backend; however, most storage providers exposes SQL APIs which can be used to do further data analysis.

There might be a case where we need to gather real-time data and showcase in real time, which practically doesn't need the data to be stored for future purposes and can run real-time analytics to produce results based on the requests.

主站蜘蛛池模板: 景宁| 林芝县| 化隆| 宁乡县| 安阳县| 凤城市| 英超| 定日县| 尼木县| 临澧县| 林甸县| 西吉县| 揭东县| 桐庐县| 中西区| 兰考县| 安泽县| 凤翔县| 长沙县| 皮山县| 安新县| 北碚区| 东乡族自治县| 灌阳县| 健康| 安丘市| 清涧县| 襄汾县| 泰来县| 佳木斯市| 固镇县| 寿光市| 固镇县| 秦皇岛市| 忻城县| 抚远县| 合肥市| 南昌县| 东阳市| 杨浦区| 寿阳县|