官术网_书友最值得收藏!

Coding

Try to tune your code to improve the Spark application performance. For instance, filter your application-based data early in your ETL cycle. One example is when using raw HTML files, detag them and crop away unneeded parts at an early stage. Tune your degree of parallelism, try to find the resource-expensive parts of your code, and find alternatives.

ETL is one of the first things you are doing in an analytics project. So you are grabbing data from third-party systems, either by directly accessing relational or NoSQL databases or by reading exports in various file formats such as, CSV, TSV, JSON or even more exotic ones from local or remote filesystems or from a staging area in HDFS: after some inspections and sanity checks on the files an ETL process in Apache Spark basically reads in the files and creates RDDs or DataFrames/Datasets out of them.

They are transformed so they fit to the downstream analytics application running on top of Apache Spark or other applications and then stored back into filesystems as either JSON, CSV or PARQUET files, or even back to relational or NoSQL databases.

Finally, I can recommend the following resource for any performance-related problems with Apache Spark: https://spark.apache.org/docs/latest/tuning.html.
主站蜘蛛池模板: 巴南区| 哈尔滨市| 宾阳县| 漠河县| 五常市| 蓝山县| 苍溪县| 原阳县| 浙江省| 岑溪市| 丰台区| 安西县| 东乡族自治县| 广饶县| 潍坊市| 囊谦县| 左贡县| 青州市| 平阴县| 舒兰市| 阳西县| 黄浦区| 沿河| 呼玛县| 长春市| 绥棱县| 百色市| 绥滨县| 滨州市| 雷波县| 红河县| 济宁市| 康马县| 灌南县| 南充市| 漠河县| 错那县| 喀喇沁旗| 库车县| 聊城市| 阳山县|