官术网_书友最值得收藏!

Implicit schema discovery

One important aspect of the DataSource API is implicit schema discovery. For a subset of data sources, implicit schema discovery is possible. This means that while loading the data, not only are the individual columns discovered and made available in a DataFrame or Dataset, but also the column names and types.

Take a JSON file, for example. Column names are already explicitly present in the file. Due to the dynamic schema of JSON objects per default, the complete JSON file is read to discover all the possible column names. In addition, the column types are inferred and discovered during this parsing process.

If the JSON file gets very large and you want to make use of the lazy loading nature that every Apache Spark data object usually supports, you can specify a fraction of the data to be sampled in order to infer column names and types from a JSON file.

Another example is the the Java Database Connectivity (JDBC) data source where the schema doesn't even need to be inferred but is directly read from the source database.

主站蜘蛛池模板: 丁青县| 莆田市| 手机| 临澧县| 丹寨县| 鄂托克前旗| 凤冈县| 南澳县| 沙坪坝区| 普安县| 新津县| 康马县| 陆丰市| 永川市| 台北市| 白水县| 襄城县| 南汇区| 陈巴尔虎旗| 沙河市| 晋宁县| 平江县| 新余市| 永登县| 资溪县| 峨边| 江油市| 浦江县| 湄潭县| 齐齐哈尔市| 江孜县| 嘉荫县| 娄底市| 西乌| 稷山县| 泰宁县| 辉县市| 固安县| 宜兴市| 新余市| 禹州市|