官术网_书友最值得收藏!

Implicit schema discovery

One important aspect of the DataSource API is implicit schema discovery. For a subset of data sources, implicit schema discovery is possible. This means that while loading the data, not only are the individual columns discovered and made available in a DataFrame or Dataset, but also the column names and types.

Take a JSON file, for example. Column names are already explicitly present in the file. Due to the dynamic schema of JSON objects per default, the complete JSON file is read to discover all the possible column names. In addition, the column types are inferred and discovered during this parsing process.

If the JSON file gets very large and you want to make use of the lazy loading nature that every Apache Spark data object usually supports, you can specify a fraction of the data to be sampled in order to infer column names and types from a JSON file.

Another example is the the Java Database Connectivity (JDBC) data source where the schema doesn't even need to be inferred but is directly read from the source database.

主站蜘蛛池模板: 甘孜| 安岳县| 万荣县| 新野县| 新泰市| 朝阳区| 兰考县| 察隅县| 南漳县| 阿鲁科尔沁旗| 正定县| 嵩明县| 黎城县| 广东省| 东丰县| 海阳市| 常宁市| 大新县| 晴隆县| 宜川县| 东丽区| 武乡县| 都安| 鸡泽县| 南投县| 肇源县| 宾阳县| 曲沃县| 临安市| 齐齐哈尔市| 高青县| 余姚市| 乌苏市| 永宁县| 奉节县| 凤山县| 仙桃市| 依安县| 香格里拉县| 常宁市| 张掖市|