官术网_书友最值得收藏!

Implicit schema discovery

One important aspect of the DataSource API is implicit schema discovery. For a subset of data sources, implicit schema discovery is possible. This means that while loading the data, not only are the individual columns discovered and made available in a DataFrame or Dataset, but also the column names and types.

Take a JSON file, for example. Column names are already explicitly present in the file. Due to the dynamic schema of JSON objects per default, the complete JSON file is read to discover all the possible column names. In addition, the column types are inferred and discovered during this parsing process.

If the JSON file gets very large and you want to make use of the lazy loading nature that every Apache Spark data object usually supports, you can specify a fraction of the data to be sampled in order to infer column names and types from a JSON file.

Another example is the the Java Database Connectivity (JDBC) data source where the schema doesn't even need to be inferred but is directly read from the source database.

主站蜘蛛池模板: 迭部县| 陈巴尔虎旗| 华安县| 昌吉市| 大埔区| 广东省| 南漳县| 湖口县| 通河县| 延边| 灵川县| 池州市| 元朗区| 三门县| 苏尼特右旗| 瑞丽市| 庐江县| 盐池县| 湘乡市| 晋州市| 烟台市| 葵青区| 徐州市| 英德市| 阿荣旗| 武陟县| 白山市| 名山县| 灵寿县| 汽车| 木兰县| 土默特右旗| 白水县| 吉林省| 襄城县| 兴安盟| 铜鼓县| 衡阳县| 锦州市| 根河市| 土默特右旗|