書名： Practical Big Data Analytics
作者名： Nataraj Dasgupta
本章字數： 258字
更新時間： 2021-07-02 19:26:19

Semi-structured

Semi-structured data refers to data that has both the elements of an organizational schema as well as aspects that are arbitrary. A personal phone diary (increasingly rare these days!) with columns for name, address, phone number, and notes could be considered a semi-structured dataset. The user might not be aware of the addresses of all inpiduals and hence some of the entries may have just a phone number and vice versa.

Similarly, the column for notes may contain additional descriptive information (such as a facsimile number, name of a relative associated with the inpidual, and so on). It is an arbitrary field that allows the user to add complementary information. The columns for name, address, and phone number can thus be considered structured in the sense that they can be presented in a tabular format, whereas the notes section is unstructured in the sense that it may contain an arbitrary set of descriptive information that cannot be represented in the other columns in the diary.

In computing, semi-structured data is usually represented by formats, such as JSON, that can encapsulate both structured as well as schemaless or arbitrary associations, generally using key-value pairs. A more common example could be email messages, which have both a structured part, such as name of the sender, time when the message was received, and so on, that is common to all email messages and an unstructured portion represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query semi-structured datasets.

官术网_书友最值得收藏!

Practical Big Data Analytics

Semi-structured