官术网_书友最值得收藏!

Types of Data

To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.

Categorizing Data Based on Structure

Data can be pided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:

Figure 2.1: Categorization based on content

These three categories are as follows:

  • Structured data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following image shows what structured data usually looks like:

Figure 2.2: Structured data

The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes.

  • Semi-structured data: This type of data is not presented in a tabular structure, but it can be transformed into a table. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following screenshot shows how semi-structured data can appear:

Figure 2.3: Semi-structured data

The format shown in the preceding screenshot is called markup language format. Here, the data is stored between tags, hierarchically. It is a universally accepted format, and there are a lot of parsers available that can convert this data into structured data.

  • Unstructured data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following image shows what unstructured data looks like:

Figure 2.4: Unstructured data

This is called unstructured data because if we want to get employee details from the preceding text snippet with our program, we will not be able to do so by simple parsing. We have to make our algorithm understand the semantics of the language to make it able to extract information from this.

Categorizing Data Based on Content

Data can be pided into four categories based on content, as shown in the following diagram:

Figure 2.5: Categorizing data based on structure

Let's look at each category here:

  • Text data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
  • Image data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
  • Audio data: This refers to voice recordings, music, and so on. This type of data can only be heard.
  • Video data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.

With that, we have learned about the different types of data and their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the next section, we will look into some of the preprocessing steps for cleaning data.

主站蜘蛛池模板: 上犹县| 天台县| 濮阳县| 如皋市| 锡林郭勒盟| 三亚市| 河南省| 大余县| 托克逊县| 海口市| 永康市| 故城县| 黄陵县| 石门县| 景东| 皋兰县| 射阳县| 若尔盖县| 莆田市| 沙坪坝区| 长治市| 巨鹿县| 阿克苏市| 宁津县| 买车| 牡丹江市| 慈利县| 乐山市| 焉耆| 息烽县| 务川| 新巴尔虎左旗| 瑞丽市| 修文县| 双桥区| 岑溪市| 奈曼旗| 普兰店市| 德安县| 宁河县| 桓仁|