Types of Data
To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.
Categorizing Data Based on Structure
Data can be pided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:

Figure 2.1: Categorization based on content
These three categories are as follows:
- Structured data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following image shows what structured data usually looks like:
Figure 2.2: Structured data
The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes.
- Semi-structured data: This type of data is not presented in a tabular structure, but it can be transformed into a table. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following screenshot shows how semi-structured data can appear:
Figure 2.3: Semi-structured data
The format shown in the preceding screenshot is called markup language format. Here, the data is stored between tags, hierarchically. It is a universally accepted format, and there are a lot of parsers available that can convert this data into structured data.
- Unstructured data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following image shows what unstructured data looks like:
Figure 2.4: Unstructured data
This is called unstructured data because if we want to get employee details from the preceding text snippet with our program, we will not be able to do so by simple parsing. We have to make our algorithm understand the semantics of the language to make it able to extract information from this.
Categorizing Data Based on Content
Data can be pided into four categories based on content, as shown in the following diagram:

Figure 2.5: Categorizing data based on structure
Let's look at each category here:
- Text data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
- Image data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
- Audio data: This refers to voice recordings, music, and so on. This type of data can only be heard.
- Video data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.
With that, we have learned about the different types of data and their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the next section, we will look into some of the preprocessing steps for cleaning data.
- 計算機綜合設計實驗指導
- 數據挖掘原理與實踐
- 復雜性思考:復雜性科學和計算模型(原書第2版)
- Architects of Intelligence
- Oracle高性能自動化運維
- SQL優化最佳實踐:構建高效率Oracle數據庫的方法與技巧
- 區塊鏈技術應用與實踐案例
- 二進制分析實戰
- Hands-On System Programming with C++
- 大數據分析:R基礎及應用
- 信息融合中估計算法的性能評估
- Visual Studio 2012 and .NET 4.5 Expert Development Cookbook
- 大數據技術體系詳解:原理、架構與實踐
- Arquillian Testing Guide
- Learning Ansible