Characteristics of big data

We explored the popularity of big data in the preceding section. But it is important to know what types of data can be categorized or labeled as big data. In this section, we are going to explore various features of big data. Most of the books available on the market would claim there are six different types, discussed as follows:

Volume: Big data implies massive amounts of data. The size of data gets a very relevant role in determining the value out of the data, and it is also a key factor that determines whether we can judge the chunk of data as big. Hence, volume justifies one of the important attributes of big data.

Every minute, 204,000,000 emails are sent, 200,000 photos are uploaded, and 1,800,000 likes are generated on Facebook; on YouTube, 1,300,000 videos are viewed and 72 hours of video are uploaded.

The idea behind such aggregation of massive volumes of data is to understand that businesses and organizations are collecting and leveraging giant volumes of data to reinforce their merchandise, whether it is safety, dependability, healthcare, or governance. In brief, the idea is to turn this abundant, voluminous data into some form of business advantage.

Velocity: It relates to the increasing speed at which big data is created, and the increasing speed at which data is stored and analyzed. Processing the data in real time to match its production rate as it gets generated is a remarkable goal of big data analytics. The term velocity generally applies to how fast the data is produced and processed to satisfy the demands; it discovers the real potential in the data. The flow of data is massive and continuous. Data can be stored and processed in different ways, including batch processing, near-time, real-time processing, and streaming:
- Real-time processing refers to the ability to capture, store, and process the data in real time and trigger immediate action, potentially saving lives.
- Batch processing refers to feeding a large amount of data into large machines and processing for days at a time. It is still very common today.
Variety: It refers to many sources and types of data, either structured, semi-structured, or unstructured. We will get to discuss more on these types of big data in Chapter 5, Structures of Data Models. When we think of data variety, we think of the additional complexity that results from more kinds of data that we need to store, process, and combine. Data is more heterogeneous these days, such as BLOB image data, enterprise data, network data, video data, text data, geographic maps, computer-generated or simulated data, and social media data. We can categorize the variety of data into several dimensions. Some of the dimensions are explained as follows:
- Structural variety: This refers to the representation of the data; for example, a satellite image of wildfires from NASA is completely different from tweets sent out by people who are seeing the fire spread.
- Media variety: Data gets delivered in various media, such as text, audio, or video. These are referred to as media variety.
- Semantic variety: Semantic variety comes from different assumptions of conditions on the data. For example, we can measure its age using a qualitative approach (infant, juvenile, or adult) or a quantitative approach (numbers).
Veracity: It refers to the quality of the data, and is also designated as validity or volatility. Big data can be noisy and uncertain, full of biases and abnormalities, and it can be imprecise. The idea that data is of no value if it's not accurate—the results of the big data analysis are only as good as the data being analyzed—creates challenges in keeping track of data quality—what has been captured, where the data came from, and how it was analyzed prior to its use.
Valence: It refers to connectedness. The more connected data is, the higher its valences. A high valence dataset is denser. This makes many regular analytical critiques very inefficient.
Value: The term, in general, refers to the valuable insights gained from the ability to investigate and identify new patterns and trends from high-volume and cross-platform systems. The idea behind processing all this big data in the first place is to bring value to the query at hand. The final output of all the tasks is the value.

Here's a summed-up representation of the preceding content:

官术网_书友最值得收藏!

Hands-On Big Data Modeling

Characteristics of big data