官术网_书友最值得收藏!

Big data

The term Big data was coined by Roger Mougalas in 2005, a year after Web 2.0 was coined. Web 2.0 was used to indicate the data era where traditional business intelligence tools were ineffective due to the size of the data they had to deal with. The same year, Yahoo developed Hadoop on Google's MapReduce with an ambition to index the World Wide Web. Hadoop is an open source framework that can handle both structured and unstructured data.

Structured data is identified by well-defined data types, data rules, and controls that they would adhere to. Structured data typically sits in databases where the exact parameters of data are predefined. Oracle, Microsoft SQL Server, and several other database management systems were very focused on dealing with structured data.

Unstructured data does not have the same level of structural discipline, primarily because of the way it is generated. Unstructured data comes in all shapes and forms most of the data that exists in the world today. It could be data generated from social media, emails, chats, voice recordings, and videos. Social media necessitated the efficient management of unstructured data and several technologies started to emerge to address this opportunity.

Another classification of databases is relational and non-relational. Relational databases like MySQL, Microsoft SQL Server, and Oracle store data in a structured format within tables. These tables can be linked to each other through relationships. These relationships make sure the integrity of data is intact.

However, the downside of this model is that, it takes a lot of time to transform data into a relational schema. Therefore, it may not be the best option when data volumes are huge, and processing is often expected to be in a fraction of a second. Extracting data from a relational database is typically done using Structured Query Language (SQL).

Non-relational databases like MongoDB, Neo4J, and Cassandra store data in formats such as JSON or XML. They come in handy when data consistency is less important, and availability and query response times need to be more important. These databases also allow for horizontal scaling more seamlessly. This is important when large data volumes are involved.

Before getting into the depths of how big data management happens, it would first be useful to understand how structured data is sourced, managed, and analyzed.

Structured data processing

In a traditional environment where data sticks to well-defined data types, the process of sourcing, preparing, managing, and delivering them in a format suitable for reporting and analytics involves a process called ETLExtract, Transform, and Load. The system where all these processes happen in an organization is called a data warehouse. We'll briefly discuss each of these processes, as follows:

Extract

Data is sourced from across the organization in various forms and stored in tables in a database called the staging database. The sources could be flat files, a messaging bus, or a transaction database that is highly normalized for quick transaction writes. Source to target mappings are pre-defined to ensure that the source data was delivered to the staging area in a compatible structure (data type). The tables in the staging database act as the landing area for this data.

Transform

Data in staging tables goes through transformations that are predefined. These transformations are identified well ahead of time and coded into the system. Where data is identified as incompatible with these transformations and the rules set within the system (data types, logical criteria), the data is logged into an error-handling queue.

Load

Transformed data is then loaded into the data warehouse, and by this time it is generally high quality. This final database could also be a data mart, which is often a miniature data warehouse satisfying a specific purpose or part of an organization. In any case, there are several hops that data needs to take before getting to a shape where it is ready for analysis and reporting.

This process used to work in a conventional setup. However, it may not be practically possible to find a place to store 2.5 quintillion bytes of data (the data created per day) that do not stick to the semantic limitations of a structured database. Hence the need for a shift in approach using big data platforms. Let us now look at how unstructured data management has addressed some of the challenges posed by the data era.

Unstructured data processing

Conventional database management systems are not designed to deal with the volume of data and lack of structure often associated with the internet. The key components of a big data system include:

Data sources

Data sources in a big data system can be text files, messages from social media, web pages, emails, audio files, and video files. With the rise of the Internet of Things (IoT), data generated by the interactions of machines would also be a source that big data systems need to deal with.

Data storage/Data lake

Data from these sources are stored in a distributed file store system like the Hadoop Distributed File System (HDFS). The distributed nature of the store allows it to deal with high volumes and big data sizes. Data lakes can also deal with structured data, but do not need data to be in a structure.

Firms that successfully implemented a data lake have outperformed competition by 9% in organic revenue growth (as per research by Aberdeen)

Source: https://s3-ap-southeast-1.amazonaws.com/mktg-apac/Big+Data+Refresh+Q4+Campaign/Aberdeen+Research+-+Angling+for+Insights+in+Today's+Data+Lake.pdf

Unlike a traditional data warehouse, data lakes get a schema at read time.

Data processing

Data processing in a big data infrastructure could happen in different ways depending on the nature of data fed into the system:

  • Batch processing is typically used to process large files. These batch jobs process incoming files and store the processed data in another file. Tools like Hive, Pig, or MapReduce jobs can address this type of processing.
  • Real-time data processing happens in a system where data is from social media or IoT devices as a continuous flow of data needs to be handled. This data flow is captured in real time, and this could also involve using a message buffer to deal with the real-time volumes.
  • This data can then be transformed using conventional techniques and moved into an analytics database/data warehouse.
  • Alternatively, where the conventional process is not preferred, a low-latency NoSQL layer can be built on top of the data files for analytics and reporting purposes.

Let us now look at different architectures that have been explored to manage big data.

Big data architecture

There are big data architectures that address both the handling of high-volume data and accurate analytics requirements. For instance, the Lambda architecture has a hot path and a cold path. The hot path handles high volumes of data coming in from sources like social media, however, for read operations, the hot path provides quick access with lower data accuracy. On the other hand, the cold path involves a batch process that is time-intensive, but processes data to provide highly accurate analytics capabilities.

The hot path typically holds data only for a short period of time, after which, better quality data processed from the cold path replaces this data. The Kappa architecture took inspiration from the Lambda architecture and simplified it by using a stream processing mechanism and just using one path against the Lambda architecture's two. This takes away the complexity of duplication and ensuring the convergence of data. Frameworks like Apache Spark Streaming, Flink, and Beam are able to provide both real-time and batch processing abilities.

The third architecture used by big data systems is the Zeta architecture. It uses seven pluggable components to increase resource utilization and efficiency. The components are as follows:

  • Distributed file system
  • Real-time data storage
  • Pluggable compute model / Execution engine
  • Deployment / Container management system
  • Solution architecture
  • Enterprise applications
  • Dynamic and Global resource management

The benefits of this architecture include:

  • Reducing complexity
  • Avoiding data duplication
  • Reducing deployment and maintenance costs
  • Improving resource utilization

Breaking down the solution into reusable components adds efficiencies across several aspects of developing and managing a big data platform.

While the architectures are interesting to understand the maturity of the technology, the outcomes are perhaps more important. For instance, big data systems have allowed for better use of data captured in the form of social media interactions. The maturity of infrastructure to handle big volumes of data has helped clever customer-specific services provided across several industries. Some of the common use cases we have seen using social media analytics for example are:

  • Sentiment analysis for brands
    • Brands can use social media analytics to understand sentiments about their brands or recent launches and tweak their offerings accordingly.
  • Customer segmentation and targeted advertisements
    • Several social media platforms provide details on exactly where organizations were getting the biggest bang for their buck on marketing. Firms can fine-tune their marketing strategies based on this information and reduce cost of acquisition of customers.
  • Proactive customer services
    • Gone are the days when customers had to go through a cumbersome complaints process. There are several instances where customers have logged their complaints about a particular experience on Twitter or Facebook, and the brands have acted immediately.
  • Political campaigns
    • Even political campaigns before elections are managed proactively using social media insights. The West is perhaps more used to such activities, but in India for example, Prime Minister Narendra Modi has managed to capture the attention of his followers using clever social media tactics.
    • Several Asian political organizations have been accused of releasing fake news during a political campaign to mislead voters. For instance, WhatsApp was used as a platform to spread fake news about the India-Pakistan air battles just before the 2019 Indian elections. The Brexit referendum in 2016 is another example where parties were accused of voter manipulation. Source: https://www.bbc.com/news/world-asia-india-47797151

There are several other ways in which organizations use social media data for continuous consumer engagement. For instance, understanding sentiments of users, proactively managing complaints, and creating campaigns to increase brand awareness can all be done on social media.

As an investor, when I assess firms, one of the key dimensions I take into consideration is their awareness and ability to drive brand awareness, customer acquisition, and continuous engagement through social media channels. Understanding the advantages of using social media effectively has become a basic attribute to running a business. It is no longer just an option. The rise of social media saw firms move from on-premise servers to cloud-based infrastructure. There may not be a causation, but there definitely is a correlation between social media and the cloud.

The cloud

Big data frameworks that architecturally catalyzed the big data revolution were also supported by the evolution of cloud computing in parallel. Without these technology paradigms going mainstream, it would not have been possible to capture, store, and manage large volumes of data. It all started in 2002, when Amazon launched its online retail services. They had to procure massive servers to manage the Christmas season peak in traffic. At other times, the utilization of their servers was about 10%, and that was commonplace in those days.

The team at Amazon identified the underutilization patterns of their servers and felt that they could create a model to improve utilization during non-peak times. Sharing their server infrastructure with others who needed server resources could add efficiencies for everyone. The concept of cloud infrastructure was born.

Jeff Bezos and his team of executives eventually decided to make the most of the unused server capacity during non-peak times. Within a year, the team at Amazon had put together a service that offered computer storage, processing power, and a database. This business model transformed the innovation landscape as server infrastructure became more affordable for startups.

Amazon Web Service (AWS) went live in 2006 and by 2018 it was a $26 billion revenue-generating machine. Google, Microsoft, IBM, and others followed suit; however, Amazon have clearly got their nose ahead. 80% of enterprises were both running apps on or experimenting with AWS as their preferred cloud platform by 2018 (as per Statista). The cost of starting a business has plummeted since the mainstream adoption of cloud services.

Procuring infrastructure on a need basis has also made it cost-efficient to run and scale businesses.

uncaptioned image

Figure 3: Planned and current use of public cloud platform services worldwide, 2018. Source: https://www.statista.com/statistics/511467/worldwide-survey-public-coud-services-running-application/

As cloud services matured and scaled, several new models emerged, namely, Software as a service (SaaS), Platform as a service (PaaS), and Infrastructure as a service (IaaS).

SaaS is a model in which a software application is virtually managed on a server by a vendor and accessed by users over the internet. Google Docs was one of the early examples of this model. Today, we use cloud-hosted SaaS for several day to day applications for the simplest of tasks, from document management to conducting teleconferences. Thanks to this model, our laptops do not cry out for software updates on applications every other minute. However, we have also become increasingly reliant on the internet and feel dysfunctional without it.

PaaS is a model where instead of providing an application over the internet, the vendor provides a platform for developers to create an application. For instance, many vendors offer Blockchain in a PaaS model, where developers can use cloud-managed software development services to create Blockchain applications. IBM offers a similar service for quantum computing too, however, that can also be bucketed into the IaaS model.

IaaS is a model where computer resources are offered as a service. This would include server storage, computing and networking capacity, disaster recovery, and many others. This has helped large organizations to reduce their infrastructure footprint by moving to the cloud. Data centers were migrated to the cloud, hence achieving efficiencies across computer resources, but also reducing their carbon footprint.

With these advances in architectural, software, and infrastructure technology paradigms, the data age had well and truly taken off. We had figured out ways of creating and managing data at scale. However, what we weren't very good at was exploiting data volumes to develop intelligence at scale – intelligence that could challenge humans. Enter AI.

主站蜘蛛池模板: 会东县| 栾城县| 封开县| 扬州市| 黄龙县| 横峰县| 辰溪县| 弥渡县| 北辰区| 丹阳市| 长兴县| 罗定市| 泰州市| 额尔古纳市| 申扎县| 游戏| 石楼县| 临海市| 胶南市| 乌拉特前旗| 新闻| 莱州市| 泾阳县| 日照市| 平泉县| 金塔县| 天柱县| 新昌县| 册亨县| 娱乐| 梅州市| 内丘县| 雷州市| 融水| 衡阳县| 巴马| 虹口区| 黔西县| 青浦区| 仙桃市| 措美县|