- Practical Big Data Analytics
- Nataraj Dasgupta
- 601字
- 2021-07-02 19:26:21
Selection of the hardware stack
The choice of hardware often depends on the type of solution that is chosen and where the hardware would be located. The proper choice depends on several key metrics such as the type of data (structured, unstructured, or semi-structured), the size of data (gigabytes versus terabytes versus petabytes), and, to an extent, the frequency with which the data will be updated. The optimal choice requires a formal assessment of these variables and will be discussed later on in the book. At a high-level, we can surmise three broad models of hardware architecture:
- Multinode architecture: This would typically entail multiple nodes (or servers) that are interconnected and work on the principle of multinode or distributed computing. A classic example of a multinode architecture is Hadoop, where multiple servers maintain bi-directional communication to coordinate a job. Other technologies such as a NoSQL database like Cassandra and search and analytics platform like Elasticsearch also run on the principle of multinode computing architecture. Most of them leverage commodity servers, another name for relatively low-end machines by enterprise standards that work in tandem to provide large-scale data mining and analytics capabilities. Multinode architectures are suitable for hosting data that is in the range of terabytes and above.
- Single-node architecture: Single-node refers to computation done on a single server. This is relatively uncommon with the advent of multinode computing tools, but still retains a huge advantage over distributed computing architectures. The Fallacy of Distributed Computing outlines a set of assertions, or assumptions, related to the implementation of distributed systems such as the reliability of the network, cost of latency, bandwidth, and other considerations.
If the dataset is structured, contains primarily textual data, and is in the order of 1-5 TB, in today’s computing environment, it is entirely possible to host such datasets on single-node machines using specific technologies as has been demonstrated in later chapters.
- Cloud-based architecture: Over the past few years, numerous cloud-based solutions have appeared in the industry. These solutions have greatly reduced the barrier to entry in big data analytics by providing a platform that makes it incredibly easy to provision hardware resources on demand based on the needs of the task at hand. This materially reduces the significant overhead in procuring, managing, and maintaining physical hardware and hosting them at in-house data center facilities.
Cloud platforms such as Amazon Web Services, Azure from Microsoft, and the Google Compute Environment permit enterprises to provision 10s to 1000s of nodes at costs starting as low as 1 cent per hour per instance.
In the wake of the growing dominance of cloud vendors over traditional brick-and-mortar hosting facilities, several complementary services to manage client cloud environments have come into existence.
Examples include cloud management companies, such as Altiscale that provides big data as a service solutions and IBM Cloud Brokerage that facilitates selection and management of multiple cloud-based solutions.
The exponential decrease in the cost of hardware: The cost of hardware has gone down exponentially over the past few years. As a case in point, per Statistic Brain’s research, the cost of hard drive storage in 2013 was approximately 4 cents per GB. Compare that with $7 per GB as recent as 2000 and over $100,000 per GB in the early 80’s. Given the high cost of licensing commercial software, which can often exceed the cost of the hardware, it makes sense to allocate enough budget toward procuring capable hardware solutions. Software needs appropriate hardware to provide optimal performance and providing level importance toward hardware selection is just as important as selecting the appropriate software.
- SCRATCH與機(jī)器人
- 返璞歸真:UNIX技術(shù)內(nèi)幕
- 網(wǎng)絡(luò)安全技術(shù)及應(yīng)用
- 零起點(diǎn)學(xué)西門子S7-200 PLC
- 基于企業(yè)網(wǎng)站的顧客感知服務(wù)質(zhì)量評價(jià)理論模型與實(shí)證研究
- 啊哈C!思考快你一步
- IBM? SmartCloud? Essentials
- 工業(yè)機(jī)器人實(shí)操進(jìn)階手冊
- 電動汽車驅(qū)動與控制技術(shù)
- C#編程兵書
- 51單片機(jī)應(yīng)用程序開發(fā)與實(shí)踐
- Getting Started with Tableau 2019.2
- 工程地質(zhì)地學(xué)信息遙感自動提取技術(shù)
- 案例解說虛擬儀器典型控制應(yīng)用
- 工業(yè)控制系統(tǒng)安全