- Mastering Java for Data Science
- Alexey Grigorev
- 475字
- 2021-07-02 23:44:32
Data processing libraries
The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution.
There are very powerful extensions to the standard library such as:
- Google Guava (https://github.com/google/guava) and Apache Common Collections (https://commons.apache.org/collections/) for richer collections
- Apache Commons IO (https://commons.apache.org/io/) for simplified I/O
- AOL Cyclops-React (https://github.com/aol/cyclops-react) for richer functional-way parallel streaming
We will cover both the standard API for data processing and its extensions in Chapter 2, Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.
Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.
When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.
For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms.
There are a few data frame libraries available in Java. Some of them are as follows:
- Joinery (https://cardillo.github.io/joinery/)
- Tablesaw (https://github.com/lwhite1/tablesaw)
- Saddle (https://saddle.github.io/) a data frame library for Scala
- Apache Spark DataFrames (http://spark.apache.org/)
We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book.
There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.
Additionally, there are libraries for distributed data processing such as:
- Apache Hadoop (http://hadoop.apache.org/)
- Apache Spark (http://spark.apache.org/)
- Apache Flink (https://flink.apache.org/)
We will talk about distributed data processing in Chapter 9, Scaling Data Science.
- PyTorch深度學習實戰:從新手小白到數據科學家
- 數據挖掘原理與實踐
- DB29forLinux,UNIX,Windows數據庫管理認證指南
- 計算機信息技術基礎實驗與習題
- Python數據分析、挖掘與可視化從入門到精通
- Lean Mobile App Development
- 深入淺出MySQL:數據庫開發、優化與管理維護(第2版)
- Ceph源碼分析
- SQL優化最佳實踐:構建高效率Oracle數據庫的方法與技巧
- 數據中心數字孿生應用實踐
- LabVIEW 完全自學手冊
- 數據科學工程實踐:用戶行為分析與建模、A/B實驗、SQLFlow
- 區域云計算和大數據產業發展:浙江樣板
- 大數據技術原理與應用:概念、存儲、處理、分析與應用
- 計算機視覺