官术网_书友最值得收藏!

Data processing libraries

The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution. 

There are very powerful extensions to the standard library such as:

We will cover both the standard API for data processing and its extensions in Chapter 2Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:

<dependency> 
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.

Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.

When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.

For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms. 

There are a few data frame libraries available in Java. Some of them are as follows:

We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book. 

There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.

Additionally, there are libraries for distributed data processing such as:

We will talk about distributed data processing in Chapter 9Scaling Data Science.

主站蜘蛛池模板: 富阳市| 凤翔县| 盐池县| 社旗县| 若尔盖县| 永年县| 南京市| 洞口县| 湖口县| 茂名市| 陆川县| 灯塔市| 新乐市| 峨眉山市| 安顺市| 渑池县| 衢州市| 富川| 鲁甸县| 兴国县| 石景山区| 绥德县| 沈丘县| 安图县| 道孚县| 红桥区| 阿拉尔市| 罗山县| 阳春市| 芦山县| 普洱| 温州市| 分宜县| 商河县| 郁南县| 阿拉尔市| 通州区| 平乡县| 阜新市| 安国市| 阿勒泰市|