官术网_书友最值得收藏!

Data processing libraries

The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution. 

There are very powerful extensions to the standard library such as:

We will cover both the standard API for data processing and its extensions in Chapter 2Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:

<dependency> 
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.

Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.

When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.

For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms. 

There are a few data frame libraries available in Java. Some of them are as follows:

We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book. 

There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.

Additionally, there are libraries for distributed data processing such as:

We will talk about distributed data processing in Chapter 9Scaling Data Science.

主站蜘蛛池模板: 滨州市| 黑龙江省| 大港区| 义马市| 盐城市| 金昌市| 弋阳县| 鄂温| 永城市| 开远市| 禄劝| 仙居县| 泸州市| 承德县| 德庆县| 波密县| 乌鲁木齐市| 永安市| 莱西市| 江门市| 昌邑市| 夏津县| 三门县| 天祝| 南溪县| 唐河县| 石城县| 五常市| 易门县| 富源县| 巴东县| 文成县| 衡阳县| 柞水县| 云梦县| 绥棱县| 库尔勒市| 雷波县| 太和县| 山阳县| 故城县|