PG凤凰传奇

書名： Scala for Data Science
作者名： Pascal Bugnion
本章字數： 1575字
更新時間： 2021-07-23 14:33:09

First steps with JDBC

Let's start by connecting to JDBC from the command line. To follow with the examples, you will need access to a running MySQL server. If you added the MySQL connector to the list of dependencies, open a Scala console by typing the following command:

$ sbt console

Let's import JDBC:

scala> import java.sql._
import java.sql._

We then need to tell JDBC to use a specific connector. This is normally done using reflection, loading the driver at runtime:

scala> Class.forName("com.mysql.jdbc.Driver")
Class[_] = class com.mysql.jdbc.Driver

This loads the appropriate driver into the namespace at runtime. If this seems somewhat magical to you, it's probably not worth worrying about exactly how this works. This is the only example of reflection that we will consider in this book, and it is not particularly idiomatic Scala.

Connecting to a database server

Having specified the SQL connector, we can now connect to a database. Let's assume that we have a database called test on host 127.0.0.1, listening on port 3306. We create a connection as follows:

scala> val connection = DriverManager.getConnection(
 "jdbc:mysql://127.0.0.1:3306/test",
 "root", // username when connecting
 "" // password
)
java.sql.Connection = com.mysql.jdbc.JDBC4Connection@12e78a69

The first argument to getConnection is a URL-like string with jdbc:mysql://host[:port]/database. The second and third arguments are the username and password. Pass in an empty string if you can connect without a password.

Creating tables

Now that we have a database connection, let's interact with the server. For these examples, you will find it useful to have a MySQL shell open (or a MySQL GUI such as MySQLWorkbench) as well as the Scala console. You can open a MySQL shell by typing the following command in a terminal:

$ mysql

As an example, we will create a small table to keep track of famous physicists. In a mysql shell, we would run the following command:

mysql> USE test;
mysql> CREATE TABLE physicists (
 id INT(11) AUTO_INCREMENT PRIMARY KEY,
 name VARCHAR(32) NOT NULL
);

To achieve the same with Scala, we send a JDBC statement to the connection:

scala> val statementString = """
CREATE TABLE physicists (
 id INT(11) AUTO_INCREMENT PRIMARY KEY,
 name VARCHAR(32) NOT NULL
)
"""

scala> val statement = connection.prepareStatement(statementString)
PreparedStatement = JDBC4PreparedStatement@c983201: CREATE TABLE ...

scala> statement.executeUpdate()
results: Int = 0

Let's ignore the return value of executeUpdate for now.

Inserting data

Now that we have created a table, let's insert some data into it. We can do this with a SQL INSERT statement:

scala> val statement = connection.prepareStatement("""
 INSERT INTO physicists (name) VALUES ('Isaac Newton')
""")

scala> statement.executeUpdate()
Int = 1

In this case, executeUpdate returns 1. When inserting rows, it returns the number of rows that were inserted. Similarly, if we had used a SQL UPDATE statement, this would return the number of rows that were updated. For statements that do not manipulate rows directly (such as the CREATE TABLE statement in the previous section), executeUpdate just returns 0.

Let's just jump into a mysql shell to verify the insertion performed correctly:

mysql> select * from physicists ;
+----+--------------+
| id | name |
+----+--------------+
| 1 | Isaac Newton |
+----+--------------+
1 row in set (0.00 sec)

Let's quickly summarize what we have seen so far: to execute SQL statements that do not return results, use the following:

val statement = connection.prepareStatement("SQL statement string")
statement.executeUpdate()

In the context of data science, we frequently need to insert or update many rows at a time. For instance, we might have a list of physicists:

scala> val physicistNames = List("Marie Curie", "Albert Einstein", "Paul Dirac")

We want to insert all of these into the database. While we could create a statement for each physicist and send it to the database, this is quite inefficient. A better solution is to create a batch of statements and send them to the database together. We start by creating a statement template:

scala> val statement = connection.prepareStatement("""
 INSERT INTO physicists (name) VALUES (?)
""")
PreparedStatement = JDBC4PreparedStatement@621a8225: INSERT INTO physicists (name) VALUES (** NOT SPECIFIED **)

This is identical to the previous prepareStatement calls, except that we replaced the physicist's name with a ? placeholder. We can set the placeholder value with the statement.setString method:

scala> statement.setString(1, "Richard Feynman")

This replaces the first placeholder in the statement with the string Richard Feynman:

scala> statement
com.mysql.jdbc.JDBC4PreparedStatement@5fdd16c3:
INSERT INTO physicists (name) VALUES ('Richard Feynman')

Note that JDBC, somewhat counter-intuitively, counts the placeholder positions from 1 rather than 0.

We have now created the first statement in the batch of updates. Run the following command:

scala> statement.addBatch()

By running the preceding command, we initiate a batch insert: the statement is added to a temporary buffer that will be executed when we run the executeBatch method. Let's add all the physicists in our list:

scala> physicistNames.foreach { name => 
 statement.setString(1, name)
 statement.addBatch()
}

We can now execute all the statements in the batch:

scala> statement.executeBatch
Array[Int] = Array(1, 1, 1, 1)

The return value of executeBatch is an array of the number of rows altered or inserted by each item in the batch.

Note that we used statement.setString to fill in the template with a particular name. The PreparedStatement object has setXXX methods for all basic types. To get a complete list, read the PreparedStatement API documentation (http://docs.oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html).

Reading data

Now that we know how to insert data into a database, let's look at the converse: reading data. We use SQL SELECT statements to query the database. Let's do this in the MySQL shell first:

mysql> SELECT * FROM physicists;
+----+-----------------+
| id | name |
+----+-----------------+
| 1 | Isaac Newton |
| 2 | Richard Feynman |
| 3 | Marie Curie |
| 4 | Albert Einstein |
| 5 | Paul Dirac |
+----+-----------------+
5 rows in set (0.01 sec)

To extract this information in Scala, we define a PreparedStatement:

scala> val statement = connection.prepareStatement("""
 SELECT name FROM physicists
""")
PreparedStatement = JDBC4PreparedStatement@3c577c9d:
SELECT name FROM physicists

We execute this statement by running the following command:

scala> val results = statement.executeQuery()
results: java.sql.ResultSet = com.mysql.jdbc.JDBC4ResultSet@74a2e158

This returns a JDBC ResultSet instance. The ResultSet is an abstraction representing a set of rows from the database. Note that we used statement.executeQuery rather than statement.executeUpdate. In general, one should execute statements that return data (in the form of ResultSet) with executeQuery. Statements that modify the database without returning data (insert, create, alter, or update statements, among others) are executed with executeUpdate.

The ResultSet object behaves somewhat like an iterator. It exposes a next method that advances itself to the next record, returning true if there are records left in ResultSet:

scala> results.next // Advance to the first record
Boolean = true

When the ResultSet instance points to a record, we can extract fields in this record by passing in the field name:

scala> results.getString("name")
String = Isaac Newton

We can also extract fields using positional arguments. The fields are indexed from one:

scala> results.getString(1) // first positional argument
String = Isaac Newton

When we are done with a particular record, we call the next method to advance the ResultSet to the next record:

scala> results.next // advances the ResultSet by one record
Boolean = true

scala> results.getString("name")
String = Richard Feynman

A ResultSet object supports the getXXX(fieldName) methods to access the fields of a record and a next method to advance to the next record in the result set.

One can iterate over a result set using a while loop:

scala> while(results.next) { println(results.getString("name")) }
Marie Curie
Albert Einstein
Paul Dirac

Tip

A word of warning applies to reading fields that are nullable. While one might expect JDBC to return null when faced with a null SQL field, the return type depends on the getXXX command used. For instance, getInt and getLong will return 0 for any field that is null. Similarly, getDouble and getFloat return 0.0. This can lead to some subtle bugs in code. In general, one should be careful with getters that return Java value types (int, long) rather than objects. To find out if a value is null in the database, query it first with getInt (or getLong or getDouble, as appropriate), then use the wasNull method that returns a Boolean if the last read value was null:

scala> rs.getInt("field")
0
scala> rs.wasNull // was the last item read null?
true

This (surprising) behavior makes reading from ResultSet instances error-prone. One of the goals of the second part of this chapter is to give you the tools to build an abstraction layer on top of the ResultSet interface to avoid having to call methods such as getInt directly.

Reading values directly from ResultSet objects feels quite unnatural in Scala. We will look, further on in this chapter, at constructing a layer through which you can access the result set using type classes.

We now know how to read and write to a database. Having finished with the database for now, we close the result sets, prepared statements, and connections:

scala> results.close

scala> statement.close

scala> connection.close

While closing statements and connections is not important in the Scala shell (they will get closed when you exit), it is important when you run programs; otherwise, the objects will persist, leading to "out of memory exceptions". In the next sections, we will look at establishing connections and statements with the loan pattern, a design pattern that closes a resource automatically when we finish using it.

官术网_书友最值得收藏!

Scala for Data Science

First steps with JDBC

Connecting to a database server

Creating tables

Inserting data

Reading data

Tip