書名： Java 9 Programming Blueprints
作者名： Jason Lee
本章字數： 2626字
更新時間： 2021-07-02 18:56:31

Modern database access with JPA

Let's stop here for a moment. Remember our discussion of the Java Persistence API (JPA) and database? Here is where we see that coming in. With the JPA, interactions with the database are done via the EntityManager interface, which we retrieve from the cleverly named EntityManagerFactory. It is important to note that the EntityManager instances are not thread-safe, so they should not be shared between threads. That's why we didn't create one in the constructor and pass it around. This is, of course, a local variable, so we need not worry about that too much until, and if, we decide to pass it as a parameter to another method, which we are doing here. As we will see in a moment, everything happens in the same thread, so we will not have to worry about thread-safety issues as the code stands now.

With our EntityManager, we call the getDuplicates() method and pass the manager and field name, fileName. This is what that method looks like:

    private List<FileInfo> getDuplicates(EntityManager em,  
     String fieldName) { 
       List<FileInfo> files = em.createQuery( 
         DUPLICATE_SQL.replace("%FIELD%", fieldName), 
          FileInfo.class).getResultList(); 
       return files; 
    }

This is a fairly straightforward use of the Java Persistence API--we're creating a query and telling it that we want, and getting a List of FileInfo references back. The createQuery() method creates a TypedQuery object, on which we will call getResultList() to retrieve the results, which gives us List<FileInfo>.

Before we go any further, we need to have a short primer on the Java Persistence API. JPA is what is known as an object-relational mapping (ORM) tool. It provides an object-oriented, type-safe, and database-independent way of storing data in, typically, a relational database. The specification/library allows application authors to define their data models using concrete Java classes, then persist and/or read them with little thought about the mechanics specific to the database currently being used. (The developer isn't completely shielded from database concerns--and it's arguable as to whether or not he or she should be--but those concerns are greatly lessened as they are abstracted away behind the JPA interfaces). The process of acquiring a connection, creating the SQL, issuing it to the server, processing results, and more are all handled by the library, allowing a greater focus on the business of the application rather than the plumbing. It also allows a high degree of portability between databases, so applications (or libraries) can be easily moved from one system to another with minimal change (usually restricted to configuration changes).

At the heart of JPA is Entity, the business object (or domain model, if you prefer) that models the data for the application. This is expressed in the Java code as a plain old Java object (POJO), which is marked up with a variety of annotations. A complete discussion of all of those annotations (or the API as a whole) is outside the scope of this book, but we'll use enough of them to get you started.

With that basic explanation given, let's take a look at our one and only entity--the FileInfo class:

    @Entity 
    public class FileInfo implements Serializable { 
      @GeneratedValue 
      @Id 
      private int id; 
      private String fileName; 
      private String path; 
      private long size; 
      private String hash; 
    }

This class has five properties. The only one that needs special attention is id. This property holds the primary key value for each row, so we annotate it with @Id. We also annotate this field with @GeneratedValue to indicate that we have a simple primary key for which we'd like the system to generate a value. This annotation has two properties: strategy and generator. The default value for strategy is GenerationType.AUTO, which we happily accept here. Other options include IDENTITY, SEQUENCE, and TABLE. In more complex uses, you may want to specify a strategy explicitly, which allows you to fine-tune how the key is generated (for example, the starting number, the allocation size, the name of the sequence or table, and so on). By choosing AUTO, we're telling JPA to choose the appropriate generation strategy for our target database. If you specify a strategy other than AUTO, you will also need to specify the details for the generator, using @SequenceGenerator for SEQUENCE and @TableGenerator for TABLE. You will also need to give the ID of the generator to the @GeneratedValue annotation using the generator attribute. We're using the default, so we need not specify a value for this attribute.

The next four fields are the pieces of data we have identified that we need to capture. Note that if we do not need to specify anything special about the mapping of these fields to the database columns, no annotations are necessary. However, if we would like to change the defaults, we can apply the @Column annotation and set the appropriate attribute, which can be one or more of columnDefinition (used to help generate the DDL for the column), insertable, length, name, nullable, precision, scale, table, unique, and updatable. Again, we're happy with the defaults.

JPA also requires each property to have a getter and a setter; the specification seems to be worded oddly, which has led to some ambiguity as to whether or not this is a hard requirement, and different JPA implementations handle this differently, but it's certainly safer to provide both as a matter of practice. If you need a read-only property, you can experiment with either no setter, or simply a no-op method. We haven't shown the getters and setters here, as there is nothing interesting about them. We have also omitted the IDE-generated equals() and hashCode() methods.

To help demonstrate the module system, we've put our entity in a com.steeplesoft.dupefind.lib.model subpackage. We'll tip our hand a bit and go ahead and announce that this class will be used by both our CLI and GUI modules, so we'll need to update our module definition as follows:

    module dupefind.lib { 
      exports com.steeplesoft.dupefind.lib; 
      exports com.steeplesoft.dupefind.lib.model; 
      requires java.logging; 
      requires javax.persistence; 
    }

That's all there is to our entity, so let's turn our attention back to our application logic. The createQuery() call deserves a bit of discussion. Typically, when using JPA, queries are written in what is called JPAQL (Java Persistence API Query Language). It looks very much like SQL, but has a more object-oriented feel to it. For example, if we wanted to query for every FileInfo record in the database, we would do so with this query:

    SELECT f FROM FileInfo f

I have put the keywords in all caps, with variable names in lower and the entity name in camel case. This is mostly a matter of style, but while most identifiers are case-insensitive, JPA does require that the case on the entity name matches that of the Java class it represents. You must also specify an alias, or identification variable, for the entity, which we simply call f.

To get a specific FileInfo record, you can specify a WHERE clause as follows:

    SELECT f from FileInfo f WHERE f.fileName = :name

With this query, we can filter the query just as SQL does, and, just like SQL, we specify a positional parameter. The parameter can either be a name, like we've done here, or simply a ?. If you use a name, you set the parameter value on the query using that name. If you use the question mark, you must set the parameter using its index in the query. For small queries, this is usually fine, but for larger, more complex queries, I would suggest using names so that you don't have to manage index values, as that's almost guaranteed to cause a bug at some point. Setting the parameter can look something like this:

    Query query = em.createQuery( 
      "SELECT f from FileInfo f WHERE f.fileName = :name"); 
    query.setParameter("name", "test3.txt"); 
    query.getResultList().stream() //...

With that said, let's take a look at our query:

    SELECT f  
    FROM FileInfo f,  
      (SELECT s.%FIELD%  
        FROM FileInfo s  
        GROUP BY s.%FIELD%  
        HAVING (COUNT(s.%FIELD%) > 1)) g 
    WHERE f.%FIELD% = g.%FIELD%  
    AND f.%FIELD% IS NOT NULL  
    ORDER BY f.fileName, f.path

This query is moderately complicated, so let's break it down and see what's going on. First, in our SELECT query, we will specify only f, which is the identification variable of the entity for which we are querying. Next, we are selecting from a regular table and a temporary table, which is defined by the subselect in the FROM clause. Why are we doing it this way? We need to identify all of the rows that have a duplicate value (fileName, size, or hash). To do that, we use a HAVING clause with the COUNT aggregation function, HAVING (COUNT(fieldName > 1)) which says, in effect, give me all of the rows where this field occurs more than one time. The HAVING clause requires a GROUP BY clause, and once that's done, all of the rows with duplicate values are aggregated down to a single row. Once we have that list of rows, we will then join the real (or physical) table to those results to filter our physical table. Finally, we filter out the null fields in the WHERE clause, then order by fileName and path so that we don't have to do that in our Java code, which is likely to be less efficient than it would be if done by the database--a system designed for such operations.

You should also note the %FIELD% attribute in the SQL. We'll run the same query for multiple fields, so we've written the query once, and placed a marker in the text that we will replace with the desired field, which is sort of a poor man's template. There are, of course, a variety of ways to do this (and you may have one you find superior), but this is simple and easy to use, so it's perfectly acceptable in this environment.

We should also note that it is, generally speaking, a very bad idea to either concatenate SQL with values or do string replacements like we're doing, but our scenario is a bit different. If we were accepting user input and inserting that into the SQL this way, then we would certainly have a target for an SQL injection attack. In our use here, though, we aren't taking input from users, so this approach should be perfectly safe. In terms of database performance, this shouldn't have any adverse effects either. While we will require three different hard parses (one for each field by which we will filter), this is no different than if we were hardcoding the queries in our source file. Both of those issues, as well as many more, are always good to consider as you write your queries (and why I said the developer is mostly shielded from database concerns).

All of that gets us through the first step, which is identifying all of the files that have the same name. We now need to identify the files that have the same size, which can be done using the following piece of code:

    List<FileInfo> files = getDuplicates(em, "fileName"); 
    files.addAll(getDuplicates(em, "size"));

In our call to find duplicate filenames, we declared a local variable, files, to store those results. In finding files with duplicate sizes, we call the same getDuplicates() method, but with the correct field name, and simply add that to files via the List.addAll() method.

We now have a complete list of all of the possible duplicates, so we need to generate the hashes for each of these to see if they are truly duplicates. We will do that with this loop:

    em.getTransaction().begin(); 
    files.forEach(f -> calculateHash(f)); 
    em.getTransaction().commit();

In a nutshell, we start a transaction (since we'll be inserting data into the database), then loop over each possible duplicate via List.forEach() and a lambda that calls calculateHash(f), and then pass the FileInfo instance. Once the loop terminates, we commit the transaction to save our changes.

What does calculateHash() do? Let's a take a look:

    private void calculateHash(FileInfo file) { 
      try { 
        MessageDigest messageDigest =  
          MessageDigest.getInstance("SHA3-256"); 
        messageDigest.update(Files.readAllBytes( 
          Paths.get(file.getPath()))); 
        ByteArrayInputStream inputStream =  
          new ByteArrayInputStream(messageDigest.digest()); 
        String hash = IntStream.generate(inputStream::read) 
         .limit(inputStream.available()) 
         .mapToObj(i -> Integer.toHexString(i)) 
         .map(s -> ("00" + s).substring(s.length())) 
         .collect(Collectors.joining()); 
        file.setHash(hash); 
      } catch (NoSuchAlgorithmException | IOException ex) { 
        throw new RuntimeException(ex); 
      } 
    }

This simple method encapsulates the work required to read the contents of a file and generate a hash. It requests an instance of MessageDigest using the SHA3-256 hash, which is one of the four new hashes supported by Java 9 (the other three being SHA3-224, SHA3-384, and SHA3-512). Many developers' first thought is to reach for MD-5 or SHA-1, but those are no longer considered reliable. Using the new SHA-3 should guarantee we avoid any false positives.

The rest of the method is pretty interesting in terms of how it does its work. First, it reads all of the bytes of the specified file and passes them to MessageDigest.update(), which updates the internal state of the MessageDigest object to give us the hash we want. Next, we create a ByteArrayInputStream that wraps the results of messageDigest.digest().

With our hash ready, we generate a string based on those bytes. We will do that by generating a stream via the IntStream.generate() method using the InputStream we just created as a source. We will limit the stream generation to the bytes available in the inputStream. For each byte, we will convert it to a string via Integer.toHexString(); then pad it with zero to two spaces, which prevents, for example, the single-digit hex characters E and F from being interpreted as EF; then collect them all into a string using Collections.joining(). Finally, we take that string value and update the FileInfo object.

The eagle-eyed might notice something interesting: we call FileInfo.setHash() to change the value of the object, but we never tell the system to persist those changes. This is because our FileInfo instance is a managed instance, meaning that we got it from JPA, which is keeping an eye on it, so to speak. Since we retrieved it via JPA, when we make any changes to its state, JPA knows it needs to persist those changes. When we call em.getTransaction().commit() in the calling method, JPA automatically saves those changes to the database.

There's a catch to this automatic persistence: if you retrieve an object via JPA, then pass it across some sort of barrier that serializes the object, for example, across a remote EJB interface, then the JPA entity is said to be "detached". To reattach it to the persistence context, you will need to call entityManager. merge(), after which this behavior will resume. There is no need to call entityManager.flush() unless you have some need to synchronize the in-memory state of the persistence context with the underlying database.

Once we've calculated the hashes for the potential duplicates (at this point, given that they have duplicate SHA-3 hashes, they are almost certainly actual duplicates), we're ready to gather and report them:

    getDuplicates(em, "hash").forEach(f -> coalesceDuplicates(f)); 
    em.close();

We call the same getDuplicates() method to find duplicate hashes, and pass each record to the coalesceDuplicates() method, which will group these in a manner appropriate to report upstream to our CLI or GUI layers, or, perhaps, to any other program consuming this functionality:

    private void coalesceDuplicates(FileInfo f) { 
      String name = f.getFileName(); 
      List<FileInfo> dupes = duplicates.get(name); 
      if (dupes == null) { 
        dupes = new ArrayList<>(); 
        duplicates.put(name, dupes); 
      } 
      dupes.add(f); 
    }

This simple method follows what is likely a very familiar pattern:

Get a List from a Map based on the key, the filename.
If the map doesn't exist, create it and add it to the map.
Add the FileInfo object to the list.

This completes the duplicate file detection. Back in find(), we will call factory.close() to be a good JPA citizen, then return to the calling code. With that, we're ready to build our CLI.

官术网_书友最值得收藏!

Java 9 Programming Blueprints

Modern database access with JPA