書名： Java 9 Programming Blueprints
作者名： Jason Lee
本章字數(shù)： 2489字
更新時間： 2021-07-02 18:56:31

Concurrent Java with a Future interface

One of the more common and popular APIs is the Future<V> interface. Future is a means to encapsulate an asynchronous calculation. Typically, the Future instance is returned by ExecutorService, which we'll discuss later. The calling code, once it has the reference to Future, can continue to work on other tasks while Future runs in the background in another thread. When the caller is ready for the results of Future, it calls Future.get(). If Future has finished its work, the call returns immediately with the results. If, however, Future is still working, calls to get() will block until Future completes.

For our uses, though, Future isn't the most appropriate choice. Looking over the non-functional requirements, we see the desire to avoid crashing the system by exhausting the available memory explicitly listed out. As we'll see later, the way this will be implemented is by storing the data in a lightweight on-disk database, and we will implement that--again, as we'll see later-by storing the file information as it is retrieved rather than by gathering the data, then saving it in a post-process method. Given that, our Future won't be returning anything. While there is a way to make that work (defining Future as Future<?> and returning null), it's not the most natural approach.

Perhaps the most appropriate approach is ExecutorService, which is Executor that provides additional functionality, such as the ability to create a Future, as discussed earlier, and manage termination of the queue. What, then, is Executor? Executor is a mechanism to execute Runnable that is more robust than simply calling new Thread(runnable).start(). The interface itself is very basic, consisting only of the execute(Runnable) method, so its value is not immediately apparent just from looking at the Javadoc. If, however, you look at ExecutorService, which is the interface that all Executor provided by the JDK implement, as well as the various Executor implementations, their value easily becomes more apparent. Let's take a quick survey now.

Looking at the Executors class, we can see five different types of Executor implementations: a cached thread pool, a fixed-size thread pool, a scheduled thread pool, a single thread executor, and a work-stealing thread pool. With the single thread Executor being the only exception, each of these can be instantiated directly (ThreadPoolExecutor, ScheduledThreadPoolExecutor, and ForkJoinPool), but users are urged by the JDK authors to use the convenience methods on the Executors class. That said, what are each of these options and why might you choose one?

Executors.newCachedThreadPool(): This returns Executor that provides a pool of cached threads. As tasks come in, Executor will attempt to find an unused thread to execute the task with. If one cannot be found, a new Thread is created and the work begins. When a task is complete, Thread is returned to the pool to await reuse. After approximately 60 seconds, unused threads are destroyed and removed from the pool, which prevents resources from being allocated and never released. Care must be taken with this Executor, though, as the thread pool is unbounded, which means that under heavy use, the system could be overwhelmed by active threads.
Executors.newFixedThreadPool(int nThreads): This method returns an Executor similar to the one previously mentioned, with the exception that the thread pool is bounded to at most nThreads.
Executors.newScheduledThreadPool(int corePoolSize): This Executor is able to schedule tasks to run after an optional initial delay and then periodically, based on the delay and TimeUnit value. See, for example, the schedule(Runnable command, long delay, TimeUnit unit) method.
Executors.newSingleThreadExecutor(): This method will return an Executor that will use a single thread to execute the tasks submitted to it. Tasks are guaranteed to be executed in the order in which they were submitted.
Executors.newWorkStealingExecutor(): This method will return a so-called work stealing Executor, which is of type ForkJoinPool. The tasks submitted to this Executor are written in such a way as to be able to divide up the work to additional worker threads until the size of the work is under a user-defined threshold.

Given our non-functional requirements, the fixed-size ThreadPoolExecutor seems to be the most appropriate. One configuration option we'll need to support, though, is the option to force the generation of hashes for every file found. Based on the preceding algorithm, only files that have duplicate names or sizes will be hashed. However, users may want a more thorough analysis of their file specification and would like to force a hash on every file. We'll implement this using the work-stealing (or fork/join) pool.

With our threading approach selected, let's take a look at the entry point for the library, a class we'll call FileFinder. Since this is our entry point, it will need to know where we want to search and what we want to search for. That will give us the instance variables, sourcePaths and patterns:

    private final Set<Path> sourcePaths = new HashSet<>(); 
    private final Set<String> patterns = new HashSet<>();

We're declaring the variables as private, as that is a good object-oriented practice. We're also declaring them final, to help avoid subtle bugs where these variables are assigned new values, resulting in the unexpected loss of data. Generally speaking, I find it to be a good practice to mark variables as final by default to prevent such subtle bugs. In the case of instance variables in a class like this, a variable can only be declared final if it is either immediately assigned a value, as we are doing here, or if it is given a value in the class' constructors.

We also want to define our ExecutorService now:

    private final ExecutorService es = 
      Executors.newFixedThreadPool(5);

We have somewhat arbitrarily chosen to limit our thread pool to five threads, as it seems to be a fair balance between providing a sufficient number of worker threads for heavy requests, while not allocating a large number of threads that may not be used in most cases. In our case, it is probably a minor issue overblown, but it's certainly something to keep in mind.

Next, we need to provide a means to store any duplicates found. Consider the following lines of code as an example:

    private final Map<String, List<FileInfo>> duplicates =  
      new HashMap<>();

We'll see more details later, but, for now, all that we need to note is that this is a Map of List<FileInfo> objects, keyed by the file hash.

The final variable to make note of is something that might be a bit unexpected--an EntityManagerFactory. You might be asking yourself, what is that? The EntityManagerFactory is an interface to interact with a persistence unit as defined by the Java Persistence API (JPA), which is part of the Java Enterprise Edition Specification. Fortunately, though, the specification was written in such a way to mandate that it be usable in a Standard Edition (SE) context like ours.

So, what are we doing with such an API? If you'll look back at the non-functional requirements, we've specified that we want to make sure that the search for duplicate files doesn't exhaust the available memory on the system. For very large searches, it is quite possible that the list of files and their hashes can grow to a problematic size. Couple that with the memory it will take to generate the hashes, which we'll discuss later, and we can very likely run into out-of-memory situations. We will, therefore, be using JPA to save our search information in a simple, light database (SQLite) that will allow us to save our data to the disk. It will also allow us to query and filter the results more efficiently than iterating over in-memory structures repeatedly.

Before we can make use of those APIs, we need to update our module descriptor to let the system know that we now require the persistence modules. Consider the following code snippet as an example:

    module dupefind.lib { 
      exports com.steeplesoft.dupefind.lib; 
      requires java.logging; 
      requires javax.persistence; 
    }

We've declared to the system that we require both javax.persistence and java.logging, which we'll be using later. As we discussed in Chapter 2, Managing Processes in Java, if any one of these modules are not present, the JVM instance will fail to start.

Perhaps the more important part of the module definition is the exports clause. With this line (there can be 0 or more of them), we're telling the system that we are exporting all of the types in the specified package. This line will allow our CLI module, which we'll get into later, to use the classes (as well as interfaces, enums, and so on, if we were to add any) in that module. If a type's package does not export, consuming modules will be unable to see the type, which we'll also demonstrate later.

With that understanding, let's take a look at our constructor:

    public FileFinder() { 
      Map<String, String> props = new HashMap<>(); 
      props.put("javax.persistence.jdbc.url",  
       "jdbc:sqlite:" +  
       System.getProperty("user.home") +  
       File.separator +  
       ".dupfinder.db"); 
      factory = Persistence.createEntityManagerFactory 
       ("dupefinder", props); 
      purgeExistingFileInfo(); 
    }

To configure the persistence unit, JPA typically uses a persistence.xml file. In our case, though, we'd like a bit more control over where the database file is stored. As you can see in the preceding code, we are constructing the JDBC URL using the user.home environment variable. We then store that in a Map using the JPA-defined key to specify the URL. This Map is then passed to the createEntityManagerFactory method, which overrides anything set in persistence.xml. This allows us to put the database in the home directory appropriate for the user's operating system.

With our class constructed and configured, it's time to take a look at how we'll find duplicate files:

    public void find() { 
      List<PathMatcher> matchers = patterns.stream() 
       .map(s -> !s.startsWith("**") ? "**/" + s : s) 
       .map(p -> FileSystems.getDefault() 
       .getPathMatcher("glob:" + p)) 
       .collect(Collectors.toList());

Our first step is to create a list of the PathMatcher instances based on the patterns specified by the user. A PathMatcher instance is a functional interface that is implemented by objects that attempt to match files and paths. Our instances are retrieved from the FileSystems class.

When requesting PathMatcher, we have to specify the globbing pattern. As can be seen in the first call to map(), we have to make an adjustment to what the user specified. Typically, a pattern mask is specified simply as something like *.jpg. However, a pattern mask like this won't work in a way that the user expects, in that it will only look in the current directory and not walk down into any subdirectories. To do that, the pattern must be prefixed with **/, which we do in the call to map(). With our adjusted pattern, we request the PathMatcher instance from the system's default FileSystem. Note that we specify the matcher pattern as "glob:" + p because we need to indicate that we are, indeed, specifying a glob file.

With our matchers prepared, we're ready to start the search. We do that with this code:

    sourcePaths.stream() 
     .map(p -> new FindFileTask(p)) 
     .forEach(fft -> es.execute(fft));

Using the Stream API, we map each source path to a lambda that creates an instance of FindFileTask, providing it the source path it will search. Each of these FileFindTask instances will then be passed to our ExecutorService via the execute() method.

The FileFindTask method is the workhorse for this part of the process. It is a Runnable as we'll be submitting this to the ExecutorService, but it is also a FileVisitor<Path> as it will be used in walking the file tree, which we do from the run() method:

    @Override 
    public void run() { 
      final EntityTransaction transaction = em.getTransaction(); 
      try { 
        transaction.begin(); 
        Files.walkFileTree(startDir, this); 
        transaction.commit(); 
      } catch (IOException ex) { 
        transaction.rollback(); 
      } 
    }

Since we will be inserting data into the database via JPA, we'll need to start a transaction as our first step. Since this is an application-managed EntityManager, we have to manage the transaction manually. We acquire a reference to the EntityTransaction instance outside the try/catch block to simplify referencing it. Inside the try block, we start the transaction, start the file walking via Files.walkFileTree(), then commit the transaction if the process succeeds. If it fails--if an Exception was thrown--we roll back the transaction.

The FileVisitor API requires a number of methods, most of which are not too terribly interesting, but we'll show them for clarity's sake:

    @Override 
    public FileVisitResult preVisitDirectory(final Path dir,  
    final BasicFileAttributes attrs) throws IOException { 
      return Files.isReadable(dir) ?  
       FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE; 
    }

Here, we tell the system that if the directory is readable, then we continue with walking down that directory. Otherwise, we skip it:

    @Override 
    public FileVisitResult visitFileFailed(final Path file,  
     final IOException exc) throws IOException { 
       return FileVisitResult.SKIP_SUBTREE; 
    }

The API requires this method to be implemented, but we're not very interested in file read failures, so we simply return a skip result:

    @Override 
    public FileVisitResult postVisitDirectory(final Path dir,  
     final IOException exc) throws IOException { 
       return FileVisitResult.CONTINUE; 
    }

Much like the preceding method, this method is required, but we're not interested in this particular event, so we signal the system to continue:

    @Override 
    public FileVisitResult visitFile(final Path file, final
     BasicFileAttributes attrs) throws IOException { 
       if (Files.isReadable(file) && isMatch(file)) { 
         addFile(file); 
       } 
       return FileVisitResult.CONTINUE; 
    }

Now we've come to a method we're interested in. We will check to make sure that the file is readable, then check to see if it's a match. If it is, we add the file. Regardless, we continue walking the tree. How do we test if the file's a match? Consider the following code snippet as an example:

    private boolean isMatch(final Path file) { 
      return matchers.isEmpty() ? true :  
       matchers.stream().anyMatch((m) -> m.matches(file)); 
    }

We iterate over the list of PathMatcher instances we passed in to the class earlier. If the List is empty, which means the user didn't specify any patterns, the method's result will always be true. However, if there are items in the List, we use the anyMatch() method on the List, passing a lambda that checks the Path against the PathMatcher instance.

Adding the file is very straightforward:

    private void addFile(Path file) throws IOException { 
      FileInfo info = new FileInfo(); 
      info.setFileName(file.getFileName().toString()); 
      info.setPath(file.toRealPath().toString()); 
      info.setSize(file.toFile().length()); 
      em.persist(info); 
    }

We create a FileInfo instance, set the properties, then persist it to the database via em.persist().

With our tasks defined and submitted to ExecutorService, we need to sit back and wait. We do that with the following two method calls:

    es.shutdown(); 
    es.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS);

The first step is to ask ExecutorService to shut down. The shutdown() method will return immediately, but it will instruct ExecutorService to refuse any new tasks, as well as shut down its threads as soon as they are idle. Without this step, the threads will continue to run indefinitely. Next, we will wait for the service to shut down. We specify the maximum wait time to make sure we give our tasks time to complete. Once this method returns, we're ready to process the results, which is done in the following postProcessFiles() method:

    private void postProcessFiles() { 
      EntityManager em = factory.createEntityManager(); 
      List<FileInfo> files = getDuplicates(em, "fileName");

官术网_书友最值得收藏!

Java 9 Programming Blueprints

Concurrent Java with a Future interface