官术网_书友最值得收藏!

Building the library

The foundational piece of this project is the library which both the CLI and the GUI will consume, so it makes sense to start here. When designing the library--its inputs, outputs, and general behavior--it helps to understand what exactly do we want this system to do, so let's take some time to discuss the functional requirements.

As stated in the introduction, we'd like to be able to search for duplicate files in an arbitrary number of directories. We'd also like to be able to restrict the search and comparison to only certain files. If we don't specify a pattern to match, then we want to check every file.

The most important part is how to identify a match. There are, of course, a myriad of ways in which this can be done, but the approach we will use is as follows:

  • Identify files that have the same filename. Think of those situations where you might have downloaded images from your camera to your computer for safekeeping, then, later, perhaps you forgot that you had already downloaded the images, so you copied them again somewhere else. Obviously, you only want one copy, but is the file, for example, IMG_9615.JPG, in the temp directory the same as the one in your picture backup directory? By identifying files with matching names, we can test them to be sure.
  • Identify files that have the same size. The likelihood of a match here is smaller, but there is still a chance. For example, some photo management software, when importing images from a device, if it finds a file with the same name, will modify the filename of the second file and store both, rather than stopping the import and requiring immediate user intervention. This can result in a large number of files such as IMG_9615.JPG and IMG_9615-1.JPG. This check will help identify these situations.
  • For each match above, to determine whether the files are actually a match, we'll generate a hash based on the file contents. If more than one file generates the same hash, the likelihood of those files being identical is extremely high. These files we will flag as potential duplicates.

It's a pretty simple algorithm and should be pretty effective, but we do have a problem, albeit one that's likely not immediately apparent. If you have a large number of files, especially a set with a large number of potential duplicates, processing all of these files could be a very lengthy process, which we would like to mitigate as much as possible, which leads us to some non-functional requirements:

  • The program should process files in a concurrent manner so as to minimize, as much as possible, the amount of time it takes to process a large file set
  • This concurrency should be bounded so that the system is not overwhelmed by processing the request
  • Given the potential for a large amount of data, the system must be designed in such a way so as to avoid using up all available RAM and causing system instability

With that fairly modest list of functional and non-functional requirements, we should be ready to begin. Like the last application, let's start by defining our module. In src/main/java, we will create this module-info.java:

    module com.steeplesoft.dupefind.lib { 
      exports com.steeplesoft.dupefind.lib; 
    } 

Initially, the compiler--and the IDE--will complain that the com.steeplesoft.dupefind.lib package does not exist and won't compile the project. That's fine for now, as we'll be creating that package now.

The use of the word concurrency in the functional requirements, most likely, immediately brings to mind the idea of threads. We introduced the idea of threads in Chapter 2, Managing Java Processes, so if you are not familiar with them, review that section in the previous chapter.

Our use of threading in this project is different from that in the last, in that we will have a body of work that needs to be done, and, once it's finished, we want the threads to exit. We also need to wait for these threads to finish their work so that we can analyze it. In the java.util.concurrent package, the JDK provides several options to accomplish this.

主站蜘蛛池模板: 尼木县| 衡阳市| 油尖旺区| 天峻县| 麦盖提县| 西乌珠穆沁旗| 墨脱县| 镇雄县| 银川市| 高安市| 花垣县| 诏安县| 辉南县| 永平县| 洛宁县| 南开区| 南宁市| 牙克石市| 黄石市| 通辽市| 广东省| 九龙城区| 唐山市| 武平县| 濮阳市| 时尚| 濮阳市| 甘肃省| 永嘉县| 方城县| 昌平区| 曲阳县| 哈密市| 阿瓦提县| 正蓝旗| 岫岩| 和林格尔县| 吉林省| 赤峰市| 启东市| 沈阳市|