官术网_书友最值得收藏!

Time for action – counting frequent words by filtering

On this occasion, you have some plain text files, and you want to know what is said in them. You don't want to read them, so you decide to count the times that the words appear in the text, and see the most frequent ones to get an idea of what the files are about. The first of our two tutorials on filtering is about counting the words in the file.

Note

Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt, and is available for you to download from Packt Publishing's website, www.packtpub.com.

Let's work.

Tip

This section and the following sections have many steps. So, feel free to preview the data from time-to-time. In this way, you make sure that you are doing well, and understand what filtering is about, as you progress in the design of your transformation.

  1. Create a new transformation.
  2. By using a Text file input step, read your file. The trick here is to put as a Separator, a sign you are not expecting in the file, such as |. By doing so, of the whole lines would be recognized as a single field. Configure the Fields tab by defining a single String field named line.
  3. This particular file has a big header describing the content and origin of it. We are not interested in those lines, so in the Content tab, as Header type 378, which is the number of lines that precedes the specific content we're interested in.
  4. From the Transform category of steps, drag to the canvas a Split field to rows step, and create a hop from the Text file input step to this one.
  5. Configure the step as follows:
  6. With this last step selected, do a preview. Your preview window should look as follows:
  7. Close the preview window.
  8. Add a Select values step to remove the line field.
    Note

    It's not mandatory to remove this field, but as it will not be used any longer, removing it will make future previews clearer.

  9. Expand the Flow category of steps, and drag a Filter rows step to the work area.
  10. Create a hop from the last step to the Filter rows step.
  11. Edit the Filter rows step by double-clicking on it.
  12. Click on the <field> textbox to the left of the = sign. The list of fields appears. Select word.
  13. Click on the = sign. A list of operations appears. Select IS NOT NULL.
  14. The window looks like the following screenshot:
  15. Click on OK.
  16. From the Transform category of steps, drag a Sort rows step to the canvas.
  17. Create a hop from the Filter rows step, to the Sort rows step. When asked for the kind of hop, select Main output of step, as shown in the following screenshot:
  18. Use the last step to sort the rows by word (ascending).
  19. From the Statistics category, drag-and-drop a Group by step on the canvas, and add it to the stream, after the Sort rows step.
  20. Configure the grids in the Group by configuration window, as shown in the following screenshot:
  21. With the Group by step selected, do a preview. You will see this:

What just happened?

You read a regular plain file, and counted the words appearing in it.

The first thing you did was read the plain file, and split the lines so that every word became a new row in the dataset. For example, as a consequence of splitting the line:

subsidence; comparison with the Portillo chain.

The following rows were generated:

Thus, a new field named word became the basis for your transformation, and therefore you removed the line field.

First of all, you discarded rows with null words. You did it by using a filter with the condition word IS NOT NULL.

Then, you counted the words by using the Group by step you learned in the previous tutorial. Doing it this way, you got a preliminary list of the words in the file, and the number of occurrences of each word.

主站蜘蛛池模板: 正安县| 蒲城县| 区。| 萨迦县| 达尔| 永年县| 包头市| 平陆县| 湟中县| 长治县| 昌图县| 哈密市| 长沙市| 泰顺县| 威宁| 湘西| 大邑县| 西宁市| 绥阳县| 阳西县| 兰西县| 木里| 即墨市| 布尔津县| 大关县| 廊坊市| 台山市| 峡江县| 宁陕县| 南川市| 建阳市| 海淀区| 岱山县| 连城县| 宁国市| 从化市| 昌都县| 安国市| 广德县| 湖州市| 武强县|