南宁市财神网络科技有限公司

書名： Pentaho Data Integration Beginner's Guide（Second Edition）
作者名： María Carina Roldán
本章字數： 2122字
更新時間： 2021-07-23 15:46:56

Time for action – getting data from an XML file with information about countries

In this section, you will build an Excel file with basic information about countries. The source will be an XML file that you can download from the book's website.

Open the kettle.properties file located in a folder named .kettle inside your home directory. If you work under Windows, that folder could be C:\Documents and Settings\<your_name>\ or C:\Users\<your_name>\ depending on the Windows version.
If you work under Linux (or similar) or Mac OS, the folder will most probably be /home/<your_name>/.

Note

Note that the .kettle folder is a system folder, and as such, may not display using the GUI file explorer on any OS. You can change the UI settings to display the folder, or use a terminal window.
Add the following line (for Windows systems):
```
LABSOUTPUT=c:/pdi_files/output
```
Or this line (for Linux or similar systems):
```
LABSOUTPUT=/home/your_name/pdi_files/output
```
Make sure that the directory named output exists.
Save the file, restart Spoon and create a new transformation.
Give the transformation a name and save it in the same directory where you have all the other transformations.
From the book's website, download the resources folder containing a file named countries.xml. Save the folder in your working directory. For example, if your transformations are in pdi_labs, the file will be in pdi_labs/resources/.

Note

The previous two steps are important. Don't skip them! If you do, some of the following steps will fail.

Take a look at the file. You can edit it with any text editor, or you can double-click on it to see it within a web explorer. In any case, you will see information about countries. This is just the extract for a single country:

<?xml version="1.0" encoding="UTF-8"?>
<world>
...
  <country>
    <name>Argentina</name>
    <capital>Buenos Aires</capital>
    <language isofficial="T">
      <name>Spanish</name>
      <percentage>96.8</percentage>
    </language>
    <language isofficial="F">
      <name>Italian</name>
      <percentage>1.7</percentage>
    </language>
    <language isofficial="F">
      <name>Indian Languages</name>
      <percentage>0.3</percentage>
    </language>
  </country>
...
</world>

From the Input steps, drag to the canvas a Get data from XML step.
Open the configuration window for this step by double-clicking on it.
In the File or directory textbox, press Ctrl + Space or Shift + command + Space in Mac. A drop-down list appears containing a list of defined variables:
Select Internal.Transformation.Filename.Directory. The textbox is filled with this text.
Complete the text so you can read this: ${Internal.Transformation.Filename.Directory}/resources/countries.xml.
Click on the Add button. The full path is moved to the grid.
Select the Content tab and click on Get XPath nodes.
In the list that appears, select /world/country/language.
Select the Fields tab and fill in the grid as shown in the following screenshot:
Click on Preview rows, and you should see something like the following screenshot:
Click on OK.
From the Output steps, drag to the canvas a Microsoft Excel Output step.
Create a hop from the Get data from XML step to the Microsoft Excel Output step.
Open the configuration window for this step by double-clicking on it.
In the Filename textbox press Ctrl + Space.
From the drop-down list, select ${LABSOUTPUT}.

Note

If you don't see this variable, please verify that you spelled the name correctly in the kettle.properties file, saved the file, and restarted Spoon.
Beside that text, type /countries_info. The complete text should be: ${LABSOUTPUT}/countries_info.
Select the Fields tab and then click on the Get fields button to fill in the grid.
Click on OK. This is your final transformation:
Save and run the transformation.
Check that the file countries_info.xls has been created in the output folder, and contains the information you previewed in the input step.

What just happened?

You got information about countries from an XML file and saved it in a more readable format for the common people: an Excel sheet.

To get the information you used a Get data from XML step. As the source file was taken from a folder relative to the folder where you stored the transformation, you set the directory to ${Internal.Transformation.Filename.Directory}. When the transformation runs, Kettle replaces ${Internal.Transformation.Filename.Directory} with the real path of the transformation, for example, c:/pdi_labs/.

In the same way, you didn't put a fixed value for the path of the final Excel file. As the folder you used ${LABSOUTPUT}. When the transformation ran, Kettle replaced ${LABSOUTPUT} with the value you wrote in the kettle.properties file. Then, the output file was saved in that folder, for example, c:/pdi_files/output.

What is XML?

XML stands for EXtensible Markup Language. It is basically a language designed to describe data. XML files or documents contain information wrapped in tags. Look at this piece of XML taken from the countries file:

<?xml version="1.0" encoding="UTF-8"?>
<world>
...
  <country>
    <name>Argentina</name>
    <capital>Buenos Aires</capital>
    <language isofficial="T">
      <name>Spanish</name>
      <percentage>96.8</percentage>
    </language>
    <language isofficial="F">
      <name>Italian</name>
      <percentage>1.7</percentage>
    </language>
    <language isofficial="F">
      <name>Indian Languages</name>
      <percentage>0.3</percentage>
    </language>
  </country>
...
</world>

The first line in the document is the XML declaration. It defines the XML version of the document, and should always be present.

Below the declaration is the body of the document. The body is a set of nested elements. An element is a logical piece enclosed by a start tag and a matching end tag, for example, <country> </country>.

Within the start tag of an element, you may have attributes. An attribute is a markup construct consisting of a name/value pair, for example, isofficial="F".

This is the most basic terminology related to XML files. If you want to know more about XML, you can visit http://www.w3schools.com/xml/.

PDI transformation files

Despite the KTR extension, PDI transformations are just XML files. As such, you are able to explore them inside and recognize different XML elements. Look the following sample text:

<?xml version="1.0" encoding="UTF-8"?>
<transformation>
  <info>
    <name>hello_world</name>
    <description>My first transformation</description>
    <extended_description>PDI Beginner&apos;s Guide (2nd edition)Chapter 1</extended_description>
...
</transformation>

This is an extract from the hello_world.ktr file. Here you can see the root element named transformation and some inner elements, for example, info and name.

Note that if you copy a step by selecting it in the Spoon work area and press Ctrl + C, and then paste it to a text editor, you can see its XML definition. If you copy it back to the canvas, a new identical step will be added to your transformation.

Getting data from XML files

In order to get data from an XML file, you have to use the Get data from XML input step. To tell PDI which information to get from the file. it is required that you use a particular notation named XPath.

XPath

XPath is a set of rules used for getting information from an XML document. In XPath, XML documents are treated as trees of nodes. There are several kinds of nodes: elements, attributes, and texts are some of them. As an example, world, country, and isofficial are some of the nodes in the sample file.

Among the nodes, there are relationships. A node has a parent, zero or more children, siblings, ancestors, and descendants depending on where the other nodes are in the hierarchy.

In the sample countries file, country is the parent of the elements name, capital and language. These three elements are children of country.

To select a node in an XML document you have to use a path expression relative to a current node.

The following table has some examples of path expressions you may use to specify fields. The examples assume that the current node is language:

Note

Note that the expressions name and ../name are not the same. The first expression selects the name of the language, while the second selects the name of the country.

For more information on XPath, visit the link http://www.w3schools.com/XPath/.

Configuring the Get data from the XML step

In order to specify the name and location of an XML file you have to fill in the File tab just as you do in any file input step. What is different here is how you get the data.

The first thing you have to do is select the path that will identify the current node. This is optimally the repeating node in the file. You select the path by filling in the Loop XPath textbox in the Content tab. You can type it by hand, or you can select it from the list of available paths by clicking on the Get XPath nodes button.

Once you select a path, PDI will generate one row of data for every found path.

In the Time for action – getting data from an XML file with information about countries section, you selected /world/country/language. Then PDI generates one row for each /world/country/language element in the file.

After selecting the loop XPath, you have to specify the fields to get. In order to do that, you have to fill in the grid in the Fields tab by using XPath notation, as explained previously.

Note that if you press the Get fields button, PDI will fill the grid with the child nodes of the current node. If you want to get some other node, you have to type its XPath by hand.

Also note the notation for the attributes. To get an attribute you can use the @ notation as explained, or you can simply type the name of the attribute without @ and select Attribute under the Element column, as you did in this section.

Kettle variables

In the previous section, you used the string ${Internal.Transformation.Filename.Directory} to identify the folder where the current transformations were saved. You also used the string ${LABSOUTPUT} to define the destination folder of the output file.

Both strings, ${Internal.Transformation.Filename.Directory} and ${LABSOUTPUT} are Kettle variables, that is, keywords linked to a value. You use the name of a variable, and when the transformation runs, the name of the variable is replaced by its value.

The first of these two variables is an environment variable, and it is not the only one available. Other known environment variables are: ${user.home}, ${java.io.tmpdir} and ${java.home}. All these variables, whose values are auto-populated by Kettle by interrogating the system environment, are ready to use any time you need.

The second variable is a variable you defined in the kettle.properties file. In this file, you may define as many variables as you want. The only thing you have to keep in mind is that those variables will be available inside Spoon only after you restart it.

Tip

You also have the possibility of editing the kettle.properties file from Spoon. The option is available in the main menu Edit | Edit the kettle.properties file. If you use this option to modify a variable, the value will be available immediately.

If you defined several variables in the kettle.properties file and care about the order in which you did it, or the comments you may have put in the file, it's not a good idea to edit it from Spoon.

Note

You have to know that when you edit the kettle.properties file from Spoon, Kettle will not respect the order of the lines you had in the file, and it will also add to the file a lot of pre-defined variables. So if you want to take control over the look and feel of your file you shouldn't use this option.

These two kinds of variables, environment variables and variables defined in kettle.properties are the most primitive kind of variables found in PDI. All of these variables are string variables and their scope is the Java Virtual Machine. This mainly means that they will always be ready for being used in any job or transformation.

How and when you can use variables

Any time you see a red dollar sign by the side of a textbox, you may use a variable. Inside the textbox you can mix variable names with static text, as you did in Time for action – getting data from an XML file with information about countries section when you put the name of the destination as ${LABSOUTPUT}/countries_info.

To see all the available variables, you have to position the cursor in the textbox, press Ctrl + Space , and a full list is displayed so you can select the variable of your choice. If you place the mouse cursor over any of the variables for a second, the actual value of the variable will be shown.

If you know the name of the variable, you don't need to select it from the list. You may type its name, by using any of these notations: ${<name>} or %%<name>%%.

Have a go hero – exploring XML files

Now you can explore by yourself. On the book's website, there are some sample XML files. Download them and try this:

Read the customer.xml file and create a list of customers
Read the tomcat-users.xml file and get the users and their passwords
Read the areachart.xml and get the color palette, that is, the list of colors used

Note

The customer file is included in the Pentaho Report Designer software package. The others come with the Pentaho BI package. This software has many XML files for you to use. If you are interested, you can download the software from http://sourceforge.net/projects/pentaho/files/.

官术网_书友最值得收藏!

Pentaho Data Integration Beginner's Guide（Second Edition）

Time for action – getting data from an XML file with information about countries

Note

Note

Note

What just happened?

What is XML?

PDI transformation files

Getting data from XML files

XPath

Note

Configuring the Get data from the XML step

Kettle variables

Tip

Note

How and when you can use variables

Have a go hero – exploring XML files

Note