官术网_书友最值得收藏!

  • Using OpenRefine
  • Ruben Verborgh Max De Wilde
  • 493字
  • 2021-08-06 16:57:12

Recipe 3 – exploring your data

In this recipe, you will get to know your data by scanning the different zones giving access to the total number of rows/records, the various display options, the column headers and menus, and the actual cell contents.

Once your dataset has been loaded, you will access the main interface of OpenRefine as shown in the following screenshot:

Four zones are seen on this screen; let's go through them from top to bottom, numbered as 1 to 4 in the preceding screenshot:

  1. Total number of rows: If you did not forget to specify that quotation marks are to be ignored (see Recipe 2 – creating a new project), you should see a total of 75814 rows from the Powerhouse file. When data are filtered on a given criterion, this bar will display something like 123 matching rows (75814 total).
  2. Display options: Try to alternate between rows and records by clicking on either word. True, not much will change, except that you may now read 75814 records in zone 1. The number of rows is always equal to the number of records in a new project, but they will evolve independently from now on. This zone will also let you choose whether to display 5, 10, 25, or 50 rows/records on a page, and it also provides the right way to navigate from page to page.
  3. Column headers and menus: You will find here the first row that was parsed as column headers when the project was created. In the Powerhouse dataset, the columns read Record ID, Object Title, Registration Number, and so on (if you deselected the Parse next 1 line as column headers option box, you will see Column 1, Column 2, and so on instead). The leftmost column is always called All and is divided in three subcolumns containing stars (to mark good records, for instance), flags (to mark bad records, for instance), and IDs. Starred and flagged rows can easily be faceted, as we will see in Chapter 2, Analyzing and Fixing Data. Every column also has a menu (see the following screenshot) that can be accessed by clicking on the small dropdown to the left of the column header.
  4. Cell contents: This option shows the main area displaying the actual values of the cells.

Before starting to profile and clean your data, it is important to get to know them well and to be at ease with OpenRefine: have a look at each column (using the horizontal scrollbar) to verify that the column headers have been parsed correctly, that the cell types were rightly guessed, and so on. Change the number of rows displayed per page to 50 and go through a few pages to check that the values are consistent (ideally, you should already have done so during preview before creating your project). When you feel that you are sufficiently familiar with the interface, you can consider moving along to the next recipe.

主站蜘蛛池模板: 社旗县| 中牟县| 磐石市| 广宁县| 舒兰市| 运城市| 左云县| 肥东县| 山阳县| 秦安县| 苍山县| 富平县| 瑞安市| 红河县| 上虞市| 乃东县| 本溪市| 启东市| 罗定市| 射洪县| 克拉玛依市| 积石山| 岱山县| 台江县| 开江县| 惠来县| 岑溪市| 桐柏县| 孟津县| 溧阳市| 张家口市| 林西县| 台中县| 宣武区| 冕宁县| 隆回县| 鹤壁市| 龙山县| 卓尼县| 彝良县| 怀远县|