官术网_书友最值得收藏!

Recipe 2 – creating a new project

In this recipe, you will learn how to get data into OpenRefine, whether by creating a new project and loading a dataset, opening an existing project from a previous session, or importing someone else's project.

If you successfully installed OpenRefine and launched it as explained in Recipe 1 – installing OpenRefine, you will notice that OpenRefine opens in your default browser. However, it is important to realize that the application is run locally: you do not need an Internet connection to use OpenRefine, except if you want to reconcile your data with external sources through the use of extensions (see Appendix, Regular Expressions and GREL for such advanced uses). Be also reassured that your sensitive data will not be stored online or shared with anyone. In practice, OpenRefine uses the port 3333 of your local machine, which means that it will be available through the URL http://localhost:3333/ or http://127.0.0.1:3333/.

Here is the start screen you will be looking at when you first open OpenRefine:

On the left, three tabs are available:

  • Create Project: This option loads a dataset into OpenRefine. This is what you will want when you use OpenRefine for the first time. There are various supported formats, as shown in the preceding screenshot. You can import data in different ways:
    • This Computer: Select a file stored on your local machine
    • Web Addresses (URLs): Import data directly from an online source*
    • Clipboard: Copy-paste your data into a text field
    • Google Data: Enable access to a Google Spreadsheet or Fusion Table*

    *Internet connection required

  • Open Project: This option helps you go back to an existing project created during a former session. The next time you start OpenRefine, it will show a list of existing projects and propose you to continue working on a dataset that you have been using previously.
  • Import Project: With this option, we can directly import an existing OpenRefine project archive. This allows you to open a project that someone else has exported, including the history of all transformations already performed on the data since the project was created.

File formats supported by OpenRefine

Here are some of the file formats supported by OpenRefine:

  • Comma-Separated Values (CSV), Tab-Separated Values (TSV), and other *SV
  • MS Excel documents (both .XLS and .XLSX) and Open Document Format (ODF) spreadsheets (.ODS), although the latter is not explicitly mentioned
  • JavaScript Object Notation (JSON)
  • XML and Resource Description Framework (RDF) as XML
  • Line-based formats (logs)

If you need other formats, you can add them by way of OpenRefine extensions.

Project creation with OpenRefine is straightforward and consists of three simple steps: selecting your file, previewing the import, and validating to let OpenRefine create your project. Let's create a new project by clicking on the Choose Files button from the This Computer tab, selecting your dataset (refer to the following information box), then clicking on Next.

Note

Although we encourage you to experiment with OpenRefine on your own dataset, it may be useful for you to be able to reproduce the examples used throughout this book. In order to facilitate this, all recipes are performed on the dataset from the Powerhouse Museum in Sydney, freely available from your account at http://www.packtpub.com (use the file chapter1.tsv). Feel free to download this file and load it into OpenRefine in order to follow the recipes more easily. Files are also present for the remaining chapters in a similar format for download. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

On the next screen, you get an overview of your dataset as it will appear in OpenRefine. In the bottom-right corner, you can see the following parsing options as shown in the following screenshot:

By default, the first line will be parsed as column headers, which is a common practice and relevant in the case of the Powerhouse dataset. OpenRefine will also attempt a guess for each cell type in order to differentiate text strings from integers, dates, and URLs among others. This will prove useful later when sorting your data (if you choose to keep the cells in plain text format, 10 will come before 2, for instance).

Another option demanding attention is the Quotation marks are used to enclose cells containing column separators checkbox. If you leave it selected, be sure to verify that the cell values are indeed enclosed in quotes in the original file. Otherwise, deselect this box to ensure that the quotation marks are not misinterpreted by OpenRefine. In the case of the Powerhouse collection, quotes are used inside cells to indicate object titles and inscriptions, for instance, so they have no syntactic meaning: we need to deselect the checkbox before going further. The other options may come in handy in some cases; try to select and deselect them in order to see how they affect your data. Also, be sure to select the right encoding to avoid special characters to being mixed up. When everything seems right, click on Create Project to load your data into OpenRefine.

主站蜘蛛池模板: 喀喇沁旗| 尉氏县| 山西省| 芮城县| 淮滨县| 永宁县| 汝城县| 道真| 霍邱县| 岑溪市| 鄂温| 湖口县| 景洪市| 无极县| 泰兴市| 洛阳市| 依安县| 韶山市| 遂溪县| 出国| 城步| 秦安县| 玛纳斯县| 乌拉特中旗| 湟源县| 铁岭县| 邮箱| 邵东县| 福州市| 呼图壁县| 广宁县| 法库县| 苗栗县| 扶沟县| 普宁市| 永安市| 宁强县| 寿光市| 景谷| 邵武市| 安岳县|