官术网_书友最值得收藏!

  • Using OpenRefine
  • Ruben Verborgh Max De Wilde
  • 336字
  • 2021-08-06 16:57:12

Introducing OpenRefine

Let's face a hard fact: your data are messy. All data are messy. Errors will always creep into large datasets no matter how much care you have put into creating them, especially when their creation has involved several persons and/or has been spread over a long timespan. Whether your data are born-digital or have been digitized, whether they are stored in a spreadsheet or in a database, something will always go awry somewhere in your dataset.

Acknowledging this messiness is the first essential step towards a sensible approach to data quality, which mainly involves data profiling and cleaning.

Data profiling is defined by Olson (Data Quality: The Accuracy Dimension, Jack E. Olson, Morgan Kaufman, 2003) as "the use of analytical techniques to discover the true structure, content, and quality of data". In other words, it is a way to get an assessment of the current state of your data and information about errors that they contain.

Data cleaning is the process that tries to correct those errors in a semi-automated way by removing blanks and duplicates, filtering and faceting rows, clustering and transforming values, splitting multi-valued cells, and so on.

Whereas custom scripts were formerly needed to perform data profiling and cleaning tasks, often separately, the advent of Interactive Data Transformation tools (IDTs) now allows for quick and inexpensive operations on large amounts of data inside a single integrated interface, even by domain professionals lacking in-depth technical skills.

OpenRefine is such an IDT; a tool for visualizing and manipulating data. It looks like a traditional, Excel-like spreadsheet software, but it works rather like a database, that is, with columns and fields rather than individual cells. This means that OpenRefine is not well suited for encoding new rows of data, but is extremely powerful when it comes to exploring, cleaning, and linking data.

The recipes gathered in this first chapter will help you to get acquainted with OpenRefine by reviewing its main functionalities, from import/export to data exploration and from history usage to memory management.

主站蜘蛛池模板: 贡山| 漳州市| 太仓市| 宜黄县| 凭祥市| 石泉县| 青龙| 唐河县| 洱源县| 荔波县| 连平县| 许昌市| 敖汉旗| 彭泽县| 普宁市| 洪湖市| 珠海市| 九寨沟县| 探索| 洛扎县| 天柱县| 克山县| 喀什市| 宜兰县| 衡阳市| 吴旗县| 新乡县| 阳谷县| 兰西县| 子洲县| 巴中市| 偃师市| 南川市| 慈溪市| 鹿泉市| 枣强县| 扎鲁特旗| 平和县| 六盘水市| 辛集市| 天台县|