官术网_书友最值得收藏!

Introducing OpenRefine

Let's face a hard fact: your data are messy. All data are messy. Errors will always creep into large datasets no matter how much care you have put into creating them, especially when their creation has involved several persons and/or has been spread over a long timespan. Whether your data are born-digital or have been digitized, whether they are stored in a spreadsheet or in a database, something will always go awry somewhere in your dataset.

Acknowledging this messiness is the first essential step towards a sensible approach to data quality, which mainly involves data profiling and cleaning.

Data profiling is defined by Olson (Data Quality: The Accuracy Dimension, Jack E. Olson, Morgan Kaufman, 2003) as "the use of analytical techniques to discover the true structure, content, and quality of data". In other words, it is a way to get an assessment of the current state of your data and information about errors that they contain.

Data cleaning is the process that tries to correct those errors in a semi-automated way by removing blanks and duplicates, filtering and faceting rows, clustering and transforming values, splitting multi-valued cells, and so on.

Whereas custom scripts were formerly needed to perform data profiling and cleaning tasks, often separately, the advent of Interactive Data Transformation tools (IDTs) now allows for quick and inexpensive operations on large amounts of data inside a single integrated interface, even by domain professionals lacking in-depth technical skills.

OpenRefine is such an IDT; a tool for visualizing and manipulating data. It looks like a traditional, Excel-like spreadsheet software, but it works rather like a database, that is, with columns and fields rather than individual cells. This means that OpenRefine is not well suited for encoding new rows of data, but is extremely powerful when it comes to exploring, cleaning, and linking data.

The recipes gathered in this first chapter will help you to get acquainted with OpenRefine by reviewing its main functionalities, from import/export to data exploration and from history usage to memory management.

主站蜘蛛池模板: 疏勒县| 苗栗县| 田东县| 临颍县| 婺源县| 洱源县| 桐城市| 英吉沙县| 广西| 青冈县| 连云港市| 太康县| 永登县| 兴宁市| 红安县| 芷江| 扶沟县| 积石山| 诸城市| 武强县| 酒泉市| 芦溪县| 巢湖市| 蓬莱市| 昌江| 兴业县| 广宗县| 通榆县| 晋江市| 自治县| 太康县| 昭觉县| 达尔| 高州市| 连南| 黄石市| 牟定县| 林周县| 石棉县| 蓬莱市| 商城县|