- Using OpenRefine
- Ruben Verborgh Max De Wilde
- 336字
- 2021-08-06 16:57:12
Introducing OpenRefine
Let's face a hard fact: your data are messy. All data are messy. Errors will always creep into large datasets no matter how much care you have put into creating them, especially when their creation has involved several persons and/or has been spread over a long timespan. Whether your data are born-digital or have been digitized, whether they are stored in a spreadsheet or in a database, something will always go awry somewhere in your dataset.
Acknowledging this messiness is the first essential step towards a sensible approach to data quality, which mainly involves data profiling and cleaning.
Data profiling is defined by Olson (Data Quality: The Accuracy Dimension, Jack E. Olson, Morgan Kaufman, 2003) as "the use of analytical techniques to discover the true structure, content, and quality of data". In other words, it is a way to get an assessment of the current state of your data and information about errors that they contain.
Data cleaning is the process that tries to correct those errors in a semi-automated way by removing blanks and duplicates, filtering and faceting rows, clustering and transforming values, splitting multi-valued cells, and so on.
Whereas custom scripts were formerly needed to perform data profiling and cleaning tasks, often separately, the advent of Interactive Data Transformation tools (IDTs) now allows for quick and inexpensive operations on large amounts of data inside a single integrated interface, even by domain professionals lacking in-depth technical skills.
OpenRefine is such an IDT; a tool for visualizing and manipulating data. It looks like a traditional, Excel-like spreadsheet software, but it works rather like a database, that is, with columns and fields rather than individual cells. This means that OpenRefine is not well suited for encoding new rows of data, but is extremely powerful when it comes to exploring, cleaning, and linking data.
The recipes gathered in this first chapter will help you to get acquainted with OpenRefine by reviewing its main functionalities, from import/export to data exploration and from history usage to memory management.
- Hands-On Internet of Things with MQTT
- 大數據技術基礎
- PowerShell 3.0 Advanced Administration Handbook
- 7天精通Dreamweaver CS5網頁設計與制作
- 物聯網與云計算
- Zabbix Network Monitoring(Second Edition)
- 從零開始學SQL Server
- 從零開始學Java Web開發
- 電腦故障排除與維護終極技巧金典
- Hands-On Business Intelligence with Qlik Sense
- Eclipse RCP應用系統開發方法與實戰
- Getting Started with Tableau 2019.2
- SolarWinds Server & Application Monitor:Deployment and Administration
- 軟件需求最佳實踐
- 巧學活用電腦維護108問