- Haskell Data Analysis Cookbook
- Nishant Shukla
- 791字
- 2021-12-08 12:43:31
Harnessing data from various sources
Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.
In a very general sense, structured data is anything that can be parsed by an algorithm. Common examples include JSON, CSV, and XML. If given structured data, we can design a piece of code to dissect the underlying format and easily produce useful results. As mining structured data is a deterministic process, it allows us to automate the parsing. This in effect lets us gather more input to feed our data analysis algorithms.
Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.
In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.
This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.
Tip
Unlike most recipes in this book, this recipe does not contain any code. The best way to read this book is by skipping around to the recipes that interest you.
How to do it...
We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.
Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.
News
The New York Times has one of the most polished API documentation to access anything from real-estate data to article search results. This documentation can be found at http://developer.nytimes.com.
The Guardian also supports a massive datastore with over a million articles at http://www.theguardian.com/data.
USA TODAY provides some interesting resources on books, movies, and music reviews. The technical documentation can be found at http://developer.usatoday.com.
The BBC features some interesting API endpoints including information on BBC programs, and music located at http://www.bbc.co.uk/developer/technology/apis.html.
Private
Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.
For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/
Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.
For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.
Academic
Some data sources are hosted openly by universities around the world for research purposes.
To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.
The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.
Nonprofits
Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.
The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.
The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.
UNICEF also releases interesting statistics, as the quote from their website suggests:
"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at
The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.
The United States government
If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.
The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.
- UNIX編程藝術(shù)
- OpenStack Cloud Computing Cookbook(Third Edition)
- iOS Game Programming Cookbook
- 手機安全和可信應(yīng)用開發(fā)指南:TrustZone與OP-TEE技術(shù)詳解
- Windows系統(tǒng)管理與服務(wù)配置
- Magento 2 Development Cookbook
- Python王者歸來
- Building an RPG with Unity 2018
- Python Data Structures and Algorithms
- 執(zhí)劍而舞:用代碼創(chuàng)作藝術(shù)
- Terraform:多云、混合云環(huán)境下實現(xiàn)基礎(chǔ)設(shè)施即代碼(第2版)
- 計算機應(yīng)用基礎(chǔ)教程(Windows 7+Office 2010)
- 精通Spring:Java Web開發(fā)與Spring Boot高級功能
- Java程序設(shè)計教程
- Applied Deep Learning with Python