Dataset examples
There are many people and organizations publishing datasets online in a wide variety of formats (csv, sequence, xls, etc). Examples of webpages describing and linking to datasets are explored here.
The Problem
Discovering these datasets is incredibly difficult because there exists simple way of marking up pages that describe these datasets. Today, links to various datasets can be scattered throughout the web or entered into various central repositors. Being able to publish a dataset in a way that an automated search engine or software tool could discover them would go a long way towards easing the discovery process.
Use Cases
As the originator of the data, you publish a webpage with a link to that data for discovery purposes.
Alternatively, a third party may publish links to your data (or webpage describing the data) and include extra metadata about it that the originator may not have included.
Real-World Examples
Links to public web pages, either popular or insightful
Individual/Organizational Publishers
- FreeBase https://developers.google.com/freebase/data
- 1000 Genomes http://www.1000genomes.org/data
- Common Crawl - https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
- Data.gov - https://explore.data.gov/Geography-and-Environment/Worldwide-M1-Earthquakes-Past-7-Days/7tag-iwnu
- Data.gov.uk - http://data.gov.uk/dataset/average_earnings_index
Centralized Repositories and/or Directories
- DataBib - http://databib.org/repository/380
- Amazon Public Datasets - http://aws.amazon.com/datasets/Economics/2285
- DataHub - http://datahub.io/dataset/diavgeia
Common Practices
Datasets typically are described using several common fields.
- fn - name of the dataset
- records - number of records
- size - byte size of dataset
- schema - link to something describing the schema or a description of the schema itself
- url - url to dataset
- type - format the data is in
- sample - sample of data or link to the sample
- type - format the data is in
- summary - summary of the dataset
- description - description of the dataset
- terms - terms of use for dataset, url likely
- dtpublished - date dataset was published
- dtupdated - date dataset was updated
- contributor - people/organizations contributing to the dataset