Data RSS - Technical Overview
Pito Salas -
[email protected] - April 9, 2009
Introduction and Background This is the third in a series of short papers with which I am trying to create the framework and justification for a new format which for now I am calling Data Rss. In this paper I am going to try to give a technical overview of how it might work, without delving into why I think this is a good idea. Please see the other two papers for that: Data RSS: A Modest Proposal (http://www.pdfcoke.com/doc/12866121/Data-Rss) Data Rss: A Case Study (http://www.pdfcoke.com/doc/13583957/DataRSS-Case-Study)
Roles DataRSS is used between two parties, the Publisher, who ‘owns’ some data, and the Accessor, who wants to use that data. Publisher and Accessor are organizations with people in them. The Publisher wants to offer a technical means to allow an application program simple and standardized access to their data. The Accessor wants to write an application program that accesses and does something useful with data coming from any Publisher. Accessor and Publisher don’t know each other. Accessor’s Application A can as easily get data from Publisher P as from Publisher Q. Publisher P’s data can be accessed as easily by Accessor A as by Accessor B.
Protocol and Format Data RSS is a simple protocol and a simple data format. It can be implemented in any programming language and more importantly, the Publisher and Accessor software need not know (can not know) what language the counterparties software is written in. All DataRss requests return a response in one of several formats. For now those are: XML, JSON and HTML. Why HTML? This way requests from a normal browser can return some useful human readable information.
DataRss Endpoint In essence DataRss is embodied by a url which we call the DataRss endpoint. A publisher makes their data available to others by the simple and single act of implementing responses to this url. For example, hypothetically1, the Sunlight Foundation could let the world know that their DataRss endpoint could be found at http:// services.sunlightfoundation.com/datarss. At minimum this would mean that clicking on that link would return a response that looks something like this:2 --datarss: version: 0.1 source: name: Sunlight Labs version: 1 ---
1
All examples in this paper are hypothetical
All responses will be written out in more compact readable form. In reality the responses will be selectable as being in XML, JSON, YAML, or HTML 2
Data RSS - Technical Overview
Pito Salas -
[email protected] - April 9, 2009
In what follows I will document key examples of of the format as it is evolving. This is organized along the lines of each of the top level URL components that are used to control it.
REST The overall scheme of things is that I am trying to describe a unified set of REST URL patterns. Some of the routes return information about the data sets (i.e. discovery) and some of them return actual data. N.B. There are many ways to skin this cat - as is evidenced by the fact that each Publisher who designed a REST API for their data approached it in a slightly different way. In a way that is the problem that I am trying to address.
Data RSS patterns In what follows, I will use “.” (a single period) to denote the Data RSS endpoint. So when you see “,”, substitute, for example, http://www.followthemoney.org/datarss (another fictional endpoint.)
Request url: . The base Data RSS Endpoint returns a basic “hello world” response to prove that there is, in fact, a Data RSS Endpoint here. It indicates the version of DataRSS and the name of the publisher, as well as whatever version number they might set for their implementation. Example: --datarss: version: 0.1 source: name: Sunlight Labs version: 1 ---
Request url: ./info Request performance and feature information about this particular endpoint. An accessor might call this at the very start to learn something about the particular implementation. Example: Request: ./info Response: features: api-key-required: Yes formats: [JSON, XML]
Request url: ./datasets Return a list of all the distinct data sets that this endpoint publishes. Each dataset corresponds more or less to a table or database or list of information. Datasets also may present various canned queries and default behaviors. Example: REQUEST: ./datasets
Data RSS - Technical Overview
Pito Salas -
[email protected] - April 9, 2009
RESPONSE: --name: newswire fullname: New York times Newswire API --name: campaigns fullname: New York Times Campaign Finance API ---
Notes: • The name of a dataset is used in subsequent requests as an identifier.
Request url: ./dataset/
/fields Return the list of all the distinct fields of information that may appear in responses from this dataset. Example: REQUEST: ./dataset/candidates/fields RESPONSE: --name: imsp_candidate_id fullname: the id number of the candidate url-index: yes --name: candidate_name fullname: the name of the candidate url-index: no ---
Notes: • The name of a field is used in subsequent requests as an identifier • url-index: yes means that this field can be used as an actual part of the URL, in exactly this way: ./dataset/candidates/imsp_candidate_id/9120
./dataset//queries Return the list of all the standing queries that this dataset defines. A standing query is kind of a canned query which is meaningful to a particular space. Example: REQUEST: ./dataset/candidates/queries RESPONSE: --name: businesses type: url-parameter parameter: imsp_candidate_id fullname: This query will summarize contributions at the business level for a specific candidate.
Notes:
Data RSS - Technical Overview
Pito Salas - [email protected] - April 9, 2009
• The name of the query is used in subsequent requests as an identifier There are these query types, so far. • type: named-query A simple name that denotes a request for a specific result set. For example, ./dataset/newswire/ query/last24hours would return records corresponding to the named query last24hours. • type: url-parameter A query that includes a parameter right in the URL. For example: ./dataset/candidates/query/ businesses/9120 would return records for a query called businesses and the argument 9120. • type: question-mark The most powerful query type, that allows a more open ended set of question mark URL parameters. For example: ./dataset/district/query/zips?state=MA&districtnumber=29 would return records for a query called “district” with parameters state and districtnumber
Conclusion Please note: this is not meant as a specification and it’s not a specification. It is a working document which will change with feedback and further design. In the Appendix below you can see the examples that I have worked through that have driven the design. Next is to continue applying this model to other existing data APIs and find the holes. So far there have been none that were especially hard to overcome.
Data RSS - Technical Overview
Pito Salas - [email protected] - April 9, 2009
Annotated Examples EXAMPLE 1: New York Times Newswire API Hypothetical New York Times DataRss endpoint: . = http://api.nytimes.com/datarss IREQUEST: . RESPONSE: dataRSS: version: 0.1 source: name: New York Times version: 2 REQUEST: ./info RESPONSE: --features: formats: [JSON, XML] api-key-required: yes paginated: no --REQUEST: ./datasets RESPONSE: --name: newswire fullname: New York times Newswire API --name: campaigns fullname: New York Times Campaign Finance API --REQUEST: ./dataset/newswire/fields RESPONSE: --name: url url-index: no --name: section url-index: no --name: summary url-index: no --name: type url-index: no --name: people url-index: no --name: created url-index: no --name: pubdate url-index: no --... and so on
Data RSS - Technical Overview
Pito Salas - [email protected] - April 9, 2009
REQUEST: ./dataset/newswire/queries RESPONSE: --name: recent type: named-query fullname: all available recent items --name: last24hours type: named-query fullname: items published in last 24 hours --REQUEST: ./dataset/newswire/query/last24hours RESPONSE: --url: xxx section: yyy summary: zzz type: aaa people: xxx --and so on.
EXAMPLE 2: FOLLOWTHEMONEY Hypothetical Follow The Money DataRss endpoint: . = http://www.followthemoney.org/datarss REQUEST: ./info RESPONSE: --datarss: version: 0.1 source: name: Follow The Money version: 1 features: api-key-required: Yes formats: [JSON, XML] --REQUEST: ./datasets RESPONSE: --name: candidates fullname: Follow the Money information about candidates paginated: yes sorts: [sector_name, industry_name, ...] --name: party_pacs fullname: Follow the Money information about Pacs paginated: yes --REQUEST: ./dataset/candidates/fields RESPONSE: ---
Data RSS - Technical Overview
Pito Salas - [email protected] - April 9, 2009
name: imsp_candidate_id fullname: the id number of the candidate url-index: yes --name: candidate_name fullname: the name of the candidate url-index: no --name: state url-index: no fullname: the state this candidate is in --EXAMPLE REQUEST: ./dataset/candidates/imsp_candidate_id/9120 RESPONSE: information about specified candidate NOTE: This illustrates the url-index: yes option REQUEST: ./dataset/candidates/queries RESPONSE: --name: businesses type: url-parameter parameter: imsp_candidate_id fullname: This query will summarize contributions at the business level for a specific candidate. --EXAMPLE REQUEST: ./dataset/candidates/query/businesses/9120 RESPONSE: information about the businesses of the specified candidate
EXAMPLE 3: Sunlight Labs API Hypothetical Sunlight Data RSS endpoint: . = http://services.sunlightlabs.com/datarss
REQUEST: ./info RESPONSE: --datarss: version: 0.1 source: name: Sunlight Labs version: 1 features: api-key-required: Yes formats: [JSON, XML] --REQUEST: ./datasets RESPONSE: --name: legislators fullname: US Representatives and Senators, providing basic contact information as well as all the various IDs we track for legislators.
Data RSS - Technical Overview
Pito Salas - [email protected] - April 9, 2009
paginated: no --name: districts fullname: Congressional districts, providing lookups to obtain district information from a zipcode or latitude and longitude. paginated: no --REQUEST: ./dataset/districts/fields RESPONSE: --name: state fullname: the state of a district url-index: no --name: districtnumber fullname: the number of a district within a state url-index: yes --name: zip fullname: the zipcode of a district within a state url-index: no --REQUEST: ./dataset/district/zip/02474 RESPONSE: list of all districts in that zip. This example illustrates url-index: yes REQUEST: ./dataset/district/queries RESPONSE: --name: zips type: question-mark parameters: [state, districtnumber] --REQUEST: ./dataset/district/query/zips?state=MA&districtnumber=29 RESPONSE: list info about all the zipcodes in the specified district. This example illustrates query type: question-mark