Thoughts on Open Data Part 1

After being an open data consumer/visualizer/analyzer/advocate/enthusiast for the past few years, and spending the last two years actually working in the industry (is that a thing? oh my), I’ve had some time to develop my own open data philosophy. Here are a few tenets with commentary:

1) Play it Where it Lies

Before I get to my point, let me cite a few Open Data sets here in NYC:

If you seek geospatial data for the city’s subway lines and stops, NYC Spatial Data Veteran Steve Romalewski has posted them on his blog along with some examples of their use, and some analysis.

Why doesn’t the MTA share this data themselves? Well, they do, but it’s locked up inside their GTFS file, which is a very specialized data format used to publish complex transit schedules. Steve knew that there were plenty of people out there who would want to map this data in GIS software, so he did the heavy lifting of converting into shapefiles. Now, we can all google “NYC Subway Shapefile” and be two clicks away from the raw data instead of trying to learn what GTFS is and how to extract route shapes from it.

If you want raw data on the city’s tax lots, including owner names, zoning, lot size, and more, the Department of City Planning has made it and many other planning-related datasets available on its Bytes of the Big Apple site.

Why does DCP have their own Open Data page when there is a Citywide Open Data Portal? Because this page has been there for years, and thousands of people know it as a trusted source for this data. That, and DCP deserves credit for the time and effort it takes to keep these data sources and their metadata up to date. DCP has GIS expertise in-house, and they are publishing the same data they use in their own operations.

(Edit 28 April): NYC GIS Chief Colin Reilly has reminded me that MapPLUTO is a compilation of many sources of public data with the city’s Digital Tax Map as its base geospatial data. Open Data begets Open Data. DCP is adding value for their own purposes but publishing the output for the rest of us to use. (end edit)

If you seek trip and fare data for New York City’s Taxis, some guy named Chris Whong has shared all 173 Million trips from 2013 and wrote about it on his blog.

Why doesn’t the Taxi Commission just publish these files themselves? The first answer I got was that the data were too large to work on the city’s open data portal, which starts getting sluggish when dealing with datasets in the 10s of millions of rows. Whatever the reason for not sharing it, it’s public data that I FOILed so you don’t have to.

“Play it where it lies” means that open data should be published wherever the hell it is most useful to those who will use it, and in whatever format it will be most useful. That might be a CSV on a public-facing website, it might be geojson on github, it might be shapefiles on an FTP server, it might be a big dataset in Google BigQuery, it might be a full-on home-grown RESTful API, and maybe, just maybe it’s a proprietary open data portal’s database (doubtful). My point is that no one of these formats/places is suitable for all data, and forcing them all into the same place is almost surely a bad thing… this is exactly what most commercial open data portals do, and I’ve seen first hand where a dataset becomes less useful or completely useless after being forced into the walled garden of an open data portal.

I am convinced that the role of open data portals should be as collections of pointers to resources that are grouped together under a common theme. The data store and the data catalog should not intermingle, and the catalogs should be modular and standards-based, so the pointers can be chopped up and re-aggregated under other contexts. I trust Google to point me to high quality data more than I do the search box on an open data portal, if only because it will include all the “unofficial” data sources like those I cited above.

1.5) Open Data should live as “close to home” as possible

This is an extension of #1, but basically: The more hands data passes through on its way to being “published”, the higher the probability that we will lose something along the way. Some governments call this “cleaning” the data, but it might really be dirtying it up. If the data was exported from a 30-year old mainframe and was tab-separated and full of lookup codes, give us that raw dump along with the cleaned data so we can check your work.

2) Open Data is not for “the public”

Hear me out… raw data isn’t useful to most people. That’s why it’s called raw data. It needs something more before it’s ready for consumption. It needs a data journalist, data scientist, civic hacker, coder, mapper, or some other middleman (who I am now going to refer to as an open data broker) to turn it into a finished product.

Open Data portals are often sold and promoted as citizen engagement tools, where you can glean immediate insight about your city without being a “Technical Person”. To accomplish this, the open data portals are filled with charting and mapping tools, flashy images, “citizen-friendly” category structures, giant jumbotron scrollers, and a hundred other things that are not raw data and do not help you get raw data. Dare I say that all of these “bells and whistles” actually make the data harder to access, which is antithetical to the whole concept of an open data portal to begin with.

I am convinced that most data consumers just want a download button when they arrive at an open data portal. Few will care that the data lives in a database, or that there is an API, and even fewer (read: none) will bother to use the built-in generic pie chart maker.

By turning an open data portal into a flashy frontend experience, we do a disservice to the very people that will eventually unleash the power of the data and bridge the gap between raw data and finished data products. Raw data is for people who know how to use raw data, period.

To be continued…

Here’s part 2