Update 6/16/2014: Many people have asked for this data since I published this post, and like a non -forward-thinking government, I’ve come up with a lot of excuses for not sharing it. Here’s a couple of torrents. Happy hacking!
Update 6/18/2014: Andrés Monroy graciously offered to host these files for download, and has setup a simple download page with smaller chunks of the data. Social Media connections FTW!
FOIL (The Freedom of Information Law) is like the computer-illiterate grandmother of Open Data. The first time I came face-to-face with FOILed information, it was in the form of a phonebook-sized set of printed charts indicating the flow rates of Combined Sewer Overflows (Check out Leif Percifeld’s dontflush.me project). Whoever he foiled wasted a ton of paper on it, and scraping data from the charts was an actual proposed activity at an eco-themed hackathon. I had heard plenty about people’s awful experiences with FOIL, lots of unresponsive or uncooperative government agencies, months-long backlogs, and everyone’s favorite, lots and lots of PDFs (PDFs are where data goes to die)! A relative of mine (we like to hold our governments accountable in my family) even told me a State government responded to her FOIL request saying it would cost them $20,000 to fulfill it, and if she cut them a check they’d happily oblige. I had never really been through the process first-hand, but last week, NYC’s Taxi and Limousine Commission tweeted a data-driven chart that caught my eye:
#metricmonday they call it, a twitter campaign that involves some visualization made with taxi data. This time around it was a chart showing hourly taxi volume in a given week, highlighting the twice-daily dip in available cabs during shift change. Like any good chart, it sourced it’s data: “NYC TLC 2013 taxi tripsheet data”
My immediate response from the BetaNYC twitter account was “Is the data available?”… I knew it wasn’t, but wanted to see what they’d say. I had seen this trip data manifested in several data animations, and had even seen a presentation on it by NYU Center for Urban Science and Progress Researchers (ironically, in a class about the possibilities of Open Data). Search high and low, you won’t find this data available for download anywhere.
The TLC’s twitter account responded quickly, stating that the data was easily FOILable, providing a link, which I followed. Several civic hackers responded, expressing concern over the unavailability of the data.
To my surprise, they accepted requests via email. All you had to do was fill out a PDF form. Well, it seemed like a good day to make my first FOIL request, so I used Apple Preview to paperlessly fill out and sign the request form, and fired it off via email.
Also to my surprise, I received a response only a few hours later:
You’re not only required to provide a large enough hard drive, but it must be “brand new, still in the box and unopened”, presumably for security reasons. This requirement is a bit silly in my opinion, and probably prevents a lot of would-be FOILers from getting this data, but I made it this far so I figured I’d keep going.
I grabbed a 500GB Drive from Radio Shack and made plans to drop it off at the TLC’s offices in lower Manhattan. On the 22nd floor of 33 Beaver street, I was courteously greeted by the same person who emailed me. I handed off the drive, and made arrangements to pick it up early the following morning.
So that’s it… I retrieved the drive and saw that all of the data they loaded only took up about 50GB. Overall, I have to say I was impressed with the TLC’s responsiveness, professionalism, and the fact that they allow email correspondence for this sort of thing in the first place. Despite the annoyance of two in-person trips and the expense of a brand new hard drive, I was able to go from viewing the FOIL information page to possessing the data in just 2 business days. (Not bad for what it is, but it doesn’t change the fact that this data should be open, API accessible, downloadable, and free for all to use. Size and complexity of the data are not an acceptable excuse in 2014)
Enough with the FOILing, let’s check out the data! Here’s a few screengrabs so you can get a feel for what’s included. I still haven’t figured out what exactly I’m going to do with it, but you can bet it will be animated and beautiful.
There are two folders of data, Faredata_2013 and Tripdata_2013. Each folder contains chunks of data in csv format, ranging from ~1.5 to ~2.5 GB in size.
Fare data looks like this, showing medallion, hack_license, vendor_id, pickup date/time, payment type, fare, tip amount (look at all those zeros!), tolls, and total.
Trip data (the good stuff!) looks like this. Each file has about 14 million rows, and each row contains medallion, hack license, vendor id, rate code, store and forward flag, pickup date/time dropoff date/time, passenger count, trip time in seconds, trip distance, and latitude/longitude coordinates for the pickup and dropoff locations. The possibilities are endless! I smell a tip analysis coming on!
Thanks for reading! I hope you enjoyed this, and that NYC’s Taxi trip data will be Open Data before too long.
Cover Photo by flickr user moonman82