Visualizing the MTA’s Turnstile Data

Here’s the Vimeo version.

I set out several months ago to visualize the MTA’s turnstile dataset. It’s updated weekly and resides here. The animation you see here was made in processing, but there were numerous steps required to prep the data into a format that could be pulled in. I’ve met with lots of people at various civic tech events over the past few months who have lamented about how hard to consume this dataset is, and I’m pleased that a little bit of scripting and elbow grease (finger grease, really, as in mouse clicking) has resulted in usable data. Observe:

A002,R051,02-00-00,04-13-13,00:00:00,REGULAR,004074010,001402167,04-13-13,04:00:00,REGULAR,004074045,001402171,04-13-13,08:00:00,REGULAR,004074062,001402206,04-13-13,12:00:00,REGULAR,004074197,001402309,04-13-13,16:00:00,REGULAR,004074441,001402368,04-13-13,20:00:00,REGULAR,004074829,001402442,04-14-13,00:00:00,REGULAR,004075007,001402466,04-14-13,04:00:00,REGULAR,004075041,001402475
A002,R051,02-00-00,04-14-13,08:00:00,REGULAR,004075059,001402492,04-14-13,12:00:00,REGULAR,004075141,001402556,04-14-13,16:00:00,REGULAR,004075345,001402600,04-14-13,20:00:00,REGULAR,004075638,001402634,04-15-13,00:00:00,REGULAR,004075737,001402664,04-15-13,04:00:00,REGULAR,004075744,001402665,04-15-13,08:00:00,REGULAR,004075791,001402774,04-15-13,12:00:00,REGULAR,004075982,001403053

Above are the first two of about 23,000 lines that constitute one week of turnstile data. In case you were wondering, both of these lines contain data for the same turnstile, but the first runs from midnight on 4/13 to 4:00am on 4/14. The second line goes from 8:00am on 4/14 to 12:00pm on 4/15.

So, each line consists fo three columns of identifying data, and then a sequence of columns with a timestamp,type of report, entry count, and exit count, which repeats 8 times! Best of all, it gives us running totals for each turnstile instead of just a number of entries or exits, so to get anything useful out of it, you need to do some subtraction in excel. Simply subtract your entry tally for one timestamp from the previous reading, which might be 5 columns to the left, or possibly on the previous line somewhere near the end. Easy peasy. To make things more complicated, not every turnstile has readings at 4 hours intervals, and some that do stick to 4 hours are slightly offset, going from 11:00 pm to 3:00 am, for example.

The first step was to write a Ruby script that would split these verbose lines into individually manageable parts. The script is available on github, and the results look like this:

A002,R051,02-00-00,04-13-13,00:00:00,REGULAR,004074010,001402167
A002,R051,02-00-00,04-13-13,04:00:00,REGULAR,004074045,001402171

Sure, there are now 291,000 lines, but now that we have each individual reading its own line, we can sort by unique ID, do some math with the previous line to get a solid number.

So, we’ve conquored the format challenges, but now we have a geocoding problem. But wait, doesn’t the GTFS data contain stops.txt, which contains a station identifier and a latitude and longitude? All we need to do is a join or a vlookup to assignn lats and longs to this dataset, right? Unfortunately, it’s not that simple. The turnstile dataset’s unique id for a station is called the Control Unit (Column 2 if you’re interested), and has nothing to do with the station_id field in the GTFS data.

Yesterday, a friend who is just as passionately nerdy about subways as me assisted with the very manual process of grabbing latitudes and longitudes for the 700+ lines in the MTA’s key for the Control Units. This task was made slightly more difficult by me not being very familiar with the system outside of Manhattan, and the fact that many stations can have the exact same name and be located miles apart on different lines. I digress. We go through it, I performed my vlookup, and moved the data into processing.

My vision for the video was to simulate the actual movements of people by animating the dots moving in and out of stations. The sketch grabs each line and displays it on its own, so there is no aggregating of data by station and trying to make sense of the nonstandard intervals. Since most of the data exists on 4 hours intervals, there were visible waves of activity. I got around this by offsetting the start and end times for each trip slightly, so that they did not take up the full time for their interval. For example, if a turnstile logged 200 entries for a 4 hour period, 4 dots would be drawn, but the start and end times of their movement would be staggered to blend the activity into the next time period.

Mission accomplished. Even at HD resolutions, it is still difficult to capture the entire NYC region and still see the detail I’d like to. Several people have asked for zoomed-in versions, and I will work on them in the coming weeks.

Data Roundup May 21 | School of Data - Evidence is Power says:

May 23, 2013 at 3:05 am

[…] @chris_whong has visualised NYC’s subway turnstiles for @NYCEDC http://chriswhong.com/open-data/visualizing-the-mtas-turnstile-data/ and reveals the hard work (finger clicking) that goes into it. To help, also from NY @NYStateCIO have just released open data sets that reveal the ‘Wineries, Breweries, and Distilleries Map’ for the state of New York. […]

James says:

May 31, 2013 at 10:42 pm

Very cool

Data Roundup May 21 | archaeoinaction.info says:

June 3, 2013 at 1:06 pm

Perco Magovich says:

February 28, 2014 at 6:34 am

Very interesting and detailed work. Please, specify, red color means rush hours and green means recession or lack of people streams? Thanks.

Chris says:

March 1, 2014 at 11:56 am

No. Red represents people leaving a station, green represents them entering.

Taylor says:

December 23, 2014 at 10:38 pm

Great project, from the data acquisition to the visualization. Was curious where the coordinate data was ultimately derived from? Just from a quick glance, it appears that the coordinates are different than those listed in the GTFS data or those listed on along with the turnstile data, though very close. Would you possibly be able to provide more detail on your methodology for matching the turnstile ID’s with the station location coordinates?

December 27, 2014 at 12:05 am

Hi, sorry for the delayed response. There was an “answer key” in the metadata somewhere that included the station name and what lines stopped there. A friend and I searched for each one on a web map using the station name and lines, then we just logged the lat/lon that the map gave us. Comparing it to GTFS probably would have been a better process.

Leave a Reply