Visualizing the MTA’s Turnstile Data

Here’s the Vimeo version.

I set out several months ago to visualize the MTA’s turnstile dataset. It’s updated weekly and resides here. The animation you see here was made in processing, but there were numerous steps required to prep the data into a format that could be pulled in. I’ve met with lots of people at various civic tech events over the past few months who have lamented about how hard to consume this dataset is, and I’m pleased that a little bit of scripting and elbow grease (finger grease, really, as in mouse clicking) has resulted in usable data. Observe:

A002,R051,02-00-00,04-13-13,00:00:00,REGULAR,004074010,001402167,04-13-13,04:00:00,REGULAR,004074045,001402171,04-13-13,08:00:00,REGULAR,004074062,001402206,04-13-13,12:00:00,REGULAR,004074197,001402309,04-13-13,16:00:00,REGULAR,004074441,001402368,04-13-13,20:00:00,REGULAR,004074829,001402442,04-14-13,00:00:00,REGULAR,004075007,001402466,04-14-13,04:00:00,REGULAR,004075041,001402475
A002,R051,02-00-00,04-14-13,08:00:00,REGULAR,004075059,001402492,04-14-13,12:00:00,REGULAR,004075141,001402556,04-14-13,16:00:00,REGULAR,004075345,001402600,04-14-13,20:00:00,REGULAR,004075638,001402634,04-15-13,00:00:00,REGULAR,004075737,001402664,04-15-13,04:00:00,REGULAR,004075744,001402665,04-15-13,08:00:00,REGULAR,004075791,001402774,04-15-13,12:00:00,REGULAR,004075982,001403053

Above are the first two of about 23,000 lines that constitute one week of turnstile data. In case you were wondering, both of these lines contain data for the same turnstile, but the first runs from midnight on 4/13 to 4:00am on 4/14. The second line goes from 8:00am on 4/14 to 12:00pm on 4/15.

So, each line consists fo three columns of identifying data, and then a sequence of columns with a timestamp,type of report, entry count, and exit count, which repeats 8 times! Best of all, it gives us running totals for each turnstile instead of just a number of entries or exits, so to get anything useful out of it, you need to do some subtraction in excel. Simply subtract your entry tally for one timestamp from the previous reading, which might be 5 columns to the left, or possibly on the previous line somewhere near the end. Easy peasy. To make things more complicated, not every turnstile has readings at 4 hours intervals, and some that do stick to 4 hours are slightly offset, going from 11:00 pm to 3:00 am, for example.

The first step was to write a Ruby script that would split these verbose lines into individually manageable parts. The script is available on github, and the results look like this:

A002,R051,02-00-00,04-13-13,00:00:00,REGULAR,004074010,001402167
A002,R051,02-00-00,04-13-13,04:00:00,REGULAR,004074045,001402171

Sure, there are now 291,000 lines, but now that we have each individual reading its own line, we can sort by unique ID, do some math with the previous line to get a solid number.

So, we’ve conquored the format challenges, but now we have a geocoding problem. But wait, doesn’t the GTFS data contain stops.txt, which contains a station identifier and a latitude and longitude? All we need to do is a join or a vlookup to assignn lats and longs to this dataset, right? Unfortunately, it’s not that simple. The turnstile dataset’s unique id for a station is called the Control Unit (Column 2 if you’re interested), and has nothing to do with the station_id field in the GTFS data.

Yesterday, a friend who is just as passionately nerdy about subways as me assisted with the very manual process of grabbing latitudes and longitudes for the 700+ lines in the MTA’s key for the Control Units. This task was made slightly more difficult by me not being very familiar with the system outside of Manhattan, and the fact that many stations can have the exact same name and be located miles apart on different lines. I digress. We go through it, I performed my vlookup, and moved the data into processing.

My vision for the video was to simulate the actual movements of people by animating the dots moving in and out of stations. The sketch grabs each line and displays it on its own, so there is no aggregating of data by station and trying to make sense of the nonstandard intervals. Since most of the data exists on 4 hours intervals, there were visible waves of activity. I got around this by offsetting the start and end times for each trip slightly, so that they did not take up the full time for their interval. For example, if a turnstile logged 200 entries for a 4 hour period, 4 dots would be drawn, but the start and end times of their movement would be staggered to blend the activity into the next time period.

Mission accomplished. Even at HD resolutions, it is still difficult to capture the entire NYC region and still see the detail I’d like to. Several people have asked for zoomed-in versions, and I will work on them in the coming weeks.

Leave a Reply

Your email address will not be published. Required fields are marked *