Taxi TechBlog 1: Data Prep and Backend

A week ago, I published NYC Taxis: A Day in the Life and it went a little bit viral. This seems to be the perfect mix of some brand new (for me) techniques that were a huge hit, along with subject matter that seemingly everyone can relate to. This is part 1 of 2 of a techblog about how I built this visualization, and will cover data munging and building the backend. The next post will talk about the frontend, including animations and charts.

My code is open source and available on github.

Viral Notes

But first, a few words about the launch and reception. The visualization was published in the early hours of Monday, July 14th, after I spent just about the whole day Sunday putting finishing touches on it and getting it to run on Heroku. I’d been working on it for over a month, sneaking in a few hours here and there on nights and weekends, and just wanted to get a minimum viable product launched so I could stop thinking about it (that backfired :p ). By Monday evening it had seen over 80,000 unique visitors, had around 900 to 1000 concurrent users, and had racked up a whopping 600,000 map views on Mapbox. It was picked up by FiveThirtyEight, The New York Times, Technically Brooklyn, Gizmodo, The New York Observer, Gothamist, TechPresident, Fast Company, Bloomberg, The Washington Post, Huffington Post, AM New York, New York Magazine, Quartz, Bustle, mashable, and, most unexpectedly (but so awesome), BuzzFeed. It even got coverage in the UK, The Netherlands and France.

On Monday afternoon, someone from Heroku (where the app is hosted) noticed the spike in traffic and the subsequent failed page loads, and contacted me to recommend turning on more dynos (I am not very experienced with heroku and had no idea what that meant, but it literally came down to dragging a slider, and the traffic became much more manageable.) Heroku also generously added me to their beta program to help cover the costs of the additional dynos.

Concerned about almost $1000 in overages on my Mapbox account in the first 12 hours, I took to the twittersphere asking the civic hacker community for some advice on how to deal with the situation. Monetize with ads? Add a paypal donate button? Switch to Open Street Maps’ free tile solutions? The chatter caught the attention of Mapbox, who got in touch with me within an hour and offered to sponsor the app! (I had already switched out the mapbox maps and put up a OSM tileset, but not having a dark, subdued basemap was suboptimal for the visualization. In any case, OSM rules and deserves a shoutout for making hosted, leaflet-compatible tiles available for the world to use)

Read the very long string of responses here. Mapbox stepped in offered to help!

At the time of this posting, the visualization has been shared over 4700 times on Facebook, 3700 times on Twitter, had 300,000 pageviews from 216,000 visitors.

Querying the Data

I’d FOILed the 2013 taxi trip data a few months ago. The blog post I wrote about it had gotten some attention, and resulted in dozens of requests to share the data. Many people came to the BetaNYC hacknights, hard disk in hand, to transfer the data on the spot. About a month ago, some of the requesters urged me enough to create a couple of torrents. At the same time, Andrés Moroy (someone I don’t know except via twitter) offered to host it so I finally got all 50GB zipped and uploaded.

Over the next couple of weeks, data scientists all over the world started publishing lots of interesting analyses of the data, and after someone loaded the data into Google BigQuery, a rather interesting thread erupted where people were showing off all sorts of ways to slice up the 173 Million Rows quickly. This would prove to remove a huge barrier to entry for me.

I’m not sure when the idea of “a day in the life” of a taxi came to mind, but it came directly from the simple question: “How much does a single taxi/driver earn in a single day/shift?” I’d heard that many drivers rent their shifts and need to earn back enough to break even before they take anything home. I’ve done similar static “Day in the life” map projects using the twitter API as well. Because the time and start/end location of each taxi trip was available, it just made sense to follow along to see when/how/where they earned their money over the course of he day.

I set out trying to figure out how to use BigQuery to get the data I wanted: A full day’s trips for a single medallion. With a hard-coded medallion and date, I was able to slap together a query that worked, but I had no idea how to pull all trips from a random cab on a random date, let alone man random cab/days. Reddit to the rescue! u/fhoffa, who started the Taxi Data BigQuery thread, helped me out in an hour with the exact query I needed:

The combination of having the data in BigQuery and a rather awesome community on Reddit already working on the data probably saved dozens of hours if I had tried to do all of this on my own. I can’t stress enough how important the community is to civic hacking projects like this.

So, data in hand, I see that there are around 30-60 trips a day for most of these vehicles. We have a time succession of geographic points, which is enough to string together into a line. This can be animated, as I’ve done in this animation of twitter user’s movements, simply moving a dot “as the crow flies” between the start and end points, and keeping accurate time with an accelerated clock. However, simply moving a dot between start and end points wasn’t going to cut it for this visualization. I wanted to show something that resembles a car actually driving, so I needed a way to trace out a reasonable driving path for each trip. Enter the Google Directions API.

BetaNYC recently had a series of events around the newly released Citibike Trip Data, which also includes only the start and end point/time for each trip. I’d heard several civic hackers profess the awesomeness of the API for inferring a possible route from a trip’s available data. Googling will get you to their documentation page, which has pretty much everything you need.

I built a few API calls between random spots in New York to see what kind of results I would be dealing with:

https://maps.googleapis.com/maps/api/directions/json?origin=40.631538,-73.965327&destination=40.691099,-73.991785&key={myKey}

gets us JSON:

{
   "routes" : [
      {
         "bounds" : {
            "northeast" : {
               "lat" : 40.692154,
               "lng" : -73.96506359999999
            },
            "southwest" : {
               "lat" : 40.6295283,
               "lng" : -74.0034151
            }
         },
         "copyrights" : "Map data ©2014 Google",
         "legs" : [
            {
               "distance" : {
                  "text" : "5.8 mi",
                  "value" : 9352
               },
               "duration" : {
                  "text" : "14 mins",
                  "value" : 850
               },
               "end_address" : "90 Court Street, Brooklyn, NY 11201, USA",
               "end_location" : {
                  "lat" : 40.6910763,
                  "lng" : -73.9917049
               },
               "start_address" : "716-722 Westminster Road, Brooklyn, NY 11230, USA",
               "start_location" : {
                  "lat" : 40.6315269,
                  "lng" : -73.9654273
               },
               "steps" : [
                  {
                     "distance" : {
                        "text" : "0.1 mi",
                        "value" : 213
                     },
                     "duration" : {
                        "text" : "1 min",
                        "value" : 26
                     },
                     "end_location" : {
                        "lat" : 40.6296317,
                        "lng" : -73.96506359999999
                     },
                     "html_instructions" : "Head \u003cb\u003esouth\u003c/b\u003e on \u003cb\u003eWestminster Rd\u003c/b\u003e toward \u003cb\u003eAvenue H\u003c/b\u003e",
                     "polyline" : {
                        "points" : "az~vF|jmbMzJiA"
                     },
                     "start_location" : {
                        "lat" : 40.6315269,
                        "lng" : -73.9654273
                     },
                     "travel_mode" : "DRIVING"
                  },

The route is our overall trip, which is divided into legs. This API call included only an origin and destination, so there’s only 1 leg. If we had added waypoints to our API call, we could have up to 9 legs returned.

Each leg has steps, which are basically chunks of the trip that don’t require some instruction to the driver, such as turning or merging onto a highway. Each step has its own html_instructions that are the same thing Google would show you for turn instructions if you got directions via google maps. Neato.

But wait, how does this help us map the step/leg/route? Here’s where it gets interesting. Each step includes a polyline, which looks like this:

"polyline" : {
                        "points" : "az~vF|jmbMzJiA"
                     }

What the heck does az~vF|jmbMzJiA mean? It turns out that this is google’s super-compressed format for encoding polylines. This string can be decoded into a series of latitude/longitude coordinates representing the path to be followed for this step! There’s a handy widget for decoding/encoding these and mapping them on the fly here. Give it a try. Copy az~vF|jmbMzJiA and paste it in, see what you get.

Further down in the JSON response from our API call, there’s an overview_polyline that shows the entire route strung together:

"overview_polyline" : {
            "points" : "az~vF|jmbMzJiAR~DgBPuJ~AEN[FyBb@_I|A}AV}Dz@yDr@_JfBuMfC}Bb@F|@^bHDHTjEXdG}MxAcJbAqEj@aFd@iBR}BXmBXmA\\kA`@aA^{KfFyD`BuAr@}AfAmAjAoAzAm@x@gFvJk@dAaAtByBdE_BzCyBhEoCnFy@lAiArAkAhAiBlAoEtCoDbCsApAkA|AcBvCiJ~QgAjBMj@}@hBo@|@s@v@sAdA_@PmBr@eBt@i@VwAnAsDtBaE|B_FjCyBlA_DzAi@XoBdAiGhDK@YJ_A\\iBx@yBrAuBvAkBvAgBzAUN[LYF]B[CWEqBq@kFuBgHoCkf@{Qk@Om@IeACY@OB_@G_@A[EQCc@[KWKc@He@F[@g@rDgScG{B{B{@jBoKhBr@"
         }

Go ahead, try that one in the decoder widget too. You’ll see that the results don’t seem to make sense, as the polyline starts on the roads and then wanders off in seemingly wild directions. This is because there are escape characters in the encoded string. Look above, every place you see “\\” is really meant to represent a single backslash. Try the encoder tool again, you should see a more sensible path.

With escape characters:

Without escape characters:

Now I needed a way to decode and encode these programatically. Google has a Geometry Library that you can use to decode these client-side, but I had plans to decode these server-side and convert them to geoJSON. Some googling led me to MapBox, who has released a kick-ass open source Node Package that encodes and decodes these google polylines! How cool is that?

The Plan

Now, all of the pieces are in order…

Use BigQuery to get a bunch of random trip/days Download them as a CSV
Write a node script to build out API calls for each series of 4 trips (Each API call can handle and origin/destination and up to 8 waypoints, meaning 9 total legs per call. Each taxi trip consists of two legs, the trip itself and the “downtime” between this trip and the next trip. So, we can efficiently handle 4 trips/8 legs per API call)
Append the polylines to the raw data, also append the start time of the next trip as a value for the current trip (so we know how long the “downtime” lasts)
Move everything into a sqlite database
Build a node server with a single endpoint, /trip, that will query the sqlite db for all trips for a random medallion, convert the results to geoJSON, and send the response to the browser.

getDirections.js opens our bigQuery results CSV, builds out API calls, and appends the appropriate polylines to the raw data. You can check out the code in my github repo. I won’t go into too much detail here, but I basically slice the number of trips for each taxi into groups of 4 or less, use the start point of the first trip as my origin, and the start point of the next group of 4 or the last stretch of downtime as my destination. The rest of the start and end points are waypoints. Once everything is stored in the appropriate strings/arrays, I build out my API call like this:

 var fullApiCall = apiBase + origin + destination + waypointStr + apiKeyStr;

I store all of the fullApiCall strings in an Array that I use later to actually make the calls and process the results.

There’s also a step to combine all of the polylines for the steps into a large polyline for each trip or downtime. This is done using the Mapbox package I talked about before, getting the encoded latitude and longitude coordinates, linking all of the steps together for a single trip, and then re-encoding them.

function getPolyline(leg){ //gets steps for leg, combines into one polyline
var legPoints = [];
  for(s in leg.steps){
    var points = leg.steps[s].polyline.points;
    points = polyline.decode(points);
    for(p in points){
     legPoints.push(points[p]);
    }
  }
  legPolyline = polyline.encode(legPoints);
  return legPolyline;
}

The raw data:

medallion,pickuptime,dropofftime,passengers,pickupx,pickupy,dropoffx,dropoffy,fare,paymenttype,surcharge,mtatax,tip,tolls,total
8139A6C9596767B37F84DACB7E200BDD,2/11/13 0:02,2/11/13 0:27,6,-73.863464,40.769997,-73.981827,40.772884,34,CRD,0.5,0.5,6.9,4.8,46.7

becomes:

medallion,pickuptime,dropofftime,passengers,pickupx,pickupy,dropoffx,dropoffy,fare,paymenttype,surcharge,mtatax,tip,tolls,total,nextpickuptime,key,trippolyline,nextpolyline
8139A6C9596767B37F84DACB7E200BDD,2/11/13 0:02,2/11/13 0:27,6,-73.863464,40.769997,-73.981827,40.772884,34,CRD,0.5,0.5,6.9,4.8,46.7,2/11/13 0:30,0,azywFtpyaMGFGHsGhK??AD?HAF?H?F?H?F?F@H@F@DBBBDJLBDPR`@d@??RNFDDDDBF@D@D@D?DADADABCDEDEFMd@m@j@iAjA_CHQ??REB?B?B?@?B@FDBB@@@D@D@D?F?D?DANCHCNEJEJILIJIPILEFCDCFEV??o@dAaAbBi@dAu@rA]v@[p@c@hASj@_@lA[jAQx@Y|AKp@Ip@I|@I|@Ev@Gr@Cp@KhACRE|@Ev@Av@??HT@F@LBN@NBPBNDPLn@H`@DVFb@?DBD@BBDDHJbADXBXPlAHf@Hf@\|ARt@Nh@Nj@^jAZz@`AdCtBnEvBzE`A~BNZf@tAt@vBpAdEb@~Ap@vCf@~Bb@`CTrAF\?@?@FV?B?@VbBFd@TdBRnBPlBDp@Dj@HzAB\@V?XBb@@f@?X?~@Ap@Al@ElAGfAI~@?@QjBEf@c@rFAF[jD[fE]|E}@rJYxDE\?@?B?@}@vJK~@?@MfA?BAB?FA?APAD?@Ef@_@rB_@vBi@vC??Gn@KnAOfBEhAEvASzA[vBO|AGf@Cj@Gx@AXIjCE|@E`@Gv@ANGz@Ch@WhCW|CAHY~CUjCEVG^Mn@Kb@O^Yl@Yf@CDYb@m@x@qAlBSZw@pAiAbBMRMREHA@QVk@z@KPW^?@A?aBdCCBs@hA_@h@ILUV_AhASPCDw@t@}AhAa@XoAz@}BzAkAz@oHdFGDYRGDCBYRiNrJYRKHA@ID}HzFw@j@w@h@u@b@YLe@Pi@LWF_@DYBa@Bq@?E?e@Cc@EmAUm@Ug@SCAe@YYSOM_Aw@y@w@uAiAq@k@[WCCKIII}@w@mA{@aBwASOSQkB_Ba@c@c@_@wAqAaCsBaGaF???[?AAAUU]_@KKY]a@c@i@o@IKOQACoAwAm@q@oAuA{BeC}@aAACw@aAs@{@GGGGIGICCCECMEKAOEKAM?K?M@MBKBKDKHKJILILENELALANAP?P?P?RAP?NCRATCREXCTEVEVI\GTOn@WzA??e@hAu@nBKVKXQb@GNGPUp@Qd@Qh@Ql@Y~@GTABCHo@|BADKZIVIVa@rAaFxO??IDA@A@OVQTSTQTIJKPA@ABELM`@Qd@ENCFITIZITGRCFITK\{A|E??A@ABADITGLGJGFEDEBC@E@C@G@K?YAKCOG??M`@??PHt@d@`@VZRVPjBpALHHDfBnAz@h@dAp@zB|Az@h@~@l@lAv@n@b@lAv@l@b@dAr@x@h@x@h@jAt@fAt@bAl@rBvAjAv@t@f@lAt@l@`@~B|AjAx@p@`@|B~AdAp@z@h@tDdChAr@??}D~LeC`IyAnEa@tAWr@oA|Dm@jBGLCHa@nA{AvE??jCbB^VXPBB~@l@lAx@l@`@rBrAHF|BxAfC~AxBvAxB|AnAx@j@`@??gAdD}@nCg@zAGPABIX_AxCUt@Ml@Kj@Gb@Ej@Gd@Mr@Md@Qh@Sb@Ub@U^_@l@]r@Yn@Uj@Wr@W|@IZENGTETEFABCHG\EXCRENCLGNEHIJQTIJKX[|@??mFxP??dAp@`Ap@tA|@p@b@zBzA~BzAzBzA`CvAzBxA|BzA|BxAzBdBl@b@vA~@fC`B|@j@`Al@|B~Aj@^VPx@h@d@XZNdAl@xBvAfCbBfCbBr@f@dAr@bCzA~B|A|BzA|BzAhCdBfC`BzB~AzBtA~BzAxB|A|BxAfAr@v@f@??Nk@,qmzwFjqpbMzEiO??|BxAnBrAJFz@j@FBz@f@??{BbHUl@aBhF??~BtAf@^

Importing a csv to sqlite can be done at the command line, and I won’t go into it too much here.

Next we build a node server. When someone hits nyctaxi.herokuapp.com/trip, pick a medallion at random:

 var getMedallion = "select medallion from `trips` order by random() limit 1";
db.each(getMedallion, function(err, result)

Then, get all of the rows that match that randomly chosen medallion:

function getRows(medallion,callback) {
 var statement = "select * from 'trips' where medallion = '" + medallion + "'";

 summary = [];

 db.serialize(function() {
 db.each(statement, function(err, result) {
 if (err) { console.log(err); }

 summary.push(result);
 },function(){
 console.log(summary.length + " rows found for this medallion");
 callback(summary);
 });
 });


}

Next a function called createGeojson() decodes the polyline for each trip and downtime, and creates a properly formatted geoJSON LineString. These are pushed to a featureCollection, and that featureCollection is sent back to the browser as a response.

So, hitting /trip gets us a geoJSON FeatureCollection for an entire day’s trips for a single medallion, ready for immediate visualization using D3’s d3.json() function!

It weighs in at 131kb for this trip. It would have been more efficient to decode the polylines client-side, but I really wanted to just have geoJSON ready to go for d3. Give it a try, copy the geoJSON result and paste it into GeoJSONLint and you’ll see a taxi’s trips for one day:

So there you have it, I combined raw taxi trip data with google directions API results, put them in a database and built a simple API endpoint to grab one random, and serve it up as proper geoJSON. Done-zo.

In the next post, I’ll detail how I built the frontend, designed the visualization and made the timing, animations, and charts work. Stay tuned, and thanks for reading!

24 Responses to "Taxi TechBlog 1: Data Prep and Backend"

Fred Wetzel says:

July 22, 2014 at 9:34 am

Fascinating to watch. Any way to include cumulative mileage? Thanks
Chris says:

July 22, 2014 at 9:55 am

Yes, but it will have to wait until I find more time to work on this… that is, unless some enterprising developer wants to add it in and do a pull request!
Allan Walker says:

July 31, 2014 at 11:06 am

HI Chris, firstly many thanks for your initiative on FOILing this dataset.

I think I’ve stumbled on something: trip distance in the data dump just looks – odd. However, it points to some insight: here’s my dataviz and post about my process.

http://allanwalkerit.tumblr.com/post/93410053442/nyc-taxis-did-you-go-for-a-ride-or-were-you-taken
Breaking Down July’s Best Data Visualization | Placemeter says:

August 1, 2014 at 2:54 pm

[…] biggest city. NYC Taxis: A Day in the Life captured imaginations worldwide. In the words of its creator, Chris […]
Taxi Techblog 2: Leaflet, D3, and other Frontend Stuff says:

August 3, 2014 at 8:36 pm

[…] is part 2 of my techblog about NYC Taxis: A Day in the Life. In part 1, I showed how I queried the necessary data, manipulated it a bit, and built a simple node server to […]
Taxi Techblog 2: Leaflet, D3, and other Frontend Fun says:

August 3, 2014 at 8:36 pm

[…] is part 2 of my techblog about NYC Taxis: A Day in the Life. In part 1, I showed how I queried the necessary data, manipulated it a bit, and built a simple node server to […]
New York-based Placemeter is turning disused smartphones into big data | SeekerBlog says:

August 5, 2014 at 7:23 pm

[…] The visualization NYC Taxis: A Day in the Life has gone server-crashing-viral. It was constructed by Placemeter developer Chris Whong (details here). […]
GEDDEM – Premium Tech and Lifestyle Digest. » The Complex Journey to A Day in the Life of a NYC Cab says:

August 12, 2014 at 9:53 am

[…] man behind the project is Chris Whong, and according to his post detailing his process, the genesis of the idea came from one simple question: “How much […]
Tom says:

August 20, 2014 at 11:06 am

Very cool. One small issue: the date/year shown is 1913.
Chris says:

August 24, 2014 at 5:26 pm

Thanks for typing up these blog posts, they’ve been very useful. I liked your Google Directions API hack so I published a wrapper for doing that style of bulk query: https://github.com/christophercliff/google-direction-getter
Chris says:

August 25, 2014 at 11:40 am

Awesome! Thanks so much for sharing!
Les pépites de la semaine (n°2) | Particité says:

October 7, 2014 at 4:47 am

[…] Un énorme travail de chiffres, de calculs et de cartes. Les taxis de New-York sont légendaires pour leurs couleurs […]
weekly 224 – 28.10.–03.11.2014 | weekly – semanario – s?pt?mânal – haftal?k – ?? says:

November 6, 2014 at 4:56 pm

[…] Whong blogs about NYC Taxis, a Taxi website that visualizes the data on a […]
Thiago Bueno says:

November 12, 2014 at 1:12 pm

Thank you very much for the explanation. Really entertaining and insightful 😉
Internship Blog Post 02 | longpigblog says:

April 15, 2015 at 4:21 am

[…] the ability to toggle the speed that the marker travelled at. Mr Wong very kindly created two blog posts where he details how he created the visualisation which would prove very useful especially when […]
Using API’s t o get route files from Google and Open Street Map | mylifemapping says:

May 18, 2015 at 1:31 pm

[…] for a look at the geoJSON text file that we get in return. Thanks to Chris Whong’s NYC Taxi blog for explaining this in […]
Map based web data visualization | River on Bridge... says:

June 24, 2015 at 2:30 am

[…] NYC Amazing data visualization of a daily life of taxi driver in NYC. The creator described the whole tech in the blog. Some one introduces a brief skill on this. 6. Digital Attack Map Showing live DDOS attack across […]
Martin Lavoie says:

October 10, 2015 at 12:33 pm

Hello!

amazing visualization and thanks for sharing the code. I have a question. I dowloaded your code and tried to run it locally (on a MAMP server) on my computer and it did not work. What did I miss? Sorry for the silly question but I am quite new to this kind of stuff…I have a lot to learn 🙂
Martin
Michael says:

October 15, 2015 at 3:43 am

Hi Chris,

I have wanted to thank you for sharing! It can be very useful so thanks lot!
Final Project: NYC Taxi Data – Drew Levitt says:

December 15, 2015 at 9:30 pm

[…] waiting for me, and I have some good resources to refer to: very impressive technical blog posts by Chris Whong and Todd Schneider point the way toward a formal database ecosystem much better suited to truly big […]
Rocky Ahmed says:

January 17, 2016 at 12:03 am

Chris has a great process on compiling data visualization, I am not a techie guy but I enjoy following every post you have and I learned from it specially the open source code. It’s good to learn all of this, I can understand my developer is a simple way what he is talking about.

http://www.irvingcab.com/limo-service-in-irving-tx/
Google BigQuery – ButMan World says:

May 11, 2016 at 8:21 am

[…] Taxi TechBlog 1: Data Prep and Backend […]
Multidimensional Maps That Present New Views on New York Metropolis - Atlas Obscura - NYC Breaking News says:

November 7, 2017 at 11:04 am

[…] in city planning at New York College’s Wagner Graduate Faculty of Public Service, achieved some viral fame in 2014 for his digital map NYC Taxis: A Day in the Life. His hypnotic visualization, which may be […]
manusha says:

March 8, 2019 at 6:26 am

Is it possible to do parallel visualization for all vehicles in a map?

Leave a Reply