MTA turnstile data and recommendations to WomenTechWomenYes

Our first project at the Metis Data Science bootcamp was to offer recommendations to the WomenTechWomenYes (WTWY) organization for the placement of volunteers to collect email addresses.

The email addresses will be used to distribute invitations to their annual gala. The purpose of the annual gala is to raise awareness about the organization and fundraising.

The MTA turnstile data was offered as a starting point. This data set gives an indication of subway-rider traffic in NYC. The data is available at this link.

I downloaded one month of data, corresponding to June, 2018 in four separate CSV files. I concatenated these files into a dataframe and conducted the exploratory data analysis with Python.

Preliminary exploration showed that there are no nulls in the data collected, yay!

However, there were a couple of duplicated data rows and some entry and exit numbers that did not make sense.

The duplicates were dropped for the analysis.

I was interested in looking at total traffic per station: for the overall period, per week, per day of the week, and per hour.

Of course, the conversion of the ‘date’ column from a string to a datetime object was necessary.

The entry and exit numbers per turnstile are cumulative. Therefore, the entries and exits for a time period are the delta of the entry and exit numbers between the beginning and the end of the observation period.

I used ‘groupby’ liberally to look at the individual turnstile data. First, I grouped the data by turnstile and used the ‘diff()’ function to create new columns for the time intervals between readings, deltas for entries and exits.

Then, negative and large deltas were spotted… and eliminated. These data points were about 14% of the data and I did not have a good assumption to replace these values. Some of the negative delta values can be explained if the turnstile counter was reset during the observation period.

After these values were out of the picture, I defined ‘total_traffic’ per turnstile as the sum of the delta entries and delta exits between readings

Again, ‘groupby’ came to the rescue. I grouped by ‘station’ and ‘linename’ to make sure that I was not aggregating all the stations that were on the same street. There are several stations with the same name, but they are at different locations and serve different lines.

The daily, weekly, and day of the week slices of the data were subsets of the ‘station’ and ‘linename’ groupby dataset. I added the corresponding additional index to see the desired segment.

The datetime module was very useful to identify weeks and day of the week.

The process chosen by the WTWY required several self-selecting steps:

  1. The person of interest needed to talk with the volunteer
  2. Provide an email address to the volunteer
  3. Read the invite email
  4. Show up at the gala.

I started with top 10 stations by traffic, and hoped that after the several self-selecting filtering steps, we would have a decent amount of attendees at the gala.

I decided to look at other potential options that may result in a greater success rate. After some googling I found an article from Tech Crunch examining the footprint of tech large companies in NYC. I figured that tech workers would likely be more interested in the gala.

This article was instrumental to see the location of the NYC offices for companies with the most tech employees.

The MTA turnstile data does not include geographical information. However, I found another dataset online that included geographical information for the MTA stations.

There was no common column for the turnstile data and the location data.

I had very limited time to finish this. Thus, instead of messing with the module fuzzywuzzy, I just created a ‘station_id’ column in the turnstile dataset, and manually entered the corresponding ‘station_id’ number as per the station name and description on the geographical dataset for the top ten stations by traffic.

Using Python, I merged the turnstile and geographical datasets using a left join on ‘station ID’ and exported the new dataframe to a CSV. Then, I plotted the top ten station on the New York City map with Tableau.

Alt

The employers with most tech employees in NYC are near the following stations:

  • Astor Pl
  • 8th at NYU
  • Herald SQ
  • Times SQ

The last two stations are also in the top ten by traffic!

Therefore, the final recommendation for the positioning of volunteers was based on a hybrid between tech company office locations and high traffic:

* Astor Pl  (Near IBM and Facebook)
* 8th at NYU  (Near IBM, Facebook and NYU)
* Herald SQ (Near Amazon and LinkedIn plus high traffic) 
* Times SQ. (Near IBM and Facebook)
* Grand Central Station (High traffic)
* Penn Station (High traffic)

IBM is by far the largest employer for tech workers in NYC. It may be beneficial to place a volunteer in front of the IBM building to collect email addresses for WTWY.

The days with highest subway station traffic are Tuesday through Thursday and the busiest time of day is from 4pm to 8pm.

Best wishes to WTWY with the gala!

Written on July 8, 2018