RAMP Challenge: Predicting Air Passengers

This is the final project of MAP536: Python for Data Science taught at École Polytechnique. We were presented with the challenge to help an unnamed airline company (operating 100~200 seat airplanes) to predict how many passengers will travel on a given date in order to best allocate the planes among possible routes.

Available historical data included:

  • date
  • number of passengers
  • departure and arrival airports
  • how early the tickets were booked

We were permitted to use any relevant external data we could find. The goal was to build a model with the strongest predictive power.

The project was done in groups of two and I was very lucky to have collaborated with Jiayu Gan.

We obtained a final model implemented in XGBoost with 19 features (out of a pool of more than 200). Ranked by importance in descending order, these features are:

For a brief report without codes, please check the link: https://github.com/FerdinandYu/Predicting-Air-Passengers/blob/master/Final%20Report.pdf

I. Preliminaries

I.1. Given Data Exploration

In [30]:
# import required packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from datetime import date
import matplotlib.pyplot as plt
import imp
import numpy as np
pd.set_option('display.max_columns', None)
import seaborn as sns; sns.set()
from scipy import stats
import holidays
import geopy.distance
In [238]:
init_data = pd.read_csv("train.csv")
init_data['DateOfDeparture'] = pd.to_datetime(init_data['DateOfDeparture'], infer_datetime_format=True)
init_data = init_data.rename(columns={"DateOfDeparture": "Date"})
init_data.head(5)
Out[238]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd
0 2012-06-19 ORD DFW 12.875000 12.331296 9.812647
1 2012-09-10 LAS DEN 14.285714 10.775182 9.466734
2 2012-10-05 DEN LAX 10.863636 11.083177 9.035883
3 2011-10-09 ATL ORD 11.480000 11.169268 7.990202
4 2012-02-21 DEN SFO 11.450000 11.269364 9.517159
In [239]:
plt.figure(1,figsize = (10,5))

plt.subplot(121)
plt.hist(init_data.WeeksToDeparture)

plt.subplot(122)
plt.hist(init_data.log_PAX)

plt.show()

The covered dates in the original data were from 2011-09-01 to 2013-03-05. 20 airports (all in the US) were involved, see the list below:

PHX - Phoenix Sky Harbor International Airport; IAH - George Bush Intercontinental Airport; ATL - Hartsfield–Jackson Atlanta International Airport; DEN - Denver International Airport; SEA - Seattle-Tacoma International Airport; DTW - Detroit Metropolitan Wayne County Airport; EWR - Newark Liberty International Airport; BOS - Boston Logan International Airport; JFK - John F. Kennedy International Airport; MCO - Orlando International Airport; PHL - Philadelphia International Airport; MIA - Miami International Airport; MSP - Minneapolis−Saint Paul International Airport; LAX - Los Angeles International Airport; ORD - O'Hare International Airport; SFO - San Francisco International Airport; LAS - McCarran International Airport; CLT - Charlotte Douglas International Airport; DFW - Dallas/Fort Worth International Airport; LGA - LaGuardia Airport

From the histograms, we observe that WeeksToDeparture and log_PAX follow approximately a normal distribution.

I.2. External Data Collection

Predicting air passengers can be a challenging task when the original data contained only limited information about the departure and destination airports and ticket booking time. Many factors might have accounted for the actual number of passengers present on a specific airplane on a specific date, like the economic vibrancy of the two cities, the weather, the airport operating conditions, historical on-time rates, number of flights connecting the two cities, competition among airlines, a vacation or a special event, etc. We were given the weather data to begin with:

I.2.1 Weather Data

In [240]:
weather_data = pd.read_csv("weather_data.csv")
weather_data.head(1)
Out[240]:
Date AirPort Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity Max Sea Level PressurehPa Mean Sea Level PressurehPa Min Sea Level PressurehPa Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
0 2011-09-01 ATL 35 29 24 21 18 14 79 56 32 1022 1019 1017 16 16 11 19 6 26.0 0.00 3 NaN 129

⚠️We considered indicators of extreme weather which might lead to flight delays or cancellations like minimum temperature, maximum wind, visibility, etc. and explicitly included dummies for thunderstorms and snowstorms. However, after experimenting with several models using these weather indicators, we found them to be poor predictors of air passengers, so we decided to drop them. We then collected other datasets relevant to air traffic from public sources. We have been able to find detailed information about airport operations (departures/arrivals, passenger volume, etc.) and flights (on-time/delay, frequency, etc.) of each day on the Federal Aviation Administration (FAA)’s website, and general info about airports (locations) and routes (competing airlines, distances) on OpenFlights. Besides, we have included data about the socioeconomic conditions (GDP, population, etc.) of the cities obtained from Wikipedia and the Federal Reserve Bank (FRED). Considering the seasonal travel pattern, we have also included the major public and school holidays in the US. For a comprehensive list of the external data, see the table below:

Dataset Source
On-time & Delay statistics by airport FAA: https://aspm.faa.gov/asqp/sys/Airport.asp
Airport Geographic Coordinates OpenFlights: https://openflights.org/data.html#airport
Details about Flights between the Two Cities by Day FAA: https://aspm.faa.gov/apm/sys/AnalysisCP.asp
Route Distances Derived from airports’ geographic coordinates (using geopy.distance.distance() function based on Haversine formula)
Departure & Arrival Seats by Airport by Day FAA: https://aspm.faa.gov/tfms/sys/Airport.asp
Socioeconomic Conditions of Cities Served Wikipedia (keyword: city names); FRED: https://fred.stlouisfed.org/series
School Holidays https://www.feiertagskalender.ch/ferien.php?geo=3537&jahr=2012&klasse=0&hl=en
Public Holidays Holiday Package PyPI

I.2.2 Traffic Data:

In [241]:
data_passenger = pd.read_csv('TFMSC Report.csv',thousands=',')
#remove business aviation
data_passenger = data_passenger[data_passenger['Business_Aviation'] == 'No']
data_passenger['Date'] = pd.to_datetime(data_passenger['Date'], infer_datetime_format=True)
#drop useless cols
data_passenger = data_passenger[['Date', 'Airport', 'Departures', 'Arrivals', 'Departure_Seats', 'Arrival_Seats']]
#modify AIRPORT Column
data_passenger['Airport'] = data_passenger['Airport'].apply(lambda s: s[:3])

#groupby AIRPORT-DAY
data_passenger = data_passenger.groupby(['Date','Airport']).sum()
data_passenger.head(1)
Out[241]:
Departures Arrivals Departure_Seats Arrival_Seats
Date Airport
2011-09-01 ATL 1171 1171 148950 148849
In [242]:
data_passenger = data_passenger.rename(
    columns={'Departures': 'd_Departures',
            'Arrivals': 'd_Arrivals',
            'Departure_Seats': 'd_Departure_Seats',
            'Arrival_Seats': 'd_Arrival_Seats'})

ext_data = pd.merge(
            init_data, data_passenger, how='left',
            left_on=['Date', 'Departure'],
            right_on=['Date', 'Airport'],
            sort=False)

data_passenger = data_passenger.rename(
    columns={'d_Departures': 'a_Departures',
            'd_Arrivals': 'a_Arrivals',
            'd_Departure_Seats': 'a_Departure_Seats',
            'd_Arrival_Seats': 'a_Arrival_Seats'})

ext_data = pd.merge(
            ext_data, data_passenger, how='left',
            left_on=['Date', 'Arrival'],
            right_on=['Date', 'Airport'],
            sort=False)
In [243]:
ext_data.head(1)
Out[243]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0

I.2.3 Geo Data:

In [244]:
data_geo = pd.read_csv('airports_details.csv')
AIRLIST = ['ATL', 'BOS', 'CLT', 'DEN', 'DFW', 'DTW', 'EWR', 'IAH', 'JFK', 'LAS', 'LAX', 'LGA', 'MCO',
          'MIA', 'MSP', 'ORD', 'PHL', 'PHX', 'SEA', 'SFO']
data_geo['VALID'] = data_geo['IATA'].apply(lambda r: (r in AIRLIST))
data_geo = data_geo[data_geo['VALID']==True]
data_geo = data_geo.drop('VALID', axis=1)
data_geo.head(1)
Out[244]:
IATA Lat Lon Alt
3249 BOS 42.3643 -71.005203 20
In [245]:
data_geo = data_geo.rename(
    columns={'Lat': 'd_Lat',
            'Lon': 'd_Lon',
            'Alt': 'd_Alt'})

ext_data = pd.merge(
            ext_data, data_geo, how='left',
            left_on=['Departure'],
            right_on=['IATA'],
            sort=False)

data_geo = data_geo.rename(
    columns={'d_Lat': 'a_Lat',
            'd_Lon': 'a_Lon',
            'd_Alt': 'a_Alt'})

ext_data = pd.merge(
            ext_data, data_geo, how='left',
            left_on=['Arrival'],
            right_on=['IATA'],
            sort=False)

ext_data['Lat_Delta'] = ext_data['a_Lat'] - ext_data['d_Lat']
ext_data['Lon_Delta'] = ext_data['a_Lon'] - ext_data['d_Lon']

ext_data=ext_data.drop('IATA_x', axis = 1)
ext_data=ext_data.drop('IATA_y', axis = 1)
In [246]:
ext_data['distance_KM'] = ext_data.apply(lambda row: geopy.distance.vincenty((row['d_Lat'], row['d_Lon']),
                                                                              (row['a_Lat'], row['a_Lon'])).km, axis=1)
In [247]:
ext_data.head(1)
Out[247]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856

I.2.4 Holiday Data:

In [248]:
def daysDerinere(ls):
    R = 100
    for day in ls:
        if((day >= 0) & (day < R)):
            R = day
    return R

def daysProchaine(ls):
    R = -100
    for day in ls:
        if((day <= 0) & (day > R)):
            R = day
    return R        
In [249]:
# Get list of holidays between 20110701 and 20130605
a = holidays.US()[date(2011,7, 1):date(2013,6, 5)]
b = holidays.US()[date(2012,1, 1):date(2012,12, 31)]
c = a+b

# Calculate NEXT/LAST distance matrix
R = pd.DataFrame(pd.date_range(start='9/1/2011', end='3/5/2013'), columns=['Date'])
R['isHoliday'] = R['Date'].apply(lambda x: int(x in c))

for hd in c:
    R[str(hd)] = R['Date'].apply(lambda x: (pd.to_datetime(x)-pd.to_datetime(hd)).days)

R['DaysToDerinere'] = R.apply(lambda x: daysDerinere(x[2:23]),axis = 1)
R['DaysToProchaine'] = R.apply(lambda x: daysProchaine(x[2:23]),axis = 1)

R = R[['Date','isHoliday', 'DaysToDerinere', 'DaysToProchaine']]
R['Date'] = pd.to_datetime(R['Date'], infer_datetime_format=True)
In [250]:
ext_data = pd.merge(
            ext_data, R, how='left',
            left_on=['Date'],
            right_on=['Date'],
            sort=False)
ext_data.head(1)
Out[250]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15

I.2.5 Date & Time Data

In [251]:
ext_data['year'] = ext_data['Date'].dt.year
ext_data['month'] = ext_data['Date'].dt.month
ext_data['weekday'] = ext_data['Date'].dt.weekday
ext_data['week'] = ext_data['Date'].dt.week
ext_data['n_days'] = ext_data['Date'].apply(lambda date: (date - pd.to_datetime("2011-01-01")).days)
In [252]:
ext_data['wd_5'] = ext_data['weekday'].apply(lambda x: 1 if x == 5 else 0)
In [253]:
ext_data.head(1)
Out[253]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0

I.2.6 Socioeconomic Data:

In [254]:
data_gdp = pd.read_csv('Pop_GDP_by_city.csv', thousands=',')
In [255]:
data_gdp = data_gdp.rename(
    columns={'Pop2018': 'd_POP',
            'GDP_B': 'd_GDP'})

ext_data = pd.merge(
            ext_data, data_gdp, how='left',
            left_on=['Departure'],
            right_on=['Airport'],
            sort=False)

data_gdp = data_gdp.rename(
    columns={'d_POP': 'a_POP',
            'd_GDP': 'a_GDP'})

ext_data = pd.merge(
            ext_data, data_gdp, how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)

ext_data=ext_data.drop(['Airport_x'], axis = 1)
ext_data=ext_data.drop(['Airport_y'], axis = 1)

ext_data['d_POP_log']=np.log(ext_data['d_POP'])
ext_data['a_POP_log']=np.log(ext_data['a_POP'])
ext_data['d_GDP_log']=np.log(ext_data['d_GDP'])
ext_data['a_GDP_log']=np.log(ext_data['a_GDP'])
In [256]:
ext_data.head(1)
Out[256]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5 d_POP d_GDP a_POP a_GDP d_POP_log a_POP_log d_GDP_log a_GDP_log
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0 2705994 680 1345047 535 14.81098 14.11194 6.522093 6.282267

I.2.7 On-time Data:

In [257]:
data_on_time = pd.read_csv("On_time_NAS_V2.csv",thousands=',')
data_on_time['Date'] = pd.to_datetime(data_on_time['Date'], infer_datetime_format=True)
data_on_time.head(1)
Out[257]:
Facility Date Total_Flights All_Causes_Perc Extreme_Weather_Only_Perc Carrier_Cause_Only_Perc NAS_Cause_Only_Perc Security_Cause_Only_Perc NAS_n_Prorated_Perc Carrier_n_Prorated_Perc
0 ATL 2011-09-01 1130.0 90.27 100.0 97.7 96.73 100.0 95.22 96.64
In [258]:
ext_data = pd.merge(
            ext_data, data_on_time, how='left',
            left_on=['Departure','Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = pd.merge(
            ext_data, data_on_time, how='left',
            left_on=['Arrival', 'Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)
Out[258]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5 d_POP d_GDP a_POP a_GDP d_POP_log a_POP_log d_GDP_log a_GDP_log Total_Flights_x All_Causes_Perc_x Extreme_Weather_Only_Perc_x Carrier_Cause_Only_Perc_x NAS_Cause_Only_Perc_x Security_Cause_Only_Perc_x NAS_n_Prorated_Perc_x Carrier_n_Prorated_Perc_x Total_Flights_y All_Causes_Perc_y Extreme_Weather_Only_Perc_y Carrier_Cause_Only_Perc_y NAS_Cause_Only_Perc_y Security_Cause_Only_Perc_y NAS_n_Prorated_Perc_y Carrier_n_Prorated_Perc_y
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0 2705994 680 1345047 535 14.81098 14.11194 6.522093 6.282267 931.0 88.18 99.89 96.67 97.85 100.0 95.6 93.02 786.0 93.0 99.75 97.58 98.6 99.87 97.71 96.06

I.2.8 Reliability Data:

In [259]:
data_reliability= pd.read_csv("Airport View - Schedule Reliability Report.csv",thousands=',')
data_reliability['Date'] = pd.to_datetime(data_reliability['Date'], infer_datetime_format=True)
data_reliability.head(1)
Out[259]:
Facility Date Actual_Departures Actual_Arrivals Departure_Cancellations Arrival_Cancellations Departure_Diversions Arrival_Diversions On_Time_Arrivals %_On_Time_Gate_Departures %_On_Time_Gate_Arrivals Average_Gate_Departure_Delay Average_Gate_Arrival_Delay Average_Block_Delay Average_Taxi_Out_Time Average_Taxi_in_Time Delayed_Arrivals Average_Delay_Per_Delayed_Arrival Percent_Schedule_Arrival_Reliability Percent_Schedule_Departure_Reliability
0 ATL 2011-09-01 1129.0 1130.0 4.0 4.0 3.0 0.0 1013.0 90.26 89.65 6.08 5.82 2.56 20.22 12.86 117.0 44.24 89.29 89.64
In [260]:
ext_data = pd.merge(
            ext_data, data_reliability, how='left',
            left_on=['Departure','Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = pd.merge(
            ext_data, data_reliability, how='left',
            left_on=['Arrival', 'Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)
Out[260]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5 d_POP d_GDP a_POP a_GDP d_POP_log a_POP_log d_GDP_log a_GDP_log Total_Flights_x All_Causes_Perc_x Extreme_Weather_Only_Perc_x Carrier_Cause_Only_Perc_x NAS_Cause_Only_Perc_x Security_Cause_Only_Perc_x NAS_n_Prorated_Perc_x Carrier_n_Prorated_Perc_x Total_Flights_y All_Causes_Perc_y Extreme_Weather_Only_Perc_y Carrier_Cause_Only_Perc_y NAS_Cause_Only_Perc_y Security_Cause_Only_Perc_y NAS_n_Prorated_Perc_y Carrier_n_Prorated_Perc_y Actual_Departures_x Actual_Arrivals_x Departure_Cancellations_x Arrival_Cancellations_x Departure_Diversions_x Arrival_Diversions_x On_Time_Arrivals_x %_On_Time_Gate_Departures_x %_On_Time_Gate_Arrivals_x Average_Gate_Departure_Delay_x Average_Gate_Arrival_Delay_x Average_Block_Delay_x Average_Taxi_Out_Time_x Average_Taxi_in_Time_x Delayed_Arrivals_x Average_Delay_Per_Delayed_Arrival_x Percent_Schedule_Arrival_Reliability_x Percent_Schedule_Departure_Reliability_x Actual_Departures_y Actual_Arrivals_y Departure_Cancellations_y Arrival_Cancellations_y Departure_Diversions_y Arrival_Diversions_y On_Time_Arrivals_y %_On_Time_Gate_Departures_y %_On_Time_Gate_Arrivals_y Average_Gate_Departure_Delay_y Average_Gate_Arrival_Delay_y Average_Block_Delay_y Average_Taxi_Out_Time_y Average_Taxi_in_Time_y Delayed_Arrivals_y Average_Delay_Per_Delayed_Arrival_y Percent_Schedule_Arrival_Reliability_y Percent_Schedule_Departure_Reliability_y
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0 2705994 680 1345047 535 14.81098 14.11194 6.522093 6.282267 931.0 88.18 99.89 96.67 97.85 100.0 95.6 93.02 786.0 93.0 99.75 97.58 98.6 99.87 97.71 96.06 933.0 931.0 9.0 13.0 0.0 0.0 788.0 75.99 84.64 13.87 11.75 1.12 15.12 8.44 143.0 71.16 83.24 75.03 789.0 786.0 4.0 6.0 0.0 1.0 702.0 84.66 89.31 8.25 8.85 1.54 14.29 8.61 84.0 76.02 88.42 84.16

I.2.9 Inter-airport Traffic Data:

In [261]:
data_citypairs = pd.read_csv("citypairs.csv",thousands=',')
data_citypairs['Date'] = pd.to_datetime(data_citypairs['Date'], infer_datetime_format=True)
data_citypairs.head(1)
Out[261]:
Departure Arrival Date Flight_Count On_Time_Gate_Departures On_Time_Airport_Departures On_Time_Gate_Arrivals Arrivals_With_EDCT Average_EDCT Gate_Departure_Delay Taxi_Out_Delay Average_Taxi_Out_Time Airport_Departure_Delay Airborne_Delay Taxi_In_Delay Block_Delay Gate_Arrival_Delay
0 ATL BOS 2011-09-01 18 100.0 88.89 94.44 0 0.0 1.44 10.51 22.89 7.89 0.5 4.07 2.06 1.61
In [262]:
ext_data = pd.merge(
            ext_data, data_citypairs, how='left',
            left_on=['Departure', 'Arrival', 'Date'],
            right_on=['Departure', 'Arrival', 'Date'],
            sort=False)
ext_data.head(1)
Out[262]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5 d_POP d_GDP a_POP a_GDP d_POP_log a_POP_log d_GDP_log a_GDP_log Total_Flights_x All_Causes_Perc_x Extreme_Weather_Only_Perc_x Carrier_Cause_Only_Perc_x NAS_Cause_Only_Perc_x Security_Cause_Only_Perc_x NAS_n_Prorated_Perc_x Carrier_n_Prorated_Perc_x Total_Flights_y All_Causes_Perc_y Extreme_Weather_Only_Perc_y Carrier_Cause_Only_Perc_y NAS_Cause_Only_Perc_y Security_Cause_Only_Perc_y NAS_n_Prorated_Perc_y Carrier_n_Prorated_Perc_y Actual_Departures_x Actual_Arrivals_x Departure_Cancellations_x Arrival_Cancellations_x Departure_Diversions_x Arrival_Diversions_x On_Time_Arrivals_x %_On_Time_Gate_Departures_x %_On_Time_Gate_Arrivals_x Average_Gate_Departure_Delay_x Average_Gate_Arrival_Delay_x Average_Block_Delay_x Average_Taxi_Out_Time_x Average_Taxi_in_Time_x Delayed_Arrivals_x Average_Delay_Per_Delayed_Arrival_x Percent_Schedule_Arrival_Reliability_x Percent_Schedule_Departure_Reliability_x Actual_Departures_y Actual_Arrivals_y Departure_Cancellations_y Arrival_Cancellations_y Departure_Diversions_y Arrival_Diversions_y On_Time_Arrivals_y %_On_Time_Gate_Departures_y %_On_Time_Gate_Arrivals_y Average_Gate_Departure_Delay_y Average_Gate_Arrival_Delay_y Average_Block_Delay_y Average_Taxi_Out_Time_y Average_Taxi_in_Time_y Delayed_Arrivals_y Average_Delay_Per_Delayed_Arrival_y Percent_Schedule_Arrival_Reliability_y Percent_Schedule_Departure_Reliability_y Flight_Count On_Time_Gate_Departures On_Time_Airport_Departures On_Time_Gate_Arrivals Arrivals_With_EDCT Average_EDCT Gate_Departure_Delay Taxi_Out_Delay Average_Taxi_Out_Time Airport_Departure_Delay Airborne_Delay Taxi_In_Delay Block_Delay Gate_Arrival_Delay
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0 2705994 680 1345047 535 14.81098 14.11194 6.522093 6.282267 931.0 88.18 99.89 96.67 97.85 100.0 95.6 93.02 786.0 93.0 99.75 97.58 98.6 99.87 97.71 96.06 933.0 931.0 9.0 13.0 0.0 0.0 788.0 75.99 84.64 13.87 11.75 1.12 15.12 8.44 143.0 71.16 83.24 75.03 789.0 786.0 4.0 6.0 0.0 1.0 702.0 84.66 89.31 8.25 8.85 1.54 14.29 8.61 84.0 76.02 88.42 84.16 26.0 80.77 80.77 92.31 0.0 0.0 13.69 2.45 13.35 15.62 2.46 2.43 0.54 11.19

I.2.10 Airlines Data:

In [263]:
data_routes = pd.read_csv("Routes between Airports.csv")
In [264]:
airports = list(set(ext_data.Departure))
data_routes['dep_valid'] = data_routes['Departure'].apply(lambda x: x in airports)
data_routes['arr_valid'] = data_routes['Arrival'].apply(lambda x: x in airports)
data_routes = data_routes[(data_routes['dep_valid']==True) & (data_routes['arr_valid']==True)][['Airline','Departure','Arrival','Codeshare','Stops']]
data_routes.head(1)
Out[264]:
Airline Departure Arrival Codeshare Stops
996 4B LAS LAX NaN 0
In [265]:
airlines = data_routes.groupby(by = ['Departure', 'Arrival']).count().reset_index()[['Departure', 'Arrival', 'Airline']]
airlines.head(1)
Out[265]:
Departure Arrival Airline
0 ATL BOS 8
In [266]:
ext_data = pd.merge(
            ext_data, airlines, how='left',
            left_on=['Departure', 'Arrival'],
            right_on=['Departure', 'Arrival'],
            sort=False)
ext_data.head(1)
Out[266]:
Date Departure Arrival WeeksToDeparture log_PAX std_wtd d_Departures d_Arrivals d_Departure_Seats d_Arrival_Seats a_Departures a_Arrivals a_Departure_Seats a_Arrival_Seats d_Lat d_Lon d_Alt a_Lat a_Lon a_Alt Lat_Delta Lon_Delta distance_KM isHoliday DaysToDerinere DaysToProchaine year month weekday week n_days wd_5 d_POP d_GDP a_POP a_GDP d_POP_log a_POP_log d_GDP_log a_GDP_log Total_Flights_x All_Causes_Perc_x Extreme_Weather_Only_Perc_x Carrier_Cause_Only_Perc_x NAS_Cause_Only_Perc_x Security_Cause_Only_Perc_x NAS_n_Prorated_Perc_x Carrier_n_Prorated_Perc_x Total_Flights_y All_Causes_Perc_y Extreme_Weather_Only_Perc_y Carrier_Cause_Only_Perc_y NAS_Cause_Only_Perc_y Security_Cause_Only_Perc_y NAS_n_Prorated_Perc_y Carrier_n_Prorated_Perc_y Actual_Departures_x Actual_Arrivals_x Departure_Cancellations_x Arrival_Cancellations_x Departure_Diversions_x Arrival_Diversions_x On_Time_Arrivals_x %_On_Time_Gate_Departures_x %_On_Time_Gate_Arrivals_x Average_Gate_Departure_Delay_x Average_Gate_Arrival_Delay_x Average_Block_Delay_x Average_Taxi_Out_Time_x Average_Taxi_in_Time_x Delayed_Arrivals_x Average_Delay_Per_Delayed_Arrival_x Percent_Schedule_Arrival_Reliability_x Percent_Schedule_Departure_Reliability_x Actual_Departures_y Actual_Arrivals_y Departure_Cancellations_y Arrival_Cancellations_y Departure_Diversions_y Arrival_Diversions_y On_Time_Arrivals_y %_On_Time_Gate_Departures_y %_On_Time_Gate_Arrivals_y Average_Gate_Departure_Delay_y Average_Gate_Arrival_Delay_y Average_Block_Delay_y Average_Taxi_Out_Time_y Average_Taxi_in_Time_y Delayed_Arrivals_y Average_Delay_Per_Delayed_Arrival_y Percent_Schedule_Arrival_Reliability_y Percent_Schedule_Departure_Reliability_y Flight_Count On_Time_Gate_Departures On_Time_Airport_Departures On_Time_Gate_Arrivals Arrivals_With_EDCT Average_EDCT Gate_Departure_Delay Taxi_Out_Delay Average_Taxi_Out_Time Airport_Departure_Delay Airborne_Delay Taxi_In_Delay Block_Delay Gate_Arrival_Delay Airline
0 2012-06-19 ORD DFW 12.875 12.331296 9.812647 1041.0 1042.0 100665.0 100511.0 822.0 824.0 94185.0 94813.0 41.9786 -87.9048 672 32.896801 -97.038002 607 -9.081799 -9.133202 1290.346856 0 22 -15 2012 6 1 25 535 0 2705994 680 1345047 535 14.81098 14.11194 6.522093 6.282267 931.0 88.18 99.89 96.67 97.85 100.0 95.6 93.02 786.0 93.0 99.75 97.58 98.6 99.87 97.71 96.06 933.0 931.0 9.0 13.0 0.0 0.0 788.0 75.99 84.64 13.87 11.75 1.12 15.12 8.44 143.0 71.16 83.24 75.03 789.0 786.0 4.0 6.0 0.0 1.0 702.0 84.66 89.31 8.25 8.85 1.54 14.29 8.61 84.0 76.02 88.42 84.16 26.0 80.77 80.77 92.31 0.0 0.0 13.69 2.45 13.35 15.62 2.46 2.43 0.54 11.19 4

I.2.11 Other Data:

⚠️We've collected and derived other features to complete the pool. Due to their weak relevance to the volume of air passengers, however, we didn't attach them to the full dataset, but only performed rough modeling on the individual subsets of features (similar to what we've done with the weather data) instead. These included average on-board passengers of 100~199 seat airplanes departured/arrived from/at an airport on a given date, average departure/arrival seats, etc.

In [268]:
ext_data = ext_data.drop([ 'd_POP','d_GDP','a_POP', 'a_GDP'], axis=1)
In [281]:
full = ext_data
print(full.shape)
(8902, 103)

I.3. Preliminary Exploration

I.3.1 Missing values

In [282]:
full.isnull().sum().sum()
Out[282]:
294
In [283]:
for feature in list(full.columns):
    if full[feature].isnull().sum() > 0:
        print(feature+": "+str(full[feature].isnull().sum()))
d_Departures: 2
d_Arrivals: 2
d_Departure_Seats: 2
d_Arrival_Seats: 2
a_Departures: 5
a_Arrivals: 5
a_Departure_Seats: 5
a_Arrival_Seats: 5
Flight_Count: 19
On_Time_Gate_Departures: 19
On_Time_Airport_Departures: 19
On_Time_Gate_Arrivals: 19
Arrivals_With_EDCT: 19
Average_EDCT: 19
Gate_Departure_Delay: 19
Taxi_Out_Delay: 19
Average_Taxi_Out_Time: 19
Airport_Departure_Delay: 19
Airborne_Delay: 19
Taxi_In_Delay: 19
Block_Delay: 19
Gate_Arrival_Delay: 19

As we can see from above, there are a total of 294 missing values in the full dataset and they mainly occur in on-time/reliability features. There are no features with more than 1% missing values so we should be fine. Here we chose to impute them with the mean values of each column:

In [284]:
full = full.fillna(full.mean())

I.3.2 PCA of On-time/Reliability Features

In [285]:
PCA_features = ['All_Causes_Perc_x', 'Extreme_Weather_Only_Perc_x','Carrier_Cause_Only_Perc_x', 'NAS_Cause_Only_Perc_x',
                'Security_Cause_Only_Perc_x', 'NAS_n_Prorated_Perc_x', 'Carrier_n_Prorated_Perc_x',  'All_Causes_Perc_y',
                'Extreme_Weather_Only_Perc_y', 'Carrier_Cause_Only_Perc_y', 'NAS_Cause_Only_Perc_y', 'Security_Cause_Only_Perc_y',
                'NAS_n_Prorated_Perc_y', 'Carrier_n_Prorated_Perc_y', 'Actual_Departures_x', 'Actual_Arrivals_x', 'Departure_Cancellations_x',
                'Arrival_Cancellations_x', 'Departure_Diversions_x', 'Arrival_Diversions_x', 'On_Time_Arrivals_x', '%_On_Time_Gate_Departures_x',
                '%_On_Time_Gate_Arrivals_x', 'Average_Gate_Departure_Delay_x', 'Average_Gate_Arrival_Delay_x', 'Average_Block_Delay_x',
                'Average_Taxi_Out_Time_x', 'Average_Taxi_in_Time_x', 'Delayed_Arrivals_x', 'Average_Delay_Per_Delayed_Arrival_x',
                'Percent_Schedule_Arrival_Reliability_x', 'Percent_Schedule_Departure_Reliability_x', 'Actual_Departures_y', 'Actual_Arrivals_y',
                'Departure_Cancellations_y', 'Arrival_Cancellations_y', 'Departure_Diversions_y', 'Arrival_Diversions_y', 'On_Time_Arrivals_y',
                '%_On_Time_Gate_Departures_y', '%_On_Time_Gate_Arrivals_y', 'Average_Gate_Departure_Delay_y', 'Average_Gate_Arrival_Delay_y',
                'Average_Block_Delay_y', 'Average_Taxi_Out_Time_y', 'Average_Taxi_in_Time_y', 'Delayed_Arrivals_y', 'Average_Delay_Per_Delayed_Arrival_y',
                'Percent_Schedule_Arrival_Reliability_y', 'Percent_Schedule_Departure_Reliability_y', 'On_Time_Gate_Departures',
                'On_Time_Airport_Departures', 'On_Time_Gate_Arrivals', 'Arrivals_With_EDCT', 'Average_EDCT', 'Gate_Departure_Delay',
                'Taxi_Out_Delay', 'Average_Taxi_Out_Time', 'Airport_Departure_Delay', 'Airborne_Delay', 'Taxi_In_Delay', 'Block_Delay', 'Gate_Arrival_Delay']
In [286]:
PCA_data = full[PCA_features]
In [287]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca5_feature = pca.fit_transform(PCA_data)
print(pca.explained_variance_ratio_)
[0.44802478 0.3708787  0.08536748 0.05915139 0.01882783]
In [288]:
full['OnTimePCA1'] = pca5_feature[:,0]
full['OnTimePCA2'] = pca5_feature[:,1]
full['OnTimePCA3'] = pca5_feature[:,2]
full['OnTimePCA4'] = pca5_feature[:,3]
In [289]:
full = full.drop(PCA_features, axis=1)
In [290]:
full.columns
Out[290]:
Index(['Date', 'Departure', 'Arrival', 'WeeksToDeparture', 'log_PAX',
       'std_wtd', 'd_Departures', 'd_Arrivals', 'd_Departure_Seats',
       'd_Arrival_Seats', 'a_Departures', 'a_Arrivals', 'a_Departure_Seats',
       'a_Arrival_Seats', 'd_Lat', 'd_Lon', 'd_Alt', 'a_Lat', 'a_Lon', 'a_Alt',
       'Lat_Delta', 'Lon_Delta', 'distance_KM', 'isHoliday', 'DaysToDerinere',
       'DaysToProchaine', 'year', 'month', 'weekday', 'week', 'n_days', 'wd_5',
       'd_POP_log', 'a_POP_log', 'd_GDP_log', 'a_GDP_log', 'Total_Flights_x',
       'Total_Flights_y', 'Flight_Count', 'Airline', 'OnTimePCA1',
       'OnTimePCA2', 'OnTimePCA3', 'OnTimePCA4'],
      dtype='object')
In [ ]:
def normalize(df):
    ignoreList = ['Lat_x', 'Lat_y', 'Lon_x', 'Lon_y', 'Lat_Delta', 'Lon_Delta','wd_5','wd_6',
                 'Rain_x', 'Rain_y', 'BadWeather_arrival', 'BadWeather_departure',
                 'ToSouth','ToNorth','ToWest','ToEast','Winter','Summer','WinterToSouth',
                  'WinterToNorth','WinterToWest','WinterToEast','SummerToSouth','SummerToNorth',
                  'SummerToWest','SummerToEast']
    result = df.copy()
    for feature_name in df.columns:
        if(not(feature_name in ignoreList)):
            print(feature_name)
            try:
                mean = df[feature_name].mean()
                std = df[feature_name].std()
                result[feature_name] = (df[feature_name] - mean) / std
            except:
                pass
    return result

data_weather_std = normalize(data_weather)