RAMP Challenge: Predicting Air Passengers

This is the final project of MAP536: Python for Data Science taught at École Polytechnique. We were presented with the challenge to help an unnamed airline company (operating 100~200 seat airplanes) to predict how many passengers will travel on a given date in order to best allocate the planes among possible routes.

Available historical data included:

date
number of passengers
departure and arrival airports
how early the tickets were booked

We were permitted to use any relevant external data we could find. The goal was to build a model with the strongest predictive power.

The project was done in groups of two and I was very lucky to have collaborated with Jiayu Gan.

We obtained a final model implemented in XGBoost with 19 features (out of a pool of more than 200). Ranked by importance in descending order, these features are:

For a brief report without codes, please check the link: https://github.com/FerdinandYu/Predicting-Air-Passengers/blob/master/Final%20Report.pdf

Table of Content¶

I. Preliminaries

I.1 Given Data Exploration
I.2 External Data Collection
I.3 Preliminary Exploration
- 1.3.1 Missing Values
- 1.3.2 PCA of On-time/Reliability Features

I. Preliminaries¶

I.1. Given Data Exploration

# import required packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from datetime import date
import matplotlib.pyplot as plt
import imp
import numpy as np
pd.set_option('display.max_columns', None)
import seaborn as sns; sns.set()
from scipy import stats
import holidays
import geopy.distance

init_data = pd.read_csv("train.csv")
init_data['DateOfDeparture'] = pd.to_datetime(init_data['DateOfDeparture'], infer_datetime_format=True)
init_data = init_data.rename(columns={"DateOfDeparture": "Date"})
init_data.head(5)

plt.figure(1,figsize = (10,5))

plt.subplot(121)
plt.hist(init_data.WeeksToDeparture)

plt.subplot(122)
plt.hist(init_data.log_PAX)

plt.show()

The covered dates in the original data were from 2011-09-01 to 2013-03-05. 20 airports (all in the US) were involved, see the list below:

PHX - Phoenix Sky Harbor International Airport; IAH - George Bush Intercontinental Airport; ATL - Hartsfield–Jackson Atlanta International Airport; DEN - Denver International Airport; SEA - Seattle-Tacoma International Airport; DTW - Detroit Metropolitan Wayne County Airport; EWR - Newark Liberty International Airport; BOS - Boston Logan International Airport; JFK - John F. Kennedy International Airport; MCO - Orlando International Airport; PHL - Philadelphia International Airport; MIA - Miami International Airport; MSP - Minneapolis−Saint Paul International Airport; LAX - Los Angeles International Airport; ORD - O'Hare International Airport; SFO - San Francisco International Airport; LAS - McCarran International Airport; CLT - Charlotte Douglas International Airport; DFW - Dallas/Fort Worth International Airport; LGA - LaGuardia Airport

From the histograms, we observe that WeeksToDeparture and log_PAX follow approximately a normal distribution.

I.2. External Data Collection

Predicting air passengers can be a challenging task when the original data contained only limited information about the departure and destination airports and ticket booking time. Many factors might have accounted for the actual number of passengers present on a specific airplane on a specific date, like the economic vibrancy of the two cities, the weather, the airport operating conditions, historical on-time rates, number of flights connecting the two cities, competition among airlines, a vacation or a special event, etc. We were given the weather data to begin with:

I.2.1 Weather Data

weather_data = pd.read_csv("weather_data.csv")
weather_data.head(1)

⚠️We considered indicators of extreme weather which might lead to flight delays or cancellations like minimum temperature, maximum wind, visibility, etc. and explicitly included dummies for thunderstorms and snowstorms. However, after experimenting with several models using these weather indicators, we found them to be poor predictors of air passengers, so we decided to drop them. We then collected other datasets relevant to air traffic from public sources. We have been able to find detailed information about airport operations (departures/arrivals, passenger volume, etc.) and flights (on-time/delay, frequency, etc.) of each day on the Federal Aviation Administration (FAA)’s website, and general info about airports (locations) and routes (competing airlines, distances) on OpenFlights. Besides, we have included data about the socioeconomic conditions (GDP, population, etc.) of the cities obtained from Wikipedia and the Federal Reserve Bank (FRED). Considering the seasonal travel pattern, we have also included the major public and school holidays in the US. For a comprehensive list of the external data, see the table below:

Dataset		Source
On-time & Delay statistics by airport		FAA: https://aspm.faa.gov/asqp/sys/Airport.asp
Airport Geographic Coordinates		OpenFlights: https://openflights.org/data.html#airport
Details about Flights between the Two Cities by Day		FAA: https://aspm.faa.gov/apm/sys/AnalysisCP.asp
Route Distances		Derived from airports’ geographic coordinates (using geopy.distance.distance() function based on Haversine formula)
Departure & Arrival Seats by Airport by Day		FAA: https://aspm.faa.gov/tfms/sys/Airport.asp
Socioeconomic Conditions of Cities Served		Wikipedia (keyword: city names); FRED: https://fred.stlouisfed.org/series
School Holidays		https://www.feiertagskalender.ch/ferien.php?geo=3537&jahr=2012&klasse=0&hl=en
Public Holidays		Holiday Package PyPI

I.2.2 Traffic Data:

data_passenger = pd.read_csv('TFMSC Report.csv',thousands=',')
#remove business aviation
data_passenger = data_passenger[data_passenger['Business_Aviation'] == 'No']
data_passenger['Date'] = pd.to_datetime(data_passenger['Date'], infer_datetime_format=True)
#drop useless cols
data_passenger = data_passenger[['Date', 'Airport', 'Departures', 'Arrivals', 'Departure_Seats', 'Arrival_Seats']]
#modify AIRPORT Column
data_passenger['Airport'] = data_passenger['Airport'].apply(lambda s: s[:3])

#groupby AIRPORT-DAY
data_passenger = data_passenger.groupby(['Date','Airport']).sum()
data_passenger.head(1)

data_passenger = data_passenger.rename(
    columns={'Departures': 'd_Departures',
            'Arrivals': 'd_Arrivals',
            'Departure_Seats': 'd_Departure_Seats',
            'Arrival_Seats': 'd_Arrival_Seats'})

ext_data = pd.merge(
            init_data, data_passenger, how='left',
            left_on=['Date', 'Departure'],
            right_on=['Date', 'Airport'],
            sort=False)

data_passenger = data_passenger.rename(
    columns={'d_Departures': 'a_Departures',
            'd_Arrivals': 'a_Arrivals',
            'd_Departure_Seats': 'a_Departure_Seats',
            'd_Arrival_Seats': 'a_Arrival_Seats'})

ext_data = pd.merge(
            ext_data, data_passenger, how='left',
            left_on=['Date', 'Arrival'],
            right_on=['Date', 'Airport'],
            sort=False)

ext_data.head(1)

I.2.3 Geo Data:

data_geo = pd.read_csv('airports_details.csv')
AIRLIST = ['ATL', 'BOS', 'CLT', 'DEN', 'DFW', 'DTW', 'EWR', 'IAH', 'JFK', 'LAS', 'LAX', 'LGA', 'MCO',
          'MIA', 'MSP', 'ORD', 'PHL', 'PHX', 'SEA', 'SFO']
data_geo['VALID'] = data_geo['IATA'].apply(lambda r: (r in AIRLIST))
data_geo = data_geo[data_geo['VALID']==True]
data_geo = data_geo.drop('VALID', axis=1)
data_geo.head(1)

data_geo = data_geo.rename(
    columns={'Lat': 'd_Lat',
            'Lon': 'd_Lon',
            'Alt': 'd_Alt'})

ext_data = pd.merge(
            ext_data, data_geo, how='left',
            left_on=['Departure'],
            right_on=['IATA'],
            sort=False)

data_geo = data_geo.rename(
    columns={'d_Lat': 'a_Lat',
            'd_Lon': 'a_Lon',
            'd_Alt': 'a_Alt'})

ext_data = pd.merge(
            ext_data, data_geo, how='left',
            left_on=['Arrival'],
            right_on=['IATA'],
            sort=False)

ext_data['Lat_Delta'] = ext_data['a_Lat'] - ext_data['d_Lat']
ext_data['Lon_Delta'] = ext_data['a_Lon'] - ext_data['d_Lon']

ext_data=ext_data.drop('IATA_x', axis = 1)
ext_data=ext_data.drop('IATA_y', axis = 1)

ext_data['distance_KM'] = ext_data.apply(lambda row: geopy.distance.vincenty((row['d_Lat'], row['d_Lon']),
                                                                              (row['a_Lat'], row['a_Lon'])).km, axis=1)

ext_data.head(1)

I.2.4 Holiday Data:

def daysDerinere(ls):
    R = 100
    for day in ls:
        if((day >= 0) & (day < R)):
            R = day
    return R

def daysProchaine(ls):
    R = -100
    for day in ls:
        if((day <= 0) & (day > R)):
            R = day
    return R

# Get list of holidays between 20110701 and 20130605
a = holidays.US()[date(2011,7, 1):date(2013,6, 5)]
b = holidays.US()[date(2012,1, 1):date(2012,12, 31)]
c = a+b

# Calculate NEXT/LAST distance matrix
R = pd.DataFrame(pd.date_range(start='9/1/2011', end='3/5/2013'), columns=['Date'])
R['isHoliday'] = R['Date'].apply(lambda x: int(x in c))

for hd in c:
    R[str(hd)] = R['Date'].apply(lambda x: (pd.to_datetime(x)-pd.to_datetime(hd)).days)

R['DaysToDerinere'] = R.apply(lambda x: daysDerinere(x[2:23]),axis = 1)
R['DaysToProchaine'] = R.apply(lambda x: daysProchaine(x[2:23]),axis = 1)

R = R[['Date','isHoliday', 'DaysToDerinere', 'DaysToProchaine']]
R['Date'] = pd.to_datetime(R['Date'], infer_datetime_format=True)

ext_data = pd.merge(
            ext_data, R, how='left',
            left_on=['Date'],
            right_on=['Date'],
            sort=False)
ext_data.head(1)

I.2.5 Date & Time Data

ext_data['year'] = ext_data['Date'].dt.year
ext_data['month'] = ext_data['Date'].dt.month
ext_data['weekday'] = ext_data['Date'].dt.weekday
ext_data['week'] = ext_data['Date'].dt.week
ext_data['n_days'] = ext_data['Date'].apply(lambda date: (date - pd.to_datetime("2011-01-01")).days)

ext_data['wd_5'] = ext_data['weekday'].apply(lambda x: 1 if x == 5 else 0)

ext_data.head(1)

I.2.6 Socioeconomic Data:

data_gdp = pd.read_csv('Pop_GDP_by_city.csv', thousands=',')

data_gdp = data_gdp.rename(
    columns={'Pop2018': 'd_POP',
            'GDP_B': 'd_GDP'})

ext_data = pd.merge(
            ext_data, data_gdp, how='left',
            left_on=['Departure'],
            right_on=['Airport'],
            sort=False)

data_gdp = data_gdp.rename(
    columns={'d_POP': 'a_POP',
            'd_GDP': 'a_GDP'})

ext_data = pd.merge(
            ext_data, data_gdp, how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)

ext_data=ext_data.drop(['Airport_x'], axis = 1)
ext_data=ext_data.drop(['Airport_y'], axis = 1)

ext_data['d_POP_log']=np.log(ext_data['d_POP'])
ext_data['a_POP_log']=np.log(ext_data['a_POP'])
ext_data['d_GDP_log']=np.log(ext_data['d_GDP'])
ext_data['a_GDP_log']=np.log(ext_data['a_GDP'])

ext_data.head(1)

I.2.7 On-time Data:

data_on_time = pd.read_csv("On_time_NAS_V2.csv",thousands=',')
data_on_time['Date'] = pd.to_datetime(data_on_time['Date'], infer_datetime_format=True)
data_on_time.head(1)

ext_data = pd.merge(
            ext_data, data_on_time, how='left',
            left_on=['Departure','Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = pd.merge(
            ext_data, data_on_time, how='left',
            left_on=['Arrival', 'Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)

I.2.8 Reliability Data:

data_reliability= pd.read_csv("Airport View - Schedule Reliability Report.csv",thousands=',')
data_reliability['Date'] = pd.to_datetime(data_reliability['Date'], infer_datetime_format=True)
data_reliability.head(1)

ext_data = pd.merge(
            ext_data, data_reliability, how='left',
            left_on=['Departure','Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = pd.merge(
            ext_data, data_reliability, how='left',
            left_on=['Arrival', 'Date'],
            right_on=['Facility', 'Date'],
            sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)

I.2.9 Inter-airport Traffic Data:

data_citypairs = pd.read_csv("citypairs.csv",thousands=',')
data_citypairs['Date'] = pd.to_datetime(data_citypairs['Date'], infer_datetime_format=True)
data_citypairs.head(1)

ext_data = pd.merge(
            ext_data, data_citypairs, how='left',
            left_on=['Departure', 'Arrival', 'Date'],
            right_on=['Departure', 'Arrival', 'Date'],
            sort=False)
ext_data.head(1)

I.2.10 Airlines Data:

data_routes = pd.read_csv("Routes between Airports.csv")

airports = list(set(ext_data.Departure))
data_routes['dep_valid'] = data_routes['Departure'].apply(lambda x: x in airports)
data_routes['arr_valid'] = data_routes['Arrival'].apply(lambda x: x in airports)
data_routes = data_routes[(data_routes['dep_valid']==True) & (data_routes['arr_valid']==True)][['Airline','Departure','Arrival','Codeshare','Stops']]
data_routes.head(1)

airlines = data_routes.groupby(by = ['Departure', 'Arrival']).count().reset_index()[['Departure', 'Arrival', 'Airline']]
airlines.head(1)

ext_data = pd.merge(
            ext_data, airlines, how='left',
            left_on=['Departure', 'Arrival'],
            right_on=['Departure', 'Arrival'],
            sort=False)
ext_data.head(1)

I.2.11 Other Data:

⚠️We've collected and derived other features to complete the pool. Due to their weak relevance to the volume of air passengers, however, we didn't attach them to the full dataset, but only performed rough modeling on the individual subsets of features (similar to what we've done with the weather data) instead. These included average on-board passengers of 100~199 seat airplanes departured/arrived from/at an airport on a given date, average departure/arrival seats, etc.

ext_data = ext_data.drop([ 'd_POP','d_GDP','a_POP', 'a_GDP'], axis=1)

full = ext_data
print(full.shape)

(8902, 103)

I.3. Preliminary Exploration

I.3.1 Missing values

full.isnull().sum().sum()

294

for feature in list(full.columns):
    if full[feature].isnull().sum() > 0:
        print(feature+": "+str(full[feature].isnull().sum()))

d_Departures: 2
d_Arrivals: 2
d_Departure_Seats: 2
d_Arrival_Seats: 2
a_Departures: 5
a_Arrivals: 5
a_Departure_Seats: 5
a_Arrival_Seats: 5
Flight_Count: 19
On_Time_Gate_Departures: 19
On_Time_Airport_Departures: 19
On_Time_Gate_Arrivals: 19
Arrivals_With_EDCT: 19
Average_EDCT: 19
Gate_Departure_Delay: 19
Taxi_Out_Delay: 19
Average_Taxi_Out_Time: 19
Airport_Departure_Delay: 19
Airborne_Delay: 19
Taxi_In_Delay: 19
Block_Delay: 19
Gate_Arrival_Delay: 19

As we can see from above, there are a total of 294 missing values in the full dataset and they mainly occur in on-time/reliability features. There are no features with more than 1% missing values so we should be fine. Here we chose to impute them with the mean values of each column:

full = full.fillna(full.mean())

I.3.2 PCA of On-time/Reliability Features

PCA_features = ['All_Causes_Perc_x', 'Extreme_Weather_Only_Perc_x','Carrier_Cause_Only_Perc_x', 'NAS_Cause_Only_Perc_x',
                'Security_Cause_Only_Perc_x', 'NAS_n_Prorated_Perc_x', 'Carrier_n_Prorated_Perc_x',  'All_Causes_Perc_y',
                'Extreme_Weather_Only_Perc_y', 'Carrier_Cause_Only_Perc_y', 'NAS_Cause_Only_Perc_y', 'Security_Cause_Only_Perc_y',
                'NAS_n_Prorated_Perc_y', 'Carrier_n_Prorated_Perc_y', 'Actual_Departures_x', 'Actual_Arrivals_x', 'Departure_Cancellations_x',
                'Arrival_Cancellations_x', 'Departure_Diversions_x', 'Arrival_Diversions_x', 'On_Time_Arrivals_x', '%_On_Time_Gate_Departures_x',
                '%_On_Time_Gate_Arrivals_x', 'Average_Gate_Departure_Delay_x', 'Average_Gate_Arrival_Delay_x', 'Average_Block_Delay_x',
                'Average_Taxi_Out_Time_x', 'Average_Taxi_in_Time_x', 'Delayed_Arrivals_x', 'Average_Delay_Per_Delayed_Arrival_x',
                'Percent_Schedule_Arrival_Reliability_x', 'Percent_Schedule_Departure_Reliability_x', 'Actual_Departures_y', 'Actual_Arrivals_y',
                'Departure_Cancellations_y', 'Arrival_Cancellations_y', 'Departure_Diversions_y', 'Arrival_Diversions_y', 'On_Time_Arrivals_y',
                '%_On_Time_Gate_Departures_y', '%_On_Time_Gate_Arrivals_y', 'Average_Gate_Departure_Delay_y', 'Average_Gate_Arrival_Delay_y',
                'Average_Block_Delay_y', 'Average_Taxi_Out_Time_y', 'Average_Taxi_in_Time_y', 'Delayed_Arrivals_y', 'Average_Delay_Per_Delayed_Arrival_y',
                'Percent_Schedule_Arrival_Reliability_y', 'Percent_Schedule_Departure_Reliability_y', 'On_Time_Gate_Departures',
                'On_Time_Airport_Departures', 'On_Time_Gate_Arrivals', 'Arrivals_With_EDCT', 'Average_EDCT', 'Gate_Departure_Delay',
                'Taxi_Out_Delay', 'Average_Taxi_Out_Time', 'Airport_Departure_Delay', 'Airborne_Delay', 'Taxi_In_Delay', 'Block_Delay', 'Gate_Arrival_Delay']

PCA_data = full[PCA_features]

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca5_feature = pca.fit_transform(PCA_data)
print(pca.explained_variance_ratio_)

[0.44802478 0.3708787  0.08536748 0.05915139 0.01882783]

full['OnTimePCA1'] = pca5_feature[:,0]
full['OnTimePCA2'] = pca5_feature[:,1]
full['OnTimePCA3'] = pca5_feature[:,2]
full['OnTimePCA4'] = pca5_feature[:,3]

full = full.drop(PCA_features, axis=1)

full.columns

Index(['Date', 'Departure', 'Arrival', 'WeeksToDeparture', 'log_PAX',
       'std_wtd', 'd_Departures', 'd_Arrivals', 'd_Departure_Seats',
       'd_Arrival_Seats', 'a_Departures', 'a_Arrivals', 'a_Departure_Seats',
       'a_Arrival_Seats', 'd_Lat', 'd_Lon', 'd_Alt', 'a_Lat', 'a_Lon', 'a_Alt',
       'Lat_Delta', 'Lon_Delta', 'distance_KM', 'isHoliday', 'DaysToDerinere',
       'DaysToProchaine', 'year', 'month', 'weekday', 'week', 'n_days', 'wd_5',
       'd_POP_log', 'a_POP_log', 'd_GDP_log', 'a_GDP_log', 'Total_Flights_x',
       'Total_Flights_y', 'Flight_Count', 'Airline', 'OnTimePCA1',
       'OnTimePCA2', 'OnTimePCA3', 'OnTimePCA4'],
      dtype='object')

def normalize(df):
    ignoreList = ['Lat_x', 'Lat_y', 'Lon_x', 'Lon_y', 'Lat_Delta', 'Lon_Delta','wd_5','wd_6',
                 'Rain_x', 'Rain_y', 'BadWeather_arrival', 'BadWeather_departure',
                 'ToSouth','ToNorth','ToWest','ToEast','Winter','Summer','WinterToSouth',
                  'WinterToNorth','WinterToWest','WinterToEast','SummerToSouth','SummerToNorth',
                  'SummerToWest','SummerToEast']
    result = df.copy()
    for feature_name in df.columns:
        if(not(feature_name in ignoreList)):
            print(feature_name)
            try:
                mean = df[feature_name].mean()
                std = df[feature_name].std()
                result[feature_name] = (df[feature_name] - mean) / std
            except:
                pass
    return result

data_weather_std = normalize(data_weather)

	Date	Departure	Arrival	WeeksToDeparture	log_PAX	std_wtd
0	2012-06-19	ORD	DFW	12.875000	12.331296	9.812647
1	2012-09-10	LAS	DEN	14.285714	10.775182	9.466734
2	2012-10-05	DEN	LAX	10.863636	11.083177	9.035883
3	2011-10-09	ATL	ORD	11.480000	11.169268	7.990202
4	2012-02-21	DEN	SFO	11.450000	11.269364	9.517159

		Departures	Arrivals	Departure_Seats	Arrival_Seats
Date	Airport
2011-09-01	ATL	1171	1171	148950	148849