This is the final project of MAP536: Python for Data Science taught at École Polytechnique. We were presented with the challenge to help an unnamed airline company (operating 100~200 seat airplanes) to predict how many passengers will travel on a given date in order to best allocate the planes among possible routes.
Available historical data included:
We were permitted to use any relevant external data we could find. The goal was to build a model with the strongest predictive power.
The project was done in groups of two and I was very lucky to have collaborated with Jiayu Gan.
We obtained a final model implemented in XGBoost
with 19 features (out of a pool of more than 200). Ranked by importance in descending order, these features are:
For a brief report without codes, please check the link: https://github.com/FerdinandYu/Predicting-Air-Passengers/blob/master/Final%20Report.pdf
# import required packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from datetime import date
import matplotlib.pyplot as plt
import imp
import numpy as np
pd.set_option('display.max_columns', None)
import seaborn as sns; sns.set()
from scipy import stats
import holidays
import geopy.distance
init_data = pd.read_csv("train.csv")
init_data['DateOfDeparture'] = pd.to_datetime(init_data['DateOfDeparture'], infer_datetime_format=True)
init_data = init_data.rename(columns={"DateOfDeparture": "Date"})
init_data.head(5)
plt.figure(1,figsize = (10,5))
plt.subplot(121)
plt.hist(init_data.WeeksToDeparture)
plt.subplot(122)
plt.hist(init_data.log_PAX)
plt.show()
The covered dates in the original data were from 2011-09-01
to 2013-03-05
. 20 airports (all in the US) were involved, see the list below:
PHX - Phoenix Sky Harbor International Airport; IAH - George Bush Intercontinental Airport; ATL - Hartsfield–Jackson Atlanta International Airport; DEN - Denver International Airport; SEA - Seattle-Tacoma International Airport; DTW - Detroit Metropolitan Wayne County Airport; EWR - Newark Liberty International Airport; BOS - Boston Logan International Airport; JFK - John F. Kennedy International Airport; MCO - Orlando International Airport; PHL - Philadelphia International Airport; MIA - Miami International Airport; MSP - Minneapolis−Saint Paul International Airport; LAX - Los Angeles International Airport; ORD - O'Hare International Airport; SFO - San Francisco International Airport; LAS - McCarran International Airport; CLT - Charlotte Douglas International Airport; DFW - Dallas/Fort Worth International Airport; LGA - LaGuardia Airport
From the histograms, we observe that WeeksToDeparture
and log_PAX
follow approximately a normal distribution.
Predicting air passengers can be a challenging task when the original data contained only limited information about the departure and destination airports and ticket booking time. Many factors might have accounted for the actual number of passengers present on a specific airplane on a specific date, like the economic vibrancy of the two cities, the weather, the airport operating conditions, historical on-time rates, number of flights connecting the two cities, competition among airlines, a vacation or a special event, etc. We were given the weather data to begin with:
I.2.1 Weather Data
weather_data = pd.read_csv("weather_data.csv")
weather_data.head(1)
⚠️We considered indicators of extreme weather which might lead to flight delays or cancellations like minimum temperature, maximum wind, visibility, etc. and explicitly included dummies for thunderstorms and snowstorms. However, after experimenting with several models using these weather indicators, we found them to be poor predictors of air passengers, so we decided to drop them. We then collected other datasets relevant to air traffic from public sources. We have been able to find detailed information about airport operations (departures/arrivals, passenger volume, etc.) and flights (on-time/delay, frequency, etc.) of each day on the Federal Aviation Administration (FAA)’s website, and general info about airports (locations) and routes (competing airlines, distances) on OpenFlights. Besides, we have included data about the socioeconomic conditions (GDP, population, etc.) of the cities obtained from Wikipedia and the Federal Reserve Bank (FRED). Considering the seasonal travel pattern, we have also included the major public and school holidays in the US. For a comprehensive list of the external data, see the table below:
Dataset | Source | |
---|---|---|
On-time & Delay statistics by airport | FAA: https://aspm.faa.gov/asqp/sys/Airport.asp | |
Airport Geographic Coordinates | OpenFlights: https://openflights.org/data.html#airport | |
Details about Flights between the Two Cities by Day | FAA: https://aspm.faa.gov/apm/sys/AnalysisCP.asp | |
Route Distances | Derived from airports’ geographic coordinates (using geopy.distance.distance() function based on Haversine formula) | |
Departure & Arrival Seats by Airport by Day | FAA: https://aspm.faa.gov/tfms/sys/Airport.asp | |
Socioeconomic Conditions of Cities Served | Wikipedia (keyword: city names); FRED: https://fred.stlouisfed.org/series | |
School Holidays | https://www.feiertagskalender.ch/ferien.php?geo=3537&jahr=2012&klasse=0&hl=en | |
Public Holidays | Holiday Package PyPI |
data_passenger = pd.read_csv('TFMSC Report.csv',thousands=',')
#remove business aviation
data_passenger = data_passenger[data_passenger['Business_Aviation'] == 'No']
data_passenger['Date'] = pd.to_datetime(data_passenger['Date'], infer_datetime_format=True)
#drop useless cols
data_passenger = data_passenger[['Date', 'Airport', 'Departures', 'Arrivals', 'Departure_Seats', 'Arrival_Seats']]
#modify AIRPORT Column
data_passenger['Airport'] = data_passenger['Airport'].apply(lambda s: s[:3])
#groupby AIRPORT-DAY
data_passenger = data_passenger.groupby(['Date','Airport']).sum()
data_passenger.head(1)
data_passenger = data_passenger.rename(
columns={'Departures': 'd_Departures',
'Arrivals': 'd_Arrivals',
'Departure_Seats': 'd_Departure_Seats',
'Arrival_Seats': 'd_Arrival_Seats'})
ext_data = pd.merge(
init_data, data_passenger, how='left',
left_on=['Date', 'Departure'],
right_on=['Date', 'Airport'],
sort=False)
data_passenger = data_passenger.rename(
columns={'d_Departures': 'a_Departures',
'd_Arrivals': 'a_Arrivals',
'd_Departure_Seats': 'a_Departure_Seats',
'd_Arrival_Seats': 'a_Arrival_Seats'})
ext_data = pd.merge(
ext_data, data_passenger, how='left',
left_on=['Date', 'Arrival'],
right_on=['Date', 'Airport'],
sort=False)
ext_data.head(1)
data_geo = pd.read_csv('airports_details.csv')
AIRLIST = ['ATL', 'BOS', 'CLT', 'DEN', 'DFW', 'DTW', 'EWR', 'IAH', 'JFK', 'LAS', 'LAX', 'LGA', 'MCO',
'MIA', 'MSP', 'ORD', 'PHL', 'PHX', 'SEA', 'SFO']
data_geo['VALID'] = data_geo['IATA'].apply(lambda r: (r in AIRLIST))
data_geo = data_geo[data_geo['VALID']==True]
data_geo = data_geo.drop('VALID', axis=1)
data_geo.head(1)
data_geo = data_geo.rename(
columns={'Lat': 'd_Lat',
'Lon': 'd_Lon',
'Alt': 'd_Alt'})
ext_data = pd.merge(
ext_data, data_geo, how='left',
left_on=['Departure'],
right_on=['IATA'],
sort=False)
data_geo = data_geo.rename(
columns={'d_Lat': 'a_Lat',
'd_Lon': 'a_Lon',
'd_Alt': 'a_Alt'})
ext_data = pd.merge(
ext_data, data_geo, how='left',
left_on=['Arrival'],
right_on=['IATA'],
sort=False)
ext_data['Lat_Delta'] = ext_data['a_Lat'] - ext_data['d_Lat']
ext_data['Lon_Delta'] = ext_data['a_Lon'] - ext_data['d_Lon']
ext_data=ext_data.drop('IATA_x', axis = 1)
ext_data=ext_data.drop('IATA_y', axis = 1)
ext_data['distance_KM'] = ext_data.apply(lambda row: geopy.distance.vincenty((row['d_Lat'], row['d_Lon']),
(row['a_Lat'], row['a_Lon'])).km, axis=1)
ext_data.head(1)
def daysDerinere(ls):
R = 100
for day in ls:
if((day >= 0) & (day < R)):
R = day
return R
def daysProchaine(ls):
R = -100
for day in ls:
if((day <= 0) & (day > R)):
R = day
return R
# Get list of holidays between 20110701 and 20130605
a = holidays.US()[date(2011,7, 1):date(2013,6, 5)]
b = holidays.US()[date(2012,1, 1):date(2012,12, 31)]
c = a+b
# Calculate NEXT/LAST distance matrix
R = pd.DataFrame(pd.date_range(start='9/1/2011', end='3/5/2013'), columns=['Date'])
R['isHoliday'] = R['Date'].apply(lambda x: int(x in c))
for hd in c:
R[str(hd)] = R['Date'].apply(lambda x: (pd.to_datetime(x)-pd.to_datetime(hd)).days)
R['DaysToDerinere'] = R.apply(lambda x: daysDerinere(x[2:23]),axis = 1)
R['DaysToProchaine'] = R.apply(lambda x: daysProchaine(x[2:23]),axis = 1)
R = R[['Date','isHoliday', 'DaysToDerinere', 'DaysToProchaine']]
R['Date'] = pd.to_datetime(R['Date'], infer_datetime_format=True)
ext_data = pd.merge(
ext_data, R, how='left',
left_on=['Date'],
right_on=['Date'],
sort=False)
ext_data.head(1)
ext_data['year'] = ext_data['Date'].dt.year
ext_data['month'] = ext_data['Date'].dt.month
ext_data['weekday'] = ext_data['Date'].dt.weekday
ext_data['week'] = ext_data['Date'].dt.week
ext_data['n_days'] = ext_data['Date'].apply(lambda date: (date - pd.to_datetime("2011-01-01")).days)
ext_data['wd_5'] = ext_data['weekday'].apply(lambda x: 1 if x == 5 else 0)
ext_data.head(1)
data_gdp = pd.read_csv('Pop_GDP_by_city.csv', thousands=',')
data_gdp = data_gdp.rename(
columns={'Pop2018': 'd_POP',
'GDP_B': 'd_GDP'})
ext_data = pd.merge(
ext_data, data_gdp, how='left',
left_on=['Departure'],
right_on=['Airport'],
sort=False)
data_gdp = data_gdp.rename(
columns={'d_POP': 'a_POP',
'd_GDP': 'a_GDP'})
ext_data = pd.merge(
ext_data, data_gdp, how='left',
left_on=['Arrival'],
right_on=['Airport'],
sort=False)
ext_data=ext_data.drop(['Airport_x'], axis = 1)
ext_data=ext_data.drop(['Airport_y'], axis = 1)
ext_data['d_POP_log']=np.log(ext_data['d_POP'])
ext_data['a_POP_log']=np.log(ext_data['a_POP'])
ext_data['d_GDP_log']=np.log(ext_data['d_GDP'])
ext_data['a_GDP_log']=np.log(ext_data['a_GDP'])
ext_data.head(1)
data_on_time = pd.read_csv("On_time_NAS_V2.csv",thousands=',')
data_on_time['Date'] = pd.to_datetime(data_on_time['Date'], infer_datetime_format=True)
data_on_time.head(1)
ext_data = pd.merge(
ext_data, data_on_time, how='left',
left_on=['Departure','Date'],
right_on=['Facility', 'Date'],
sort=False)
ext_data = pd.merge(
ext_data, data_on_time, how='left',
left_on=['Arrival', 'Date'],
right_on=['Facility', 'Date'],
sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)
data_reliability= pd.read_csv("Airport View - Schedule Reliability Report.csv",thousands=',')
data_reliability['Date'] = pd.to_datetime(data_reliability['Date'], infer_datetime_format=True)
data_reliability.head(1)
ext_data = pd.merge(
ext_data, data_reliability, how='left',
left_on=['Departure','Date'],
right_on=['Facility', 'Date'],
sort=False)
ext_data = pd.merge(
ext_data, data_reliability, how='left',
left_on=['Arrival', 'Date'],
right_on=['Facility', 'Date'],
sort=False)
ext_data = ext_data.drop(['Facility_x', 'Facility_y'], axis=1)
ext_data.head(1)
data_citypairs = pd.read_csv("citypairs.csv",thousands=',')
data_citypairs['Date'] = pd.to_datetime(data_citypairs['Date'], infer_datetime_format=True)
data_citypairs.head(1)
ext_data = pd.merge(
ext_data, data_citypairs, how='left',
left_on=['Departure', 'Arrival', 'Date'],
right_on=['Departure', 'Arrival', 'Date'],
sort=False)
ext_data.head(1)
data_routes = pd.read_csv("Routes between Airports.csv")
airports = list(set(ext_data.Departure))
data_routes['dep_valid'] = data_routes['Departure'].apply(lambda x: x in airports)
data_routes['arr_valid'] = data_routes['Arrival'].apply(lambda x: x in airports)
data_routes = data_routes[(data_routes['dep_valid']==True) & (data_routes['arr_valid']==True)][['Airline','Departure','Arrival','Codeshare','Stops']]
data_routes.head(1)
airlines = data_routes.groupby(by = ['Departure', 'Arrival']).count().reset_index()[['Departure', 'Arrival', 'Airline']]
airlines.head(1)
ext_data = pd.merge(
ext_data, airlines, how='left',
left_on=['Departure', 'Arrival'],
right_on=['Departure', 'Arrival'],
sort=False)
ext_data.head(1)
⚠️We've collected and derived other features to complete the pool. Due to their weak relevance to the volume of air passengers, however, we didn't attach them to the full dataset, but only performed rough modeling on the individual subsets of features (similar to what we've done with the weather data) instead. These included average on-board passengers of 100~199 seat airplanes departured/arrived from/at an airport on a given date, average departure/arrival seats, etc.
ext_data = ext_data.drop([ 'd_POP','d_GDP','a_POP', 'a_GDP'], axis=1)
full = ext_data
print(full.shape)
full.isnull().sum().sum()
for feature in list(full.columns):
if full[feature].isnull().sum() > 0:
print(feature+": "+str(full[feature].isnull().sum()))
As we can see from above, there are a total of 294 missing values in the full dataset and they mainly occur in on-time/reliability features. There are no features with more than 1% missing values so we should be fine. Here we chose to impute them with the mean values of each column:
full = full.fillna(full.mean())
PCA_features = ['All_Causes_Perc_x', 'Extreme_Weather_Only_Perc_x','Carrier_Cause_Only_Perc_x', 'NAS_Cause_Only_Perc_x',
'Security_Cause_Only_Perc_x', 'NAS_n_Prorated_Perc_x', 'Carrier_n_Prorated_Perc_x', 'All_Causes_Perc_y',
'Extreme_Weather_Only_Perc_y', 'Carrier_Cause_Only_Perc_y', 'NAS_Cause_Only_Perc_y', 'Security_Cause_Only_Perc_y',
'NAS_n_Prorated_Perc_y', 'Carrier_n_Prorated_Perc_y', 'Actual_Departures_x', 'Actual_Arrivals_x', 'Departure_Cancellations_x',
'Arrival_Cancellations_x', 'Departure_Diversions_x', 'Arrival_Diversions_x', 'On_Time_Arrivals_x', '%_On_Time_Gate_Departures_x',
'%_On_Time_Gate_Arrivals_x', 'Average_Gate_Departure_Delay_x', 'Average_Gate_Arrival_Delay_x', 'Average_Block_Delay_x',
'Average_Taxi_Out_Time_x', 'Average_Taxi_in_Time_x', 'Delayed_Arrivals_x', 'Average_Delay_Per_Delayed_Arrival_x',
'Percent_Schedule_Arrival_Reliability_x', 'Percent_Schedule_Departure_Reliability_x', 'Actual_Departures_y', 'Actual_Arrivals_y',
'Departure_Cancellations_y', 'Arrival_Cancellations_y', 'Departure_Diversions_y', 'Arrival_Diversions_y', 'On_Time_Arrivals_y',
'%_On_Time_Gate_Departures_y', '%_On_Time_Gate_Arrivals_y', 'Average_Gate_Departure_Delay_y', 'Average_Gate_Arrival_Delay_y',
'Average_Block_Delay_y', 'Average_Taxi_Out_Time_y', 'Average_Taxi_in_Time_y', 'Delayed_Arrivals_y', 'Average_Delay_Per_Delayed_Arrival_y',
'Percent_Schedule_Arrival_Reliability_y', 'Percent_Schedule_Departure_Reliability_y', 'On_Time_Gate_Departures',
'On_Time_Airport_Departures', 'On_Time_Gate_Arrivals', 'Arrivals_With_EDCT', 'Average_EDCT', 'Gate_Departure_Delay',
'Taxi_Out_Delay', 'Average_Taxi_Out_Time', 'Airport_Departure_Delay', 'Airborne_Delay', 'Taxi_In_Delay', 'Block_Delay', 'Gate_Arrival_Delay']
PCA_data = full[PCA_features]
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca5_feature = pca.fit_transform(PCA_data)
print(pca.explained_variance_ratio_)
full['OnTimePCA1'] = pca5_feature[:,0]
full['OnTimePCA2'] = pca5_feature[:,1]
full['OnTimePCA3'] = pca5_feature[:,2]
full['OnTimePCA4'] = pca5_feature[:,3]
full = full.drop(PCA_features, axis=1)
full.columns
def normalize(df):
ignoreList = ['Lat_x', 'Lat_y', 'Lon_x', 'Lon_y', 'Lat_Delta', 'Lon_Delta','wd_5','wd_6',
'Rain_x', 'Rain_y', 'BadWeather_arrival', 'BadWeather_departure',
'ToSouth','ToNorth','ToWest','ToEast','Winter','Summer','WinterToSouth',
'WinterToNorth','WinterToWest','WinterToEast','SummerToSouth','SummerToNorth',
'SummerToWest','SummerToEast']
result = df.copy()
for feature_name in df.columns:
if(not(feature_name in ignoreList)):
print(feature_name)
try:
mean = df[feature_name].mean()
std = df[feature_name].std()
result[feature_name] = (df[feature_name] - mean) / std
except:
pass
return result
data_weather_std = normalize(data_weather)