MSBA OPIM 602 Project 1 - Capital Bikeshare Multiple Linear Regression
Capital Bikeshare Multiple Linear Regression
Created: November 9, 2021
This webpage is for Project 1 of the OPIM 602 course, Machine Learning I, in the MSBA program at the Georgetown University McDonough School of Business.
# Introduction
Capital Bikeshare (the “Company”) is a micro-transit, bicycle-sharing system that services the Washington, DC, Virginia, and Maryland areas. The Company offers single trip, day trip, and annual memberships. This paper analyzes the effects of collected predictors from 2011 and 2012 to determine the most impactful drivers of bike rental demand. Through the analysis detailed below, it is evident that bike rental demand is primarily driven by temperature, humidity, and certain hours of the day.
# Pre-Modeling Approach
Process and Assumptions
To begin, we’ll load in the data used in our analysis.
|
|
For the purposes of the analysis, a multiple linear regression model was used and an alpha value of 0.05 was used as the benchmark to determine statistical significance. Though some predictors (e.g., hour of the day, season, etc.) were provided as numbers, I assumed that these are intended to be treated as factored levels as they are limited in range (e.g., 0 to 23, 1 to 4, etc.), categorical in nature, and will not take decimal places during predictions. The categorical predictors were converted to dummy variables (i.e., a column for each category and a 1 or 0 to specify if it matches that category) to identify specific predictors within categories. The model predictions are used to determine aggregate demand. In other words, this is demand for both casual and registered users. Data was separated into a training set, consisting approximately 70% of data, and a testing set, consisting of the remaining data points. The constructed model is based on the training set whereas the reported model metrics are based on the testing set.
My exploration of the data types is outline in the code block below.
|
|
I then separated the variables into a training and testing set using simple random sampling with a 70% / 30% split.
|
|
I then explored for missing values which showed that the data was not missing data and thus imputation was not required prior to the model fitting.
|
|
Given my exploration of the data types, I converted some variables to factor variables, as they were categorical in nature, and removed the unnecessary columns.
|
|
I then converted factor variables to dummies to assist in the modeling process.
|
|
Excluded Predictors
Certain predictors were intentionally excluded from the analysis. These include the following: “instant” (removed as it is an identifier for each measurement), “dteday” (as this field is specific to the day, month, and year which would differ in future predictions, “yr” (as this field is a binary classifier signifying either 2011 or 2012, which would not apply to future predictions), and “casual” / “registered” (as we are predicting aggregate demand). Throughout the rest of the analysis, any reference to the predictors means the original set of predictors less those identified above, and any incremental exclusions as identified below.
Collinearity
Certain predictors were intentionally excluded from the analysis. These include the following: “instant” (removed as it is an identifier for each measurement), “dteday” (as this field is specific to the day, month, and year which would differ in future predictions, “yr” (as this field is a binary classifier signifying either 2011 or 2012, which would not apply to future predictions), and “casual” / “registered” (as we are predicting aggregate demand). Throughout the rest of the analysis, any reference to the predictors means the original set of predictors less those identified above, and any incremental exclusions as identified below.
|
|
|
|
As shown in the chart above, there is a strong, positive correlation between the “temp” and “atemp” predictors which also have a moderate, positive relationship with the dependent variable, “cnt” (with “temp” having a higher correlation). There are also moderate, negative correlations between “windspeed” and “hum” and “hum” and “cnt”. The diagonal of the plot displays the distribution density of the respective variable. The distribution of “cnt”, the dependent variable, is positively skewed which gives rise to the need for potential feature engineering. Additionally, the “workingday” predictor is a higher-level grouping of the “weekday” predictor. As a result, “atemp” and “workingday” were removed from our data set for model training.
|
|
# Initial Model
Predictors
An initial model was constructed using all predictors before any feature engineering. A backward stepwise approach was used to derive the top three predictors of bike rentals. This resulted in a model with demand predicted by “temp” (normalized recorded temperature), “hr_17” (5:00 PM), and “hr_18” (6:00 PM).
|
|
|
|
This resulted in a three-variable model, given we are focusing on the top three factors, being the temperature, hour 17, and hour 18.
Linear Regression Assumptions and Influential Data Points – Initial Model
There are four linear regression assumptions I tested: linearity, normality, heteroscedasticity, and multicollinearity. For the initial model, normality, and heteroscedasticity were not satisfied. The linearity portion was tested during our collinearity matrix analysis which suggested that there were moderate linear relationships that existed between the predictors and dependent variable.
The model did not satisfy the linear regression assumptions of normality and homoscedasticity. To test normality, the Anderson-Darling test was used on the residuals. To test homoscedasitcity / constant error variance, the Breusch-Pagan test was used on the residuals. There were no issues with multicollinearity when assessing VIF scores for the initial model.
|
|
In addition to testing the regression assumptions, I inspected the model for influential data points. Given provided data is assumed to be correct, I opted not to remove any influential observations.
# Final Model
Feature Engineering – Dependent Variable
As done previously, the starting data set for the final model also excluded atemp and working day variables as they were deemed to have high levels of collinearity with other predictors.
|
|
The chart below displays the bike rentals as reported (top-left), natural log transformation (top-right), square root transformation (bottom-left), and cube root transformation (bottom-right).
|
|
As reported, the bike rentals had a skewness factor of 1.28 showing the dependent variable is positively skewed. The log transformation shifted the skewness from 1.28 to -0.93, or negatively skewed. The square root and cube root transformations shifted the skewness from 1.28 to 0.29 and -0.08, respectively. Given the near-normal skewness of the cube root transformation, I transformed the dependent variable by taking its cube root.
Feature Engineering – “Hr” Predictor
I then visualized the bike rental demand by hour of the day. Based on the visual, there is an uptick in demand in certain hours (mainly commute times) and lower demand in other hours (such as early morning / late night). To better simplify the data, I binned the “hr” predictor into the three distinct bins of nearly equal rental demand: working hours, other hours, and commuting hours. The plot below displays the rental demand by hour bin along with a label noting which bin covers which hours.
|
|
|
|
|
|
Final Model
Based on my modeling procedures, the top three predictors of aggregate demand are hourly normalized temperature (“temp”), hourly normalized humidity (“hum”), and the binned other hour category. The regression equation can be seen below.
|
|
$$ \operatorname{cnt} = \alpha + \beta_{1}(\operatorname{temp}) + \beta_{2}(\operatorname{hum}) + \beta_{3}(\operatorname{hr_category_other_hours}) + \epsilon $$
The model has an adjusted R2 of 0.541, which means that 54.1% of the variation in aggregate demand is driven by the three variables in my final model. Additionally, the RMSE of the model is 1.381, which is very low relative to the bike rental amounts each day. Given the cube root transformation of the dependent variable, the inverse should be calculated (i.e., raised to the third power) to reverse the effects of the transformation when interpreting predictions.
Linear Regression Assumptions – Final Model
As part of the auditing process of creating my multiple linear regression model, I tested my model for the four regression assumptions, which are detailed below.
|
|
- Linearity: The standard residuals closely follow the Q-Q plot and this show that there is a linear relationship.
- Normality: The residuals and fitted values appear to be randomly scattered around the zero reference line in red initially suggesting normality. However, the results of the Anderson-Darling test for normality of the residuals returns a p-value of near-zero, which indicates the normality assumption is not satisfied.
- Heteroscedasticity: There does not appear to be a bell-shaped effect of the residuals initially suggesting homoscedasticity and constant error variance. However, the results of the Breusch-Pagan test for constant error variance returns a p-value of near-zero, which indicates that the error terms have non-constant variance and are heteroscedastic.
- Independent Errors / Multicollinearity: I calculated the variance inflation factor (VIF) of my final model to inspect of independent predictor multicollinearity. Given the VIF values are low, near the 1.0 to 1.2 range, the error terms appear to be independent and there is no multicollinearity.
Predictions on Test Data
|
|
The performance of the final model on the training set and testing set is displayed in the table below.
Training Set | Testing Set | |
---|---|---|
MSE | 1.897385 | 1.906189 |
RMSE | 1.377456 | 1.380648 |
MAE | 1.142663 | 1.143901 |
Correlation | 0.734301 | 0.735362 |
Adj. R^2 | 0.539197 | 0.540758 |
# Recommendation
Based on my analysis, the top three predictors of bike rental demand are “other” hours of the day, temperature, and humidity. While my model is statistically significant overall, and each of the three predictors are statistically significant, the normality and homoscedasticity linear regression assumptions were not satisfied. As such, I recommend Capital Bikeshare management explore the usage of other machine learning models that can better predict the data as a multiple linear regression model may not be appropriate. However, while additional modeling is performed, Capital Bikeshare can further explore the relationship between temperature, humidity, and time of the day as they are statistically significant, and intuitively, practically significant. Specifically, management should target marketing campaigns on favorable (i.e., hotter) days, days with low humidity levels, and hours of the day not within off-peak times (i.e., 10:00 PM to 6:59 AM). This, in turn, will help increase ridership and thus drive top-line performance.