Medical Costs EDA and Prediction

A Quick Overview

The main goal of this project is to figure out what habits or life conditions have more influence in how much one spends in hospital (such as being a smoker, being obese or being old) and then build regression models able to predict well the charges one will have in the hospital given some of their characteristics.

The dataset was obtained from Kaggle and needed very little cleaning and preparation, so the focus of this project is in data visualization, the insights obtained from it, and the regression models.

The Exploratory Data Analysis (EDA) shows us that the smoking habit is the greatest villain for people seeking hospitals, and when combined with advanced age or high body mass, it becomes even worst. So let’s dive into the Analysis and keep ourselves healthy!

About the Dataset

Context

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

Columns

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance

Acknowledgement

For more information about this dataset you can access this link.

Libraries

All the libraries used in this notebook are listed in the code chunk below.

library(tidyverse)
library(caret)
library(corrplot)
library(scales)
library(ggthemes)
library(knitr)
library(glmnet)
library(caTools)
library(Metrics)
library(randomForest)
library(kableExtra)
library(fastDummies)

Loading and pre-processing data

raw <- read_csv('insurance.csv')

head(raw) %>%
  kable() %>%
  kable_styling(full_width = F)

age	sex	bmi	children	smoker	region	charges
19	female	27.900	0	yes	southwest	16884.924
18	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855
31	female	25.740	0	no	southeast	3756.622

Now let’s check if there are any “NA” values that can cause problems in our data analysis.

any(is.na(raw))

## [1] FALSE

Good thing to know there are no “NA” values in our data, which is not common but help us a lot and save us some effort. Now we can look at our data to make useful insights and necessary changes.

Now, let’s have a quick look at the data description.

raw <- raw %>%
  mutate_at(c('sex', 'smoker', 'region'), as.factor)

summary(raw) %>%
  kable() %>%
  kable_styling(full_width = F)

age	sex	bmi	children	smoker	region	charges
Min. :18.00	female:662	Min. :15.96	Min. :0.000	no :1064	northeast:324	Min. : 1122
1st Qu.:27.00	male :676	1st Qu.:26.30	1st Qu.:0.000	yes: 274	northwest:325	1st Qu.: 4740
Median :39.00	NA	Median :30.40	Median :1.000	NA	southeast:364	Median : 9382
Mean :39.21	NA	Mean :30.66	Mean :1.095	NA	southwest:325	Mean :13270
3rd Qu.:51.00	NA	3rd Qu.:34.69	3rd Qu.:2.000	NA	NA	3rd Qu.:16640
Max. :64.00	NA	Max. :53.13	Max. :5.000	NA	NA	Max. :63770

The youngest person in this dataset is 18 years old and the oldest is 64. Apparently everything is fine with the data, so we can make the final adjustments and study more the correlation between the attributes.

Converting columns to numeric

The columns sex, smoker and region were initially defined as character, but for take part in the regression models, they need to be converted to numeric.

Since there are no inheritance for any of these variables, they will be converted to dummy variables (assumes only value 0 or 1).

data_num <- dummy_cols(raw, 
                       select_columns = c("sex", "smoker", "region"), 
                       remove_selected_columns = T,
                       remove_first_dummy = T)

colnames(data_num)[which(names(data_num) == "sex_male")] <- "sex" 
colnames(data_num)[which(names(data_num) == "smoker_yes")] <- "smoker" 


head(data_num) %>%
  kable() %>%
  kable_styling(full_width = F)

age	bmi	children	charges	sex	smoker	region_northwest	region_southeast	region_southwest
19	27.900	0	16884.924	0	1	0	0	1
18	33.770	1	1725.552	1	0	0	1	0
28	33.000	3	4449.462	1	0	0	1	0
33	22.705	0	21984.471	1	0	1	0	0
32	28.880	0	3866.855	1	0	1	0	0
31	25.740	0	3756.622	0	0	0	1	0

Data Analysis

corrplot(cor(data_num), method = "square")

cor(data_num, data_num$charges)

##                         [,1]
## age               0.29900819
## bmi               0.19834097
## children          0.06799823
## charges           1.00000000
## sex               0.05729206
## smoker            0.78725143
## region_northwest -0.03990486
## region_southeast  0.07398155
## region_southwest -0.04321003

The charges have a strong correlation with the fact of a patient being a smoker or not, what is just as expected. There is also a weaker correlation with the patient age and even weaker with the body mass index (bmi), what is opposed to the expected, since it is common sense people with higher bmi also develop more health problems.

But let’s investigate smoking in more detail and see how much patients spend on treatments on average.

Smoking Influence

  ggplot(raw, aes(x = smoker, fill = sex)) + 
  geom_bar(position = "dodge" ) +
  ggtitle('Quantity of smokers and non smokers per sex') + 
  scale_y_continuous(breaks = seq(0,600,100)) + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(colour = "gray70",
                                  size = 0.5,
                                  linetype = "dashed"))

  ggplot(raw, aes(x = smoker, y = charges)) + 
  geom_boxplot(aes(fill = sex)) +
  ggtitle('Amount of charges for smokers and non smokers') + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = 'K'))

As we can see, the fact of being a smoker or not plays a major role in how much a patient spends in the hospital. But the plot above still hide some information from us, so let’s see a violin plot to understand better the data distribution.

sample_size <- raw %>% group_by(smoker) %>% summarise(count = n())

raw %>%
  left_join(sample_size) %>%
  mutate(myaxis = paste( smoker, "\n\n", "n =", count)) %>%
  ggplot(aes(x = myaxis, y = charges)) + 
  geom_violin(aes(fill = sex)) +
  ggtitle('Amount of charges for smokers and non smokers') + 
  xlab("smoker") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(colour = "gray70",
                                  size = 0.5,
                                  linetype = "dashed")) + 
  scale_y_continuous(breaks = seq(0,60000,10000), labels = label_number(scale = 1e-3, suffix = 'K'))

The distribution for women and men are quite similar, what means the sex does not have a big influence in whether the patient spends more on treatment or not. Now let’s analyze the age distribution to see if age is a determinant factor.

ggplot(data_num, aes(x = age, y = ..density..)) +
  geom_histogram(binwidth = 2, color = "white", fill = "steelblue4") +
  geom_density(lwd = 1.2, color ='thistle4' ) +
  ggtitle('Distribution of age') + 
  scale_x_continuous(breaks = seq(10,75,5), limits = c(10,75)) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(colour = "gray70",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor = element_blank())

As we saw earlier, the ages are between 18 years and 64 ears, but now we have a better notion of how they are distributed. It can be observed a lot of young patients, with less than 20 years old, while other ages are nearly equally distributed.

Let’s check if the fact of being a smoker or not is worse for the youngers.

raw %>%
  filter(age<=20) %>%
  group_by(smoker) %>%
ggplot(aes(y = smoker, x = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  ggtitle('Charges for patients under 20 years old') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0,50000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5))

raw %>%
  filter(age<=30 & age>20) %>%
  group_by(smoker) %>%
ggplot(aes(y = smoker, x = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  ggtitle('Charges for patients from 20 to 30 years old') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0,55000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5))

raw %>%
  filter(age<=40 & age>30) %>%
  group_by(smoker) %>%
ggplot(aes(y = smoker, x = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  ggtitle('Charges for patients from 30 to 40 years old') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0,60000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5))

raw %>%
  filter(age<=50 & age>40) %>%
  group_by(smoker) %>%
ggplot(aes(y = smoker, x = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  ggtitle('Charges for patients from 40 to 50 years old') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0,60000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5))

raw %>%
  filter(age>50) %>%
  group_by(smoker) %>%
ggplot(aes(y = smoker, x = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  ggtitle('Charges for patients over 50 years old') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0,70000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5))

The graphs let us no doubt. The charges for smoking patients are incredibly higher than the ones for non-smoking. And this is specially true for patients under 20 years old, when the difference is enormous.

The graphs also show us some outliers in charges for non-smoking patients at any age, which is probably due to some serious disease or accident. Nevertheless, the charges for smoking patients have no visible outliers, which means all smokers in the collected data, spends a lot on treatments.

raw %>%
  filter(smoker == 'yes') %>%
ggplot(aes(y = charges, x = age)) +
  geom_point(pch=16, color = "darkorange3") +
  ggtitle('Charges by smoking patients age') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(breaks = seq(0,70000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(color = "#cacaca",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor.y = element_blank())

raw %>%
  filter(smoker == 'no') %>%
ggplot(aes(y = charges, x = age)) +
  geom_point(pch=16, color = "forestgreen") +
  ggtitle('Charges by non-smoking patients age') + 
  scale_fill_brewer(palette = "Dark2") +
  scale_y_continuous(breaks = seq(0,70000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(color = "#cacaca",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor.y = element_blank())

raw %>%
ggplot(aes(y = charges, x = age, color = smoker)) +
  geom_point(pch=16) +
  ggtitle('Charges per age for smokers and non-smokers') + 
  scale_fill_manual(values = c("forestgreen", "darkorange3")) +
  scale_y_continuous(breaks = seq(0,70000,5000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(color = "#cacaca",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor.y = element_blank()) +
  geom_smooth(method = "lm", formula = y ~ x)

The charges for non-smokers follow quite straight the linear model, increasing with patients age. There are some higher values probably due to diseases and accidents. As for smokers, the values doesn’t fit in the linear model, even though they increase a bit with patients age. The outliers are the “expected” when the patient is a smoker and in the best scenarios they spend as much as the non-smokers patients who spend more on treatments.

I think that is enough reason for you to stop smoking if you haven’t already! Join the healthy side of the force!

Since we have already checked the behavior of how much a patient spends on treatment accordingly to ones age and smoking habit, let’s now check the influence of the other relevant attribute, the body mass index(bmi).

Body Mass Index (BMI) Influence

Body Mass Index is a simple calculation using a person’s height and weight. The formula is: \[BMI = kg/m^2\] where kg is a person’s weight in kilograms and m² is the square of their height in meters.

Accordingly to the World Healthy Organization (WHO), for adults until 65 years old, BMI falls into one of the following categories.

BMI	Nutritional Status
Below 18.5	Underweight
18.5 – 24.9	Normal weight
25.0 – 29.9	Pre-obesity
30.0 – 34.9	Obesity class I
35.0 – 39.9	Obesity class II
Above 40	Obesity class III

Now let’s check if it has something to do with how much a patient spends on treatment. Just looking at the definition of BMI, is expected that when it comes above 30, the patients start to develop more diseases and the charges increase.

data_num %>%
ggplot(aes(x = bmi, y = ..density..)) +
  geom_histogram(binwidth = 2, fill = "darkgoldenrod3", color = "gray30") +
  geom_density(lwd=1.1, color = "coral4") +
  ggtitle('Distribution of BMI') + 
  scale_x_continuous(breaks = seq(0,50, by = 5), limits = c(10,55)) +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(color = "#cacaca",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor.y = element_blank())

summary(data_num$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.96   26.30   30.40   30.66   34.69   53.13

raw %>%
ggplot(aes(y = charges, x = bmi, color = smoker)) +
  geom_point(pch=16) +
  ggtitle('Charges by BMI for smokers and non-smokers') + 
  scale_y_continuous(breaks = seq(0,70000,10000),
                     labels = label_number(scale = 1e-3, suffix = "K", accuracy = 5)) +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid = element_line(color = "#cacaca",
                                  size = 0.5,
                                  linetype = "dashed"),
        panel.grid.minor.y = element_blank()) +
  geom_smooth(method = "lm", formula = y ~ x)

As we can see above, the BMI alone is not enough for a person develop diseases and spend a lot on treatments, since the charges for non-smokers are nearly constant, despite the person’s BMI.

As for smokers, the BMI can clearly worsen the healthy situation. The graph show us a considerable rise in charges for patients who smoke and have a BMI over 30, the point where obesity starts.

Regression Models

Now let’s make some regression models and see which one performs better, after some hyperparameters adjustment.

Linear Regression

Let’s start with the classic linear regression.

set.seed(77)

sample <- sample.split(data_num$age, SplitRatio = 0.8)
train <- subset(data_num, sample == T)
test <- subset(data_num, sample == F)

linear_model <- lm(charges ~ ., data = train)

After testing some combinations to predict the charges, the linear model using all the other columns performed better, but all of them had very similar parameters.

summary(linear_model)

## 
## Call:
## lm(formula = charges ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10970.1  -2812.4   -852.1   1424.1  29787.8 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -12131.4     1078.7 -11.247   <2e-16 ***
## age                 263.3       13.0  20.248   <2e-16 ***
## bmi                 332.6       31.1  10.695   <2e-16 ***
## children            379.9      150.6   2.522   0.0118 *  
## sex                -161.3      363.3  -0.444   0.6571    
## smoker            23766.4      450.6  52.749   <2e-16 ***
## region_northwest   -150.1      520.9  -0.288   0.7733    
## region_southeast   -614.2      523.5  -1.173   0.2410    
## region_southwest  -1106.9      523.8  -2.113   0.0348 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5908 on 1058 degrees of freedom
## Multiple R-squared:  0.7579, Adjusted R-squared:  0.756 
## F-statistic:   414 on 8 and 1058 DF,  p-value: < 2.2e-16

As shown above, the R² for the first model was around 0.75, not bad for the first approach for a regression model, especially without any normalization or hyperparameters adjustment.

We can also calculate the R² for the testing data, which happens to be almost the same

data.frame("R2_train" = R2(train$charges, predict(linear_model, train)),
           "R2_test" = R2(test$charges, predict(linear_model, test))) %>%
kable(caption = "Metrics", digits = 3) %>%
kable_styling(full_width = F)

Metrics
R2_train	R2_test
0.758	0.727

Polynomial Regression

What if our data is actually more complex than a simple straight line?

Surprisingly, we can actually use a linear model to fit nonlinear data. For that, we are going to add powers to each feature as new features and train a linear model on this extended set of features. This technique is called Polynomial Regression.

poly_model <- lm(charges ~ .^2 , data = train)
summary(poly_model)

## 
## Call:
## lm(formula = charges ~ .^2, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10565.8  -1924.6  -1111.1   -158.8  29323.3 
## 
## Coefficients: (3 not defined because of singularities)
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -3.887e+03  2.808e+03  -1.384 0.166514    
## age                                2.167e+02  5.732e+01   3.780 0.000166 ***
## bmi                                1.103e+02  9.107e+01   1.211 0.226036    
## children                           1.278e+03  7.518e+02   1.700 0.089376 .  
## sex                               -4.293e+02  1.728e+03  -0.248 0.803818    
## smoker                            -1.930e+04  2.074e+03  -9.306  < 2e-16 ***
## region_northwest                   5.527e+02  2.504e+03   0.221 0.825366    
## region_southeast                   2.987e+03  2.397e+03   1.246 0.212972    
## region_southwest                   2.674e+02  2.429e+03   0.110 0.912363    
## age:bmi                            7.820e-01  1.776e+00   0.440 0.659720    
## age:children                      -6.507e+00  9.321e+00  -0.698 0.485257    
## age:sex                            2.408e+01  2.083e+01   1.156 0.247834    
## age:smoker                         1.778e+01  2.666e+01   0.667 0.505133    
## age:region_northwest              -7.675e-01  2.987e+01  -0.026 0.979507    
## age:region_southeast               3.205e+01  3.038e+01   1.055 0.291725    
## age:region_southwest               2.279e+01  3.083e+01   0.739 0.459929    
## bmi:children                      -1.311e+01  2.125e+01  -0.617 0.537419    
## bmi:sex                           -3.942e+01  5.040e+01  -0.782 0.434278    
## bmi:smoker                         1.422e+03  6.123e+01  23.220  < 2e-16 ***
## bmi:region_northwest              -6.459e+01  7.611e+01  -0.849 0.396296    
## bmi:region_southeast              -1.822e+02  6.677e+01  -2.729 0.006467 ** 
## bmi:region_southwest              -9.204e+01  7.309e+01  -1.259 0.208190    
## children:sex                      -3.233e+02  2.432e+02  -1.330 0.183946    
## children:smoker                   -6.113e+02  3.087e+02  -1.980 0.047931 *  
## children:region_northwest          4.767e+02  3.540e+02   1.347 0.178373    
## children:region_southeast          1.366e+01  3.523e+02   0.039 0.969089    
## children:region_southwest         -1.906e+02  3.411e+02  -0.559 0.576423    
## sex:smoker                        -4.683e+02  7.423e+02  -0.631 0.528255    
## sex:region_northwest               7.646e+02  8.363e+02   0.914 0.360784    
## sex:region_southeast               1.205e+03  8.386e+02   1.436 0.151163    
## sex:region_southwest               6.772e+02  8.382e+02   0.808 0.419340    
## smoker:region_northwest           -5.716e+02  1.083e+03  -0.528 0.597614    
## smoker:region_southeast           -7.711e+02  1.022e+03  -0.754 0.450840    
## smoker:region_southwest            8.333e+02  1.078e+03   0.773 0.439760    
## region_northwest:region_southeast         NA         NA      NA       NA    
## region_northwest:region_southwest         NA         NA      NA       NA    
## region_southeast:region_southwest         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4689 on 1033 degrees of freedom
## Multiple R-squared:  0.8511, Adjusted R-squared:  0.8463 
## F-statistic: 178.9 on 33 and 1033 DF,  p-value: < 2.2e-16

data.frame("R2_train" = R2(train$charges, predict(poly_model, train)),
           "R2_test" = R2(test$charges, predict(poly_model, test))) %>%
kable(caption = "Metrics", digits = 3) %>%
kable_styling(full_width = F)

Metrics
R2_train	R2_test
0.851	0.817

In the Regression above, we used a degree = 2 to predict the costs. Since increasing the degree over 2 for the Polynomial Regression didn’t show us any better results, we stopped there to optimize performance.

The more we increase the degree of a Polynomial Regression, more terms are generated as a combination of the features, resulting in more processing time and not necessarily better results. For example, in a polynomial regression containing two features a and b with degree = 3, there would be added the features a³, b³, a², b², a²b and ab².

Ridge, Lasso and Elastic Net

Ridge Regression is a regularized version of Linear Regression. It adds a regularization term (the l₂ penalty) to the cost functions that forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.

Lasso Regression is another regularized version of Linear Regression: just like Ridge Regression, it adds a regularization term to the cost function, but it uses the l₁ penalty instead of l₂. But Lasso tends to completely eliminate the weights of the least important features, shrinking them all the way to zero.

Let’s not define what are the penalties l₁ and l₂ for now, not to get into advanced math explanations.

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is simply mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio \(\alpha\) in the glmnet library. When \(\alpha\) = 0, Elastic Net is equivalent to Ridge Regression, and when \(\alpha\) = 1, it is equivalent to Lasso Regression.

Since we don’t know which type of regularization will perform better, we can perform an elastic net regression, increasing the \(\alpha\) by a small value and printing the result’s metrics into a table to see which \(\alpha\) resulted in a better model.

train_matrix <- model.matrix(charges ~ ., data = train)
test_matrix <- model.matrix(charges ~ ., data = test)
train_matrix_sqr <- model.matrix(charges ~ .^2, data = train)
test_matrix_sqr <- model.matrix(charges ~ .^2, data = test)

y_train <- train$charges
y_test <- test$charges

lambda_grid <- 10^seq(-2,10, length.out = 100)

list_of_fits <- list()

for (i in 0:10) {
  alpha_value <- paste("alpha", i/10)
  
  list_of_fits[[alpha_value]] <- 
    cv.glmnet(train_matrix, y_train, alpha = i/10, type.measure = "mse", lambda = lambda_grid)
}

results <- data.frame()

for (i in 0:10) {
  alpha_value <- paste("alpha", i/10)
  
  predicted <-
    predict(list_of_fits[[alpha_value]], 
    s = list_of_fits[[alpha_value]]$lambda.1se,
    newx = test_matrix)
  
  predicted_train <-
    predict(list_of_fits[[alpha_value]], 
    s = list_of_fits[[alpha_value]]$lambda.1se,
    newx = train_matrix)
  
  rmse <- rmse(y_test, predicted)
  R2_test <- R2(y_test, predicted)
  R2_train <- R2(y_train, predicted_train)
  
  temp <- data.frame(alpha = i/10, rmse = rmse,
                     R2_test = R2_test, R2_train = R2_train)
  results <- rbind(results, temp)
}

results %>%
  kable(caption = "Elastic Net Regression for Linear Model", 
        digits = 4,
        row.names = F) %>%
  kable_styling(full_width = F)

Elastic Net Regression for Linear Model
alpha	rmse	R2_test	R2_train
0.0	6749.269	0.7258	0.7577
0.1	6826.248	0.7253	0.7572
0.2	6853.075	0.7249	0.7559
0.3	6895.251	0.7230	0.7539
0.4	6835.486	0.7227	0.7538
0.5	6788.117	0.7230	0.7541
0.6	6806.253	0.7223	0.7532
0.7	6821.810	0.7218	0.7524
0.8	6778.683	0.7222	0.7532
0.9	6858.838	0.7206	0.7504
1.0	6800.595	0.7215	0.7519

We can also add regularization after the model being expanded to polynomial, so let’s do the same thing to the polynomial model built earlier.

list_of_fits_poly <- list()

for (i in 0:10) {
  alpha_value <- paste("alpha", i/10)
  
  list_of_fits_poly[[alpha_value]] <- 
    cv.glmnet(train_matrix_sqr, y_train, alpha = i/10, type.measure = "mse", lambda = lambda_grid)
}

results_poly <- data.frame()

for (i in 0:10) {
  alpha_value <- paste("alpha", i/10)
  
  predicted_sqr <-
    predict(list_of_fits_poly[[alpha_value]], 
    s = list_of_fits_poly[[alpha_value]]$lambda.1se,
    newx = test_matrix_sqr)
  
  predicted_train_sqr <-
    predict(list_of_fits_poly[[alpha_value]], 
    s = list_of_fits_poly[[alpha_value]]$lambda.1se,
    newx = train_matrix_sqr)
  
  rmse_sqr <- rmse(y_test, predicted_sqr)
  R2_test_sqr <- R2(y_test, predicted_sqr)
  R2_train_sqr <- R2(y_train, predicted_train_sqr)
  
  temp <- data.frame(alpha = i/10, rmse = rmse_sqr,
                     R2_test = R2_test_sqr, R2_train = R2_train_sqr)
  results_poly <- rbind(results_poly, temp)
}

results_poly %>%
  kable(caption = "Elastic Net Regression for Quadratic Model", 
        digits = 4,
        row.names = F) %>%
  kable_styling(full_width = F)

Elastic Net Regression for Quadratic Model
alpha	rmse	R2_test	R2_train
0.0	5813.158	0.7927	0.8357
0.1	5746.907	0.7976	0.8379
0.2	5813.908	0.7931	0.8328
0.3	5847.760	0.7915	0.8303
0.4	5947.150	0.7862	0.8264
0.5	5728.810	0.7987	0.8366
0.6	5809.106	0.7931	0.8312
0.7	5769.418	0.7957	0.8336
0.8	5693.757	0.8011	0.8383
0.9	5832.027	0.7921	0.8284
1.0	5821.074	0.7924	0.8289

For this data, we can see that neither Ridge or Lasso Regression resulted in a better performance for the prediction model. This is probably because the training sample was well randomly selected and our initial models were already fitted to the data.

Random Forest

Finally, let’s try to predict the costs using Random Forest Regression.

rf <- randomForest(charges ~.,
                   data = train,
                   ntree = 200,
                   mtry = 3
                   )

rf

## 
## Call:
##  randomForest(formula = charges ~ ., data = train, ntree = 200,      mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 20940092
##                     % Var explained: 85.35

data.frame("R2_train" = R2(train$charges, predict(rf, train)),
           "R2_test" = R2(test$charges, predict(rf, test))) %>%
  kable(caption = "Random Forest Model", 
        digits = 4,
        row.names = F) %>%
  kable_styling(full_width = F)

Random Forest Model
R2_train	R2_test
0.948	0.828

Comparison

The linear model had the worst performance, whilst the random forest seems to have the best one. This is probably due to the non-linearity of the data, since when we transformed the linear model into a quadratic one, the predictions were much better already.

It’s clear that the Random Forest Regression adapted better to the training than to the test data (a bit of overfitting), which is really common with decision tree algorithms. Perhaps it could have a better performance after tuning the hyperparameters using grid search, but the dataset is not big enough to make much improvements.