Li Wang & Bryan Bittner 2022-07-10
rmarkdown::render("Project2.Rmd",
output_format = "github_document",
output_file = "README.md",
output_options = list(html_preview= FALSE,toc=TRUE,toc_depth=2,toc_float=TRUE)
)
We will use the following packages:
library(rmarkdown)
library(httr)
library(jsonlite)
library(readr)
library(tidyverse)
library(lubridate)
library(knitr)
library(caret)
library(randomForest)
library(corrplot)
library(gbm)
This online News Popularity Data Set summarizes a heterogeneous set of features about articles published by Mashable in a period of two years.
Our target variable is the shares variable(Number of shares ), and predict variables are the following:
The purpose of our analysis is to predict the number of shares in social networks (popularity). In this project, we produce some basic summary statistics and plots about the training data, and fit a linear regression model and an ensemble tree-based model.
Use a relative path to import the data.
newsData<-read_csv(file="../Datasets/OnlineNewsPopularity.csv",show_col_types = FALSE)
head(newsData)
## # A tibble: 6 × 61
## url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq… num_hrefs
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 http://mashable.c… 731 12 219 0.664 1.00 0.815 4
## 2 http://mashable.c… 731 9 255 0.605 1.00 0.792 3
## 3 http://mashable.c… 731 9 211 0.575 1.00 0.664 3
## 4 http://mashable.c… 731 9 531 0.504 1.00 0.666 9
## 5 http://mashable.c… 731 13 1072 0.416 1.00 0.541 19
## 6 http://mashable.c… 731 10 370 0.560 1.00 0.698 2
## # … with 53 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>, average_token_length <dbl>,
## # num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>, data_channel_is_bus <dbl>,
## # data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>, kw_min_min <dbl>,
## # kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
## # kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
## # self_reference_avg_sharess <dbl>, weekday_is_monday <dbl>, weekday_is_tuesday <dbl>, weekday_is_wednesday <dbl>,
## # weekday_is_thursday <dbl>, weekday_is_friday <dbl>, weekday_is_saturday <dbl>, weekday_is_sunday <dbl>, …
Subset the data. If running the reports by an automated parameter driven process, the report will automatically use the parameter passed into this report. If running the report manually without a parameter, the data will subset to the ‘lifestyle’ news channel.
#Read the parameter being passed in to the automated report
if (params$columnNames != "") {
paramColumnNameType<-params$columnNames
}else{
paramColumnNameType<-"lifestyle"
}
columnName<-paste("data_channel_is_",paramColumnNameType,sep="")
#According to dplyr help, to refer to column names stored as string, use the '.data' pronoun.
#https://dplyr.tidyverse.org/reference/filter.html
newsDataSubset <- filter(newsData,.data[[columnName]] == 1)
Merging the weekdays columns channels as one single column named publishing_day.
# Merging the weekdays columns channels as one single column named publishing_day
newsDataSubset <- newsDataSubset %>%
select(url, starts_with("weekday_is")) %>%
pivot_longer(-url) %>%
dplyr::filter(value > 0) %>%
mutate(publishing_day = gsub("weekday_is_", "", name)) %>%
left_join(newsDataSubset, by = "url") %>%
select(-name, -starts_with("weekday_is_"))
# set the publishing_day as factor variable
newsDataSubset$publishing_day<- as.factor(newsDataSubset$publishing_day)
head(newsDataSubset)
## # A tibble: 6 × 56
## url value publishing_day timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq…
## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 http:/… 1 monday 731 8 960 0.418 1.00 0.550
## 2 http:/… 1 monday 731 10 187 0.667 1.00 0.800
## 3 http:/… 1 monday 731 11 103 0.689 1.00 0.806
## 4 http:/… 1 monday 731 10 243 0.619 1.00 0.824
## 5 http:/… 1 monday 731 8 204 0.586 1.00 0.698
## 6 http:/… 1 monday 731 11 315 0.551 1.00 0.702
## # … with 47 more variables: num_hrefs <dbl>, num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
## # average_token_length <dbl>, num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>,
## # data_channel_is_bus <dbl>, data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>,
## # kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>,
## # kw_min_avg <dbl>, kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>,
## # self_reference_max_shares <dbl>, self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>,
## # LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>, …
Here we drop some non-preditive variables: url,value,timedelta,data_channel_is_lifestyle, data_channel_is_entertainment,data_channel_is_bus, data_channel_is_socmed ,data_channel_is_tech,data_channel_is_world columns,is_weekend. They won’t contribute anything.
newsDataSubset<-newsDataSubset%>%select(-c(1,2,4,16:21,34))
newsDataSubset
## # A tibble: 2,099 × 46
## publishing_day n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 monday 8 960 0.418 1.00 0.550 21
## 2 monday 10 187 0.667 1.00 0.800 7
## 3 monday 11 103 0.689 1.00 0.806 3
## 4 monday 10 243 0.619 1.00 0.824 1
## 5 monday 8 204 0.586 1.00 0.698 7
## 6 monday 11 315 0.551 1.00 0.702 4
## 7 monday 10 1190 0.409 1.00 0.561 25
## 8 monday 6 374 0.641 1.00 0.828 7
## 9 tuesday 12 499 0.513 1.00 0.662 14
## 10 wednesday 11 223 0.662 1.00 0.826 5
## # … with 2,089 more rows, and 39 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
## # average_token_length <dbl>, num_keywords <dbl>, kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>,
## # kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>, kw_max_avg <dbl>, kw_avg_avg <dbl>,
## # self_reference_min_shares <dbl>, self_reference_max_shares <dbl>, self_reference_avg_sharess <dbl>, LDA_00 <dbl>,
## # LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>,
## # global_rate_positive_words <dbl>, global_rate_negative_words <dbl>, rate_positive_words <dbl>,
## # rate_negative_words <dbl>, avg_positive_polarity <dbl>, min_positive_polarity <dbl>, max_positive_polarity <dbl>, …
Lets set up our Train/Test split. This will allow us to determine the model fit using a subset of data called Training, while saving the remainder of the data called Test to test our model predictions with.
set.seed(111)
train <- sample(1:nrow(newsDataSubset),size=nrow(newsDataSubset)*0.7)
test <- dplyr::setdiff(1:nrow(newsDataSubset),train)
newsDataSubsetTrain <- newsDataSubset[train,]
newsDataSubsetTest <- newsDataSubset[test,]
Start with the data structure and basic summary statistics for the ‘shares’ field.
# data structure
str(newsDataSubsetTrain)
## tibble [1,469 × 46] (S3: tbl_df/tbl/data.frame)
## $ publishing_day : Factor w/ 7 levels "friday","monday",..: 1 5 1 6 5 4 6 5 5 7 ...
## $ n_tokens_title : num [1:1469] 13 8 14 8 11 11 13 9 10 12 ...
## $ n_tokens_content : num [1:1469] 776 1331 879 762 743 ...
## $ n_unique_tokens : num [1:1469] 0.499 0.419 0.419 0.375 0.532 ...
## $ n_non_stop_words : num [1:1469] 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num [1:1469] 0.699 0.612 0.584 0.497 0.713 ...
## $ num_hrefs : num [1:1469] 4 3 19 11 9 22 10 11 10 12 ...
## $ num_self_hrefs : num [1:1469] 2 0 4 2 0 2 5 3 3 1 ...
## $ num_imgs : num [1:1469] 1 1 12 1 1 0 0 4 0 2 ...
## $ num_videos : num [1:1469] 0 0 0 0 0 0 0 1 0 0 ...
## $ average_token_length : num [1:1469] 4.63 4.72 4.54 4.9 4.75 ...
## $ num_keywords : num [1:1469] 5 7 9 8 5 10 10 9 7 8 ...
## $ kw_min_min : num [1:1469] -1 4 217 4 -1 -1 217 -1 4 -1 ...
## $ kw_max_min : num [1:1469] 1600 1400 456 309 566 728 1400 690 2100 1000 ...
## $ kw_avg_min : num [1:1469] 319 346 344 132 113 ...
## $ kw_min_max : num [1:1469] 8100 24500 0 0 11600 39700 0 0 0 0 ...
## $ kw_max_max : num [1:1469] 843300 843300 69100 843300 843300 ...
## $ kw_avg_max : num [1:1469] 407740 209814 25289 153025 341800 ...
## $ kw_min_avg : num [1:1469] 2488 2959 0 0 3397 ...
## $ kw_max_avg : num [1:1469] 4640 10290 3073 4638 5972 ...
## $ kw_avg_avg : num [1:1469] 3714 4967 2379 2946 4232 ...
## $ self_reference_min_shares : num [1:1469] 9800 0 11000 5200 0 2000 2300 1400 2900 2200 ...
## $ self_reference_max_shares : num [1:1469] 9800 0 11000 5200 0 2000 3400 24000 2900 2200 ...
## $ self_reference_avg_sharess : num [1:1469] 9800 0 11000 5200 0 ...
## $ LDA_00 : num [1:1469] 0.0408 0.0289 0.0226 0.0255 0.3398 ...
## $ LDA_01 : num [1:1469] 0.04 0.0287 0.0223 0.025 0.04 ...
## $ LDA_02 : num [1:1469] 0.2609 0.0286 0.0223 0.1508 0.0401 ...
## $ LDA_03 : num [1:1469] 0.2628 0.0286 0.2491 0.0251 0.0401 ...
## $ LDA_04 : num [1:1469] 0.396 0.885 0.684 0.774 0.54 ...
## $ global_subjectivity : num [1:1469] 0.424 0.386 0.42 0.378 0.388 ...
## $ global_sentiment_polarity : num [1:1469] 0.1147 0.1208 0.1898 0.0835 0.0744 ...
## $ global_rate_positive_words : num [1:1469] 0.0464 0.0331 0.0512 0.0315 0.0458 ...
## $ global_rate_negative_words : num [1:1469] 0.0193 0.0143 0.0171 0.021 0.0162 ...
## $ rate_positive_words : num [1:1469] 0.706 0.698 0.75 0.6 0.739 ...
## $ rate_negative_words : num [1:1469] 0.294 0.302 0.25 0.4 0.261 ...
## $ avg_positive_polarity : num [1:1469] 0.301 0.42 0.357 0.281 0.294 ...
## $ min_positive_polarity : num [1:1469] 0.0333 0.0333 0.1 0.0625 0.0625 ...
## $ max_positive_polarity : num [1:1469] 0.6 1 0.8 1 0.6 0.5 0.6 0.8 0.8 0.8 ...
## $ avg_negative_polarity : num [1:1469] -0.259 -0.25 -0.161 -0.144 -0.267 ...
## $ min_negative_polarity : num [1:1469] -1 -0.5 -0.8 -0.4 -0.8 -0.6 -0.8 -0.5 0 -0.8 ...
## $ max_negative_polarity : num [1:1469] -0.1 -0.125 -0.05 -0.05 -0.025 -0.3 -0.1 -0.05 0 -0.05 ...
## $ title_subjectivity : num [1:1469] 0 1 0.447 0 0.527 ...
## $ title_sentiment_polarity : num [1:1469] 0 0 0.253 0 -0.132 ...
## $ abs_title_subjectivity : num [1:1469] 0.5 0.5 0.0533 0.5 0.0273 ...
## $ abs_title_sentiment_polarity: num [1:1469] 0 0 0.253 0 0.132 ...
## $ shares : num [1:1469] 750 5100 3600 987 2600 1800 3700 2800 3400 1200 ...
# data summary
summary(newsDataSubsetTrain$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28 1100 1600 3677 3200 208300
Now lets show the Mean, Median, Variance, and Standard Deviation. Notice the Variance and Standard Deviation are both extremely high. This might be something we will have to investigate further.
newsDataSubsetTrain %>% summarise(avg = mean(shares), med = median(shares), var = var(shares), sd = sd(shares))
## # A tibble: 1 × 4
## avg med var sd
## <dbl> <dbl> <dbl> <dbl>
## 1 3677. 1600 87380046. 9348.
Looking at the different columns in the dataset, there are two that stand out. Generally speaking, people probably aren’t going to look at articles that don’t have images or videos. Here are the summary stats for the articles grouped on the number of images in the article.
newsDataSubsetTrain %>% group_by(num_imgs) %>%
summarise(avg = mean(shares), med = median(shares), var = var(shares), sd = sd(shares))
## # A tibble: 39 × 5
## num_imgs avg med var sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 3251. 1700 23231753. 4820.
## 2 1 3469. 1500 89161464. 9443.
## 3 2 3653. 1500 99053188. 9953.
## 4 3 3086. 2050 9118448. 3020.
## 5 4 1867. 1200 2769069. 1664.
## 6 5 4640. 1800 84775414. 9207.
## 7 6 2960. 1500 26086991. 5108.
## 8 7 3293. 2300 5881784. 2425.
## 9 8 3106. 1850 15019514. 3876.
## 10 9 4285. 2250 42337039. 6507.
## # … with 29 more rows
As we can see from the above table, the number of shares tend to increase as the number of images increases. Therefore, the number of images variable affects shares, and we will keep this variable.
Here are the summary stats for articles with videos.
newsDataSubsetTrain %>% group_by(num_videos) %>%
summarise(avg = mean(shares), med = median(shares), var = var(shares), sd = sd(shares))
## # A tibble: 16 × 5
## num_videos avg med var sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 3534. 1600 71966935. 8483.
## 2 1 3671. 1700 30614286. 5533.
## 3 2 2361. 1700 4504196. 2122.
## 4 3 3589. 1600 38755051. 6225.
## 5 4 6667. 2000 66743333. 8170.
## 6 5 8200. 1500 183997950. 13565.
## 7 6 6950 6950 245000 495.
## 8 7 9033 7050 99351823. 9968.
## 9 8 833 833 NA NA
## 10 9 1100 1100 NA NA
## 11 10 1699. 1300 1109348. 1053.
## 12 11 2833. 3100 303333. 551.
## 13 12 4400 4400 8000000 2828.
## 14 15 196700 196700 NA NA
## 15 28 660 660 NA NA
## 16 50 932 932 NA NA
As we can see from the above table, number of shares tend to increase as the number of videos increases. Therefore, the number of videos variable affects shares, and we will keep this variable.
A plot with the number of shares on the y-axis and number of words in the title (n_tokens_title) on the x-axis is created:
g <- ggplot(newsDataSubsetTrain, aes(x = n_tokens_title, y = shares))
g + geom_point()+labs(title = "Plot of shares VS n_tokens_title")
The number of shares will vary depending on on the channel type. But there is clearly a relationship between the number of words in the title and the number of shares. Therefore, the number of words in the title affects shares, and we will keep n_tokens_title variable.
A plot with the number of shares on the y-axis and publishing day (publishing_day) on the x-axis is created:
g <- ggplot(newsDataSubsetTrain, aes(x = publishing_day, y = shares))
g + geom_point()+labs(title = "Plot of shares VS publishing_day")
Looking at the plot, some of the days will have a higher number of shares and some of the days will have a lower number of shares. The days with the higher shares will vary depending on the channel. For example, it makes sense that some of the business related channels have a higher share rate during the work week than the weekend. Therefore, the publishing_day affects shares, and we will keep publishing_day.
A plot with the number of shares on the y-axis and rate of positive words (rate_positive_words) on the x-axis is created:
g <- ggplot(newsDataSubsetTrain, aes(x = rate_positive_words, y = shares))
g + geom_point()+labs(title = "Plot of shares VS rate_positive_words")
Looking across the plots for each of the channels, there is a correlation between using positive words and a higher share number. Therefore, the variable rate_positive_words effects shares, and we will keep this variable.
A plot with the number of shares on the y-axis and number of words in the content (n_tokens_content) on the x-axis is created:
g <- ggplot(newsDataSubsetTrain, aes(x = n_tokens_content, y = shares))
g + geom_point()+labs(title = "Plot of shares VS n_tokens_content")
For each of the channel types, it is easy to see that the number of shares will decrease as the number of words in the article increases. For most channel types, the highest shares are with the articles that have less than 2000 words. Therefore, the variable n_tokens_content effects shares, and we will keep this variable.
A plot with the number of shares on the y-axis and average word length (average_token_length) on the x-axis is created:
g <- ggplot(newsDataSubsetTrain, aes(x = average_token_length, y = shares))
g + geom_point()+labs(title = "Plot of shares VS average_token_length")
From the above plot, we can see that the most shares contain 4-6 length words. Therefore, the variable average_token_length effects shares, and we will keep this variable.
Correlation matrix plot is generated:
newsDataSubsetTrain1<-select(newsDataSubsetTrain,-publishing_day)
corr=cor(newsDataSubsetTrain1, method = c("spearman"))
corrplot(corr,tl.cex=0.5)
By the above correlation matrix plot, we can see these variables are strongly correlated:
These are strongly correlated which makes us to assume that these features are so linearly dependent that any one of the strong correlated feature can be used and excluding the other features with high correlation.
Let’s do feature selection:
newsDataSubsetTrain2<-select(newsDataSubsetTrain,-abs_title_sentiment_polarity, -abs_title_subjectivity, -title_sentiment_polarity,-min_negative_polarity,-max_positive_polarity,-rate_negative_words,-global_rate_negative_words,-global_sentiment_polarity,-LDA_03,-LDA_00,-self_reference_max_shares,-self_reference_min_shares,-kw_max_avg,-kw_min_avg,-kw_min_max,-kw_avg_max,-kw_max_max,-kw_max_min,-kw_min_min,-LDA_01,-LDA_02,-num_hrefs,-n_non_stop_unique_tokens,-n_unique_tokens,-n_non_stop_words)
newsDataSubsetTest2<-select(newsDataSubsetTrain,-abs_title_sentiment_polarity, -abs_title_subjectivity, -title_sentiment_polarity,-min_negative_polarity,-max_positive_polarity,-rate_negative_words,-global_rate_negative_words,-global_sentiment_polarity,-LDA_03,-LDA_00,-self_reference_max_shares,-self_reference_min_shares,-kw_max_avg,-kw_min_avg,-kw_min_max,-kw_avg_max,-kw_max_max,-kw_max_min,-kw_min_min,-LDA_01,-LDA_02,-num_hrefs,-n_non_stop_unique_tokens,-n_unique_tokens,-n_non_stop_words)
A Linear Regression Model is the first model type we will look at. These models are an intuitive way to investigate the linear relation between multiple variables. These models make the estimation procedure simple and easy to understand. Linear Regression models can come in all different shapes and sizes and can be used to model more than just a straight linear relationship. Regression models can be modified with interactive and or higher order terms that will conform to a more complex relationship.
For the first linear model example, we can try a model using just the “num_imgs” and “num_videos” as our predictors.
set.seed(111)
#Fit a multiple linear regression model using just the "num_imgs" and "num_videos" as our predictors.
mlrFit <- train(shares ~ num_imgs + num_videos,
data = newsDataSubsetTrain2,
method="lm",
trControl=trainControl(method="cv",number=5))
mlrFit
## Linear Regression
##
## 1469 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1176, 1175, 1175, 1175, 1175
## Resampling results:
##
## RMSE Rsquared MAE
## 8654.134 0.006868955 3338.509
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Next we can try a linear model using all of the fields as a predictor variables.
#Fit a multiple linear regression model using all of the fields as a predictor variables.
set.seed(111)
mlrAllFit <- train(shares ~ .,
data = newsDataSubsetTrain2,
method="lm",
trControl=trainControl(method="cv",number=5))
mlrAllFit
## Linear Regression
##
## 1469 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1176, 1175, 1175, 1175, 1175
## Resampling results:
##
## RMSE Rsquared MAE
## 8766.142 0.008764755 3435.973
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Try a model using just the num_imgs + num_videos + kw_avg_avg + num_imgs*kw_avg_avg as our predictors.
set.seed(111)
#Fit a multiple linear regression model with num_imgs + num_videos + kw_avg_avg + num_imgs*kw_avg_avg.
mlrInteractionFit <- train(shares ~ num_imgs + num_videos + kw_avg_avg + num_imgs*kw_avg_avg,
data = newsDataSubsetTrain2,
method="lm",
trControl=trainControl(method="cv",number=5))
mlrInteractionFit
## Linear Regression
##
## 1469 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1176, 1175, 1175, 1175, 1175
## Resampling results:
##
## RMSE Rsquared MAE
## 8629.451 0.03062777 3291.096
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The Random Forest Model is an example of an ensemble based model. Instead of traditional decision trees, ensemble methods average across the tree. This will greatly increase our prediction power, but it will come at the expense of the easy interpretation from traditional decision trees. The Random Forest based model will not use all available predictors. Instead it will take a random subset of the predictors for each tree fit and calculate the model fit for that subset. It will repeat the process a pre-determined number of times and automatically pick the best predictors for the model. This will end up creating a reduction in the overall model variance.
set.seed(111)
randomForestFit <- train(shares ~ .,
data = newsDataSubsetTrain2,
method="rf",
preProcess=c("center","scale"),
trControl=trainControl(method="cv",number=5),
tuneGrid=data.frame(mtry=ncol(newsDataSubsetTrain)/3))
randomForestFit
## Random Forest
##
## 1469 samples
## 20 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1176, 1175, 1175, 1175, 1175
## Resampling results:
##
## RMSE Rsquared MAE
## 9524.759 0.00306765 3817.129
##
## Tuning parameter 'mtry' was held constant at a value of 15.33333
Boosted Regression Tree (BRT) models are a combination of two techniques: decision tree algorithms and boosting methods. It repeatedly fits many decision trees to improve the accuracy of the model.
Boosted Regression Tree uses the boosting method in which the input data are weighted in subsequent trees. The weights are applied in such a way that data that was poorly modelled by previous trees has a higher probability of being selected in the new tree. This means that after the first tree is fitted the model will take into account the error in the prediction of that tree to fit the next tree, and so on. By taking into account the fit of previous trees that are built, the model continuously tries to improve its accuracy. This sequential approach is unique to boosting.
set.seed(111)
BoostedTreeFit <- train(shares ~ .,
data = newsDataSubsetTrain2,
distribution = "gaussian",
method="gbm",
trControl=trainControl(method="cv",number=5),
verbose = FALSE)
BoostedTreeFit
## Stochastic Gradient Boosting
##
## 1469 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1176, 1175, 1175, 1175, 1175
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 8795.865 0.010744808 3366.154
## 1 100 8812.067 0.009141891 3385.138
## 1 150 8905.616 0.009836261 3437.492
## 2 50 8830.359 0.009322874 3416.717
## 2 100 8962.216 0.006257919 3463.981
## 2 150 9052.275 0.002712680 3507.443
## 3 50 8864.338 0.011828246 3413.643
## 3 100 9026.561 0.007730813 3543.420
## 3 150 9114.531 0.008093483 3588.768
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Tuning parameter 'n.minobsinnode' was held constant at
## a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 50, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
All the models are compared by RMSE on the test set
set.seed(111)
#compute RMSE of MlrFit
mlrFitPred <- predict(mlrFit, newdata = newsDataSubsetTest2)
MlrFit<-postResample(mlrFitPred, newsDataSubsetTest2$shares)
MlrFit.RMSE<-MlrFit[1]
#compute RMSE of MlrAllFit
MlrAllFitPred <- predict(mlrAllFit, newdata = newsDataSubsetTest2)
MlrAllFit<-postResample(MlrAllFitPred, newsDataSubsetTest2$shares)
MlrAllFit.RMSE<-MlrAllFit[1]
#compute RMSE of MlrInterFit
mlrInteractionFitPred <- predict(mlrInteractionFit, newdata = newsDataSubsetTest2)
MlrInterFit<-postResample(mlrInteractionFitPred, newsDataSubsetTest2$shares)
MlrInterFit.RMSE<-MlrInterFit[1]
#compute RMSE of RandomForest
ForestPred <- predict(randomForestFit, newdata = newsDataSubsetTest2)
RandomForest<-postResample(ForestPred, newsDataSubsetTest2$shares)
RandomForest.RMSE<-RandomForest[1]
#compute RMSE of BoostedTree
BoostPred <- predict(BoostedTreeFit, newdata = newsDataSubsetTest2)
BoostedTree<-postResample(BoostPred, newsDataSubsetTest2$shares)
BoostedTree.RMSE<-BoostedTree[1]
#Compare Root MSE values
c(MlrFit=MlrFit.RMSE,MlrAllFit=MlrAllFit.RMSE,MlrInterFit=MlrInterFit.RMSE,RandomForest=RandomForest.RMSE,BoostedTree=BoostedTree.RMSE)
## MlrFit.RMSE MlrAllFit.RMSE MlrInterFit.RMSE RandomForest.RMSE BoostedTree.RMSE
## 9285.358 9187.197 9214.330 5403.120 9059.082
From the above compare, we can see the smallest RMSE is 5403.120 which belong to RandomForest. Therefore, we will choose the Random Forest Model.
Below is a chuck of code that can be used to automate the reports. In order to automate this project, the first thing we do is build a set of parameters. These parameters match up with the column names from the full news dataset. The program with read the parameter and subset the data down to only values with the specified news channel name that is in the parameter.
To automate the project for all of the different news channels, simply execute the code chunk below directly to the console. Separate .md files will then be created for each news channel type.
#Add column names
columnNames <- data.frame("lifestyle","entertainment","bus","socmed","tech","world")
#Create filenames
output_file<-paste0(columnNames,".md")
#create a list for each column name
params = lapply(columnNames, FUN = function(x){list(columnNames = x)})
#put into a data frame
reports<-tibble(output_file,params)
#Render Code
apply(reports, MARGIN=1,FUN=function(x)
{
rmarkdown::render(input="Project2.Rmd",
output_format="github_document",
output_file=x[[1]],
params=x[[2]],
output_options = list(html_preview= FALSE,toc=TRUE,toc_depth=2,toc_float=TRUE)
)
}
)