This is a Group Project for the Data Analysis for Business class of LUISS University’s Management and Computer Science degree. a.y. 2024/2025 - Spring
This project analyzes the Hittersdataset to predict
salary levels by converting continuous salaries into categorical
groups and modeling them using multinomial logistic
regression. The analysis includes data cleaning, exploratory
data analysis, feature evaluation, and model assessment to identify the
most relevant predictors of salary.
The dataset contains performance and salary information for Major League Baseball (MLB) players from the 1986 and 1987 seasons. Each row represents a player, with variables capturing both seasonal and career statistics.
Multinomial logistic regression is a generalization of binary logistic regression to multi-class problems. While binary logistic regression models the log-odds of one binary outcome (e.g., success/failure), multinomial regression models the log-odds of each outcome category relative to a baseline class. This allows us to predict categorical outcomes with more than two levels, such as salary levels (e.g., Low, Medium, High).
To do this, we select one class (say, class \(K\)) to serve as the baseline, and model the probabilities of the other \(K - 1\) classes as follows:
\[ \Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}} \quad \text{for } k = 1, \dots, K-1 \]
And for the baseline class \(K\):
\[ \Pr(Y = K \mid X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}} \]
We can also express the log-odds of any class \(k\) relative to the baseline class as:
\[ \log \left( \frac{\Pr(Y = k \mid X = x)}{\Pr(Y = K \mid X = x)} \right) = \beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p \]
This shows that the log-odds of any class compared to the baseline are linear functions of the predictors. The choice of baseline affects the coefficients but not the predicted class probabilities. Therefore, interpretation of coefficients must be done carefully and in the context of the chosen reference category.
This formulation is the foundation of the multinomial logistic regression model and allows us to interpret the effect of each feature on the likelihood of class membership.
In this section we proceed with our analysis by importing the
"Hitters.csv" dataset and performing any necessary
preprocessing.
#imported the data
hitters <- read.csv("/Users/yaseminates/Desktop/Hitters.csv")
#convert the dataset to a standard data frame to ensure compatibility with functions
# (although r typically reads csv as dataframe automatically)
hitters <- as.data.frame(hitters)
#viewed the structure of the dataframe
str(hitters)
## 'data.frame': 317 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 574 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 159 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 21 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 107 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 75 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 59 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 10 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 4631 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1300 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 90 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 702 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 504 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 488 ...
## $ League : chr "A" "N" "A" "N" ...
## $ Division : chr "E" "W" "W" "E" ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 238 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 445 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 22 ...
## $ Salary : num NA 475 480 500 91.5 ...
## $ NewLeague: chr "A" "N" "A" "N" ...
#we can see that most data classes have been assigned correctly except for "League", "Division" and "NewLeague", which are supposed to be factors but are assigned as characters
#manually reassigned above classes
hitters$League <- as.factor(hitters$League)
hitters$Division <- as.factor(hitters$Division)
hitters$NewLeague <- as.factor(hitters$NewLeague)
With the code below we check the dimensions of the data before we alter any values. Originally the dataset has 317 rows x 20 columns.
#checked dimensions: rows x columns
dim(hitters)
## [1] 317 20
We view the first rows of the dataset to better grasp the the structure and familiarize with the values.
#viewed first few rows
head(hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks
## 1 293 66 1 30 29 14 1 293 66 1 30 29 14
## 2 315 81 7 24 38 39 14 3449 835 69 321 414 375
## 3 479 130 18 66 72 76 3 1624 457 63 224 266 263
## 4 496 141 20 65 78 37 11 5628 1575 225 828 838 354
## 5 321 87 10 39 42 30 2 396 101 12 48 46 33
## 6 594 169 4 74 51 35 11 4408 1133 19 501 336 194
## League Division PutOuts Assists Errors Salary NewLeague
## 1 A E 446 33 20 NA A
## 2 N W 632 43 10 475.0 N
## 3 A W 880 82 14 480.0 A
## 4 N E 200 11 3 500.0 N
## 5 N E 805 40 4 91.5 N
## 6 A W 282 421 25 750.0 A
Checking for duplicates, we see that there are none.
anyDuplicated(hitters) #we can see there are no duplicates
## [1] 0
#since the output = 0
Here we check how many NA values are in each column. We can see that only our target has missing values, thus we drop nan values, instead of imputing with mean, in order not to risk introducing bias.
colSums(is.na(hitters))
## AtBat Hits HmRun Runs RBI Walks Years CAtBat
## 0 0 0 0 0 0 0 0
## CHits CHmRun CRuns CRBI CWalks League Division PutOuts
## 0 0 0 0 0 0 0 0
## Assists Errors Salary NewLeague
## 0 0 58 0
Now we omit the missing values in the response variables, and check if they have been successfully deleted.
hitters <- na.omit(hitters)
sum(is.na(hitters)) #if the output =0 there are no missing values
## [1] 0
Let’s confirm the new dimensions. Now we have 259 rows after deleting the 58 rows where the response variable are missing (columns remain 20 thus far).
#confirmed new dimensions
dim(hitters)
## [1] 259 20
Checking a few key statistics of the dataframe we find that:
Career stats such as CHits, CRuns, etc.
have extremely large ranges, which could mean the dataset includes both
rookie players and “seasoned” players.
The Years feature (1–24 years) confirms that player
experience varies.
Errors, compared to Hits, has a lower
maximum (32 and 238 respectively). This indicates the players are highly
skilled.
The salary distribution is heavily right-skewed, meaning the majority of players earn a “lower-end” salary.
With the difference caused by rare instances of high salaries with Maximum salary being $2.46M.
Years (1-24) while CAtBat (19-14,053).summary(hitters)
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
## Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
## Mean :403.8 Mean :107.9 Mean :11.58 Mean : 54.72
## 3rd Qu.:527.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
## Median : 47.00 Median : 37.00 Median : 6.000 Median : 1928.0
## Mean : 51.31 Mean : 40.77 Mean : 7.266 Mean : 2645.4
## 3rd Qu.: 71.00 3rd Qu.: 56.50 3rd Qu.:10.000 3rd Qu.: 3849.5
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.0 Min. : 2.0 Min. : 3.0
## 1st Qu.: 212.0 1st Qu.: 14.5 1st Qu.: 105.5 1st Qu.: 95.0
## Median : 510.0 Median : 39.0 Median : 249.0 Median : 227.0
## Mean : 719.4 Mean : 68.3 Mean : 359.2 Mean : 327.6
## 3rd Qu.:1023.0 3rd Qu.: 92.0 3rd Qu.: 493.0 3rd Qu.: 420.5
## Max. :4256.0 Max. :548.0 Max. :2165.0 Max. :1659.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:137 E:128 Min. : 0.0 Min. : 0.0
## 1st Qu.: 71.0 N:122 W:131 1st Qu.: 113.5 1st Qu.: 8.0
## Median : 174.0 Median : 224.0 Median : 46.0
## Mean : 256.6 Mean : 292.2 Mean :120.5
## 3rd Qu.: 323.0 3rd Qu.: 325.0 3rd Qu.:201.5
## Max. :1566.0 Max. :1377.0 Max. :492.0
## Errors Salary NewLeague
## Min. : 0.000 Min. : 67.5 A:139
## 1st Qu.: 3.000 1st Qu.: 190.0 N:120
## Median : 7.000 Median : 420.0
## Mean : 8.672 Mean : 532.8
## 3rd Qu.:13.000 3rd Qu.: 750.0
## Max. :32.000 Max. :2460.0
To not put large emphasis and priority on larger-scaled features over the others, this difference needs taking care of.
We do not normalize of course, in order not to compress and lose the relative distance between features.
Thus we standardize.
\[ Z = \frac{X - \mu}{\sigma} \]
Where: - \(X\) is the original value, - \(\mu\) is the mean, and - \(\sigma\) is the standard deviation of the feature.
We can now easily interpret the coefficients as we can see how much “Salary” changes with each feature’s 1-standard-deviation-change. Otherwise,the coefficients would be disproportionately “important” for features with larger ranges.
After isolating the numerical features and standardizing, we plot the
distributions pre and post standardization of two example features
(Years and Catbat).
We see that the original distributions and relative scales of the data have been respected when standardized.
#excluding categorical features, of course
hitters_std <- hitters
numerical_features <- sapply(hitters, is.numeric)
#standardized...
hitters_std[numerical_features] <- scale(hitters[numerical_features])
#we used ChatGPT to help create and put the graphs in a grid
plot_years_original <- ggplot(hitters, aes(x = Years)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
ggtitle("Original Distribution of Years") +
theme_minimal()
plot_years_standardized <- ggplot(hitters_std, aes(x = Years)) +
geom_histogram(bins = 20, fill = "lightgreen", color = "black", alpha = 0.7) +
ggtitle("Standardized Distribution of Years") +
theme_minimal()
plot_catbat_original <- ggplot(hitters, aes(x = CAtBat)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
ggtitle("Original Distribution of CAtBat") +
theme_minimal()
plot_catbat_standardized <- ggplot(hitters_std, aes(x = CAtBat)) +
geom_histogram(bins = 20, fill = "lightgreen", color = "black", alpha = 0.7) +
ggtitle("Standardized Distribution of CAtBat") +
theme_minimal()
grid.arrange(plot_years_original, plot_years_standardized,
plot_catbat_original, plot_catbat_standardized,
ncol = 2)
We check the distribution of our target variable Salary
to determine how to categorize it. We include the mean and median
statistics in the plot as well.
It can be seen that the distribution is heavily right skewed, meaning having lower income is much more common than a higher one. However, do not be fooled by the word “Low” here. To further comment on this we must remember that the “Hitters.csv” dataset is based on the “salaries of Major league baseball players for the years 1986–87” (Kaggle).
The average annual salary (thousands) in the USA in 1987 was 12.28 per capita with median family income being 30.85 (US Census). Compared to the minimum of 67.5 for an MLB player, even the so-called “Low” salaries categorized below are only relatively low. They are incredibly high compared to the overall national average.
#check distribution of salary to make informed decisions concerning amount of categories and cut-offs
#mean and median
mean_salary <- mean(hitters$Salary)
median_salary <- median(hitters$Salary)
#we used ChatGPT to ensure clean and correct graphing of the data
#dataframe to add legend entries
line_df <- data.frame(
label = c("Mean", "Median"),
value = c(mean_salary, median_salary)
)
#visualize Salary Distribution with legend lines
salary_plot <- ggplot(hitters, aes(x = Salary)) +
geom_histogram(bins = 60, fill = "skyblue", color = "black", alpha = 0.7) +
geom_vline(data = line_df, aes(xintercept = value, linetype = label, color = label), linewidth = 1) +
scale_color_manual(values = c("Mean" = "red", "Median" = "blue")) +
scale_linetype_manual(values = c("Mean" = "dashed", "Median" = "dotted")) +
labs(
title = "Salary Distribution with Mean and Median",
x = "Salary (in $1000s)",
y = "Count",
s = "Statistic",
linetype = "Statistic"
) +
theme_minimal()
print(salary_plot)
summary(hitters$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.5 190.0 420.0 532.8 750.0 2460.0
#we see that it's heavily right skewed, having lower incomes as much more common than higher ones
To apply a multinomial logistic regression model, we need a
categorical response variable. However, the original
Salary variable is continuous. Instead of manually creating
arbitrary bins (e.g., by tertiles), we used a data-driven
approach based on clustering.
Our initial approach was using K-Means clustering,
as it was not a foreign concept to us. Although the clusters seemed
visually appealing and intuitively correct, the cluster with the
“Medium” value assigned, was too often getting wrongly flagged
as “Low” in the multinomial model. This led to an overall
decrease in model performance. This likely happened because K-Means
assumes all clusters are round and evenly sized, which doesn’t work well
with our heavily right-skewed Salary data. We decided to
explore different approaches to clustering. For methods beyond our own
experience, we used ChatGPT to brainstorm.
This led to us applying a Gaussian Mixture Model
(GMM) to the standardized salary data using the
mclust package. GMM assumes the data is generated from a
mixture of several Gaussian distributions and assigns each observation
to the most likely cluster.
This approach helped us handle different shapes and spreads of data which better suits the target variable distribution. We selected 4 clusters to be grouped, labeled as:
Low_SalaryMedium_SalaryHigh_SalaryVery_High_SalaryThese categories were then assigned to both the original and
standardized datasets as a new factor variable:
SalaryLevel_GMM.
We also visualized the salary distribution with the GMM clusters overlaid, along with vertical reference lines for the mean and median salary. This visualization helps confirm that the GMM clusters align well with meaningful variations in salary, and supports their use as a categorical response variable in our classification model.
#fiting a Gaussian Mixture Model (GMM)
set.seed(18062003)
#seed selected from the closest birthday to the specified one
#Guia Ludovica Basso - 18-06-2003
gmm_model <- Mclust(hitters_std$Salary)
#printed the GMM summary
summary(gmm_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 4 components:
##
## log-likelihood n df BIC ICL
## -268.9236 259 11 -598.9723 -679.2617
##
## Clustering table:
## 1 2 3 4
## 37 71 126 25
#assigned GMM cluster labels to original (hitters) and standardized data (hitters_std)
hitters$SalaryLevel_GMM <- factor(
gmm_model$classification,
levels = 1:4,
labels = c("Low_Salary", "Medium_Salary", "High_Salary", "Very_High_Salary")
)
hitters_std$SalaryLevel_GMM <- factor(
gmm_model$classification,
levels = 1:4,
labels = c("Low_Salary", "Medium_Salary", "High_Salary", "Very_High_Salary")
)
#visualized salary distribution colored by GMM clusters
#we used clear and blunt cutoffs to demonstrate the clusters
#through trial-error we chose 53 as the number of bins with the best aesthetics
ggplot(hitters, aes(x = Salary, fill = SalaryLevel_GMM)) +
geom_histogram(bins = 53, color = "black", alpha = 0.7) +
geom_vline(data = line_df, aes(xintercept = value, linetype = label, color = label), linewidth = 1) +
scale_fill_manual(values = c("skyblue", "forestgreen", "khaki", "red")) +
scale_color_manual(values = c("Mean" = "red", "Median" = "blue")) +
scale_linetype_manual(values = c("Mean" = "dashed", "Median" = "dotted")) +
labs(
title = "Salary Distribution by GMM Clusters with Mean and Median",
x = "Salary (in $1000s)",
y = "Count",
fill = "Cluster",
color = "Statistic",
linetype = "Statistic"
) +
theme_minimal()
We performed the chi-square test to determine
whether there is a statistically significant association between the
newly created SalaryLevels_GMM and the other Categorical
variables.
We notice that Division and SalaryLevel_GMM
are significantly associated since p = 0.0086 < 0.05, , meaning that
the distribution of salary categories differs across
divisions.
That is not the case for League and NewLeague where p values are high, for these two instances we fail to reject the null hypothesis, thus salary level appears to be independent of league affiliation in this dataset.
categorical_vars <- c("League", "Division", "NewLeague")
chi_results <- lapply(categorical_vars, function(var) {
test <- chisq.test(table(hitters_std[[var]], hitters_std$SalaryLevel_GMM))
data.frame(
Variable = var,
P_Value = test$p.value,
Significant = ifelse(test$p.value < 0.05, "Yes", "No")
)
})
chi_df <- do.call(rbind, chi_results)
print(chi_df)
## Variable P_Value Significant
## 1 League 0.779242173 No
## 2 Division 0.008571562 Yes
## 3 NewLeague 0.904194694 No
We visualize the relationship between SalaryLevel_GMM
and Division with a proportional stacked bar plot.
In the graph below it is extremely noticeable that the salary distribution is clearly not the same between East and West divisions, thus reinforcing the results from the Chi-Square test.
While Medium and High salaries have similar proportions across the two divisions (although High salaries are 5% more occurrences in the West division), the East divisions appears to have relatively less Low salaries and more Very High salaries. It is hard to say by the graphs alone, but the results may be due to the East wing having more valuable players, while the West wing might have younger/unexperienced players, less investment, or lower performance metrics.
#proportions for each "SalaryLevel_GMM" within each "Division"
div_data <- hitters_std %>%
group_by(Division, SalaryLevel_GMM) %>%
summarise(count = n(), .groups = "drop") %>% #suppreses the warning
group_by(Division) %>%
mutate(
prop = count / sum(count),
label = scales::percent(prop, accuracy = 1)
)
#We used ChatGPT to crate a cleaner graph
ggplot(div_data, aes(x = Division, y = prop, fill = SalaryLevel_GMM)) +
geom_bar(stat = "identity") +
geom_text(aes(label = label), position = position_stack(vjust = 0.5), size = 4, color = "black") +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Proportion of Salary Levels by Division (with Percentages)",
x = "Division",
y = "Proportion",
fill = "Salary Level"
) +
theme_minimal()
We used ANOVA to identify numeric features that
differ significantly across salary levels because it supports
comparisons across multiple groups, unlike a t-test
which is limited to two. A stricter significance level of
0.01 was chosen to reduce false positives and focus only on
the most confidently predictive variables.
We can see that, despite the more rigid significance level, 14 out of 16 numeric features showed statistically significant.
numeric_features <- names(hitters_std)[sapply(hitters_std, is.numeric)]
numeric_features <- setdiff(numeric_features, c("Salary"))
anova_results <- data.frame(Variable = character(), P_Value = numeric())
#after researching Stack Overflow and ChatGPT, we decided that ANOVA
#is more appropriate than a t-test for comparing numeric variables
#across multiple salary levels, since t-tests are limited to two-groups
#ANOVA allowed us to test for differences in means across all four categories
#ANOVA to collect p-values
for (var in numeric_features) {
formula <- as.formula(paste(var, "~ SalaryLevel_GMM"))
model <- aov(formula, data = hitters_std)
p_val <- summary(model)[[1]][["Pr(>F)"]][1]
anova_results <- rbind(anova_results, data.frame(Variable = var, P_Value = p_val))
}
#our threshold
significance_level <- 0.01
significant_vars <- anova_results %>%
filter(P_Value < significance_level) %>%
arrange(P_Value)
num_significant <- nrow(significant_vars)
num_total <- length(numeric_features)
cat("Number of significant numeric features (p <", significance_level, "):", num_significant, "out of", num_total, "\n\n")
## Number of significant numeric features (p < 0.01 ): 14 out of 16
cat("Significant features (ordered by significance):\n")
## Significant features (ordered by significance):
for (i in 1:nrow(significant_vars)) {
cat(paste0(i, ". ", significant_vars$Variable[i], " (p = ", signif(significant_vars$P_Value[i], 4), ")\n"))
}
## 1. CHits (p = 1.216e-21)
## 2. CRuns (p = 1.676e-21)
## 3. CAtBat (p = 3.799e-21)
## 4. CRBI (p = 2.119e-19)
## 5. Years (p = 3.565e-18)
## 6. CWalks (p = 5.9e-16)
## 7. CHmRun (p = 3.584e-14)
## 8. RBI (p = 6.48e-12)
## 9. Runs (p = 1.449e-11)
## 10. Hits (p = 1.476e-11)
## 11. Walks (p = 1.972e-11)
## 12. AtBat (p = 2.46e-09)
## 13. HmRun (p = 5.337e-08)
## 14. PutOuts (p = 4.827e-06)
This heatmap displays the mean values of numeric features grouped by
salary level (SalaryLevel_GMM). Each tile’s color intensity
reflects how high the average is — darker blue = higher mean value.
Features that darken progressively from low to high salary levels
(e.g., CHits, CRuns, CAtBat,
CRBI) are likely the most informative, as they show clear,
consistent increases across salary groups. This visual pattern aligns
closely with the ANOVA results, reinforcing the predictive strength of
these features in distinguishing salary levels
group_means <- hitters_std %>%
dplyr::select(-Salary) %>% #removed "Salary" to not over-saturate the comparison
group_by(SalaryLevel_GMM) %>%
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
group_means_melted <- melt(group_means, id.vars = "SalaryLevel_GMM")
ggplot(group_means_melted, aes(x = variable, y = SalaryLevel_GMM, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Mean of Numeric Features by Salary Level",
x = "Feature", y = "Salary Level") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplots of the top five most significant numeric features
(
CHits, CRuns, CAtBat,
CRBI and Years) show clear and consistent
upward trends across salary levels. Players in higher salary groups tend
to have more career hits, runs, at-bats, RBIs, and years of
experience.
This supports the ANOVA results and highlights these features as strong predictors of salary classification.
#top-5 significant features
top5_features <- c("CHits", "CRuns", "CAtBat", "CRBI", "Years")
#boxplots
for (var in top5_features) {
p <- ggplot(hitters_std, aes(x = SalaryLevel_GMM, y = .data[[var]], fill = SalaryLevel_GMM)) +
geom_boxplot(alpha = 0.7) +
labs(
title = paste("Distribution of", var, "by Salary Level"),
x = "Salary Level",
y = var
) +
theme_minimal()
print(p)
}
We decided to also compute Mutual Information Scores to measure the dependency between each numeric feature and salary level. Unlike ANOVA, MI captures both linear and non-linear relationships helping us supplement traditional metrics.
The plot shows that variables such as CATBat,
CHits, CRBI, CRuns, and
CWalks provide the most information about salary
classification. Of course it is important to note that these strong
correlations are also caused by the nature of the data collected.
Intuitively, CHits which is the number of hits in a
player’s career, will increase in a positive relation to
CATBat which is the the times the player is at bat. The
same goes for CRuns, CRBI and
CHits since they are the result of the previously mentioned
feature. This can also be seen in the Correlation
Heatmap
In short, it should come as no surprise that these features are strongly correlated as they are “reliant” on one another. Later, we will use these strong correlations to create a custom multinomial model where only said predictors take part.
#target variable
mi_data <- hitters_std[, c("SalaryLevel_GMM", numeric_features)]
#computed mutual information
mi_scores <- information_gain(SalaryLevel_GMM ~ ., data = mi_data)
#descending order in graph
mi_df <- mi_scores %>%
arrange(desc(importance))
ggplot(mi_df, aes(x = reorder(attributes, importance), y = importance)) +
geom_bar(stat = "identity", fill = "darkorange") +
coord_flip() +
labs(
title = "Mutual Information Scores ",
x = "Variable",
y = "Mutual Information"
) +
theme_minimal()
numeric_vars <- hitters_std %>%
dplyr::select(where(is.numeric)) %>%
dplyr::select(-Salary)
cor_matrix <- cor(hitters %>% dplyr::select(where(is.numeric)))
melted <- melt(cor_matrix)
ggplot(melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "forestgreen", high = "orange", mid = "lightyellow", midpoint = 0) +
labs(title = "Correlation Heatmap of Numeric Variables") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As mentioned above, some features have relationships with each other as they are the “result” or “alternative” of the other. Like how a player at bat can hit, or score a home run.
We use this information to create a “score” for the players and see if their performance score is also directly correlated with their salary level. From previous inference, it most likely will be. As plotted below, it indeed is.
For the creation of the score, we treated these “positive outcomes”
which are Hits and HmRun (included in
Hits) and included offensive performances while at bat
like: Walks, Runs and RBI. We
weigh them equally and do not assign importance for simplicity. We
normalize the scores by “times at bat”. As mentioned above and expected
from the MI Scores, they are directly increasing with
SalaryLevel_GMM.
#dummy dataset to not alter the original
hitters_vis <- hitters %>%
dplyr::select(-Salary) %>% #no Salary data used, only SalaryLevel
mutate(
BatAvg = Hits / AtBat,
CareerBatAvg = CHits / CAtBat,
#originally we used this but decided to create a custom score
#we do not use "HmRun" in the nominator since it is included in Hits
#this is apparently how Hit is defined in baseball
SuccessScore = (Hits + Walks + RBI + Runs) / AtBat,
SuccessScore_C = (CHits + CWalks + CRBI + CRuns) / CAtBat,
)
hitters_vis <- hitters_vis %>% filter(Years > 0) #in case of division by zero
#plot grids were created with the help of ChatGPT
ggplot(hitters_vis, aes(x = SalaryLevel_GMM, y = SuccessScore)) +
geom_boxplot(fill = "lightblue") +
labs(title = "1987 Success Score", x = "Salary Category", y = "Success Score")
ggplot(hitters_vis, aes(x = SalaryLevel_GMM, y = SuccessScore_C)) +
geom_boxplot(fill = "orange") +
labs(title = "Over-Career Success Score", x = "Salary Category", y = "Success Score (Career)")
Our model saw a notable increase in prediction accuracy when switching from a 70/30 split to 80/20. Since we have very little data we decided 80/20 is justified as the better choice.
We split the data by first creating a partition and then removing the
target variable Salary from the train (and test) set. This
is to make sure there are no data leaks which might
lead to an inflated accuracy rate.
Another method to note is that we use
createDataPartition() to create a
stratified sample instead of sample()
which randomizes an unstratified and completely random
split. We tried both and saw a better performance with the former one as
the latter does not account for the uneven distribution of the target
variable. This is the reason the section below will differ from the lab
exercises which use the latter.
#making sure SalaryLevel_GMM is a factor
hitters_std$SalaryLevel_GMM <- factor(hitters_std$SalaryLevel_GMM)
set.seed(18062003) #same as before
train_index <- createDataPartition(hitters_std$SalaryLevel, p = 0.8, list = FALSE) #80/20
train_split <- hitters_std[train_index, ]
test_split <- hitters_std[-train_index, ]
#remove "Salary" from predictors, so we DO NOT train on Salary
train_data <- dplyr::select(train_split, -Salary)
test_data <- dplyr::select(test_split, -Salary)
We first look at the simple multinomial model from the
nnet package as seen in the lab sessions. We apply the
model as is and do not indulge in any feature engineering. This is to
compare its performance with a proposed “better” model later on.
#nnet multinomial logistic regression (from lab sessions)
multinom_fit <- multinom(
formula = SalaryLevel_GMM ~ .,
data = train_data,
trace = FALSE #supress model output
)
We continue with Cross-Validating the model.
In order to determine our assigned CV model we use the calculation from the project description:
Birthdays of Team Members:
set.seed(18062003) #Guia's birthday
cv_methods = c("1. Vanilla validation set", "2. LOO-CV", "3. K-fold CV (with K = 5)", "4. K-fold CV (with K = 10)")
sample(cv_methods, 1)
## [1] "4. K-fold CV (with K = 10)"
Our assigned model is the K-fold CV (with K = 10). Since
our group consists of 3 students, we add the
Leave One Out Cross Validation (LOO-CV) to this.
We inspected the code provided in the lab sessions
for this. The comments in quotes (“) are directly from the sessions. We
later encountered the train() function from the
caret package and proceeded with it rather than
cv.glm() from the lab sessions (boot package).
Since we are using a multinomial model, the former allowed more
flexibility and efficiency as it automatically calculates the
performance metrics. The former is a better fit for a simpler model.
For the 10-Fold-CV method, our “simple” multinomial model gets a 61% accuracy estimate.
set.seed(18062003) #Guia's birthday
#=============================================
# "10-Fold Cross-Validation"
#=============================================
cv_kfold <- train(
SalaryLevel_GMM ~ .,
data = train_data,
method = "nnet",
trControl = trainControl(method = "cv", number = 10), #10-fold CV
trace = FALSE #suppress output
)
# "Display the results"
cat("===== 10-FOLD CROSS-VALIDATION =====\n")
## ===== 10-FOLD CROSS-VALIDATION =====
cat("Accuracy:", round(cv_kfold$results$Accuracy[1], 3), "\n")
## Accuracy: 0.611
cat("Kappa: ", round(cv_kfold$results$Kappa[1], 3), "\n\n")
## Kappa: 0.355
Again, we inspect the lab sessions. We proceed with the
caret package’s train as previously, for the
Leave-One-Out calculations. We get a 62% accuracy
estimate for the “simple” multinomial model.
set.seed(18062003) #Guia's birthday
#=============================================
# "Leave-One-Out Cross-Validation (LOOCV)"
#=============================================
loocv_model <- train(
SalaryLevel_GMM ~ .,
data = train_data,
method = "nnet",
trControl = trainControl(
method = "LOOCV",
verboseIter = FALSE #supresses output
),
tuneLength = 1,
trace = FALSE #supresses output
)
# "Display the results"
cat("===== LOOCV Results =====\n")
## ===== LOOCV Results =====
cat("Accuracy: ", round(loocv_model$results$Accuracy, 3), "\n")
## Accuracy: 0.625
cat("Error: ", round(1 - loocv_model$results$Accuracy, 3), "\n")
## Error: 0.375
We move on to the predictions on the test_data to
determine the test error. The code below is the continuation of the
multinom_fit model above, from the lab sessions. We output
a confusion matrix to identify potential unusual
patterns. We did encounter some after using K-Means however, for this
model (GMM) there seems to be no reason to deem predictions
“suspisciously distributed”.
Our “basic” multinomial model achieves a 66% accuracy on the test
data. It seems to be performing better on the Medium_Salary
and High_Salary classes with 78% and 74% accuracy
respectively.
set.seed(18062003) #Guia's birthday
#predictions of salary levels
predictions <- predict(multinom_fit, newdata = test_data)
confusionMatrix(predictions, test_data$SalaryLevel_GMM)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Salary Medium_Salary High_Salary Very_High_Salary
## Low_Salary 3 1 4 0
## Medium_Salary 3 9 0 0
## High_Salary 1 4 20 3
## Very_High_Salary 0 0 1 2
##
## Overall Statistics
##
## Accuracy : 0.6667
## 95% CI : (0.5208, 0.7924)
## No Information Rate : 0.4902
## P-Value [Acc > NIR] : 0.008304
##
## Kappa : 0.4783
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low_Salary Class: Medium_Salary Class: High_Salary
## Sensitivity 0.42857 0.6429 0.8000
## Specificity 0.88636 0.9189 0.6923
## Pos Pred Value 0.37500 0.7500 0.7143
## Neg Pred Value 0.90698 0.8718 0.7826
## Prevalence 0.13725 0.2745 0.4902
## Detection Rate 0.05882 0.1765 0.3922
## Detection Prevalence 0.15686 0.2353 0.5490
## Balanced Accuracy 0.65747 0.7809 0.7462
## Class: Very_High_Salary
## Sensitivity 0.40000
## Specificity 0.97826
## Pos Pred Value 0.66667
## Neg Pred Value 0.93750
## Prevalence 0.09804
## Detection Rate 0.03922
## Detection Prevalence 0.05882
## Balanced Accuracy 0.68913
As the Project Guidelines suggest: “Your goal now is to identify a better model (with lowest test error) for predicting the individual salary level”.
After inspecting the Mutual Information Scores and the Correlation Heatmap above, we used a model with only the top-3 “Strongest” predictors. We were able to achieve an improved 74% accuracy compared to the 66% before.
set.seed(18062003) #Guia's birthday
multinom_custom <- multinom(
formula = SalaryLevel_GMM ~ +CAtBat +CHits +CRBI,
data = train_data,
trace = FALSE # suppress model output
)
predictions_custom <- predict(
multinom_custom,
newdata = test_data
)
accuracy_custom <- mean(
predictions_custom == test_data$SalaryLevel_GMM
)
print(paste("Accuracy: ", accuracy_custom))
## [1] "Accuracy: 0.745098039215686"
conf_matrix_custom <- table(
Predicted = predictions_custom,
Actual = test_data$SalaryLevel_GMM
)
print(conf_matrix_custom)
## Actual
## Predicted Low_Salary Medium_Salary High_Salary Very_High_Salary
## Low_Salary 5 1 0 0
## Medium_Salary 1 10 3 0
## High_Salary 1 3 22 4
## Very_High_Salary 0 0 0 1
A notable approach in our search for a better model, was to fine-tune
the multinomial model. This regularization approach is specific to the
nnet library’s multinomial model.We tried a range of decay
values, penalizing large weights, to see if this would have an effect on
the test error.
Below is the confusion matrix of the so-called “Best Model” chosen
after all the decay values’ accuracies have been compared. The one with
the highest accuracy seems to be decay = 0.1 with 76%
accuracy. This is a 10% jump in prediction accuracy
from our previous model, which is a great improvement!!!
For each of the classes, the accuracies have went up from:
Notes from the team:
set.seed(18062003) #Guia's birthday
#we found that the preferd range of decay is 0-1
decay_values <- c(0.001, 0.01, 0.1, 0.2, 0.5, 1)
#placeholder variables
best_accuracy <- 0
best_decay <- NULL
best_model <- NULL
#try all decay values with the multinomial model. To create "method" and loop over this we had help from ChatGPT
for (decay in decay_values) {
model_tuned <- multinom(SalaryLevel_GMM ~ ., data = train_data, decay = decay, maxit = 200, trace = FALSE)
#predict
predictions_tuned <- predict(model_tuned, test_data)
#calculate the accuracy
accuracy_tuned <- mean(predictions_tuned == test_data$SalaryLevel_GMM)
#update the best model if need be
if (accuracy_tuned > best_accuracy) {
best_accuracy <- accuracy_tuned
best_decay <- decay
best_model <- model_tuned
}
}
cat("Best decay: ", best_decay, "\n")
## Best decay: 0.1
cat("Best model accuracy: ", best_accuracy, "\n")
## Best model accuracy: 0.7647059
#confusion matrix for best model
best_predictions <- predict(best_model, test_data)
confusionMatrix(best_predictions, test_data$SalaryLevel_GMM)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Salary Medium_Salary High_Salary Very_High_Salary
## Low_Salary 3 0 0 0
## Medium_Salary 3 10 1 0
## High_Salary 1 4 24 3
## Very_High_Salary 0 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.7647
## 95% CI : (0.6251, 0.8721)
## No Information Rate : 0.4902
## P-Value [Acc > NIR] : 5.69e-05
##
## Kappa : 0.6112
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low_Salary Class: Medium_Salary Class: High_Salary
## Sensitivity 0.42857 0.7143 0.9600
## Specificity 1.00000 0.8919 0.6923
## Pos Pred Value 1.00000 0.7143 0.7500
## Neg Pred Value 0.91667 0.8919 0.9474
## Prevalence 0.13725 0.2745 0.4902
## Detection Rate 0.05882 0.1961 0.4706
## Detection Prevalence 0.05882 0.2745 0.6275
## Balanced Accuracy 0.71429 0.8031 0.8262
## Class: Very_High_Salary
## Sensitivity 0.40000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.93878
## Prevalence 0.09804
## Detection Rate 0.03922
## Detection Prevalence 0.03922
## Balanced Accuracy 0.70000
For model selection, the lab session material was not the best fit
for our case , since the procedure was manual. We have too many
predictors to try and fit one-by-one. We explored the
step() function.
The accuracies observed were:
Note that we use the model_tuned with original accuracy
76%. Stepwise selection underperformed compared to the
improved-decay model (and the “full”basic” multinomial model) due to it
removing variables with an individualistic approach to the features,
without considering their combined predictive value.
Decay, retains all predictors and shrinkis
less-important coefficients.
Among stepwise methods, forward selection outperformed backward or bidirecitonal approaches. This is most likely due to its nature: Starting with no predictors and adding variables one at a time, choosing those that improve the model most at each step.
set.seed(18062003) #Guia's birthday
# Perform stepwise selection based on AIC
stepwise_model <- stepAIC(model_tuned, direction = "forward", trace = FALSE)
# Summary of the final stepwise model
summary(stepwise_model)
## Call:
## multinom(formula = SalaryLevel_GMM ~ AtBat + Hits + HmRun + Runs +
## RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI +
## CWalks + League + Division + PutOuts + Assists + Errors +
## NewLeague, data = train_data, decay = decay, maxit = 200,
## trace = FALSE)
##
## Coefficients:
## (Intercept) AtBat Hits HmRun Runs
## Medium_Salary 0.7271399 0.1298582 -0.43196885 0.17278719 0.5697687
## High_Salary 1.5558027 0.4051974 0.01087635 -0.06832509 0.3761175
## Very_High_Salary -0.2588933 -0.5122017 0.44932472 0.44931744 0.4883106
## RBI Walks Years CAtBat CHits
## Medium_Salary 0.1424564 -0.32862555 1.2187542 -0.1373118 -0.4653378
## High_Salary 0.2000430 -0.09179562 1.1993321 0.8412425 0.3411292
## Very_High_Salary 0.1242066 0.43612228 -0.2254801 0.2184726 0.8962122
## CHmRun CRuns CRBI CWalks LeagueN
## Medium_Salary 0.3436008 -0.1511561 -0.08397638 0.3522352 0.5154460
## High_Salary -0.1140415 0.2062812 -0.09960108 0.3608455 0.6098983
## Very_High_Salary 0.1754726 0.7304229 0.82062099 -0.1375994 0.4202939
## DivisionW PutOuts Assists Errors NewLeagueN
## Medium_Salary 0.1874965 -0.073312348 -0.04043314 0.2801322 0.20419324
## High_Salary 0.4650167 0.007137003 0.05531573 -0.1652843 0.06970985
## Very_High_Salary -0.6996798 0.388381946 0.17127118 -0.1492845 0.21842993
##
## Std. Errors:
## (Intercept) AtBat Hits HmRun Runs RBI
## Medium_Salary 0.6680615 1.438293 1.844064 0.8752971 1.184131 1.057769
## High_Salary 0.6475175 1.483129 1.865591 0.8910819 1.222999 1.070216
## Very_High_Salary 0.8349693 2.019978 2.343022 1.0947852 1.581464 1.374834
## Walks Years CAtBat CHits CHmRun CRuns
## Medium_Salary 0.6272190 0.9838722 9.003397 10.83287 3.713881 5.942580
## High_Salary 0.6454262 1.0117035 8.691902 10.40037 3.598999 5.794863
## Very_High_Salary 0.8201610 1.5966868 9.846786 12.03086 3.919427 6.560208
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## Medium_Salary 6.320134 2.753149 0.8716349 0.4945288 0.3092338 0.4300894
## High_Salary 6.066963 2.666466 0.9393456 0.5348724 0.3209301 0.4465304
## Very_High_Salary 6.729349 2.851632 1.4818657 0.7987672 0.3652819 0.6060987
## Errors NewLeagueN
## Medium_Salary 0.3582669 0.8755576
## High_Salary 0.4055551 0.9488428
## Very_High_Salary 0.5903444 1.4793709
##
## Residual Deviance: 334.6608
## AIC: 454.6608
# After stepwise feature selection, use the model to predict on test data
predictions <- predict(stepwise_model, newdata = test_data)
#accuracy
confusion_matrix <- table(Predicted = predictions, Actual = test_data$SalaryLevel_GMM)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Model Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Model Accuracy: 74.51 %"
We used the glmnet library from the lab sessions and
textbook: “An Introduction to Statistical Learning with Applications
in R, 2nd Ed.”, to apply L1 and L2 regularization. Note that this
slightly differs from the nnet approach, resulting in
different accuracies.
We observe 78% and 74% accuracy for
the L2 and L1 models respectively. The Ridge or L2 Regularization model
is the highest accuracy yet, with an added 2%
improvement from the nnet package’s
76% as follows:
More notes from the team:
This approach seems to have the most impact in
Low_Salary predictions. This improvement is likely caused
by the differences in the packages’ L2 (Ridge) applications but could
also have been affected by our use of “only numerical values” in the
case of the glmnet package below.
set.seed(18062003) #Guia's birthday
train_data_new <- dplyr::select(train_split, -Salary, -SalaryLevel_GMM)
test_data_new <- dplyr::select(test_split, -Salary, -SalaryLevel_GMM)
train_data_numeric <- train_data_new %>% select_if(is.numeric)
test_data_numeric <- test_data_new %>% select_if(is.numeric)
#we only use the numeric data for simplicity
#they are the majority in number and "importance"
levels_train <- levels(train_split$SalaryLevel_GMM)
train_split$SalaryLevel_GMM <- factor(train_split$SalaryLevel_GMM, levels = levels_train)
#we prepare the train and test splits for the functions
test_split$SalaryLevel_GMM <- factor(test_split$SalaryLevel_GMM, levels = levels_train)
X_train <- train_data_numeric
y_train <- train_split$SalaryLevel_GMM # Target variable for training
X_test <- test_data_numeric
y_test <- test_split$SalaryLevel_GMM # Target variable for testing
X_train_matrix <- as.matrix(X_train)
X_test_matrix <- as.matrix(X_test)
#Ridge L2
ridge_model <- glmnet(X_train_matrix, y_train, family = "multinomial", alpha = 0)
ridge_cv <- cv.glmnet(X_train_matrix, y_train, family = "multinomial", alpha = 0)
#Lasso L1
lasso_model <- glmnet(X_train_matrix, y_train, family = "multinomial", alpha = 1)
lasso_cv <- cv.glmnet(X_train_matrix, y_train, family = "multinomial", alpha = 1)
#predictions as factors
ridge_pred <- predict(ridge_cv, newx = X_test_matrix, s = "lambda.min", type = "class")
ridge_pred <- factor(ridge_pred, levels = levels_train)
lasso_pred <- predict(lasso_cv, newx = X_test_matrix, s = "lambda.min", type = "class")
lasso_pred <- factor(lasso_pred, levels = levels_train)
confusion_ridge <- confusionMatrix(ridge_pred, factor(y_test))
confusion_lasso <- confusionMatrix(lasso_pred, factor(y_test))
print(confusion_ridge)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Salary Medium_Salary High_Salary Very_High_Salary
## Low_Salary 5 0 0 0
## Medium_Salary 1 10 2 0
## High_Salary 1 4 23 3
## Very_High_Salary 0 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.7843
## 95% CI : (0.6468, 0.8871)
## No Information Rate : 0.4902
## P-Value [Acc > NIR] : 1.579e-05
##
## Kappa : 0.6492
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low_Salary Class: Medium_Salary Class: High_Salary
## Sensitivity 0.71429 0.7143 0.9200
## Specificity 1.00000 0.9189 0.6923
## Pos Pred Value 1.00000 0.7692 0.7419
## Neg Pred Value 0.95652 0.8947 0.9000
## Prevalence 0.13725 0.2745 0.4902
## Detection Rate 0.09804 0.1961 0.4510
## Detection Prevalence 0.09804 0.2549 0.6078
## Balanced Accuracy 0.85714 0.8166 0.8062
## Class: Very_High_Salary
## Sensitivity 0.40000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.93878
## Prevalence 0.09804
## Detection Rate 0.03922
## Detection Prevalence 0.03922
## Balanced Accuracy 0.70000
print(confusion_lasso)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Salary Medium_Salary High_Salary Very_High_Salary
## Low_Salary 4 1 0 0
## Medium_Salary 2 9 2 0
## High_Salary 1 4 23 3
## Very_High_Salary 0 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.7451
## 95% CI : (0.6037, 0.8567)
## No Information Rate : 0.4902
## P-Value [Acc > NIR] : 0.0001852
##
## Kappa : 0.5854
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low_Salary Class: Medium_Salary Class: High_Salary
## Sensitivity 0.57143 0.6429 0.9200
## Specificity 0.97727 0.8919 0.6923
## Pos Pred Value 0.80000 0.6923 0.7419
## Neg Pred Value 0.93478 0.8684 0.9000
## Prevalence 0.13725 0.2745 0.4902
## Detection Rate 0.07843 0.1765 0.4510
## Detection Prevalence 0.09804 0.2549 0.6078
## Balanced Accuracy 0.77435 0.7674 0.8062
## Class: Very_High_Salary
## Sensitivity 0.40000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.93878
## Prevalence 0.09804
## Detection Rate 0.03922
## Detection Prevalence 0.03922
## Balanced Accuracy 0.70000
For simplicity we do not include this part of the analysis in the .html file. The code is available on the .rmd file.
We decided to also consult our textbook further and found section 6.5 titled: Lab: Linear Models and Regularization Methods. From the section:
“The regsubsets() function (part of
the leaps library) performs best sub- set
selection by identifying the best model that contains a given number of
predictors, where best is quantified using RSS.”
We ran the regsubsets() function over all 19 predictors
using this approach and found that:
We then ran forward and backward selection with
regsubsets() in search of a better result however, we got
the same subset sizes above, with differing predictors.
When we applied these subsets to our “basic” multinomial model, the accuracy did not improve notably. So we decided to take note of these findings while not including them in the final report.
The aim of this analysis was to predict baseball player’s salaries as “Low”, “Medium”, “High” or “Very High” as determined by clustering the collected salary values.
We can conclude that the more seasoned and
experienced a player is (in relation to the Years
feature), and the better they perform (in relation to
the performance metrics like Hits, HmRun,
etc.) the higher their salary will be. We further
analyzed this ny creating a success score and saw the positive relation
proven once again. The league or division of the player seems not to
have too large an effect on the salary, especially compared to their
performance.
In further data gathering practices or analysis, more entries collected over a longer period of time will surely help refine these models.