Supervised learning and unsupervised learning are two major branches of machine learning that differ in how they use data.
In supervised learning, we work with data that includes both predictors (inputs) and a known outcome (target). The goal is to learn a relationship between them so we can predict the outcome for new data. For example, predicting house prices based on square footage, number of bedrooms, and location, or classifying emails as “spam” or “not spam.” The model is “supervised” because it learns from examples where the correct answers are already known.
In contrast, unsupervised learning deals with data that has no predefined outcome. The goal is to uncover hidden patterns or structures within the data. Examples include grouping customers into segments based on purchasing behavior (clustering) or reducing many correlated variables into fewer informative components (principal component analysis, or PCA).
In this workshop, we will focus only on supervised learning, specifically regression (predicting continuous outcomes), as these methods form the foundation for most modern predictive modeling tasks.
The dataset used in this workshop summarizes long-term hydrologic, climatic, and environmental characteristics of Arctic and sub-Arctic watersheds from 1984–2018. Each record represents a stream or river reach with associated climate, permafrost, and landscape information.
The primary variable of interest is the mean
groundwater contribution fraction (bfi_mean), It
measures how much of the stream’s flow comes from
groundwater rather than rain or snowmelt running over the land
surface (surface flow). Higher values mean the stream
is fed mostly by groundwater, while lower values mean
it’s mainly fed by surface flow.
A second response variable,
bfi_perchangeperyr, quantifies the
annual trend in groundwater contribution over the same
period — showing whether baseflow is increasing or decreasing through
time.
Predictor variables (31) include a diverse set of watershed attributes:
| Step | Purpose | Example |
|---|---|---|
| 1. Define the question | What do we want to predict? | Groundwater contribution (bfi_mean) |
| 2. Prepare the data | Clean, transform, and split | Training vs. testing data |
| 3. Build models | Try different algorithms | Linear, LASSO, ENet, Trees |
| 4. Evaluate performance | Measure accuracy | RMSE, R² |
| 5. Interpret & apply | Understand key drivers | Which predictors matter? |
library(glmnet) # for LASSO / Elastic Net
library(rpart) # for decision trees
library(rpart.plot) # tree plotting
library(ipred) # for bagging
library(randomForest) # for random forest
library(xgboost) # for boosting
library(caret) # for data splitting & RMSE calculation
library(dplyr) # for when using %>% operatorData cleaning is a mandatory and often time-consuming step in any data-driven analysis. It involves identifying and correcting errors, handling inconsistent formats, detecting and addressing outliers, and managing missing or implausible values through methods such as imputation or removal. A properly cleaned dataset ensures that models are built on reliable and interpretable information rather than noise.
For the purpose of this workshop, we will skip the data cleaning stage so that we can focus on modeling concepts and practical applications. You are provided with a pre-cleaned dataset, which has been prepared in advance to remove missing values, correct encoding issues, and standardize variable names. This allows us to devote our session entirely to exploring and comparing supervised learning methods in R.
There are many ways to bring data into R, depending on the file type
and source. You can import spreadsheets from Excel, text files such as
.csv or .txt, connect directly to databases,
access APIs, or even pull data from online repositories.
In this workshop, we will keep things simple by using a
pre-cleaned .rds file that you can download
beforehand. The .rds format is an R-specific file type that
preserves the full structure of your dataset — including variable types
and labels — making it easy to load and use with a single command. This
ensures everyone works with the same, clean, and consistent
dataset without worrying about formatting or import issues.
## tibble [9,972 × 35] (S3: tbl_df/tbl/data.frame)
## $ bfi_mean : num [1:9972] 0.485 0.535 0.286 0.49 0.491 0.451 0.408 0.27 0.473 0.438 ...
## $ site_id : num [1:9972] 2.5e+07 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
## $ bfi_perchangeperyr : num [1:9972] 1.064 0.282 0 0.816 -0.51 ...
## $ bfi_pval : num [1:9972] 0.058 0.164 0.339 0.019 0.15 0.159 0.142 0.106 0.313 0.523 ...
## $ mean_annual_precip_mm : num [1:9972] 18.1 17.6 20.4 13.8 15.7 ...
## $ precip_pct_change_annual : num [1:9972] 0.78 0.64 0.35 0.64 0.76 0.63 0.93 0.78 1.26 0.89 ...
## $ mean_annual_snow_cm : num [1:9972] 347 1000 1000 1000 276 ...
## $ snow_pct_change_annual : num [1:9972] 0.01 0 0 0 0 0 0 0 0 0 ...
## $ mean_annual_temp_deg_c : num [1:9972] -8.86 -10.43 -7.7 -10.32 -8.15 ...
## $ temp_pct_change_annual : num [1:9972] -1.63 -1.32 -1.44 -1.43 -1.95 -1.31 -1.19 -1.32 -1.27 -1.41 ...
## $ mean_annual_vpd_k_pa : num [1:9972] 0.31 0.29 0.29 0.28 0.31 0.29 0.26 NA 0.27 0.28 ...
## $ vpd_pct_change_annual : num [1:9972] 0.73 0.75 0.6 0.76 0.76 0.65 0.7 NA 0.75 0.72 ...
## $ mean_annual_r_net_w_m2 : num [1:9972] -0.38 -0.33 -0.12 -0.28 -0.32 0.25 -0.13 NA -0.32 -0.25 ...
## $ r_net_pct_change_annual : num [1:9972] -0.37 -0.36 -0.67 -0.38 -0.31 -0.04 -0.91 NA -0.47 -0.57 ...
## $ magt_deg_c : num [1:9972] NA NA NA NA NA ...
## $ mean_annual_lai : num [1:9972] 0 0 0 0 0 ...
## $ centroid_lat_dec_degree : num [1:9972] 80.1 80.1 79.4 79.7 79.8 ...
## $ elev_m : num [1:9972] 261 399 644 509 247 ...
## $ aspect_degree : num [1:9972] 53.1 252.5 25.9 326.2 260.1 ...
## $ reach_slope_m_m : num [1:9972] 0.0233 0.0069 0.0189 0.0051 0.0149 0.0095 0.0222 0.0446 0.0035 0.0181 ...
## $ upstrm_drainage_area_km2 : num [1:9972] 203 81.8 78.3 66.8 73.3 ...
## $ reach_length_km : num [1:9972] 12.85 6.92 28.96 21.79 22.76 ...
## $ stream_order : Ord.factor w/ 8 levels "1"<"2"<"3"<"4"<..: 2 1 1 1 1 2 1 1 2 1 ...
## $ permafrost_type : Factor w/ 5 levels "continuous","discontinuous",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ glacier_presence : Factor w/ 2 levels "no","glacier": 2 2 2 2 2 1 2 1 2 2 ...
## $ alt_mean_m : num [1:9972] NA NA NA NA NA NA NA NA NA NA ...
## $ alt_pct_change_annual : num [1:9972] NA NA NA NA NA NA NA NA NA NA ...
## $ permeability_log10m2 : num [1:9972] -20 -20 -14.1 -20 -20 -15.2 -14.1 -14.1 -20 -20 ...
## $ bulk_density_cg_cm_3 : num [1:9972] NA NA NA NA NA ...
## $ coarse_fragment_cm3_dm3 : num [1:9972] NA NA 304 NA NA ...
## $ clay_content_g_kg : num [1:9972] NA NA 210 NA NA ...
## $ soil_organic_carbon_100cm_kg_m2: num [1:9972] 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 ...
## $ centroid_lon : num [1:9972] 26.3 24.1 12.4 22.7 19.6 ...
## $ aspect_sin : num [1:9972] 0.799 -0.954 0.436 -0.556 -0.985 ...
## $ aspect_cos : num [1:9972] 0.601 -0.301 0.9 0.831 -0.172 ...
## ---- (2) Define response and predictors ----
# We'll predict mean groundwater contribution (BFI_mean)
# Remove ID or non-predictor columns if still present
# Remove bfi_perchangeperyr, the other response variable
# Remove aspect_degree as it was converted to degrees to sin/cos (N=0/360, E=90)
data <- subset(data, select = -c(bfi_perchangeperyr, aspect_degree, site_id, bfi_pval, centroid_lon))
# OR
#data <- data %>%
# select(
# -bfi_pval,
# -centroid_lon,
# -site_id
# )
# Drop remaining rows with any missing predictor
data_clean <- na.omit(data)
str(data_clean) ## tibble [3,083 × 30] (S3: tbl_df/tbl/data.frame)
## $ bfi_mean : num [1:3083] 0.687 0.624 0.614 0.689 0.291 0.707 0.695 0.592 0.579 0.615 ...
## $ mean_annual_precip_mm : num [1:3083] 28.4 21.8 25.6 25.1 20.3 ...
## $ precip_pct_change_annual : num [1:3083] 0.09 0.13 0.18 0.18 0.33 0.08 0.25 0.42 0.25 0.37 ...
## $ mean_annual_snow_cm : num [1:3083] 10.11 5.81 9.11 8.54 5.4 ...
## $ snow_pct_change_annual : num [1:3083] -1.73 -2.63 -1.77 -1.82 -2.7 -2.09 -1.96 -0.46 -0.66 0.54 ...
## $ mean_annual_temp_deg_c : num [1:3083] -0.13 0.07 -0.63 -0.52 -0.73 0.16 -0.92 -0.29 -0.41 0.36 ...
## $ temp_pct_change_annual : num [1:3083] -54.33 84.04 -10.25 -12.45 -7.63 ...
## $ mean_annual_vpd_k_pa : num [1:3083] 0.54 0.53 0.52 0.52 0.53 0.52 0.53 0.53 0.51 0.55 ...
## $ vpd_pct_change_annual : num [1:3083] 0.38 0.39 0.39 0.37 0.38 0.38 0.4 0.4 0.39 0.39 ...
## $ mean_annual_r_net_w_m2 : num [1:3083] 0.53 0.44 0.55 0.51 0.53 0.54 0.53 0.54 0.59 0.62 ...
## $ r_net_pct_change_annual : num [1:3083] 0.42 0.45 0.34 0.42 0.45 0.28 0.54 0.67 0.57 0.5 ...
## $ magt_deg_c : num [1:3083] 1.772 1.569 1.458 0.456 0.891 ...
## $ mean_annual_lai : num [1:3083] 2.61 2.99 1.46 2.68 2.33 ...
## $ centroid_lat_dec_degree : num [1:3083] 70.5 70.3 70.3 70.3 70.2 ...
## $ elev_m : num [1:3083] 56.3 29.8 349.5 349.7 183.1 ...
## $ reach_slope_m_m : num [1:3083] 0.0072 0.0024 0.0066 0.0031 0.0079 0.0377 0.0011 0.0005 0.0026 0.0036 ...
## $ upstrm_drainage_area_km2 : num [1:3083] 131.1 889.5 509.1 91.2 60.8 ...
## $ reach_length_km : num [1:3083] 4.88 6.37 6.86 7.38 12.67 ...
## $ stream_order : Ord.factor w/ 8 levels "1"<"2"<"3"<"4"<..: 2 2 2 2 1 1 5 2 2 1 ...
## $ permafrost_type : Factor w/ 5 levels "continuous","discontinuous",..: 4 4 3 3 3 4 4 4 3 3 ...
## $ glacier_presence : Factor w/ 2 levels "no","glacier": 1 1 1 1 1 1 1 1 1 1 ...
## $ alt_mean_m : num [1:3083] 0.575 0.565 0.551 0.567 0.581 ...
## $ alt_pct_change_annual : num [1:3083] 3.48 3.54 3.11 3.36 3.36 ...
## $ permeability_log10m2 : num [1:3083] -12.5 -11.8 -15.2 -15.2 -15.2 -15.2 -14.1 -14.1 -14.1 -14.1 ...
## $ bulk_density_cg_cm_3 : num [1:3083] 96.1 104.3 110.4 119.3 116.2 ...
## $ coarse_fragment_cm3_dm3 : num [1:3083] 225 170 234 161 220 ...
## $ clay_content_g_kg : num [1:3083] 108.9 134 153.9 138.1 80.7 ...
## $ soil_organic_carbon_100cm_kg_m2: num [1:3083] 31.5 7.2 14 14 4.6 14 31.5 31.5 31.5 31.5 ...
## $ aspect_sin : num [1:3083] -1 -0.712 -0.6 -0.393 0.974 ...
## $ aspect_cos : num [1:3083] -0.00105 0.7019 0.79968 0.91948 -0.22478 ...
## - attr(*, "na.action")= 'omit' Named int [1:6889] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:6889] "1" "2" "3" "4" ...
It’s quite common (and good practice) to start with exploratory data analysis (EDA) — including scatterplots, histograms, and correlation matrices — before building models.
Understand relationships
pairs() or
GGally::ggpairs()) help you see which predictors
are related to the response.mean_annual_temp_deg_c and
vpd_pct_change_annual have strong positive relationships
with bfi_mean.Detect multicollinearity
Spot potential issues
Build intuition
# Quick look at pairwise relationships
library(GGally)
# Select a few key variables for clarity
subset_vars <- data_clean %>%
select(bfi_mean, mean_annual_temp_deg_c, mean_annual_vpd_k_pa,
reach_length_km, clay_content_g_kg)
ggpairs(subset_vars)“We can already see that
bfi_meantends to increase with temperature and decrease with clay content — patterns we’ll test more formally using our regression models.”
When we build a predictive model, our goal is not just to explain the data we already have, but to make accurate predictions on new, unseen data. To evaluate how well a model generalizes, we divide the dataset into two parts: a training set and a test set.
The training set is used to fit the model—to learn patterns, relationships, or parameters. The test set, which the model has never seen during training, is then used to assess how well the model performs on new data. This separation helps us estimate the model’s true predictive ability.
Without this split, we risk overfitting, which happens when a model learns the training data too well—including its noise or random fluctuations—rather than the underlying structure. An overfitted model will show excellent performance on the training data but perform poorly on new data. Splitting data into training and test sets provides an honest assessment of a model’s generalization and helps us select models that are both accurate and robust.
## ---- (3) Split into training and test sets (80/20) ----
set.seed(123)
train_idx <- createDataPartition(data_clean$bfi_mean, p = 0.8, list = FALSE)
train <- data_clean[train_idx, ]
test <- data_clean[-train_idx, ]
nrow(train); nrow(test)## [1] 2468
## [1] 615
In this section, we will build a series of regression models to
predict the mean groundwater contribution fraction
(bfi_mean) using several environmental and hydrologic
predictors. Each model represents a different approach to learning the
relationship between the predictors and the outcome. We begin with
Linear Regression, a simple and interpretable model that assumes a
straight-line relationship between the predictors and the response.
Next, we extend this idea using LASSO type Regression models, which
introduces regularization to automatically select the most important
variables and prevent overfitting.
We then explore Decision Trees, which capture nonlinear patterns by splitting the data into smaller, more homogeneous groups based on predictor values. To improve upon a single tree’s instability, we introduce ensemble methods — specifically Bagging, Random Forest, and Boosting — that combine the results of many trees to produce more robust and accurate predictions.
By comparing these models, we will see how different techniques balance simplicity, flexibility, and predictive performance, helping us understand when each approach might be most effective in practical regression problems.
When evaluating predictive models, there are many possible performance measures depending on the type of problem and the modeling goal. For regression models, some commonly used metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). In classification problems, we often use measures such as accuracy, precision, recall, F1-score, and AUC (Area Under the Curve).
In this workshop, we will focus on two key metrics that are widely used for regression models — RMSE and R-squared (R²) — as they provide complementary perspectives on model quality.
The Root Mean Squared Error (RMSE) measures the average magnitude of prediction errors, showing how far the predicted values are from the observed values on average. It is calculated as the square root of the mean squared difference between actual and predicted values. A smaller RMSE indicates higher predictive accuracy.
The R-squared (R²) value represents the proportion of variability in the outcome variable that can be explained by the model. It ranges from 0 to 1, where values closer to 1 indicate a better fit.
In summary:
RMSE quantifies prediction accuracy — lower values are better.
R² quantifies explanatory power — higher values are better.
Together, these two measures give us a clear and interpretable summary of how well a model performs on unseen data.
Linear regression models the relationship between a response variable ( \(y\) ) and one or more predictors ( \(x_1, x_2, \dots, x_p\) ) by fitting a straight line (or hyperplane) that best describes the data.
\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]
where
Linear regression estimates the coefficients ( \(\beta_0, \beta_1, \dots, \beta_p\) ) by minimizing the Residual Sum of Squares (RSS) — the total squared difference between observed and predicted values:
\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2\]
This means the model finds the line (or plane) that minimizes the overall squared prediction error.
Linear regression finds the single best-fitting straight line that minimizes the total squared differences between the actual and predicted values.
# visual (ignore the code)
set.seed(1)
x <- 1:20
y <- 2 + 0.5*x + rnorm(20, 0, 2)
fit <- lm(y ~ x)
plot(x, y, pch = 19, col = "steelblue",
main = "Linear Regression: Minimizing Squared Errors",
xlab = "Predictor (x)", ylab = "Response (y)")
abline(fit, col = "darkred", lwd = 2)
segments(x, y, x, fitted(fit), col = "gray50", lty = 2)
legend("topleft", legend = c("Data", "Best-fit line", "Residuals"),
col = c("steelblue", "darkred", "gray50"), pch = c(19, NA, NA),
lty = c(NA, 1, 2), bty = "n")lm()
function############################################################
## (1) LINEAR REGRESSION - Baseline model
############################################################
fit_lm <- lm(bfi_mean ~ ., data = train)
summary(fit_lm)##
## Call:
## lm(formula = bfi_mean ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42464 -0.06336 0.00714 0.07053 0.40838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.436e-01 2.219e-01 2.450 0.014363 *
## mean_annual_precip_mm 1.550e-03 1.014e-03 1.530 0.126235
## precip_pct_change_annual 2.727e-02 1.154e-02 2.364 0.018148 *
## mean_annual_snow_cm -1.003e-05 6.311e-05 -0.159 0.873721
## snow_pct_change_annual -2.047e-03 3.492e-03 -0.586 0.557791
## mean_annual_temp_deg_c 9.089e-03 3.490e-03 2.605 0.009257 **
## temp_pct_change_annual 1.257e-05 3.021e-04 0.042 0.966809
## mean_annual_vpd_k_pa 3.431e-01 1.610e-01 2.132 0.033119 *
## vpd_pct_change_annual 1.424e-01 3.931e-02 3.624 0.000296 ***
## mean_annual_r_net_w_m2 1.669e-02 1.890e-02 0.883 0.377444
## r_net_pct_change_annual -1.014e-02 5.841e-03 -1.736 0.082694 .
## magt_deg_c -2.230e-04 2.573e-03 -0.087 0.930934
## mean_annual_lai 1.823e-03 2.554e-03 0.714 0.475577
## centroid_lat_dec_degree -1.669e-03 2.265e-03 -0.737 0.461293
## elev_m 1.236e-05 1.984e-05 0.623 0.533431
## reach_slope_m_m 2.821e-01 4.322e-01 0.653 0.514058
## upstrm_drainage_area_km2 7.426e-08 5.789e-07 0.128 0.897943
## reach_length_km -1.967e-03 2.747e-04 -7.161 1.06e-12 ***
## stream_order.L 1.922e-02 1.493e-01 0.129 0.897598
## stream_order.Q 8.535e-02 1.255e-01 0.680 0.496456
## stream_order.C 4.273e-02 8.203e-02 0.521 0.602444
## stream_order^4 3.346e-02 4.591e-02 0.729 0.466190
## stream_order^5 -1.491e-03 2.385e-02 -0.062 0.950175
## stream_order^6 -7.716e-04 1.280e-02 -0.060 0.951932
## permafrost_typediscontinuous -2.927e-02 9.283e-03 -3.153 0.001638 **
## permafrost_typesporadic -2.756e-02 1.194e-02 -2.307 0.021142 *
## permafrost_typeisolated -6.491e-02 1.402e-02 -4.631 3.83e-06 ***
## permafrost_typenone -6.695e-02 1.700e-02 -3.938 8.45e-05 ***
## glacier_presenceglacier -6.546e-02 6.923e-02 -0.946 0.344462
## alt_mean_m -1.248e-02 9.533e-03 -1.309 0.190735
## alt_pct_change_annual -4.693e-03 2.059e-03 -2.279 0.022726 *
## permeability_log10m2 2.222e-03 1.261e-03 1.762 0.078134 .
## bulk_density_cg_cm_3 5.315e-04 2.957e-04 1.798 0.072370 .
## coarse_fragment_cm3_dm3 8.957e-05 6.405e-05 1.398 0.162161
## clay_content_g_kg -3.165e-04 5.783e-05 -5.473 4.88e-08 ***
## soil_organic_carbon_100cm_kg_m2 -2.471e-04 1.698e-04 -1.456 0.145549
## aspect_sin -3.226e-04 3.250e-03 -0.099 0.920935
## aspect_cos 3.872e-03 3.236e-03 1.196 0.231685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1118 on 2430 degrees of freedom
## Multiple R-squared: 0.2185, Adjusted R-squared: 0.2066
## F-statistic: 18.36 on 37 and 2430 DF, p-value: < 2.2e-16
stream_order
variable?The stream_order variable represents the position of a
stream within the river network — where 1st-order
streams are the smallest headwater channels and larger orders
indicate progressively larger, downstream reaches. Because it encodes a
natural hierarchy, stream_order is treated as an
ordered factor in the model.
When an ordered factor is used in R, it is represented by a series of
orthogonal polynomial contrasts, shown in the
regression output as .L, .Q, .C,
^4, and so on. These correspond to linear,
quadratic, cubic, and higher-order
trends across the ordered levels. This does not make the model
nonlinear in the regression sense — the model is still linear in its
coefficients — but it allows the effect of stream order on groundwater
contribution to follow a smooth or curved pattern
rather than being forced to increase or decrease strictly linearly.
In simple terms, these terms let the model detect whether
bfi_mean changes steadily with stream size or whether it
peaks or dips at intermediate stream orders. This is helpful because
hydrologically, mid-order streams may behave differently from very small
headwaters or large river channels, and the polynomial contrasts provide
a flexible way to capture that pattern while keeping the model
interpretable and statistically valid.
Model fit and meaning The overall ( R^2 = 0.22 ) indicates that about 22 % of the variation in groundwater contribution is explained by the predictors. Although this may appear modest, it is quite strong for environmental and hydrologic data, where natural variability is high. The model is statistically significant (( p < 2.2^{-16} )), showing that it captures meaningful, non-random structure.
Key predictors and physical interpretation
permafrost_type
In R, when you include a categorical variable (like
permafrost_type) in a linear model, one category is
automatically chosen as the reference (baseline). The
coefficients for the other categories represent how much higher or lower
their predicted bfi_mean is compared to that
baseline.
If you didn’t explicitly change the reference, R will use the
first category in alphabetical order. Given your
variable levels (continuous, discontinuous,
sporadic, isolated, none), the
baseline is most likely “continuous.”
Interpret the coefficients relative to the baseline
Here are the coefficients you have:
| Category | Coefficient | Interpretation (relative to continuous) |
|---|---|---|
| discontinuous | −0.029 | Streams in discontinuous permafrost areas have, on average, a 0.03 lower BFI than those in continuous permafrost. |
| sporadic | −0.028 | Streams in sporadic permafrost have a 0.03 lower BFI than continuous permafrost areas. |
| isolated | −0.065 | Streams in isolated permafrost areas have a 0.07 lower BFI than continuous permafrost. |
| none | −0.067 | Streams with no permafrost have a 0.07 lower BFI than continuous permafrost. |
What this means
Because all coefficients are negative, and the baseline (continuous permafrost) is not shown but implicit, this tells us that — in this dataset — streams with less permafrost actually have lower groundwater contributions than those in continuous permafrost regions.
That’s counterintuitive from a purely physical standpoint (since frozen ground usually limits infiltration), but statistically, it means that after controlling for temperature, soil, and drainage variables, the model finds lower baseflow fractions in permafrost-free areas.
Unexpectedly weak predictors Some variables that might seem hydrologically important—such as precipitation, snow depth, or elevation—are not significant once correlated climate and soil effects are accounted for. This highlights how regression isolates each variable’s unique influence after adjusting for others.
Model behavior and insight The residual standard error ((~0.11)) indicates predictions are typically within about 0.1 units of observed values on a 0–1 scale. The results reinforce physically consistent relationships: permafrost continuity, soil texture, and channel geometry emerge as dominant controls on groundwater contributions, while climate trends modulate these effects. This demonstrates how even a relatively simple linear model can uncover meaningful, process-based patterns across a complex landscape.
The significant F-statistic (p < 2.2e−16) confirms the model, as a whole, fits substantially better than a null (mean-only) model — so you’ve captured meaningful signal, not random noise.
# Evaluation
rmse_lm <- sqrt(mean((test$bfi_mean - pred_lm)^2))
rsq_lm <- cor(test$bfi_mean, pred_lm)^2
cat("\nLinear Regression RMSE:", round(rmse_lm,3),
" | R-squared:", round(rsq_lm,3), "\n")##
## Linear Regression RMSE: 0.116 | R-squared: 0.152
The linear regression model achieved an RMSE of 0.12
and an R² of 0.152, meaning that, on average, the
model’s predictions differ from the observed bfi_mean
values by about 0.12 units on a 0–1 scale, while
explaining roughly 15% of the variation in groundwater
contribution across sites. Although this indicates a moderate level of
predictive accuracy, it still captures meaningful large-scale patterns
in the data — a solid baseline performance for comparison with more
advanced models like LASSO, Elastic Net, and AHRLR that we will explore
next.
As datasets grow larger and more complex, traditional linear regression often struggles to balance accuracy and interpretability. When we have many predictors—especially if some are highly correlated or only weakly related to the outcome—the model can become unstable and overfit the training data. Overfitting means the model captures random noise rather than the true underlying relationships, which reduces its predictive power on new data.
To address these challenges, we use shrinkage and variable selection methods, such as Ridge Regression, LASSO (Least Absolute Shrinkage and Selection Operator), and Elastic Net. These techniques introduce a penalty on the size of the regression coefficients, effectively “shrinking” less important ones toward zero.
Shrinkage helps control model complexity and improves prediction accuracy.
Variable selection (as in LASSO) automatically identifies the most relevant predictors by setting some coefficients exactly to zero, simplifying the model.
In short, shrinkage and selection methods provide a principled way to build models that are more stable, interpretable, and generalizable, especially when dealing with high-dimensional or correlated data.
LASSO (Least Absolute Shrinkage and Selection Operator) is an extension of linear regression that improves model performance when there are many correlated or less useful predictors. It adds a penalty that forces some coefficients to shrink toward zero, which helps simplify the model and prevent overfitting.
\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]
This part is identical to ordinary linear regression — but how we estimate the slopes ((\(\beta\))) changes.
LASSO finds the coefficients that minimize both the squared prediction error and the total absolute size of the coefficients:
\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]
The tuning parameter ( \(\lambda\) ) controls how strong the penalty is:
In a plot of λ (penalty) on the x-axis and coefficient values on the y-axis, you’ll see each coefficient’s line shrink toward zero as λ increases. Once λ is large enough, only a few predictors remain nonzero — the “selected” variables.
LASSO works like linear regression with a self-control mechanism: it still fits a line through the data, but it automatically shrinks or removes predictors that contribute little to the model. The result is a simpler, more stable, and interpretable model.
# Cross-validated LASSO to pick best lambda
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
plot(cv_lasso) # shows RMSE vs log(lambda)## [1] 0.00997893
The plot from cv_lasso shows the cross-validated
error (typically RMSE or deviance) for a range of penalty
values (( \(\lambda\) )) used in the
LASSO model. The curve usually dips to a minimum and then rises again as
the penalty becomes too strong.
The value best_lambda = 0.01095 represents the point
where the model achieves the lowest cross-validation
error, balancing bias and variance. In practical terms, this
means the LASSO model with a penalty of about 0.011
provides the best trade-off between model simplicity (fewer
variables) and predictive accuracy on unseen
data.
best_lambda# Fit final model at optimal lambda
fit_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda)
coef(fit_lasso)## 39 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 6.001176e-01
## mean_annual_precip_mm .
## precip_pct_change_annual .
## mean_annual_snow_cm .
## snow_pct_change_annual .
## mean_annual_temp_deg_c 7.252624e-03
## temp_pct_change_annual .
## mean_annual_vpd_k_pa 5.347300e-02
## vpd_pct_change_annual .
## mean_annual_r_net_w_m2 .
## r_net_pct_change_annual .
## magt_deg_c .
## mean_annual_lai .
## centroid_lat_dec_degree .
## elev_m .
## reach_slope_m_m .
## upstrm_drainage_area_km2 .
## reach_length_km -9.407050e-04
## stream_order.L .
## stream_order.Q 3.141864e-02
## stream_order.C .
## stream_order^4 .
## stream_order^5 .
## stream_order^6 .
## stream_order^7 .
## permafrost_typediscontinuous .
## permafrost_typesporadic .
## permafrost_typeisolated .
## permafrost_typenone .
## glacier_presenceglacier .
## alt_mean_m .
## alt_pct_change_annual .
## permeability_log10m2 2.494306e-04
## bulk_density_cg_cm_3 .
## coarse_fragment_cm3_dm3 .
## clay_content_g_kg -8.999784e-05
## soil_organic_carbon_100cm_kg_m2 .
## aspect_sin .
## aspect_cos .
LASSO dramatically simplified the model Out of nearly 40 predictors, the LASSO retained only six variables:
mean_annual_temp_deg_cmean_annual_vpd_k_pareach_length_kmstream_order.Qpermeability_log10m2clay_content_g_kg All other coefficients were shrunk
exactly to zero. This illustrates LASSO’s strength — it performs
automatic variable selection, removing predictors that
add little to the model’s predictive power and leaving a concise,
interpretable subset.The retained variables make strong physical sense
Temperature and VPD describe climatic influence on evapotranspiration and recharge.
Reach length and stream order (quadratic) describe channel network geometry — the “plumbing” of the landscape.
Permeability and clay content capture subsurface flow capacity and infiltration limitations.
LASSO thus bridges the major hydrologic dimensions — climate, landscape, and subsurface properties — all without manual variable selection.
The signs of the coefficients align with hydrologic theory
WOW takeaway The LASSO reduced the model from 39 predictors to just six while preserving physically meaningful relationships. It automatically identified the dominant drivers — the same ones an expert might choose — through a purely data-driven learning process. This demonstrates how regularization can reveal the essential structure in complex environmental data while improving model stability and interpretability.
rmse_lasso <- sqrt(mean((y_test - pred_lasso)^2))
rsq_lasso <- cor(y_test, pred_lasso)^2
cat("\nLASSO RMSE:", round(rmse_lasso,3),
" | R-squared:", round(rsq_lasso,3), "\n")##
## LASSO RMSE: 0.117 | R-squared: 0.14
The LASSO model achieved an RMSE of 0.117 and an R² of 0.14, which is very close to the linear regression model’s performance but with only a fraction of the predictors. This shows that LASSO delivers nearly the same predictive accuracy while producing a much simpler and more interpretable model. In practice, this balance between simplicity and accuracy is exactly what makes regularization methods like LASSO so valuable — they generalize better, reduce overfitting, and highlight the most important variables driving groundwater contributions.
Elastic Net combines the strengths of Ridge and LASSO regression. Like LASSO, it can shrink some coefficients to zero (performing variable selection), and like Ridge, it handles correlated predictors more smoothly by sharing their effects instead of forcing one to drop out.
This makes Elastic Net a great middle ground between the two methods — it balances model simplicity and stability.
\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]
\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \left[ (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 + \alpha \sum_{j=1}^{p} |\beta_j|\right]\]
Here:
\(\lambda\) controls the overall penalty strength (like before).
\(\alpha\) controls the mix between Ridge (\(L_2\)) and LASSO (\(L_1\)):
\(\alpha = 0\) → Ridge Regression
\(\alpha = 1\) → LASSO Regression
\(0 < \alpha < 1\) → Elastic Net (a blend of both)
Elastic Net is a hybrid shrinkage method that combines the variable selection of LASSO with the stability of Ridge. It’s especially effective when predictors are highly correlated or when there are many more predictors than observations (p > n).
# Cross-validated ENET to pick best lambda
cv_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, nfolds = 5)
plot(cv_enet) # shows RMSE vs log(lambda)## [1] 0.02190373
best_lambda_enet# Fit final model at optimal lambda
fit_enet <- glmnet(x_train, y_train, alpha = 0.5, lambda = best_lambda_enet)
coef(fit_enet)## 39 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 5.473688e-01
## mean_annual_precip_mm .
## precip_pct_change_annual .
## mean_annual_snow_cm .
## snow_pct_change_annual .
## mean_annual_temp_deg_c 4.687992e-03
## temp_pct_change_annual .
## mean_annual_vpd_k_pa 1.386127e-01
## vpd_pct_change_annual .
## mean_annual_r_net_w_m2 .
## r_net_pct_change_annual .
## magt_deg_c 9.541105e-05
## mean_annual_lai .
## centroid_lat_dec_degree .
## elev_m .
## reach_slope_m_m .
## upstrm_drainage_area_km2 .
## reach_length_km -7.722631e-04
## stream_order.L .
## stream_order.Q 2.479099e-02
## stream_order.C -2.757078e-03
## stream_order^4 .
## stream_order^5 .
## stream_order^6 .
## stream_order^7 .
## permafrost_typediscontinuous .
## permafrost_typesporadic .
## permafrost_typeisolated .
## permafrost_typenone .
## glacier_presenceglacier .
## alt_mean_m .
## alt_pct_change_annual .
## permeability_log10m2 5.976816e-04
## bulk_density_cg_cm_3 .
## coarse_fragment_cm3_dm3 .
## clay_content_g_kg -7.396067e-05
## soil_organic_carbon_100cm_kg_m2 .
## aspect_sin .
## aspect_cos .
Elastic Net kept a broader but stable set of
predictors
The ENet model retained seven nonzero coefficients,
blending climate, landscape, and soil factors:
mean_annual_temp_deg_cmean_annual_vpd_k_pamagt_deg_creach_length_kmstream_order.Qstream_order.Cpermeability_log10m2clay_content_g_kgThis shows that Elastic Net preserved more information than LASSO, especially among correlated climate and stream geometry variables, while still shrinking weaker predictors toward zero.
magt_deg_c), though small in magnitude, likely
captures local permafrost thermal effects.
- Stream order (quadratic and cubic contrasts) capture curved or non-linear patterns in how groundwater contribution changes across the river network. Instead of assuming that larger streams always have more or less groundwater input, these terms allow the model to represent a hump-shaped relationship — where mid-sized streams (not the smallest headwaters or largest rivers) often show the strongest interaction between surface and groundwater. This reflects the idea that intermediate-order streams tend to be the most hydrologically connected zones in a watershed.
rmse_enet <- sqrt(mean((y_test - pred_enet)^2))
rsq_enet <- cor(y_test, pred_enet)^2
cat("\nENET RMSE:", round(rmse_enet,3),
" | R-squared:", round(rsq_enet,3), "\n")##
## ENET RMSE: 0.117 | R-squared: 0.14
The Elastic Net model achieved predictive accuracy comparable to the LASSO model (RMSE ≈ 0.117, R² ≈ 0.14), while maintaining a more stable solution in the presence of correlated predictors. This highlights Elastic Net’s advantage — it balances model simplicity and robustness, capturing key relationships without overfitting or losing important correlated variables.
AHRLR builds on the strengths of LASSO, Adaptive LASSO, and Relaxed LASSO to achieve better variable selection and lower bias. It is a two-stage procedure designed to:
This makes AHRLR both sparse (many coefficients are zero) and less biased for the variables that remain.
\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]
AHRLR solves the following adaptive and relaxed penalized least squares problem:
\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2+ \lambda_1 \sum_{j=1}^{p} w_j |\beta_j|+ \lambda_2 (1 - \phi) \sum_{j=1}^{p} \beta_j^2\]
where:
This two-stage process allows flexible control over selection and shrinkage strength.
You can visualize AHRLR as a two-step process:
This results in a model that is both parsimonious and accurate.
AHRLR combines adaptive weighting and relaxation to reduce LASSO’s bias while keeping its ability to select variables. It bridges the gap between model simplicity and unbiased estimation, making it particularly effective for data with correlated predictors or mixed signal strengths.
# Quick Start ----------------------------
# 1. Load AHRLR functions
source("https://hasthika.github.io/Rpacks/ahrlr_fit.txt")
# 2. Fit model
fit <- fit_AHRLR(x_train, y_train, lambda2 = 1)
# 3. Inspect selected variables
fit$selected_vars## [1] "precip_pct_change_annual" "mean_annual_temp_deg_c"
## [3] "mean_annual_vpd_k_pa" "vpd_pct_change_annual"
## [5] "reach_length_km" "stream_order.Q"
## [7] "stream_order^5" "permafrost_typeisolated"
## [9] "clay_content_g_kg"
# 4. Predict on new data
pred_ahrlr <- predict_AHRLR(fit, newx = x_test)
# 5. Optional summary
summarize_AHRLR(fit, X_test = x_test, y_test = y_test)## AHRLR model summary
## -------------------
## lambda.min: 0.1316359
## Selected variables (9):
## precip_pct_change_annual, mean_annual_temp_deg_c, mean_annual_vpd_k_pa, vpd_pct_change_annual, reach_length_km, stream_order.Q, stream_order^5, permafrost_typeisolated, clay_content_g_kg
## Test RMSE: 0.1163
AHRLR identified a comprehensive yet stable set of
predictors
The Adaptive Hybrid Relaxed Lasso Regression (AHRLR) retained
nine meaningful variables, expanding slightly beyond
LASSO and ENet:
precip_pct_change_annualmean_annual_temp_deg_cmean_annual_vpd_k_pavpd_pct_change_annualreach_length_kmstream_order.Qstream_order^5permafrost_typeisolatedclay_content_g_kgCompared with LASSO and ENet, AHRLR recovers a slightly broader, more balanced set of predictors — capturing both climatic and physical watershed controls while avoiding over-shrinkage.
Climate effects:
Changes in precipitation, temperature, and vapor pressure deficit (VPD)
highlight the influence of long-term climate trends on
groundwater contribution.
Landscape effects:
Variables like reach_length_km and
stream_order describe basin size and
connectivity, controlling how surface and subsurface flows
interact.
Subsurface effects:
permafrost_typeisolated and clay_content_g_kg
reflect groundwater movement capacity — regions with
more clay or isolated permafrost show reduced groundwater
inputs.
Together, these findings suggest that climate, topography, and soil composition jointly determine how much groundwater sustains streamflow.
rmse_AHRLR <- sqrt(mean((y_test - pred_ahrlr)^2))
rsq_AHRLR <- cor(y_test, pred_ahrlr)^2
cat("\nAHRLR RMSE:", round(rmse_AHRLR, 3),
" | R-squared:", round(rsq_AHRLR, 3), "\n")##
## AHRLR RMSE: 0.116 | R-squared: 0.147
The AHRLR model achieved an RMSE of 0.116 and an R² of 0.147, performing about the same as the linear regression and LASSO models in terms of predictive accuracy. What makes AHRLR stand out is that it keeps slightly more meaningful variables while still maintaining a low error rate. This means it captures a broader picture of the system — combining climate, landscape, and soil effects — without losing predictive strength.
While linear, Ridge, LASSO, and Elastic Net regressions are powerful and interpretable, they all assume a linear relationship between predictors and the response. In many real-world situations, that assumption can be too limiting — relationships may be nonlinear, involve interactions among variables, or change across different ranges of the predictors.
To capture these complex patterns, we turn to tree-based models, which divide the data into smaller, more homogeneous groups based on predictor values. Each split represents a simple decision rule, and together these rules form a tree-like structure that is easy to interpret.
There are two main types of tree-based models:
Regression Trees, used when the outcome variable is continuous (e.g., predicting groundwater contribution).
Classification Trees, used when the outcome variable is categorical (e.g., predicting whether a stream is glacier-fed or not).
In this workshop, we will focus on regression trees to stay within the continuous-response setting, then extend the idea to more advanced ensemble methods like Bagging, Random Forest, and Boosting to improve predictive performance and stability.
This regression tree predicts the mean groundwater
contribution (bfi_mean) using climate, location,
and watershed variables.
Each split represents a decision rule that divides the
data into groups with more similar groundwater contributions.
The predicted value at each terminal node (yval) is the
average bfi_mean for sites that meet all
conditions along that branch.
The first and most important division occurs at
mean_annual_temp_deg_c < -9.545, separating very
cold (Arctic) regions from warmer (sub-Arctic to
temperate) ones.
This confirms that temperature is the dominant control
on groundwater contribution across the study area.
Warmer basins tend to have greater groundwater influence, consistent with permafrost thaw and deeper infiltration.
Within the cold group, variation depends on net radiation trend and drainage area.
High net radiation trend (r_net_pct_change_annual ≥
0.385) and very low ground temperatures (magt_deg_c
< −6.3)
→ extremely low groundwater contributions (≈ 0.35–0.38).
These are the coldest, most frozen landscapes with minimal subsurface
flow.
Cold areas with weaker radiation increase and larger
drainage areas (≥ 37 km²)
→ moderately higher groundwater values (≈ 0.52).
Small, cold basins (< 37 km²)
→ the highest BFI in the cold branch (≈ 0.59), suggesting that
small, well-connected headwaters can still maintain
notable groundwater input when not fully frozen.
For warmer watersheds, drainage area again drives the next major split:
mean_annual_snow_cm ≥ 4 cm) show distinct contrasts:
The tree highlights a hierarchical climate–basin size control:
| Dominant Condition | Typical BFI | Interpretation |
|---|---|---|
| Very cold (≤ −9.5 °C), frozen soils | 0.35–0.45 | Minimal groundwater contribution |
| Cold, larger basins | 0.50–0.53 | Slightly increased baseflow buffering |
| Moderate temperature, large basins | 0.55–0.60 | Sustained groundwater flow |
| Warm, small basins | 0.65–0.70 | Strongest groundwater–surface exchange |
In short:
The regression tree reveals that temperature and basin size are the two most powerful predictors of groundwater contribution, with additional refinements from net radiation, snow, and latitude.
Groundwater influence increases systematically with warming and decreasing permafrost control, particularly in smaller, low-latitude catchments.
This tree provides an interpretable map of how climate and watershed geometry together shape baseflow across the Arctic landscape.
# Predict and evaluate
pred_tree <- predict(fit_tree, newdata = test)
rmse_tree <- sqrt(mean((test$bfi_mean - pred_tree)^2))
rsq_tree <- cor(test$bfi_mean, pred_tree)^2
cat("\nDecision Tree RMSE:", round(rmse_tree,3),
" | R-squared:", round(rsq_tree,3), "\n")##
## Decision Tree RMSE: 0.104 | R-squared: 0.32
Decision trees are powerful because they are easy to interpret and can capture nonlinear relationships automatically. However, a single tree can be unstable — small changes in the data can lead to a very different tree structure — and its predictions may not be as accurate as more robust methods. This happens because trees have high variance: they tend to fit the training data closely but may not generalize well to new observations.
To overcome these limitations, we use ensemble methods, which combine the predictions of many trees to create a more stable and accurate model.
Bagging (Bootstrap Aggregating) reduces variance by building many trees on random bootstrap samples of the data and averaging their predictions.
Random Forest extends bagging by adding another layer of randomness — at each tree split, only a random subset of predictors is considered — which further decorrelates the trees and improves performance.
Boosting takes a different approach by building trees sequentially, where each new tree focuses on correcting the errors made by the previous ones.
Together, these ensemble techniques make tree-based models more accurate, stable, and resistant to overfitting, while still retaining the ability to model complex, nonlinear relationships.
Bagging (Bootstrap Aggregating) is a simple but powerful way to improve the stability and accuracy of decision trees. The idea is to create many bootstrap samples — random samples of the training data taken with replacement — and fit a separate tree to each one. Each tree may look a little different because it sees a slightly different subset of the data.
When making predictions, we average the results from all these trees (or take a majority vote for classification). Averaging helps cancel out random fluctuations and reduces the variance of the model, leading to more reliable predictions.
However, the main disadvantage of bagging is that it sacrifices interpretability — since predictions come from the average of many trees, it’s no longer clear how individual predictors influence the outcome. In other words, bagging gives up some transparency in exchange for greater accuracy and stability.
In short:
Bagging = Build many trees on random samples + average their predictions. It smooths out noisy individual trees to produce a more stable and accurate ensemble model, though at the cost of losing a simple, single-tree interpretation.
############################################################
## (6) BAGGING - Averaging many bootstrap trees
############################################################
fit_bag <- bagging(bfi_mean ~ ., data = train, nbagg = 50) # 50 bootstraps
pred_bag <- predict(fit_bag, newdata = test)
rmse_bag <- sqrt(mean((test$bfi_mean - pred_bag)^2))
rsq_bag <- cor(test$bfi_mean, pred_bag)^2
cat("\nBagging RMSE:", round(rmse_bag,3),
" | R-squared:", round(rsq_bag,3), "\n")##
## Bagging RMSE: 0.097 | R-squared: 0.405
The Bagging model achieved an RMSE of 0.098 and an R² of 0.392, which is a major improvement over the linear and LASSO models (where R² was around 0.15).
This means Bagging explains nearly 40% of the variation in groundwater contribution — about 2.5 times more than the linear model — and predicts much more accurately (lower RMSE).
The improvement happens because Bagging combines results from many decision trees built on different bootstrap samples of the data. By averaging across many trees, the model:
“Bagging outperforms linear methods because it can capture complex, nonlinear patterns in how climate, soils, and landscape interact — not just straight-line effects. But it sacrifices interpretability.
Random Forest builds on the idea of bagging but adds another clever layer of randomness to further improve performance. Like bagging, it creates many bootstrap samples and grows a separate decision tree for each. However, when a tree is being built, the algorithm does not consider all predictors at each split — instead, it selects a random subset of features and chooses the best split only among them.
This extra randomness helps make each tree more unique and less correlated with the others. When the predictions from all trees are averaged (for regression) or voted on (for classification), the result is a model that is typically more accurate and more robust than either a single tree or even plain bagging.
The trade-off is similar to bagging: interpretability is reduced because we now have hundreds of trees instead of one, but we gain excellent predictive power and strong resistance to overfitting.
In short:
Random Forest = Bagging + Random Feature Selection at Each Split This approach decorrelates trees, reduces variance, and usually delivers one of the best out-of-the-box predictive models.
############################################################
## (7) RANDOM FOREST - Bagging + random feature selection
############################################################
fit_rf <- randomForest(bfi_mean ~ ., data = train,
ntree = 300, mtry = floor(sqrt(ncol(train)-1)),
importance = TRUE)
print(fit_rf)##
## Call:
## randomForest(formula = bfi_mean ~ ., data = train, ntree = 300, mtry = floor(sqrt(ncol(train) - 1)), importance = TRUE)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.007708973
## % Var explained: 51.05
# Simplify the Random Forest importance plot
varImpPlot(fit_rf,
n.var = 10, # show only top 10 variables
main = "Top 10 Important Variables (Random Forest)",
cex = 0.8) # shrink label text for readabilitypred_rf <- predict(fit_rf, newdata = test)
rmse_rf <- sqrt(mean((test$bfi_mean - pred_rf)^2))
rsq_rf <- cor(test$bfi_mean, pred_rf)^2
cat("\nRandom Forest RMSE:", round(rmse_rf,3),
" | R-squared:", round(rsq_rf,3), "\n")##
## Random Forest RMSE: 0.089 | R-squared: 0.512
This plot shows the top 10 most important predictors
of groundwater contribution (bfi_mean) identified by the
Random Forest model. Each variable’s importance is measured in two
ways:
From both measures, we can see that:
mean_annual_temp_deg_c,
r_net_pct_change_annual, and
mean_annual_r_net_w_m2 highlight the strong role of climate
and surface energy balance in shaping groundwater contributions.In short:
The Random Forest confirms that larger drainage areas and warmer, more temperate regions tend to have higher groundwater contributions — consistent with what we saw in the tree model and regression results.
Gradient Boosting is an advanced ensemble method that builds trees sequentially, where each new tree learns from the mistakes of the previous ones. Unlike bagging or random forests, which grow many independent trees and average their results, boosting takes an iterative approach: it starts with a simple model and gradually improves it by focusing on the observations that were hardest to predict.
At each step, a new tree is fit to the residual errors (the difference between the actual and predicted values), and its predictions are combined with the previous model to reduce overall error. This process continues for many rounds, producing a strong predictive model from many weak learners.
The key idea is that each tree “boosts” the performance of the ensemble by correcting earlier errors. The method is highly flexible and powerful, often achieving excellent accuracy. However, it comes with trade-offs — gradient boosting can be computationally intensive and prone to overfitting if too many trees are grown or the learning rate is too high.
In short:
Gradient Boosting = Sequential tree learning where each tree corrects the errors of the previous ones.
############################################################
## (8) GRADIENT BOOSTING - Sequential tree learning
############################################################
# Prepare data matrices for xgboost
x_train <- model.matrix(bfi_mean ~ ., data = train)[, -1]
x_test <- model.matrix(bfi_mean ~ ., data = test)[, -1]
y_train <- train$bfi_mean
y_test <- test$bfi_mean
# xgboost requires numeric matrices only
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest <- xgb.DMatrix(data = x_test, label = y_test)
# Train model (tweak parameters as needed)
fit_xgb <- xgboost(
data = dtrain,
objective = "reg:squarederror",
nrounds = 300, # number of boosting iterations
eta = 0.05, # learning rate
max_depth = 5, # depth of trees
subsample = 0.8, # fraction of data for each tree
colsample_bytree = 0.8,
verbose = 0
)
# Predict and evaluate
pred_xgb <- predict(fit_xgb, newdata = dtest)
rmse_xgb <- sqrt(mean((y_test - pred_xgb)^2))
rsq_xgb <- cor(y_test, pred_xgb)^2
cat("\nBoosting (XGBoost) RMSE:", round(rmse_xgb,3),
" | R-squared:", round(rsq_xgb,3), "\n")##
## Boosting (XGBoost) RMSE: 0.091 | R-squared: 0.478
############################################################
## COMPARISON SUMMARY
############################################################
results <- data.frame(
Model = c("Linear", "LASSO", "Enet", "AHRLR", "Tree", "Bagging", "RF", "Boosting"),
RMSE = c(rmse_lm, rmse_lasso, rmse_enet, rmse_AHRLR, rmse_tree, rmse_bag, rmse_rf, rmse_xgb),
R2 = c(rsq_lm, rsq_lasso, rsq_enet, rsq_AHRLR, rsq_tree, rsq_bag, rsq_rf, rsq_xgb)
)
print(results[order(results$RMSE), ], row.names = FALSE)## Model RMSE R2
## RF 0.08868245 0.5121020
## Boosting 0.09061455 0.4778010
## Bagging 0.09741966 0.4047560
## Tree 0.10396971 0.3199305
## Linear 0.11615648 0.1524623
## AHRLR 0.11633932 0.1472868
## LASSO 0.11656061 0.1397317
## Enet 0.11681202 0.1402390
# Optional: bar plot of model RMSEs
barplot(results$RMSE, names.arg = results$Model,
col = "steelblue", main = "Model Comparison (RMSE ↓ Better)",
ylab = "Root Mean Squared Error")We compared several supervised learning models for predicting
mean groundwater contribution (bfi_mean).
The table below summarizes their predictive accuracy using Root
Mean Squared Error (RMSE) and R-squared (R²)
on the test data.
| Model | RMSE | R² |
|---|---|---|
| Random Forest | 0.089 | 0.511 |
| Boosting (XGBoost) | 0.090 | 0.483 |
| Bagging | 0.098 | 0.392 |
| Decision Tree | 0.104 | 0.320 |
| Linear Regression | 0.116 | 0.152 |
| AHRLR | 0.116 | 0.147 |
| LASSO | 0.117 | 0.138 |
| Elastic Net | 0.117 | 0.137 |
Model performance improves as flexibility increases. Simpler models like Linear Regression and LASSO explain only about 15% of the variation in groundwater contribution, while ensemble tree methods capture over 50%.
Random Forest delivers the best overall accuracy. With an R² of ~0.51 and the lowest RMSE, Random Forest balances predictive power and robustness by combining many decision trees with random feature selection.
Boosting performs similarly, but slightly less stably. Its sequential learning can fine-tune predictions but may slightly overfit depending on parameter settings.
Bagging and single Trees show the importance of variance reduction. Each step from Tree → Bagging → Random Forest adds stability and predictive strength.
Shrinkage models (LASSO, ENet, AHRLR) still add value. While not as accurate as ensemble methods, they offer interpretability and variable selection, helping identify key climate and permafrost drivers of groundwater flow.
The results highlight a clear trade-off in supervised learning: Linear and shrinkage models are simpler and interpretable, Tree-based ensembles are more powerful and flexible.
Together, they demonstrate how modern machine-learning tools can uncover complex, nonlinear relationships between climate, landscape, and groundwater systems — giving us both scientific insight and stronger predictive capability.