What is supervised learning?

Supervised learning and unsupervised learning are two major branches of machine learning that differ in how they use data.

In supervised learning, we work with data that includes both predictors (inputs) and a known outcome (target). The goal is to learn a relationship between them so we can predict the outcome for new data. For example, predicting house prices based on square footage, number of bedrooms, and location, or classifying emails as “spam” or “not spam.” The model is “supervised” because it learns from examples where the correct answers are already known.

In contrast, unsupervised learning deals with data that has no predefined outcome. The goal is to uncover hidden patterns or structures within the data. Examples include grouping customers into segments based on purchasing behavior (clustering) or reducing many correlated variables into fewer informative components (principal component analysis, or PCA).

In this workshop, we will focus only on supervised learning, specifically regression (predicting continuous outcomes), as these methods form the foundation for most modern predictive modeling tasks.

Dataset Description

The dataset used in this workshop summarizes long-term hydrologic, climatic, and environmental characteristics of Arctic and sub-Arctic watersheds from 1984–2018. Each record represents a stream or river reach with associated climate, permafrost, and landscape information.

The primary variable of interest is the mean groundwater contribution fraction (bfi_mean), It measures how much of the stream’s flow comes from groundwater rather than rain or snowmelt running over the land surface (surface flow). Higher values mean the stream is fed mostly by groundwater, while lower values mean it’s mainly fed by surface flow.

A second response variable, bfi_perchangeperyr, quantifies the annual trend in groundwater contribution over the same period — showing whether baseflow is increasing or decreasing through time.

Predictor variables (31) include a diverse set of watershed attributes:

Climatic factors: mean annual temperature, precipitation, vapor pressure deficit, and radiation (each with long-term trends).
Landscape and topography: elevation, slope, stream order, reach length, and drainage area.
Permafrost and soil properties: permafrost type, active layer thickness, soil permeability, clay and carbon content.
Geographic variables: latitude, longitude, and aspect.

Road Map

Step	Purpose	Example
1. Define the question	What do we want to predict?	Groundwater contribution (`bfi_mean`)
2. Prepare the data	Clean, transform, and split	Training vs. testing data
3. Build models	Try different algorithms	Linear, LASSO, ENet, Trees
4. Evaluate performance	Measure accuracy	RMSE, R²
5. Interpret & apply	Understand key drivers	Which predictors matter?

Loading libraries

library(glmnet)        # for LASSO / Elastic Net
library(rpart)         # for decision trees
library(rpart.plot)    # tree plotting
library(ipred)         # for bagging
library(randomForest)  # for random forest
library(xgboost)       # for boosting
library(caret)         # for data splitting & RMSE calculation
library(dplyr)         # for when using %>% operator

Data Cleaning and Loading

Data cleaning is a mandatory and often time-consuming step in any data-driven analysis. It involves identifying and correcting errors, handling inconsistent formats, detecting and addressing outliers, and managing missing or implausible values through methods such as imputation or removal. A properly cleaned dataset ensures that models are built on reliable and interpretable information rather than noise.

For the purpose of this workshop, we will skip the data cleaning stage so that we can focus on modeling concepts and practical applications. You are provided with a pre-cleaned dataset, which has been prepared in advance to remove missing values, correct encoding issues, and standardize variable names. This allows us to devote our session entirely to exploring and comparing supervised learning methods in R.

There are many ways to bring data into R, depending on the file type and source. You can import spreadsheets from Excel, text files such as .csv or .txt, connect directly to databases, access APIs, or even pull data from online repositories.

In this workshop, we will keep things simple by using a pre-cleaned .rds file that you can download beforehand. The .rds format is an R-specific file type that preserves the full structure of your dataset — including variable types and labels — making it easy to load and use with a single command. This ensures everyone works with the same, clean, and consistent dataset without worrying about formatting or import issues.

Loading a downloaded data set

## ---- (1) Load cleaned dataset ----
# If you saved as .rds:
data <- readRDS("Evans_clean.rds")

# or, if you saved as .csv:
# data <- read.csv("Evans_clean.csv")

Look at the structure of the data set

str(data)              # check structure (types, dimensions)

## tibble [9,972 × 35] (S3: tbl_df/tbl/data.frame)
##  $ bfi_mean                       : num [1:9972] 0.485 0.535 0.286 0.49 0.491 0.451 0.408 0.27 0.473 0.438 ...
##  $ site_id                        : num [1:9972] 2.5e+07 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
##  $ bfi_perchangeperyr             : num [1:9972] 1.064 0.282 0 0.816 -0.51 ...
##  $ bfi_pval                       : num [1:9972] 0.058 0.164 0.339 0.019 0.15 0.159 0.142 0.106 0.313 0.523 ...
##  $ mean_annual_precip_mm          : num [1:9972] 18.1 17.6 20.4 13.8 15.7 ...
##  $ precip_pct_change_annual       : num [1:9972] 0.78 0.64 0.35 0.64 0.76 0.63 0.93 0.78 1.26 0.89 ...
##  $ mean_annual_snow_cm            : num [1:9972] 347 1000 1000 1000 276 ...
##  $ snow_pct_change_annual         : num [1:9972] 0.01 0 0 0 0 0 0 0 0 0 ...
##  $ mean_annual_temp_deg_c         : num [1:9972] -8.86 -10.43 -7.7 -10.32 -8.15 ...
##  $ temp_pct_change_annual         : num [1:9972] -1.63 -1.32 -1.44 -1.43 -1.95 -1.31 -1.19 -1.32 -1.27 -1.41 ...
##  $ mean_annual_vpd_k_pa           : num [1:9972] 0.31 0.29 0.29 0.28 0.31 0.29 0.26 NA 0.27 0.28 ...
##  $ vpd_pct_change_annual          : num [1:9972] 0.73 0.75 0.6 0.76 0.76 0.65 0.7 NA 0.75 0.72 ...
##  $ mean_annual_r_net_w_m2         : num [1:9972] -0.38 -0.33 -0.12 -0.28 -0.32 0.25 -0.13 NA -0.32 -0.25 ...
##  $ r_net_pct_change_annual        : num [1:9972] -0.37 -0.36 -0.67 -0.38 -0.31 -0.04 -0.91 NA -0.47 -0.57 ...
##  $ magt_deg_c                     : num [1:9972] NA NA NA NA NA ...
##  $ mean_annual_lai                : num [1:9972] 0 0 0 0 0 ...
##  $ centroid_lat_dec_degree        : num [1:9972] 80.1 80.1 79.4 79.7 79.8 ...
##  $ elev_m                         : num [1:9972] 261 399 644 509 247 ...
##  $ aspect_degree                  : num [1:9972] 53.1 252.5 25.9 326.2 260.1 ...
##  $ reach_slope_m_m                : num [1:9972] 0.0233 0.0069 0.0189 0.0051 0.0149 0.0095 0.0222 0.0446 0.0035 0.0181 ...
##  $ upstrm_drainage_area_km2       : num [1:9972] 203 81.8 78.3 66.8 73.3 ...
##  $ reach_length_km                : num [1:9972] 12.85 6.92 28.96 21.79 22.76 ...
##  $ stream_order                   : Ord.factor w/ 8 levels "1"<"2"<"3"<"4"<..: 2 1 1 1 1 2 1 1 2 1 ...
##  $ permafrost_type                : Factor w/ 5 levels "continuous","discontinuous",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ glacier_presence               : Factor w/ 2 levels "no","glacier": 2 2 2 2 2 1 2 1 2 2 ...
##  $ alt_mean_m                     : num [1:9972] NA NA NA NA NA NA NA NA NA NA ...
##  $ alt_pct_change_annual          : num [1:9972] NA NA NA NA NA NA NA NA NA NA ...
##  $ permeability_log10m2           : num [1:9972] -20 -20 -14.1 -20 -20 -15.2 -14.1 -14.1 -20 -20 ...
##  $ bulk_density_cg_cm_3           : num [1:9972] NA NA NA NA NA ...
##  $ coarse_fragment_cm3_dm3        : num [1:9972] NA NA 304 NA NA ...
##  $ clay_content_g_kg              : num [1:9972] NA NA 210 NA NA ...
##  $ soil_organic_carbon_100cm_kg_m2: num [1:9972] 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 3.8 ...
##  $ centroid_lon                   : num [1:9972] 26.3 24.1 12.4 22.7 19.6 ...
##  $ aspect_sin                     : num [1:9972] 0.799 -0.954 0.436 -0.556 -0.985 ...
##  $ aspect_cos                     : num [1:9972] 0.601 -0.301 0.9 0.831 -0.172 ...

One way to handdle missing values

## ---- (2) Define response and predictors ----
# We'll predict mean groundwater contribution (BFI_mean)
# Remove ID or non-predictor columns if still present
# Remove bfi_perchangeperyr, the other response variable
# Remove aspect_degree as it was converted to degrees to sin/cos (N=0/360, E=90)
data <- subset(data, select = -c(bfi_perchangeperyr, aspect_degree, site_id, bfi_pval, centroid_lon))
# OR
#data <- data %>%
#  select(
#    -bfi_pval,
#    -centroid_lon,
#    -site_id
#  )

# Drop remaining rows with any missing predictor
data_clean <- na.omit(data)

str(data_clean)

## tibble [3,083 × 30] (S3: tbl_df/tbl/data.frame)
##  $ bfi_mean                       : num [1:3083] 0.687 0.624 0.614 0.689 0.291 0.707 0.695 0.592 0.579 0.615 ...
##  $ mean_annual_precip_mm          : num [1:3083] 28.4 21.8 25.6 25.1 20.3 ...
##  $ precip_pct_change_annual       : num [1:3083] 0.09 0.13 0.18 0.18 0.33 0.08 0.25 0.42 0.25 0.37 ...
##  $ mean_annual_snow_cm            : num [1:3083] 10.11 5.81 9.11 8.54 5.4 ...
##  $ snow_pct_change_annual         : num [1:3083] -1.73 -2.63 -1.77 -1.82 -2.7 -2.09 -1.96 -0.46 -0.66 0.54 ...
##  $ mean_annual_temp_deg_c         : num [1:3083] -0.13 0.07 -0.63 -0.52 -0.73 0.16 -0.92 -0.29 -0.41 0.36 ...
##  $ temp_pct_change_annual         : num [1:3083] -54.33 84.04 -10.25 -12.45 -7.63 ...
##  $ mean_annual_vpd_k_pa           : num [1:3083] 0.54 0.53 0.52 0.52 0.53 0.52 0.53 0.53 0.51 0.55 ...
##  $ vpd_pct_change_annual          : num [1:3083] 0.38 0.39 0.39 0.37 0.38 0.38 0.4 0.4 0.39 0.39 ...
##  $ mean_annual_r_net_w_m2         : num [1:3083] 0.53 0.44 0.55 0.51 0.53 0.54 0.53 0.54 0.59 0.62 ...
##  $ r_net_pct_change_annual        : num [1:3083] 0.42 0.45 0.34 0.42 0.45 0.28 0.54 0.67 0.57 0.5 ...
##  $ magt_deg_c                     : num [1:3083] 1.772 1.569 1.458 0.456 0.891 ...
##  $ mean_annual_lai                : num [1:3083] 2.61 2.99 1.46 2.68 2.33 ...
##  $ centroid_lat_dec_degree        : num [1:3083] 70.5 70.3 70.3 70.3 70.2 ...
##  $ elev_m                         : num [1:3083] 56.3 29.8 349.5 349.7 183.1 ...
##  $ reach_slope_m_m                : num [1:3083] 0.0072 0.0024 0.0066 0.0031 0.0079 0.0377 0.0011 0.0005 0.0026 0.0036 ...
##  $ upstrm_drainage_area_km2       : num [1:3083] 131.1 889.5 509.1 91.2 60.8 ...
##  $ reach_length_km                : num [1:3083] 4.88 6.37 6.86 7.38 12.67 ...
##  $ stream_order                   : Ord.factor w/ 8 levels "1"<"2"<"3"<"4"<..: 2 2 2 2 1 1 5 2 2 1 ...
##  $ permafrost_type                : Factor w/ 5 levels "continuous","discontinuous",..: 4 4 3 3 3 4 4 4 3 3 ...
##  $ glacier_presence               : Factor w/ 2 levels "no","glacier": 1 1 1 1 1 1 1 1 1 1 ...
##  $ alt_mean_m                     : num [1:3083] 0.575 0.565 0.551 0.567 0.581 ...
##  $ alt_pct_change_annual          : num [1:3083] 3.48 3.54 3.11 3.36 3.36 ...
##  $ permeability_log10m2           : num [1:3083] -12.5 -11.8 -15.2 -15.2 -15.2 -15.2 -14.1 -14.1 -14.1 -14.1 ...
##  $ bulk_density_cg_cm_3           : num [1:3083] 96.1 104.3 110.4 119.3 116.2 ...
##  $ coarse_fragment_cm3_dm3        : num [1:3083] 225 170 234 161 220 ...
##  $ clay_content_g_kg              : num [1:3083] 108.9 134 153.9 138.1 80.7 ...
##  $ soil_organic_carbon_100cm_kg_m2: num [1:3083] 31.5 7.2 14 14 4.6 14 31.5 31.5 31.5 31.5 ...
##  $ aspect_sin                     : num [1:3083] -1 -0.712 -0.6 -0.393 0.974 ...
##  $ aspect_cos                     : num [1:3083] -0.00105 0.7019 0.79968 0.91948 -0.22478 ...
##  - attr(*, "na.action")= 'omit' Named int [1:6889] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:6889] "1" "2" "3" "4" ...

#head(data_clean) # Shows the first 6 rows of the dataset
library(DT)
datatable(head(data_clean, 10), 
          caption = "Preview of the Clean Dataset",
          options = list(pageLength = 5, scrollX = TRUE))

EDA (not covered in this workshop)

It’s quite common (and good practice) to start with exploratory data analysis (EDA) — including scatterplots, histograms, and correlation matrices — before building models.

Why we explore the data first

Understand relationships
- Scatterplot matrices (like pairs() or GGally::ggpairs()) help you see which predictors are related to the response.
- For example, you might notice that mean_annual_temp_deg_c and vpd_pct_change_annual have strong positive relationships with bfi_mean.
Detect multicollinearity
- Some predictors may be highly correlated (e.g., temperature and vapor pressure deficit).
- Knowing this helps explain why shrinkage methods like LASSO or ENet behave the way they do.
Spot potential issues
- Outliers, skewed distributions, or missing values are easier to catch visually.
Build intuition

# Quick look at pairwise relationships
library(GGally)

# Select a few key variables for clarity
subset_vars <- data_clean %>%
  select(bfi_mean, mean_annual_temp_deg_c, mean_annual_vpd_k_pa,
         reach_length_km, clay_content_g_kg)

ggpairs(subset_vars)

“We can already see that bfi_mean tends to increase with temperature and decrease with clay content — patterns we’ll test more formally using our regression models.”

Split data into training and test sets

When we build a predictive model, our goal is not just to explain the data we already have, but to make accurate predictions on new, unseen data. To evaluate how well a model generalizes, we divide the dataset into two parts: a training set and a test set.

The training set is used to fit the model—to learn patterns, relationships, or parameters. The test set, which the model has never seen during training, is then used to assess how well the model performs on new data. This separation helps us estimate the model’s true predictive ability.

Without this split, we risk overfitting, which happens when a model learns the training data too well—including its noise or random fluctuations—rather than the underlying structure. An overfitted model will show excellent performance on the training data but perform poorly on new data. Splitting data into training and test sets provides an honest assessment of a model’s generalization and helps us select models that are both accurate and robust.

## ---- (3) Split into training and test sets (80/20) ----
set.seed(123)
train_idx <- createDataPartition(data_clean$bfi_mean, p = 0.8, list = FALSE)
train <- data_clean[train_idx, ]
test  <- data_clean[-train_idx, ]

nrow(train); nrow(test)

## [1] 2468

## [1] 615

Building models

In this section, we will build a series of regression models to predict the mean groundwater contribution fraction (bfi_mean) using several environmental and hydrologic predictors. Each model represents a different approach to learning the relationship between the predictors and the outcome. We begin with Linear Regression, a simple and interpretable model that assumes a straight-line relationship between the predictors and the response. Next, we extend this idea using LASSO type Regression models, which introduces regularization to automatically select the most important variables and prevent overfitting.

We then explore Decision Trees, which capture nonlinear patterns by splitting the data into smaller, more homogeneous groups based on predictor values. To improve upon a single tree’s instability, we introduce ensemble methods — specifically Bagging, Random Forest, and Boosting — that combine the results of many trees to produce more robust and accurate predictions.

By comparing these models, we will see how different techniques balance simplicity, flexibility, and predictive performance, helping us understand when each approach might be most effective in practical regression problems.

Evaluating Predictive Models

When evaluating predictive models, there are many possible performance measures depending on the type of problem and the modeling goal. For regression models, some commonly used metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). In classification problems, we often use measures such as accuracy, precision, recall, F1-score, and AUC (Area Under the Curve).

In this workshop, we will focus on two key metrics that are widely used for regression models — RMSE and R-squared (R²) — as they provide complementary perspectives on model quality.

The Root Mean Squared Error (RMSE) measures the average magnitude of prediction errors, showing how far the predicted values are from the observed values on average. It is calculated as the square root of the mean squared difference between actual and predicted values. A smaller RMSE indicates higher predictive accuracy.

The R-squared (R²) value represents the proportion of variability in the outcome variable that can be explained by the model. It ranges from 0 to 1, where values closer to 1 indicate a better fit.

In summary:

RMSE quantifies prediction accuracy — lower values are better.
R² quantifies explanatory power — higher values are better.

Together, these two measures give us a clear and interpretable summary of how well a model performs on unseen data.

(1) LINEAR REGRESSION - Baseline model

Linear Regression

Linear regression models the relationship between a response variable ( \(y\) ) and one or more predictors ( \(x_1, x_2, \dots, x_p\) ) by fitting a straight line (or hyperplane) that best describes the data.

\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]

where

\({y}_i\) = the response for observation ( \(i\) )
\(\beta_0\) = intercept
\(\beta_j\) = slope for predictor ( \(x_j\) )
\(\epsilon_i\) =. random error

Optimization Objective

Linear regression estimates the coefficients ( \(\beta_0, \beta_1, \dots, \beta_p\) ) by minimizing the Residual Sum of Squares (RSS) — the total squared difference between observed and predicted values:

\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2\]

This means the model finds the line (or plane) that minimizes the overall squared prediction error.

Key Idea

Linear regression finds the single best-fitting straight line that minimizes the total squared differences between the actual and predicted values.

# visual (ignore the code)
set.seed(1)
x <- 1:20
y <- 2 + 0.5*x + rnorm(20, 0, 2)
fit <- lm(y ~ x)

plot(x, y, pch = 19, col = "steelblue",
     main = "Linear Regression: Minimizing Squared Errors",
     xlab = "Predictor (x)", ylab = "Response (y)")
abline(fit, col = "darkred", lwd = 2)
segments(x, y, x, fitted(fit), col = "gray50", lty = 2)
legend("topleft", legend = c("Data", "Best-fit line", "Residuals"),
       col = c("steelblue", "darkred", "gray50"), pch = c(19, NA, NA),
       lty = c(NA, 1, 2), bty = "n")

Fit the linear regression model using the `lm()` function

############################################################
## (1) LINEAR REGRESSION - Baseline model
############################################################

fit_lm <- lm(bfi_mean ~ ., data = train)
summary(fit_lm)

## 
## Call:
## lm(formula = bfi_mean ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42464 -0.06336  0.00714  0.07053  0.40838 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.436e-01  2.219e-01   2.450 0.014363 *  
## mean_annual_precip_mm            1.550e-03  1.014e-03   1.530 0.126235    
## precip_pct_change_annual         2.727e-02  1.154e-02   2.364 0.018148 *  
## mean_annual_snow_cm             -1.003e-05  6.311e-05  -0.159 0.873721    
## snow_pct_change_annual          -2.047e-03  3.492e-03  -0.586 0.557791    
## mean_annual_temp_deg_c           9.089e-03  3.490e-03   2.605 0.009257 ** 
## temp_pct_change_annual           1.257e-05  3.021e-04   0.042 0.966809    
## mean_annual_vpd_k_pa             3.431e-01  1.610e-01   2.132 0.033119 *  
## vpd_pct_change_annual            1.424e-01  3.931e-02   3.624 0.000296 ***
## mean_annual_r_net_w_m2           1.669e-02  1.890e-02   0.883 0.377444    
## r_net_pct_change_annual         -1.014e-02  5.841e-03  -1.736 0.082694 .  
## magt_deg_c                      -2.230e-04  2.573e-03  -0.087 0.930934    
## mean_annual_lai                  1.823e-03  2.554e-03   0.714 0.475577    
## centroid_lat_dec_degree         -1.669e-03  2.265e-03  -0.737 0.461293    
## elev_m                           1.236e-05  1.984e-05   0.623 0.533431    
## reach_slope_m_m                  2.821e-01  4.322e-01   0.653 0.514058    
## upstrm_drainage_area_km2         7.426e-08  5.789e-07   0.128 0.897943    
## reach_length_km                 -1.967e-03  2.747e-04  -7.161 1.06e-12 ***
## stream_order.L                   1.922e-02  1.493e-01   0.129 0.897598    
## stream_order.Q                   8.535e-02  1.255e-01   0.680 0.496456    
## stream_order.C                   4.273e-02  8.203e-02   0.521 0.602444    
## stream_order^4                   3.346e-02  4.591e-02   0.729 0.466190    
## stream_order^5                  -1.491e-03  2.385e-02  -0.062 0.950175    
## stream_order^6                  -7.716e-04  1.280e-02  -0.060 0.951932    
## permafrost_typediscontinuous    -2.927e-02  9.283e-03  -3.153 0.001638 ** 
## permafrost_typesporadic         -2.756e-02  1.194e-02  -2.307 0.021142 *  
## permafrost_typeisolated         -6.491e-02  1.402e-02  -4.631 3.83e-06 ***
## permafrost_typenone             -6.695e-02  1.700e-02  -3.938 8.45e-05 ***
## glacier_presenceglacier         -6.546e-02  6.923e-02  -0.946 0.344462    
## alt_mean_m                      -1.248e-02  9.533e-03  -1.309 0.190735    
## alt_pct_change_annual           -4.693e-03  2.059e-03  -2.279 0.022726 *  
## permeability_log10m2             2.222e-03  1.261e-03   1.762 0.078134 .  
## bulk_density_cg_cm_3             5.315e-04  2.957e-04   1.798 0.072370 .  
## coarse_fragment_cm3_dm3          8.957e-05  6.405e-05   1.398 0.162161    
## clay_content_g_kg               -3.165e-04  5.783e-05  -5.473 4.88e-08 ***
## soil_organic_carbon_100cm_kg_m2 -2.471e-04  1.698e-04  -1.456 0.145549    
## aspect_sin                      -3.226e-04  3.250e-03  -0.099 0.920935    
## aspect_cos                       3.872e-03  3.236e-03   1.196 0.231685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1118 on 2430 degrees of freedom
## Multiple R-squared:  0.2185, Adjusted R-squared:  0.2066 
## F-statistic: 18.36 on 37 and 2430 DF,  p-value: < 2.2e-16

Interpreting the Linear Regression Model

What’s going on with the stream_order variable?

The stream_order variable represents the position of a stream within the river network — where 1st-order streams are the smallest headwater channels and larger orders indicate progressively larger, downstream reaches. Because it encodes a natural hierarchy, stream_order is treated as an ordered factor in the model.

When an ordered factor is used in R, it is represented by a series of orthogonal polynomial contrasts, shown in the regression output as .L, .Q, .C, ^4, and so on. These correspond to linear, quadratic, cubic, and higher-order trends across the ordered levels. This does not make the model nonlinear in the regression sense — the model is still linear in its coefficients — but it allows the effect of stream order on groundwater contribution to follow a smooth or curved pattern rather than being forced to increase or decrease strictly linearly.

In simple terms, these terms let the model detect whether bfi_mean changes steadily with stream size or whether it peaks or dips at intermediate stream orders. This is helpful because hydrologically, mid-order streams may behave differently from very small headwaters or large river channels, and the polynomial contrasts provide a flexible way to capture that pattern while keeping the model interpretable and statistically valid.

Model fit and meaning The overall ( R^2 = 0.22 ) indicates that about 22 % of the variation in groundwater contribution is explained by the predictors. Although this may appear modest, it is quite strong for environmental and hydrologic data, where natural variability is high. The model is statistically significant (( p < 2.2^{-16} )), showing that it captures meaningful, non-random structure.
Key predictors and physical interpretation
- Reach length has a large negative coefficient (( p < 10^{-12} )), meaning that longer stream reaches are linked to lower baseflow index values.
- Clay content is also strongly negative (( p < 10^{-7} )), suggesting that clay-rich soils restrict infiltration and reduce groundwater contribution.
- Mean annual temperature and vapor-pressure-deficit change are positive and significant, implying that warming and drying trends enhance subsurface contributions.
permafrost_type

In R, when you include a categorical variable (like permafrost_type) in a linear model, one category is automatically chosen as the reference (baseline). The coefficients for the other categories represent how much higher or lower their predicted bfi_mean is compared to that baseline.
If you didn’t explicitly change the reference, R will use the first category in alphabetical order. Given your variable levels (continuous, discontinuous, sporadic, isolated, none), the baseline is most likely “continuous.”
Interpret the coefficients relative to the baseline

Here are the coefficients you have:

Category	Coefficient	Interpretation (relative to continuous)
discontinuous	−0.029	Streams in discontinuous permafrost areas have, on average, a 0.03 lower BFI than those in continuous permafrost.
sporadic	−0.028	Streams in sporadic permafrost have a 0.03 lower BFI than continuous permafrost areas.
isolated	−0.065	Streams in isolated permafrost areas have a 0.07 lower BFI than continuous permafrost.
none	−0.067	Streams with no permafrost have a 0.07 lower BFI than continuous permafrost.

What this means
- Because all coefficients are negative, and the baseline (continuous permafrost) is not shown but implicit, this tells us that — in this dataset — streams with less permafrost actually have lower groundwater contributions than those in continuous permafrost regions.
- That’s counterintuitive from a purely physical standpoint (since frozen ground usually limits infiltration), but statistically, it means that after controlling for temperature, soil, and drainage variables, the model finds lower baseflow fractions in permafrost-free areas.

Unexpectedly weak predictors Some variables that might seem hydrologically important—such as precipitation, snow depth, or elevation—are not significant once correlated climate and soil effects are accounted for. This highlights how regression isolates each variable’s unique influence after adjusting for others.
Model behavior and insight The residual standard error ((~0.11)) indicates predictions are typically within about 0.1 units of observed values on a 0–1 scale. The results reinforce physically consistent relationships: permafrost continuity, soil texture, and channel geometry emerge as dominant controls on groundwater contributions, while climate trends modulate these effects. This demonstrates how even a relatively simple linear model can uncover meaningful, process-based patterns across a complex landscape.

The significant F-statistic (p < 2.2e−16) confirms the model, as a whole, fits substantially better than a null (mean-only) model — so you’ve captured meaningful signal, not random noise.

Predictions on test data

# Predictions on test data
pred_lm <- predict(fit_lm, newdata = test)

Compute RMSE and \(R^2\) using Test data (Evaluation)

# Evaluation
rmse_lm <- sqrt(mean((test$bfi_mean - pred_lm)^2))
rsq_lm  <- cor(test$bfi_mean, pred_lm)^2

cat("\nLinear Regression RMSE:", round(rmse_lm,3),
    " | R-squared:", round(rsq_lm,3), "\n")

## 
## Linear Regression RMSE: 0.116  | R-squared: 0.152

The linear regression model achieved an RMSE of 0.12 and an R² of 0.152, meaning that, on average, the model’s predictions differ from the observed bfi_mean values by about 0.12 units on a 0–1 scale, while explaining roughly 15% of the variation in groundwater contribution across sites. Although this indicates a moderate level of predictive accuracy, it still captures meaningful large-scale patterns in the data — a solid baseline performance for comparison with more advanced models like LASSO, Elastic Net, and AHRLR that we will explore next.

Shrinkage & Variable Selection Models

As datasets grow larger and more complex, traditional linear regression often struggles to balance accuracy and interpretability. When we have many predictors—especially if some are highly correlated or only weakly related to the outcome—the model can become unstable and overfit the training data. Overfitting means the model captures random noise rather than the true underlying relationships, which reduces its predictive power on new data.

To address these challenges, we use shrinkage and variable selection methods, such as Ridge Regression, LASSO (Least Absolute Shrinkage and Selection Operator), and Elastic Net. These techniques introduce a penalty on the size of the regression coefficients, effectively “shrinking” less important ones toward zero.

Shrinkage helps control model complexity and improves prediction accuracy.

Variable selection (as in LASSO) automatically identifies the most relevant predictors by setting some coefficients exactly to zero, simplifying the model.

In short, shrinkage and selection methods provide a principled way to build models that are more stable, interpretable, and generalizable, especially when dealing with high-dimensional or correlated data.

(2) LASSO REGRESSION

Concept

LASSO (Least Absolute Shrinkage and Selection Operator) is an extension of linear regression that improves model performance when there are many correlated or less useful predictors. It adds a penalty that forces some coefficients to shrink toward zero, which helps simplify the model and prevent overfitting.

Model Equation

\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]

This part is identical to ordinary linear regression — but how we estimate the slopes ((\(\beta\))) changes.

Optimization Objective

LASSO finds the coefficients that minimize both the squared prediction error and the total absolute size of the coefficients:

\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]

The tuning parameter ( \(\lambda\) ) controls how strong the penalty is:

A small ( \(\lambda\) ) → coefficients stay close to the linear regression values.
A large ( \(\lambda\) ) → coefficients are pulled toward zero, and some become exactly zero.

Visual Aid: Shrinking Coefficients

In a plot of λ (penalty) on the x-axis and coefficient values on the y-axis, you’ll see each coefficient’s line shrink toward zero as λ increases. Once λ is large enough, only a few predictors remain nonzero — the “selected” variables.

Key Idea (in plain English)

LASSO works like linear regression with a self-control mechanism: it still fits a line through the data, but it automatically shrinks or removes predictors that contribute little to the model. The result is a simpler, more stable, and interpretable model.

Create model matrix (glmnet requires matrix inputs)

# Create model matrix (glmnet requires matrix inputs)
x_train <- model.matrix(bfi_mean ~ ., data = train)[, -1]
y_train <- train$bfi_mean
x_test  <- model.matrix(bfi_mean ~ ., data = test)[, -1]
y_test  <- test$bfi_mean

Cross-validated LASSO to pick best lambda

# Cross-validated LASSO to pick best lambda
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 5)
plot(cv_lasso)    # shows RMSE vs log(lambda)

best_lambda <- cv_lasso$lambda.min
best_lambda

## [1] 0.00997893

The plot from cv_lasso shows the cross-validated error (typically RMSE or deviance) for a range of penalty values (( \(\lambda\) )) used in the LASSO model. The curve usually dips to a minimum and then rises again as the penalty becomes too strong.

The value best_lambda = 0.01095 represents the point where the model achieves the lowest cross-validation error, balancing bias and variance. In practical terms, this means the LASSO model with a penalty of about 0.011 provides the best trade-off between model simplicity (fewer variables) and predictive accuracy on unseen data.

Fit final model at `best_lambda`

# Fit final model at optimal lambda
fit_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda)
coef(fit_lasso)

## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                            s0
## (Intercept)                      6.001176e-01
## mean_annual_precip_mm            .           
## precip_pct_change_annual         .           
## mean_annual_snow_cm              .           
## snow_pct_change_annual           .           
## mean_annual_temp_deg_c           7.252624e-03
## temp_pct_change_annual           .           
## mean_annual_vpd_k_pa             5.347300e-02
## vpd_pct_change_annual            .           
## mean_annual_r_net_w_m2           .           
## r_net_pct_change_annual          .           
## magt_deg_c                       .           
## mean_annual_lai                  .           
## centroid_lat_dec_degree          .           
## elev_m                           .           
## reach_slope_m_m                  .           
## upstrm_drainage_area_km2         .           
## reach_length_km                 -9.407050e-04
## stream_order.L                   .           
## stream_order.Q                   3.141864e-02
## stream_order.C                   .           
## stream_order^4                   .           
## stream_order^5                   .           
## stream_order^6                   .           
## stream_order^7                   .           
## permafrost_typediscontinuous     .           
## permafrost_typesporadic          .           
## permafrost_typeisolated          .           
## permafrost_typenone              .           
## glacier_presenceglacier          .           
## alt_mean_m                       .           
## alt_pct_change_annual            .           
## permeability_log10m2             2.494306e-04
## bulk_density_cg_cm_3             .           
## coarse_fragment_cm3_dm3          .           
## clay_content_g_kg               -8.999784e-05
## soil_organic_carbon_100cm_kg_m2  .           
## aspect_sin                       .           
## aspect_cos                       .

Interpreting the LASSO Model

LASSO dramatically simplified the model Out of nearly 40 predictors, the LASSO retained only six variables:
- mean_annual_temp_deg_c
- mean_annual_vpd_k_pa
- reach_length_km
- stream_order.Q
- permeability_log10m2
- clay_content_g_kg All other coefficients were shrunk exactly to zero. This illustrates LASSO’s strength — it performs automatic variable selection, removing predictors that add little to the model’s predictive power and leaving a concise, interpretable subset.
The retained variables make strong physical sense
- Temperature and VPD describe climatic influence on evapotranspiration and recharge.
- Reach length and stream order (quadratic) describe channel network geometry — the “plumbing” of the landscape.
- Permeability and clay content capture subsurface flow capacity and infiltration limitations.
  
  LASSO thus bridges the major hydrologic dimensions — climate, landscape, and subsurface properties — all without manual variable selection.
The signs of the coefficients align with hydrologic theory
- Clay content (−) → higher clay = lower infiltration and reduced groundwater contribution.
- Reach length (−) → longer channels show less groundwater influence.
- Temperature and VPD (+) → warmer and drier regions exhibit higher groundwater fraction, possibly due to thawing or altered recharge pathways.
WOW takeaway The LASSO reduced the model from 39 predictors to just six while preserving physically meaningful relationships. It automatically identified the dominant drivers — the same ones an expert might choose — through a purely data-driven learning process. This demonstrates how regularization can reveal the essential structure in complex environmental data while improving model stability and interpretability.

Predict using test data and evaluate

# Predict and evaluate
pred_lasso <- predict(fit_lasso, newx = x_test)

Compute RMSE and \(R^2\) using Test data (Evaluation)

rmse_lasso <- sqrt(mean((y_test - pred_lasso)^2))
rsq_lasso  <- cor(y_test, pred_lasso)^2

cat("\nLASSO RMSE:", round(rmse_lasso,3),
    " | R-squared:", round(rsq_lasso,3), "\n")

## 
## LASSO RMSE: 0.117  | R-squared: 0.14

The LASSO model achieved an RMSE of 0.117 and an R² of 0.14, which is very close to the linear regression model’s performance but with only a fraction of the predictors. This shows that LASSO delivers nearly the same predictive accuracy while producing a much simpler and more interpretable model. In practice, this balance between simplicity and accuracy is exactly what makes regularization methods like LASSO so valuable — they generalize better, reduce overfitting, and highlight the most important variables driving groundwater contributions.

(3) ELASTIC NET REGRESSION

The idea

Elastic Net combines the strengths of Ridge and LASSO regression. Like LASSO, it can shrink some coefficients to zero (performing variable selection), and like Ridge, it handles correlated predictors more smoothly by sharing their effects instead of forcing one to drop out.

This makes Elastic Net a great middle ground between the two methods — it balances model simplicity and stability.

Model Equation

\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]

Optimization Objective

\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \left[ (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 + \alpha \sum_{j=1}^{p} |\beta_j|\right]\]

Here:

\(\lambda\) controls the overall penalty strength (like before).
\(\alpha\) controls the mix between Ridge (\(L_2\)) and LASSO (\(L_1\)):
\(\alpha = 0\) → Ridge Regression
\(\alpha = 1\) → LASSO Regression
\(0 < \alpha < 1\) → Elastic Net (a blend of both)

Key idea

Elastic Net is a hybrid shrinkage method that combines the variable selection of LASSO with the stability of Ridge. It’s especially effective when predictors are highly correlated or when there are many more predictors than observations (p > n).

Create model matrix (glmnet requires matrix inputs)

# Already done

Cross-validated Enet to pick best lambda

# Cross-validated ENET to pick best lambda
cv_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, nfolds = 5)
plot(cv_enet)    # shows RMSE vs log(lambda)

best_lambda_enet <- cv_enet$lambda.min
best_lambda_enet

## [1] 0.02190373

Fit final model at `best_lambda_enet`

# Fit final model at optimal lambda
fit_enet <- glmnet(x_train, y_train, alpha = 0.5, lambda = best_lambda_enet)
coef(fit_enet)

## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                            s0
## (Intercept)                      5.473688e-01
## mean_annual_precip_mm            .           
## precip_pct_change_annual         .           
## mean_annual_snow_cm              .           
## snow_pct_change_annual           .           
## mean_annual_temp_deg_c           4.687992e-03
## temp_pct_change_annual           .           
## mean_annual_vpd_k_pa             1.386127e-01
## vpd_pct_change_annual            .           
## mean_annual_r_net_w_m2           .           
## r_net_pct_change_annual          .           
## magt_deg_c                       9.541105e-05
## mean_annual_lai                  .           
## centroid_lat_dec_degree          .           
## elev_m                           .           
## reach_slope_m_m                  .           
## upstrm_drainage_area_km2         .           
## reach_length_km                 -7.722631e-04
## stream_order.L                   .           
## stream_order.Q                   2.479099e-02
## stream_order.C                  -2.757078e-03
## stream_order^4                   .           
## stream_order^5                   .           
## stream_order^6                   .           
## stream_order^7                   .           
## permafrost_typediscontinuous     .           
## permafrost_typesporadic          .           
## permafrost_typeisolated          .           
## permafrost_typenone              .           
## glacier_presenceglacier          .           
## alt_mean_m                       .           
## alt_pct_change_annual            .           
## permeability_log10m2             5.976816e-04
## bulk_density_cg_cm_3             .           
## coarse_fragment_cm3_dm3          .           
## clay_content_g_kg               -7.396067e-05
## soil_organic_carbon_100cm_kg_m2  .           
## aspect_sin                       .           
## aspect_cos                       .

Interpreting the Elastic Net (ENet) Model

Elastic Net kept a broader but stable set of predictors
The ENet model retained seven nonzero coefficients, blending climate, landscape, and soil factors:
- mean_annual_temp_deg_c
- mean_annual_vpd_k_pa
- magt_deg_c
- reach_length_km
- stream_order.Q
- stream_order.C
- permeability_log10m2
- clay_content_g_kg
This shows that Elastic Net preserved more information than LASSO, especially among correlated climate and stream geometry variables, while still shrinking weaker predictors toward zero.

The retained predictors make physical sense
- Temperature and vapor pressure deficit (VPD) — both positive — indicate that warmer and drier catchments tend to have greater groundwater influence, possibly due to increased thaw and recharge.
- Magnetic ground temperature (magt_deg_c), though small in magnitude, likely captures local permafrost thermal effects.
- Reach length (−) and clay content (−) reduce groundwater contributions, consistent with longer flow paths and lower infiltration capacity.
- Permeability (positive) reinforces that higher-conductivity soils enhance baseflow contribution.
- Stream order (quadratic and cubic contrasts) capture curved or non-linear patterns in how groundwater contribution changes across the river network. Instead of assuming that larger streams always have more or less groundwater input, these terms allow the model to represent a hump-shaped relationship — where mid-sized streams (not the smallest headwaters or largest rivers) often show the strongest interaction between surface and groundwater. This reflects the idea that intermediate-order streams tend to be the most hydrologically connected zones in a watershed.

Why this model is impressive
ENet’s hybrid penalty allowed correlated variables (like temperature, VPD, and permafrost proxies) to share explanatory power rather than forcing one to dominate.
It effectively balances LASSO’s sparsity and Ridge’s stability, producing a physically interpretable model that captures climate–hydrology–soil interactions without overfitting.

WOW takeaway
Elastic Net automatically identified a multi-process balance between climate, soil, and channel geometry — all consistent with hydrologic theory.
It’s more comprehensive than LASSO yet more stable than ordinary least squares, showing how moderate regularization can reveal nuanced relationships in complex environmental systems.

Predict using test data and evaluate

# Predict and evaluate
pred_enet <- predict(fit_enet, newx = x_test)

Compute RMSE and \(R^2\) using Test data (Evaluation)

rmse_enet <- sqrt(mean((y_test - pred_enet)^2))
rsq_enet <- cor(y_test, pred_enet)^2

cat("\nENET RMSE:", round(rmse_enet,3),
    " | R-squared:", round(rsq_enet,3), "\n")

## 
## ENET RMSE: 0.117  | R-squared: 0.14

The Elastic Net model achieved predictive accuracy comparable to the LASSO model (RMSE ≈ 0.117, R² ≈ 0.14), while maintaining a more stable solution in the presence of correlated predictors. This highlights Elastic Net’s advantage — it balances model simplicity and robustness, capturing key relationships without overfitting or losing important correlated variables.

(8) Adaptive Hybrid Relaxed LASSO Regression (AHRLR)

The idea

AHRLR builds on the strengths of LASSO, Adaptive LASSO, and Relaxed LASSO to achieve better variable selection and lower bias. It is a two-stage procedure designed to:

Select important predictors (like LASSO), and
Refit them with adaptive weights and reduced shrinkage (relaxation), improving estimation accuracy and model stability.

This makes AHRLR both sparse (many coefficients are zero) and less biased for the variables that remain.

Model Equation

\[{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}+\epsilon_i\]

Optimization Objective

AHRLR solves the following adaptive and relaxed penalized least squares problem:

\[\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2+ \lambda_1 \sum_{j=1}^{p} w_j |\beta_j|+ \lambda_2 (1 - \phi) \sum_{j=1}^{p} \beta_j^2\]

where:

\(w_j = 1 / |\hat{\beta}_j^{(init)}|^{\gamma}\) are adaptive weights based on an initial estimate (e.g., OLS or LASSO).
\(\lambda_1\) controls sparsity (LASSO-type penalty).
\(\lambda_2\) and \(\phi\) control the relaxation and bias adjustment.

This two-stage process allows flexible control over selection and shrinkage strength.

Visual Aid: Two-stage learning

You can visualize AHRLR as a two-step process:

Stage 1: Apply LASSO to identify important predictors.
Stage 2: Refit those predictors with adaptive weights and relaxed penalties, giving them room to “breathe” — stronger effects shrink less, weaker ones may vanish.

This results in a model that is both parsimonious and accurate.

Key idea

AHRLR combines adaptive weighting and relaxation to reduce LASSO’s bias while keeping its ability to select variables. It bridges the gap between model simplicity and unbiased estimation, making it particularly effective for data with correlated predictors or mixed signal strengths.

# Quick Start ----------------------------

# 1. Load AHRLR functions
source("https://hasthika.github.io/Rpacks/ahrlr_fit.txt")

# 2. Fit model
fit <- fit_AHRLR(x_train, y_train, lambda2 = 1)

# 3. Inspect selected variables
fit$selected_vars

## [1] "precip_pct_change_annual" "mean_annual_temp_deg_c"  
## [3] "mean_annual_vpd_k_pa"     "vpd_pct_change_annual"   
## [5] "reach_length_km"          "stream_order.Q"          
## [7] "stream_order^5"           "permafrost_typeisolated" 
## [9] "clay_content_g_kg"

# 4. Predict on new data
pred_ahrlr <- predict_AHRLR(fit, newx = x_test)

# 5. Optional summary
summarize_AHRLR(fit, X_test = x_test, y_test = y_test)

## AHRLR model summary
## -------------------
## lambda.min: 0.1316359 
## Selected variables (9):
##   precip_pct_change_annual, mean_annual_temp_deg_c, mean_annual_vpd_k_pa, vpd_pct_change_annual, reach_length_km, stream_order.Q, stream_order^5, permafrost_typeisolated, clay_content_g_kg 
## Test RMSE: 0.1163

Interpreting the AHRLR Model

AHRLR identified a comprehensive yet stable set of predictors
The Adaptive Hybrid Relaxed Lasso Regression (AHRLR) retained nine meaningful variables, expanding slightly beyond LASSO and ENet:
- precip_pct_change_annual
- mean_annual_temp_deg_c
- mean_annual_vpd_k_pa
- vpd_pct_change_annual
- reach_length_km
- stream_order.Q
- stream_order^5
- permafrost_typeisolated
- clay_content_g_kg
Compared with LASSO and ENet, AHRLR recovers a slightly broader, more balanced set of predictors — capturing both climatic and physical watershed controls while avoiding over-shrinkage.

Why this is powerful
AHRLR’s two-stage design first performs variable selection (like LASSO) and then relaxes the shrinkage with adaptive weights.
This means strong signals are shrunk less, and weaker but still relevant predictors can re-enter the model.
As a result, AHRLR often recovers variables that LASSO or ENet penalized too heavily, providing a richer yet still stable model.

Hydrologic interpretation

Climate effects:
Changes in precipitation, temperature, and vapor pressure deficit (VPD) highlight the influence of long-term climate trends on groundwater contribution.
Landscape effects:
Variables like reach_length_km and stream_order describe basin size and connectivity, controlling how surface and subsurface flows interact.
Subsurface effects:
permafrost_typeisolated and clay_content_g_kg reflect groundwater movement capacity — regions with more clay or isolated permafrost show reduced groundwater inputs.

Together, these findings suggest that climate, topography, and soil composition jointly determine how much groundwater sustains streamflow.

WOW takeaway
AHRLR bridges the gap between the overly sparse LASSO/ENet models and the fully complex regression.
It preserves interpretability while improving bias and stability — yielding a physically coherent, multi-process model that connects climate variability, landscape structure, and soil permeability.
This balance of parsimony and completeness is what makes AHRLR especially powerful for environmental modeling.

rmse_AHRLR <- sqrt(mean((y_test - pred_ahrlr)^2))
rsq_AHRLR <- cor(y_test, pred_ahrlr)^2

cat("\nAHRLR RMSE:", round(rmse_AHRLR, 3),
    " | R-squared:", round(rsq_AHRLR, 3), "\n")

## 
## AHRLR RMSE: 0.116  | R-squared: 0.147

The AHRLR model achieved an RMSE of 0.116 and an R² of 0.147, performing about the same as the linear regression and LASSO models in terms of predictive accuracy. What makes AHRLR stand out is that it keeps slightly more meaningful variables while still maintaining a low error rate. This means it captures a broader picture of the system — combining climate, landscape, and soil effects — without losing predictive strength.

Transition to Tree-Based Models

While linear, Ridge, LASSO, and Elastic Net regressions are powerful and interpretable, they all assume a linear relationship between predictors and the response. In many real-world situations, that assumption can be too limiting — relationships may be nonlinear, involve interactions among variables, or change across different ranges of the predictors.

To capture these complex patterns, we turn to tree-based models, which divide the data into smaller, more homogeneous groups based on predictor values. Each split represents a simple decision rule, and together these rules form a tree-like structure that is easy to interpret.

There are two main types of tree-based models:

Regression Trees, used when the outcome variable is continuous (e.g., predicting groundwater contribution).
Classification Trees, used when the outcome variable is categorical (e.g., predicting whether a stream is glacier-fed or not).

In this workshop, we will focus on regression trees to stay within the continuous-response setting, then extend the idea to more advanced ensemble methods like Bagging, Random Forest, and Boosting to improve predictive performance and stability.

(5) DECISION TREE (A Regression Tree In This Case) - Interpretable nonlinear model

fit_tree <- rpart(bfi_mean ~ ., data = train,
                  control = rpart.control(cp = 0.01, maxdepth = 6))

rpart.plot(fit_tree, type = 2, extra = 101, under = TRUE, faclen = 0)

Interpreting the Regression Tree

This regression tree predicts the mean groundwater contribution (bfi_mean) using climate, location, and watershed variables.
Each split represents a decision rule that divides the data into groups with more similar groundwater contributions.
The predicted value at each terminal node (yval) is the average bfi_mean for sites that meet all conditions along that branch.

1. Top split — Mean annual temperature

The first and most important division occurs at
mean_annual_temp_deg_c < -9.545, separating very cold (Arctic) regions from warmer (sub-Arctic to temperate) ones.
This confirms that temperature is the dominant control on groundwater contribution across the study area.

Cold regions (≤ −9.5 °C): average groundwater contribution ≈ 0.50
Warmer regions (> −9.5 °C): average groundwater contribution ≈ 0.58

Warmer basins tend to have greater groundwater influence, consistent with permafrost thaw and deeper infiltration.

2. Cold branch (≤ −9.5 °C)

Within the cold group, variation depends on net radiation trend and drainage area.

High net radiation trend (r_net_pct_change_annual ≥ 0.385) and very low ground temperatures (magt_deg_c < −6.3)
→ extremely low groundwater contributions (≈ 0.35–0.38).
These are the coldest, most frozen landscapes with minimal subsurface flow.
Cold areas with weaker radiation increase and larger drainage areas (≥ 37 km²)
→ moderately higher groundwater values (≈ 0.52).
Small, cold basins (< 37 km²)
→ the highest BFI in the cold branch (≈ 0.59), suggesting that small, well-connected headwaters can still maintain notable groundwater input when not fully frozen.

3. Warmer branch (> −9.5 °C)

For warmer watersheds, drainage area again drives the next major split:

Larger basins (≥ 42 km²) dominate the warm branch.
Within them, temperature and latitude refine predictions:
- Moderately cold sub-basins (−9.5 °C < T < −0.1 °C) have BFI ≈ 0.55.
- Lower-latitude, snow-rich sites (e.g., mean_annual_snow_cm ≥ 4 cm) show distinct contrasts:
  - When VPD < 0.53 kPa, groundwater fraction ≈ 0.37 (drier air, less infiltration).
  - When VPD ≥ 0.53 kPa, groundwater fraction rises to ≈ 0.52.
- High-latitude, slightly warmer zones (T > −3.9 °C) reach the highest values (~0.60), reflecting enhanced thaw-related recharge.
Smaller warm basins (< 42 km²)
split on temperature again:
- Cooler small basins (T < −4.8 °C) have moderate BFI (~0.60).
- Warmer small basins (T ≥ −4.8 °C) reach the highest BFI (~0.69), showing that small, warm catchments allow strong groundwater exchange.

4. Overall pattern

The tree highlights a hierarchical climate–basin size control:

Dominant Condition	Typical BFI	Interpretation
Very cold (≤ −9.5 °C), frozen soils	0.35–0.45	Minimal groundwater contribution
Cold, larger basins	0.50–0.53	Slightly increased baseflow buffering
Moderate temperature, large basins	0.55–0.60	Sustained groundwater flow
Warm, small basins	0.65–0.70	Strongest groundwater–surface exchange

In short:

The regression tree reveals that temperature and basin size are the two most powerful predictors of groundwater contribution, with additional refinements from net radiation, snow, and latitude.
Groundwater influence increases systematically with warming and decreasing permafrost control, particularly in smaller, low-latitude catchments.
This tree provides an interpretable map of how climate and watershed geometry together shape baseflow across the Arctic landscape.

# Predict and evaluate
pred_tree <- predict(fit_tree, newdata = test)
rmse_tree <- sqrt(mean((test$bfi_mean - pred_tree)^2))
rsq_tree  <- cor(test$bfi_mean, pred_tree)^2

cat("\nDecision Tree RMSE:", round(rmse_tree,3),
    " | R-squared:", round(rsq_tree,3), "\n")

## 
## Decision Tree RMSE: 0.104  | R-squared: 0.32

Why Go Beyond a Single Tree?

Decision trees are powerful because they are easy to interpret and can capture nonlinear relationships automatically. However, a single tree can be unstable — small changes in the data can lead to a very different tree structure — and its predictions may not be as accurate as more robust methods. This happens because trees have high variance: they tend to fit the training data closely but may not generalize well to new observations.

To overcome these limitations, we use ensemble methods, which combine the predictions of many trees to create a more stable and accurate model.

Bagging (Bootstrap Aggregating) reduces variance by building many trees on random bootstrap samples of the data and averaging their predictions.
Random Forest extends bagging by adding another layer of randomness — at each tree split, only a random subset of predictors is considered — which further decorrelates the trees and improves performance.
Boosting takes a different approach by building trees sequentially, where each new tree focuses on correcting the errors made by the previous ones.

Together, these ensemble techniques make tree-based models more accurate, stable, and resistant to overfitting, while still retaining the ability to model complex, nonlinear relationships.

(6) BAGGING - Averaging many bootstrap trees

Bagging (Bootstrap Aggregating) is a simple but powerful way to improve the stability and accuracy of decision trees. The idea is to create many bootstrap samples — random samples of the training data taken with replacement — and fit a separate tree to each one. Each tree may look a little different because it sees a slightly different subset of the data.

When making predictions, we average the results from all these trees (or take a majority vote for classification). Averaging helps cancel out random fluctuations and reduces the variance of the model, leading to more reliable predictions.

However, the main disadvantage of bagging is that it sacrifices interpretability — since predictions come from the average of many trees, it’s no longer clear how individual predictors influence the outcome. In other words, bagging gives up some transparency in exchange for greater accuracy and stability.

In short:

Bagging = Build many trees on random samples + average their predictions. It smooths out noisy individual trees to produce a more stable and accurate ensemble model, though at the cost of losing a simple, single-tree interpretation.

############################################################
## (6) BAGGING - Averaging many bootstrap trees
############################################################

fit_bag <- bagging(bfi_mean ~ ., data = train, nbagg = 50)  # 50 bootstraps
pred_bag <- predict(fit_bag, newdata = test)

rmse_bag <- sqrt(mean((test$bfi_mean - pred_bag)^2))
rsq_bag  <- cor(test$bfi_mean, pred_bag)^2

cat("\nBagging RMSE:", round(rmse_bag,3),
    " | R-squared:", round(rsq_bag,3), "\n")

## 
## Bagging RMSE: 0.097  | R-squared: 0.405

Interpreting the Bagging Model

The Bagging model achieved an RMSE of 0.098 and an R² of 0.392, which is a major improvement over the linear and LASSO models (where R² was around 0.15).

This means Bagging explains nearly 40% of the variation in groundwater contribution — about 2.5 times more than the linear model — and predicts much more accurately (lower RMSE).

The improvement happens because Bagging combines results from many decision trees built on different bootstrap samples of the data. By averaging across many trees, the model:

Reduces variance (each tree may overfit slightly, but their average is stable),
Captures nonlinear relationships that linear models miss.

“Bagging outperforms linear methods because it can capture complex, nonlinear patterns in how climate, soils, and landscape interact — not just straight-line effects. But it sacrifices interpretability.

(7) RANDOM FOREST - Bagging + random feature selection

Random Forest builds on the idea of bagging but adds another clever layer of randomness to further improve performance. Like bagging, it creates many bootstrap samples and grows a separate decision tree for each. However, when a tree is being built, the algorithm does not consider all predictors at each split — instead, it selects a random subset of features and chooses the best split only among them.

This extra randomness helps make each tree more unique and less correlated with the others. When the predictions from all trees are averaged (for regression) or voted on (for classification), the result is a model that is typically more accurate and more robust than either a single tree or even plain bagging.

The trade-off is similar to bagging: interpretability is reduced because we now have hundreds of trees instead of one, but we gain excellent predictive power and strong resistance to overfitting.

In short:

Random Forest = Bagging + Random Feature Selection at Each Split This approach decorrelates trees, reduces variance, and usually delivers one of the best out-of-the-box predictive models.

############################################################
## (7) RANDOM FOREST - Bagging + random feature selection
############################################################

fit_rf <- randomForest(bfi_mean ~ ., data = train,
                       ntree = 300, mtry = floor(sqrt(ncol(train)-1)),
                       importance = TRUE)

print(fit_rf)

## 
## Call:
##  randomForest(formula = bfi_mean ~ ., data = train, ntree = 300,      mtry = floor(sqrt(ncol(train) - 1)), importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.007708973
##                     % Var explained: 51.05

varImpPlot(fit_rf, main = "Variable Importance (Random Forest)")

# Simplify the Random Forest importance plot
varImpPlot(fit_rf,
           n.var = 10,                # show only top 10 variables
           main = "Top 10 Important Variables (Random Forest)",
           cex = 0.8)                 # shrink label text for readability

pred_rf <- predict(fit_rf, newdata = test)
rmse_rf <- sqrt(mean((test$bfi_mean - pred_rf)^2))
rsq_rf  <- cor(test$bfi_mean, pred_rf)^2

cat("\nRandom Forest RMSE:", round(rmse_rf,3),
    " | R-squared:", round(rsq_rf,3), "\n")

## 
## Random Forest RMSE: 0.089  | R-squared: 0.512

Interpreting Variable Importance (Random Forest)

This plot shows the top 10 most important predictors of groundwater contribution (bfi_mean) identified by the Random Forest model. Each variable’s importance is measured in two ways:

%IncMSE (left): how much the model’s prediction error increases when that variable’s values are randomly shuffled. → A bigger value means the variable is more important for accurate predictions.
IncNodePurity (right): how much each variable helps reduce variation (impurity) when the trees split on it.

From both measures, we can see that:

Upstream drainage area and Temperature and energy variables — including mean_annual_temp_deg_c, r_net_pct_change_annual, and mean_annual_r_net_w_m2 highlight the strong role of climate and surface energy balance in shaping groundwater contributions.

In short:

The Random Forest confirms that larger drainage areas and warmer, more temperate regions tend to have higher groundwater contributions — consistent with what we saw in the tree model and regression results.

(8) GRADIENT BOOSTING - Sequential tree learning

Gradient Boosting is an advanced ensemble method that builds trees sequentially, where each new tree learns from the mistakes of the previous ones. Unlike bagging or random forests, which grow many independent trees and average their results, boosting takes an iterative approach: it starts with a simple model and gradually improves it by focusing on the observations that were hardest to predict.

At each step, a new tree is fit to the residual errors (the difference between the actual and predicted values), and its predictions are combined with the previous model to reduce overall error. This process continues for many rounds, producing a strong predictive model from many weak learners.

The key idea is that each tree “boosts” the performance of the ensemble by correcting earlier errors. The method is highly flexible and powerful, often achieving excellent accuracy. However, it comes with trade-offs — gradient boosting can be computationally intensive and prone to overfitting if too many trees are grown or the learning rate is too high.

In short:

Gradient Boosting = Sequential tree learning where each tree corrects the errors of the previous ones.

############################################################
## (8) GRADIENT BOOSTING - Sequential tree learning
############################################################

# Prepare data matrices for xgboost
x_train <- model.matrix(bfi_mean ~ ., data = train)[, -1]
x_test  <- model.matrix(bfi_mean ~ ., data = test)[, -1]
y_train <- train$bfi_mean
y_test  <- test$bfi_mean

# xgboost requires numeric matrices only
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest  <- xgb.DMatrix(data = x_test,  label = y_test)

# Train model (tweak parameters as needed)
fit_xgb <- xgboost(
  data = dtrain,
  objective = "reg:squarederror",
  nrounds = 300,       # number of boosting iterations
  eta = 0.05,          # learning rate
  max_depth = 5,       # depth of trees
  subsample = 0.8,     # fraction of data for each tree
  colsample_bytree = 0.8,
  verbose = 0
)

# Predict and evaluate
pred_xgb <- predict(fit_xgb, newdata = dtest)
rmse_xgb <- sqrt(mean((y_test - pred_xgb)^2))
rsq_xgb  <- cor(y_test, pred_xgb)^2

cat("\nBoosting (XGBoost) RMSE:", round(rmse_xgb,3),
    " | R-squared:", round(rsq_xgb,3), "\n")

## 
## Boosting (XGBoost) RMSE: 0.091  | R-squared: 0.478

COMPARISON SUMMARY

############################################################
## COMPARISON SUMMARY
############################################################

results <- data.frame(
  Model = c("Linear", "LASSO", "Enet", "AHRLR", "Tree", "Bagging", "RF", "Boosting"),
  RMSE  = c(rmse_lm, rmse_lasso, rmse_enet, rmse_AHRLR, rmse_tree, rmse_bag, rmse_rf, rmse_xgb),
  R2    = c(rsq_lm, rsq_lasso, rsq_enet, rsq_AHRLR, rsq_tree, rsq_bag, rsq_rf, rsq_xgb)
)

print(results[order(results$RMSE), ], row.names = FALSE)

##     Model       RMSE        R2
##        RF 0.08868245 0.5121020
##  Boosting 0.09061455 0.4778010
##   Bagging 0.09741966 0.4047560
##      Tree 0.10396971 0.3199305
##    Linear 0.11615648 0.1524623
##     AHRLR 0.11633932 0.1472868
##     LASSO 0.11656061 0.1397317
##      Enet 0.11681202 0.1402390

# Optional: bar plot of model RMSEs
barplot(results$RMSE, names.arg = results$Model,
        col = "steelblue", main = "Model Comparison (RMSE ↓ Better)",
        ylab = "Root Mean Squared Error")

Final Summary

We compared several supervised learning models for predicting mean groundwater contribution (bfi_mean). The table below summarizes their predictive accuracy using Root Mean Squared Error (RMSE) and R-squared (R²) on the test data.

Model	RMSE	R²
Random Forest	0.089	0.511
Boosting (XGBoost)	0.090	0.483
Bagging	0.098	0.392
Decision Tree	0.104	0.320
Linear Regression	0.116	0.152
AHRLR	0.116	0.147
LASSO	0.117	0.138
Elastic Net	0.117	0.137

Key Takeaways

Model performance improves as flexibility increases. Simpler models like Linear Regression and LASSO explain only about 15% of the variation in groundwater contribution, while ensemble tree methods capture over 50%.
Random Forest delivers the best overall accuracy. With an R² of ~0.51 and the lowest RMSE, Random Forest balances predictive power and robustness by combining many decision trees with random feature selection.
Boosting performs similarly, but slightly less stably. Its sequential learning can fine-tune predictions but may slightly overfit depending on parameter settings.
Bagging and single Trees show the importance of variance reduction. Each step from Tree → Bagging → Random Forest adds stability and predictive strength.
Shrinkage models (LASSO, ENet, AHRLR) still add value. While not as accurate as ensemble methods, they offer interpretability and variable selection, helping identify key climate and permafrost drivers of groundwater flow.

In summary

The results highlight a clear trade-off in supervised learning: Linear and shrinkage models are simpler and interpretable, Tree-based ensembles are more powerful and flexible.

Together, they demonstrate how modern machine-learning tools can uncover complex, nonlinear relationships between climate, landscape, and groundwater systems — giving us both scientific insight and stronger predictive capability.

From Data to Decisions: Supervised Machine Learning with R

Hasthika Rupasinghe and Lasanthi Watagoda

2025-11-11

What is supervised learning?

Dataset Description

Road Map

Loading libraries

Data Cleaning and Loading

Loading a downloaded data set

Look at the structure of the data set

One way to handdle missing values

EDA (not covered in this workshop)

Why we explore the data first

Split data into training and test sets

Building models

Evaluating Predictive Models

(1) LINEAR REGRESSION - Baseline model

Linear Regression

Optimization Objective

Key Idea

Fit the linear regression model using the lm() function

Interpreting the Linear Regression Model

Predictions on test data

Compute RMSE and \(R^2\) using Test data (Evaluation)

Shrinkage & Variable Selection Models

(2) LASSO REGRESSION

Concept

Model Equation

Optimization Objective

Visual Aid: Shrinking Coefficients

Key Idea (in plain English)

Create model matrix (glmnet requires matrix inputs)

Cross-validated LASSO to pick best lambda

Fit final model at best_lambda

Interpreting the LASSO Model

Predict using test data and evaluate

Compute RMSE and \(R^2\) using Test data (Evaluation)

(3) ELASTIC NET REGRESSION

The idea

Model Equation

Optimization Objective

Key idea

Create model matrix (glmnet requires matrix inputs)

Cross-validated Enet to pick best lambda

Fit final model at best_lambda_enet

Interpreting the Elastic Net (ENet) Model

Predict using test data and evaluate

Compute RMSE and \(R^2\) using Test data (Evaluation)

(8) Adaptive Hybrid Relaxed LASSO Regression (AHRLR)

The idea

Model Equation

Optimization Objective

Visual Aid: Two-stage learning

Key idea

Interpreting the AHRLR Model

Transition to Tree-Based Models

(5) DECISION TREE (A Regression Tree In This Case) - Interpretable nonlinear model

Interpreting the Regression Tree

1. Top split — Mean annual temperature

2. Cold branch (≤ −9.5 °C)

3. Warmer branch (> −9.5 °C)

4. Overall pattern

Why Go Beyond a Single Tree?

(6) BAGGING - Averaging many bootstrap trees

Interpreting the Bagging Model

(7) RANDOM FOREST - Bagging + random feature selection

Interpreting Variable Importance (Random Forest)

(8) GRADIENT BOOSTING - Sequential tree learning

COMPARISON SUMMARY

Final Summary

Key Takeaways

In summary

Fit the linear regression model using the `lm()` function

Fit final model at `best_lambda`

Fit final model at `best_lambda_enet`