Introduction

This project presents a time series analysis focused on monthly electricity data from Spain. The dataset originates from Eurostat and is titled “Supply, transformation and consumption of electricity - monthly data” (DOI: 10.2908/NRG_CB_EM).

The main objectives of this project are:

Used libraries:

library(readxl)
library(dplyr)
library(forecast)
library(urca)
library(lmtest)
library(tseries)
library(astsa)
library(FinTS)
library(Metrics)
library(ggplot2)

Data Loading and Preparation

Read the data from the Excel file.

data <- suppressMessages(read_excel("data.xlsx", col_names = FALSE))

df <- as.data.frame(t(data))
colnames(df) <- c("Date", "Spain")

df <- df %>%
  filter(!is.na(Date)) %>%
  slice(-1) %>%
  mutate(Date = as.Date(paste(Date, "01", sep = "-"), format = "%Y-%m-%d"),
         Spain = round(as.numeric(Spain), 3))

rownames(df) <- df$Date
df$Date <- NULL

head(df, 3)
##            Spain
## 2008-01-01 24597
## 2008-02-01 23096
## 2008-03-01 21690
tail(df, 3)
##               Spain
## 2024-12-01 19590.32
## 2025-01-01 20745.94
## 2025-02-01 18208.06

The dataset should be checked for any missing values:

any(is.na(df))
## [1] FALSE
# from 2008-1 to 2025-2

(12*(24-8+1) + 2) - nrow(df)
## [1] 0

Since there is no missing data, the variable can be converted to a time series object to enable proper temporal analysis and model fitting.

ts_data <- ts(df$Spain, start = c(2008, 1), frequency = 12)

Exploratory Data Analysis

Number of observations:

length(ts_data)
## [1] 206

Time series plot:

plot(ts_data,
     ylab = "Electricity Consumption (GWh)",
     xlab = "Time (Monthly)",
     main = "Monthly Electricity Consumption in Spain")

Main observations from the time series plot:

For the remaining exploratory analysis, the dataset should already be split into training and test sets to ensure that model diagnostics are performed only on the training portion.

\[20\% \text{ of } 206 \approx 40\]

n <- length(ts_data)

# train: end before last 40 obs
ts_train <- window(ts_data, end = time(ts_data)[n - 40])

# test: start at the last 40 obs
ts_test <- window(ts_data, start = time(ts_data)[n - 39])

cat("Train data has", length(ts_train), "obervations.\n")
## Train data has 166 obervations.
cat("Test data has", length(ts_test), "obervations.\n")
## Test data has 40 obervations.
cat("Split proportion:", round(100 * length(ts_train) / n, 2),
    "/", round(100 * length(ts_test) / n, 2), "\n")
## Split proportion: 80.58 / 19.42
cat("Train starts at", start(ts_train), "and ends at", end(ts_train), "\n")
## Train starts at 2008 1 and ends at 2021 10
cat("Test starts at", start(ts_test), "and ends at", end(ts_test), "\n")
## Test starts at 2021 11 and ends at 2025 2

A key consideration before modeling is whether the series should be transformed. Common transformations include logarithmic scaling or differencing to stabilize variance or achieve stationarity. This decision will be guided by visual inspection and formal statistical tests.

Heteroscedasticity

Using the Breusch-Pagan test:

\[ \begin{aligned} \text{H}_0\!:& \quad \text{Constant variance (homoscedasticity)} \\ \text{H}_1\!:& \quad \text{Non-constant variance (heteroscedasticity)} \end{aligned} \]

# prepare the data
df <- data.frame(
  time = time(ts_train),
  value = as.numeric(ts_train)
)

# apply the test
model <- lm(value ~ time, data = df)

bptest(model)
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 0.35006, df = 1, p-value = 0.5541

The Breusch-Pagan test yields a p-value of 0.5541. Since the p-value is substantially greater than conventional significance levels (e.g., 0.05), there is no evidence to reject the null hypothesis. The residuals do not exhibit significant heteroscedasticity; the assumption of constant variance holds.

Nonetheless, a Box-Cox transformation can still be applied to evaluate whether variance stabilization is noticeable or not.

par(mfrow = c(2, 1))

plot(df$value, 
     main = "Original Data", 
     ylab = "Original",
     xlab = "Time", 
     type = "l")

lambda <- BoxCox.lambda(df$value)
plot(BoxCox(df$value, lambda), 
     main = paste("Box-Cox Transformation (λ =", round(lambda, 2), ")"),
     ylab = "Transformed", 
     xlab = "Time", 
     type = "l")

Since the Box-Cox transformation does not produce a noticeable difference in the variance structure and the Breusch-Pagan test indicated homoscedasticity, no transformation will be applied to the original series.

Stationarity

As a first step, the ACF and PACF plots of the training series will be examined to assess temporal dependence and identify potential signs of non-stationarity, such as slow decay in autocorrelations. These plots help reveal if differencing may be required before model estimation.

tsdisplay(ts_train, main = "Time series diagnosis")

The time series plot and the ACF reveal strong autocorrelations that decay slowly, indicating the presence of persistent structure over time. The PACF also shows significant early lags. These patterns are characteristic of a non-stationary series, suggesting that differencing is likely required before fitting a model.

tsdisplay(diff(ts_train, lag = 12), main = "Diagnosis after seasonal differencing (lag = 12)")

tsdisplay(diff(diff(ts_train), lag = 12), main = "Seasonal differencing + first differencing")

After applying both seasonal differencing (lag = 12) and first differencing, the transformed series appears substantially more stable. The ACF shows a rapid drop-off, and most lags lie within the confidence bounds. The PACF displays limited significant spikes, further supporting stationarity. These patterns suggest that the combined differencing successfully removed trend and seasonal components, resulting in a stationary series.

To draw a more robust inference regarding stationarity, formal statistical tests should be applied in addition to visual inspection. Specifically, the Augmented Dickey-Fuller (ADF) and KPSS tests will be used to assess the presence or absence of unit roots.

Augmented Dickey-Fuller (ADF) test:

\[ \begin{aligned} \text{H}_0\!:& \quad \text{The series has a unit root (non-stationary)} \\ \text{H}_1\!:& \quad \text{The series is stationary} \end{aligned} \]

# test the original series, probably non-stationary
adf.test(ts_train)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_train
## Dickey-Fuller = -3.0043, Lag order = 5, p-value = 0.1572
## alternative hypothesis: stationary
# test after first differencing and seasonal differencing
adf.test(diff(diff(ts_train, lag = 12), differences = 1))
## Warning in adf.test(diff(diff(ts_train, lag = 12), differences = 1)): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(diff(ts_train, lag = 12), differences = 1)
## Dickey-Fuller = -6.9048, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

The ADF test on the original series returns a p-value of 0.1572. This is above the conventional threshold (e.g., 0.05), so there is insufficient evidence to reject the null hypothesis. The original series is likely non-stationary.

After applying seasonal and first differencing, the ADF test yields a p-value less than 0.01. This provides strong evidence against the null hypothesis, indicating that the differenced series is stationary.

KPSS test:

\[ \begin{aligned} \text{H}_0\!:& \quad \text{The series is stationary} \\ \text{H}_1\!:& \quad \text{The series has a unit root (non-stationary)} \end{aligned} \]

kpss_test <- ur.kpss(diff(diff(ts_train, lag = 12)))
summary(kpss_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0199 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS test was performed on the differenced series (type = “mu”, with 4 lags). The test statistic is 0.0199, which is well below all critical values at common significance levels (e.g., 0.463 at 5%). Since the test statistic does not exceed the critical values, the null hypothesis of stationarity cannot be rejected. This test supports the conclusion that the differenced series is stationary, reinforcing the result from the ADF test.

This concludes that the original series is non-stationary, but stationarity is achieved after applying one seasonal and one regular differencing.

Model Proposals

SARIMA

From the exploratory analysis section, it is already established that the series exhibits strong seasonality and required both seasonal and non-seasonal differencing to achieve stationarity. This justifies the use of a SARIMA model, which explicitly accounts for both seasonal and non-seasonal components.

Model identification will proceed by analyzing the ACF and PACF of the differenced series and by evaluating alternative specifications using information criteria (e.g., AICc) and residual diagnostics.

acf2(diff(diff(ts_train, 12)))

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7] [,8] [,9] [,10] [,11] [,12]
## ACF  -0.26 -0.07 -0.09 -0.05 -0.04  0.06 -0.06 0.09 0.17 -0.13  0.04 -0.41
## PACF -0.26 -0.14 -0.16 -0.15 -0.15 -0.04 -0.12 0.00 0.19 -0.01  0.08 -0.42
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF   0.04  0.20  0.10 -0.07  0.04  0.01 -0.02 -0.05 -0.11  0.06  0.16  0.01
## PACF -0.28 -0.01  0.07 -0.03 -0.07  0.04  0.03  0.00  0.01 -0.08  0.06 -0.16
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF  -0.09 -0.09 -0.06  0.00  0.10 -0.08  0.01  0.11  0.03 -0.11  0.03 -0.08
## PACF -0.20 -0.10 -0.05 -0.15 -0.03 -0.04 -0.09 -0.04  0.03 -0.10  0.10 -0.13
##      [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
## ACF   0.12  0.12 -0.03 -0.05  0.03 -0.06 -0.01 -0.03 -0.04  0.10 -0.01 -0.03
## PACF -0.13  0.04  0.10 -0.03  0.08  0.02 -0.05 -0.02  0.04 -0.09 -0.03 -0.10

From the ACF and PACF plots of the seasonally and first-differenced series (diff(diff(ts_train, 12))), the following model components can be inferred:

Seasonal components (period = 12)

  • ACF shows a clear spike at lag 12, suggesting a seasonal MA(1) component.
  • PACF also shows a spike at lag 12, supporting the inclusion of a seasonal AR(1) term.
  • Seasonal differencing was already applied, so D = 1.

Non-seasonal components

  • ACF shows a rapid decay after lag 1, with only the first autocorrelation clearly significant — consistent with a non-seasonal MA(1) structure.
  • PACF shows a significant spike at lag 1 and quick cutoff — suggesting a non-seasonal AR(1) is also plausible.
  • Regular differencing was applied, so d = 1.

SARIMA(1,1,1)(1,1,1)[12] is a reasonable model candidate based on these diagnostics.

fit <- auto.arima(ts_train,
                  ic = "aicc",
                  seasonal = TRUE,
                  stepwise = TRUE, 
                  approximation = TRUE)
summary(fit)
## Series: ts_train 
## ARIMA(2,0,2)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     sar1     sar2     drift
##       0.4258  0.3448  0.0208  -0.1976  -0.6338  -0.2803  -19.5295
## s.e.  1.0934  0.8809  1.0919   0.3903   0.0846   0.0837    9.3086
## 
## sigma^2 = 549015:  log likelihood = -1235.41
## AIC=2486.82   AICc=2487.82   BIC=2511.12
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -3.105525 697.2632 501.8618 -0.1185414 2.429774 0.6088613
##                     ACF1
## Training set 0.005366936

SARIMA(2,0,2)(2,1,0)[12] with drift is also a reasonable model candidate based on auto.arima.

fit <- auto.arima(ts_train,
                  ic = "aicc",
                  max.p = 5,
                  d = 1,
                  max.q = 5,
                  max.P = 3,
                  D = 1,
                  max.Q = 3,
                  seasonal = TRUE,
                  stepwise = FALSE, 
                  approximation = FALSE)
summary(fit)
## Series: ts_train 
## ARIMA(0,1,2)(3,1,0)[12] 
## 
## Coefficients:
##           ma1      ma2     sar1     sar2     sar3
##       -0.4815  -0.1470  -0.7172  -0.4395  -0.2032
## s.e.   0.0842   0.0906   0.0854   0.0981   0.0894
## 
## sigma^2 = 547459:  log likelihood = -1229.25
## AIC=2470.5   AICc=2471.07   BIC=2488.68
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE    MAPE      MASE
## Training set 10.77535 698.6388 508.4503 -0.03317999 2.45886 0.6168546
##                    ACF1
## Training set 0.01111596

SARIMA(0,1,2)(3,1,0)[12] is also a reasonable model candidate based on a more custom auto.arima.

To identify the optimal SARIMA model, a grid search can also be performed (a more custom auto.arima), over a range of seasonal and non-seasonal parameters. Each model is evaluated using AICc, which balances fit and complexity. This approach allows for a systematic comparison of candidate models, guided by penalized likelihood. While it increases the risk of overfitting due to the large number of combinations, safeguards are applied—such as restricting the total number of parameters—to limit model complexity.

# full grid search
grid <- expand.grid(p = 0:5,
                    d = 0:2,
                    q = 0:5,
                    P = 0:5,
                    D = 1:3,
                    Q = 0:5)

# filtered grid search
grid <- subset(grid, (p + q + P + Q) <= 7)

# model fitting and evaluation
results <- lapply(seq_len(nrow(grid)), function(i) {
  row <- grid[i, ]
  model_str <- paste0("(", row$p, ",", row$d, ",", row$q, ")(",
                      row$P, ",", row$D, ",", row$Q, ")[12]")
  if (i %% 100 == 0 || i == 1 || i == nrow(grid)) {
    cat(sprintf("[STATUS] At model %d of %d: %s\n", i, nrow(grid), model_str))
  }  
  tryCatch({
    fit <- Arima(ts_train,
             order = c(row$p, row$d, row$q),
             seasonal = list(order = c(row$P, row$D, row$Q), period = 12),
             include.drift = FALSE)
    data.frame(aicc = fit$aicc, model = model_str)
  }, error = function(e) {
      cat(sprintf("[ERROR] In model %d of %d: %s -> %s\n", i, nrow(grid), model_str, e$message))
    NULL
  })
})

# remove failed models
results <- Filter(Negate(is.null), results)

# save the sorted results
results_df <- do.call(rbind, results)
results_df <- results_df[order(results_df$aicc), ]
write.csv(results_df, "results.csv", row.names = FALSE)
results_df <- read.csv("results.csv")
head(results_df, 3)
##       aicc              model
## 1 2224.473 (1,2,2)(0,3,3)[12]
## 2 2225.650 (1,2,3)(0,3,3)[12]
## 3 2225.682 (2,2,2)(0,3,3)[12]

Current candidates for the optimal SARIMA model:

  • SARIMA(1,1,1)(1,1,1)[12]

  • SARIMA(2,0,2)(2,1,0)[12]

  • SARIMA(0,1,2)(3,1,0)[12]

  • SARIMA(1,2,2)(0,3,3)[12]

models <- list(
  list(order = c(1,1,1), seasonal = c(1,1,1)),
  list(order = c(2,0,2), seasonal = c(2,1,0)),
  list(order = c(0,1,2), seasonal = c(3,1,0)),
  list(order = c(1,2,2), seasonal = c(0,3,3))
)

labels <- c("SARIMA(1,1,1)(1,1,1)[12]",
            "SARIMA(2,0,2)(2,1,0)[12]",
            "SARIMA(0,1,2)(3,1,0)[12]",
            "SARIMA(1,2,2)(0,3,3)[12]")

plot(ts_train, type = "l", col = "black", lwd = 1.5,
     main = "Fitted Values from SARIMA Models vs Training Data",
     ylab = "Electricity Consumption", xlab = "Time")

colors <- c("blue", "red", "darkgreen", "purple")

for (i in seq_along(models)) {
  fit <- Arima(ts_train,
               order = models[[i]]$order,
               seasonal = list(order = models[[i]]$seasonal, period = 12))
  lines(fitted(fit), col = colors[i], lwd = 1.5)
}

legend("bottomleft",
       legend = c("Observed", labels),
       col = c("black", colors),
       lty = 1,
       lwd = 1.5)

Based on the visual alignment of fitted values with the observed training data and considering model complexity, the most favorable choices are:

  1. SARIMA(1,1,1)(1,1,1)[12]

  2. SARIMA(2,0,2)(2,1,0)[12]

  3. SARIMA(0,1,2)(3,1,0)[12]

Evaluation of the First Choice: SARIMA(1,1,1)(1,1,1)[12]

sarima1 <- forecast::Arima(ts_train,
                           order = c(1,1,1),
                           seasonal = c(1,1,1))
summary(sarima1)
## Series: ts_train 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.4499  -0.8689  0.0579  -0.9983
## s.e.  0.1252   0.0830  0.0894   0.2415
## 
## sigma^2 = 424234:  log likelihood = -1221.8
## AIC=2453.6   AICc=2454.01   BIC=2468.75
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 10.45888 617.0807 448.1913 -0.02916893 2.165709 0.5437481
##                     ACF1
## Training set -0.03951362
# TRUE if significantly different from zero (all should be)
0.0579 / 0.0894 >= 1.96
## [1] FALSE

sar1 is not significant!

# correlation between parameters (should be <.7)
vcov_mat <- vcov(sarima1)
stddevs <- sqrt(diag(vcov_mat))

cov2cor(vcov_mat)
##               ar1         ma1        sar1         sma1
## ar1   1.000000000 -0.81456826  0.14406044 -0.005351725
## ma1  -0.814568259  1.00000000 -0.14824210  0.004377140
## sar1  0.144060443 -0.14824210  1.00000000 -0.022012213
## sma1 -0.005351725  0.00437714 -0.02201221  1.000000000
checkresiduals(sarima1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[12]
## Q* = 31.304, df = 20, p-value = 0.0513
## 
## Model df: 4.   Total lags used: 24
cat("The residuals have mean", mean(residuals(sarima1)), "and variance", var(residuals(sarima1)))
## The residuals have mean 10.45888 and variance 382986.3
qqnorm(residuals(sarima1)); qqline(residuals(sarima1))

Evaluation of the Second Choice: SARIMA(2,0,2)(2,1,0)[12]

sarima2 <- forecast::Arima(ts_train,
                           order = c(2,0,2),
                           seasonal = c(2,1,0))
summary(sarima2)
## Series: ts_train 
## ARIMA(2,0,2)(2,1,0)[12] 
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ma1      ma2     sar1     sar2
##       0.8327  0.0659  -0.3693  -0.1036  -0.6332  -0.2798
## s.e.  0.0727  0.0940   0.0510      NaN      NaN      NaN
## 
## sigma^2 = 556579:  log likelihood = -1236.98
## AIC=2487.95   AICc=2488.72   BIC=2509.21
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -83.00719 704.4336 516.3341 -0.4976523 2.503219 0.6264192
##                      ACF1
## Training set -0.006377222
0.0659 / 0.0940 >= 1.96
## [1] FALSE

ar2 is not significant!

vcov_mat <- vcov(sarima2)
stddevs <- sqrt(diag(vcov_mat))
## Warning in sqrt(diag(vcov_mat)): NaNs produced
cov2cor(vcov_mat)
## Warning in cov2cor(vcov_mat): diag(V) had non-positive or NA entries; the
## non-finite result may be dubious
##             ar1        ar2        ma1 ma2 sar1 sar2
## ar1   1.0000000 -0.9574975  0.3497048 NaN  NaN  NaN
## ar2  -0.9574975  1.0000000 -0.8478147 NaN  NaN  NaN
## ma1   0.3497048 -0.8478147  1.0000000 NaN  NaN  NaN
## ma2         NaN        NaN        NaN   1  NaN  NaN
## sar1        NaN        NaN        NaN NaN    1  NaN
## sar2        NaN        NaN        NaN NaN  NaN    1
checkresiduals(sarima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(2,1,0)[12]
## Q* = 48.363, df = 18, p-value = 0.0001329
## 
## Model df: 6.   Total lags used: 24
cat("The residuals have mean", mean(residuals(sarima2)), "and variance", var(residuals(sarima2)))
## The residuals have mean -83.00719 and variance 492302.2
qqnorm(residuals(sarima2)); qqline(residuals(sarima2))

Evaluation of the Third Choice: SARIMA(0,1,2)(3,1,0)[12]

sarima3 <- forecast::Arima(ts_train,
                           order = c(0,1,2),
                           seasonal = c(3,1,0))
summary(sarima3)
## Series: ts_train 
## ARIMA(0,1,2)(3,1,0)[12] 
## 
## Coefficients:
##           ma1      ma2     sar1     sar2     sar3
##       -0.4815  -0.1470  -0.7172  -0.4395  -0.2032
## s.e.   0.0842   0.0906   0.0854   0.0981   0.0894
## 
## sigma^2 = 547459:  log likelihood = -1229.25
## AIC=2470.5   AICc=2471.07   BIC=2488.68
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE    MAPE      MASE
## Training set 10.77535 698.6388 508.4503 -0.03317999 2.45886 0.6168546
##                    ACF1
## Training set 0.01111596

They are all significant!

vcov_mat <- vcov(sarima3)
stddevs <- sqrt(diag(vcov_mat))

cov2cor(vcov_mat)
##             ma1         ma2       sar1       sar2        sar3
## ma1   1.0000000 -0.31133605  0.1535299 0.08440160 -0.21314672
## ma2  -0.3113361  1.00000000 -0.1135561 0.06847138  0.03202671
## sar1  0.1535299 -0.11355608  1.0000000 0.57568805  0.29470067
## sar2  0.0844016  0.06847138  0.5756881 1.00000000  0.52754763
## sar3 -0.2131467  0.03202671  0.2947007 0.52754763  1.00000000

No significant correlation.

checkresiduals(sarima3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(3,1,0)[12]
## Q* = 44.752, df = 19, p-value = 0.0007424
## 
## Model df: 5.   Total lags used: 24
cat("The residuals have mean", mean(residuals(sarima3)), "and variance", var(residuals(sarima3)))
## The residuals have mean 10.77535 and variance 490937.5
qqnorm(residuals(sarima3)); qqline(residuals(sarima3))

Since no model fully satisfied the diagnostic criteria for a well-specified model (namely, uncorrelated, homoscedastic, and normally distributed residuals), further tuning is necessary.

After extensive testing of alternative SARIMA specifications and parameter adjustments, this was the best-performing configuration identified based on diagnostic criteria and overall model adequacy.

fit_arima <- forecast::Arima(ts_train,
                           order = c(1,1,1),
                           seasonal = c(1,1,1))

summary(fit_arima)
## Series: ts_train 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.4499  -0.8689  0.0579  -0.9983
## s.e.  0.1252   0.0830  0.0894   0.2415
## 
## sigma^2 = 424234:  log likelihood = -1221.8
## AIC=2453.6   AICc=2454.01   BIC=2468.75
## 
## Training set error measures:
##                    ME     RMSE      MAE         MPE     MAPE      MASE
## Training set 10.45888 617.0807 448.1913 -0.02916893 2.165709 0.5437481
##                     ACF1
## Training set -0.03951362
vcov_mat <- vcov(fit_arima)
stddevs <- sqrt(diag(vcov_mat))

cov2cor(vcov_mat)
##               ar1         ma1        sar1         sma1
## ar1   1.000000000 -0.81456826  0.14406044 -0.005351725
## ma1  -0.814568259  1.00000000 -0.14824210  0.004377140
## sar1  0.144060443 -0.14824210  1.00000000 -0.022012213
## sma1 -0.005351725  0.00437714 -0.02201221  1.000000000
checkresiduals(fit_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[12]
## Q* = 31.304, df = 20, p-value = 0.0513
## 
## Model df: 4.   Total lags used: 24
cat("The residuals have mean", mean(residuals(fit_arima)), "and variance", var(residuals(fit_arima)))
## The residuals have mean 10.45888 and variance 382986.3
qqnorm(residuals(fit_arima)); qqline(residuals(fit_arima))

# Shapiro-Wilk test
shapiro.test(residuals(fit_arima))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_arima)
## W = 0.97568, p-value = 0.005097
# Kolmogorov–Smirnov test
ks.test(residuals(fit_arima), "pnorm", mean=mean(residuals(fit_arima)), sd=sd(residuals(fit_arima)))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  residuals(fit_arima)
## D = 0.083543, p-value = 0.1969
## alternative hypothesis: two-sided

Even though the sar1 coefficient is individually insignificant and ar1 shows correlation with ma1, both contribute meaningfully to whitening the residuals. Their inclusion reduces autocorrelation not accounted for by other components. In time series modeling, residual diagnostics take precedence over individual parameter significance, and retaining such terms is justified when they improve the overall model adequacy.

The residual diagnostics for the model show reasonably good behavior:

  • The residuals appear centered with no visible trend or structural shift.

  • The ACF plot shows no significant autocorrelation at most lags.

  • The histogram indicates approximate symmetry and bell-shaped distribution.

  • The Ljung-Box test result is:

\[ \begin{aligned} \text{H}_0\!:& \quad \text{No autocorrelation in residuals} \\ \text{H}_1\!:& \quad \text{Residuals are autocorrelated} \end{aligned} \]

p-value is 0.0513, slightly above the 5% significance level, indicating marginal evidence against the null hypothesis. While not entirely conclusive, this suggests that the residuals are likely uncorrelated, or at least that any remaining autocorrelation is weak.

Normality of the residuals was evaluated using both the Shapiro-Wilk and Kolmogorov-Smirnov tests, along with visual inspection via the Q-Q plot.

  • Shapiro-Wilk test

\[ \begin{aligned} \text{H}_0\!:& \quad \text{Residuals are normally distributed} \\ \text{H}_1\!:& \quad \text{Residuals deviate from normality} \end{aligned} \]

p-value = 0.005097: Rej. H0: Some evidence of non-normality.

  • Kolmogorov-Smirnov test

\[ \begin{aligned} \text{H}_0\!:& \quad \text{Residuals follow the normal distribution} \\ \text{H}_1\!:& \quad \text{Residuals do not follow the normal distribution} \end{aligned} \]

p-value = 0.1969: Fail to reject H0: no significant evidence of non-normality.

This difference arises because the Shapiro-Wilk test is more robust against non-normality, particularly for skewed distributions and in small samples. The Kolmogorov-Smirnov test, on the other hand, is more conservative and may fail to reject the null hypothesis for non-normal data, especially in small samples.

  • Q-Q plot

Shows tail deviations but an overall “acceptable” alignment with the theoretical normal distribution.

While the Shapiro-Wilk test suggests a deviation from normality, the K-S test and graphical analysis “support” the assumption of approximate normality.

The SARIMA(1,1,1)(1,1,1)[12] model produces residuals that are close enough to white noise, with weak autocorrelation and only mild deviations from normality. Although not perfect, the model satisfies most diagnostic criteria to a reasonable extent. In the absence of clearly superior alternatives, this model may be considered adequate for forecasting purposes.

GARCH

Having identified the best-fitting ARIMA model, the next step is to test whether a GARCH model is appropriate for capturing any remaining conditional heteroscedasticity in the residuals.

\[ \begin{aligned} \text{H}_0\!:& \quad \text{No ARCH effects (homoscedasticity)} \\ \text{H}_1\!:& \quad \text{Presence of ARCH effects (heteroscedasticity)} \end{aligned} \]

ArchTest(residuals(fit_arima), lags = 12)
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  residuals(fit_arima)
## Chi-squared = 9.7974, df = 12, p-value = 0.6337

The test returned a p-value of 0.6337. As the p-value is well above common significance levels, we fail to reject the null hypothesis. There is no evidence of ARCH effects in the residuals; a GARCH extension is not required.

ETS

An ETS model will now be applied. ETS models are well-suited for time series with strong trend and seasonal components, and unlike ARIMA, they do not require stationarity. This alternative approach will allow for comparison in terms of both in-sample fit and forecasting accuracy. Model components (Error, Trend, Seasonal) are selected automatically based on information criteria.

fit_ets <- ets(ts_train, ic="aicc")
summary(fit_ets)
## ETS(M,N,A) 
## 
## Call:
## ets(y = ts_train, ic = "aicc")
## 
##   Smoothing parameters:
##     alpha = 0.5563 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 21799.6093 
##     s = 973.9892 -241.2593 -693.6657 -510.1486 606.8107 1472.184
##            -635.261 -1221.312 -1882.068 433.5543 -120.8753 1818.051
## 
##   sigma:  0.0318
## 
##      AIC     AICc      BIC 
## 3021.333 3024.533 3068.013 
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -28.62181 636.3201 487.9749 -0.2087385 2.342725 0.5920137
##                    ACF1
## Training set 0.03417389

The ETS model selected based on AICc is: ETS(M,N,A)

This configuration indicates:

  • Error: Multiplicative: the impact of random shocks scales with the level of the series

  • Trend: None: no long-term directional movement is modeled

  • Seasonality: Additive: seasonal fluctuations are constant in magnitude across time

This model structure reflects a stable seasonal pattern without an explicit trend, consistent with the behavior observed after differencing the original series.

checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,A)
## Q* = 47.728, df = 24, p-value = 0.002727
## 
## Model df: 0.   Total lags used: 24

By applying the checkresiduals function to the model, it becomes evident that the residuals display significant autocorrelation, as confirmed by the Ljung-Box test (p-value = 0.0027). The residual diagnostics indicate that the ETS model does not adequately capture the underlying dynamics of the series.

The ETS framework already encompasses the Holt-Winters family of models as special cases, so they will not be tested separately.

STLM

The next approach involves applying a Seasonal and Trend decomposition using Loess (STL), followed by modeling the seasonally adjusted series. This method, implemented via the stlm() function, decomposes the original series into trend, seasonal, and remainder components in a flexible, nonparametric way. After decomposition, the remainder is modeled using ARIMA or ETS, allowing for a hybrid structure that combines non-linear seasonal smoothing with parametric short-term dynamics.

fit_stlm <- stlm(ts_train,
                 s.window = "periodic",
                 method = "arima")

summary(fit_stlm)
##               Length Class          Mode     
## stl           664    mstl           numeric  
## model          19    forecast_ARIMA list     
## modelfunction   1    -none-         function 
## lambda          0    -none-         NULL     
## x             166    ts             numeric  
## series          1    -none-         character
## m               1    -none-         numeric  
## fitted        166    ts             numeric  
## residuals     166    ts             numeric
# STL components (seasonal, trend, remainder)
plot(fit_stlm$stl)

# ARIMA fitted to the seasonally adjusted series (?)
summary(fit_stlm$model)
## Series: x 
## ARIMA(4,1,1) with drift 
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ma1     drift
##       0.4478  0.1155  0.0700  -0.0905  -0.9144  -19.3451
## s.e.  0.1080  0.0930  0.0922   0.0905   0.0814    9.5074
## 
## sigma^2 = 385842:  log likelihood = -1292.68
## AIC=2599.36   AICc=2600.08   BIC=2621.11
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -8.743192 607.9242 464.0256 -0.1211733 2.227514 0.5629583
##                       ACF1
## Training set -0.0006021931
checkresiduals(fit_stlm)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 29.546, df = 24, p-value = 0.2003
## 
## Model df: 0.   Total lags used: 24

The residual diagnostics from the model show a relatively well-behaved structure:

  • The residual time plot displays no obvious patterns or structural deviations.

  • The ACF of the residuals remains mostly within the confidence bounds, with no strong autocorrelation.

  • The histogram suggests approximate symmetry, though some deviation from normality is visible.

  • The Ljung-Box test confirms the absence of significant autocorrelation

cat("The residuals have mean", mean(residuals(fit_stlm)), "and variance", var(residuals(fit_stlm)))
## The residuals have mean -8.743192 and variance 371734.7
qqnorm(residuals(fit_stlm)); qqline(residuals(fit_stlm))

# Shapiro-Wilk test
shapiro.test(residuals(fit_stlm))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_stlm)
## W = 0.9794, p-value = 0.01422
# Kolmogorov–Smirnov test
ks.test(residuals(fit_stlm), "pnorm", mean=mean(residuals(fit_stlm)), sd=sd(residuals(fit_stlm)))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  residuals(fit_stlm)
## D = 0.073808, p-value = 0.3263
## alternative hypothesis: two-sided

Normality of the residuals was evaluated using both the Shapiro-Wilk and Kolmogorov-Smirnov tests, along with visual inspection via the Q-Q plot.

  • Shapiro-Wilk test

\[ \begin{aligned} \text{H}_0\!:& \quad \text{Residuals are normally distributed} \\ \text{H}_1\!:& \quad \text{Residuals deviate from normality} \end{aligned} \]

p-value = 0.01422: ~Rej. H0 (depends on 1% or 5%): Some evidence of non-normality.

  • Kolmogorov-Smirnov test

\[ \begin{aligned} \text{H}_0\!:& \quad \text{Residuals follow the normal distribution} \\ \text{H}_1\!:& \quad \text{Residuals do not follow the normal distribution} \end{aligned} \]

p-value = 0.3263: Fail to reject H0: no significant evidence of non-normality.

This difference arises because the Shapiro-Wilk test is more robust against non-normality, particularly for skewed distributions and in small samples. The Kolmogorov-Smirnov test, on the other hand, is more conservative and may fail to reject the null hypothesis for non-normal data, especially in small samples.

  • Q-Q plot

Shows minor tail deviations but an overall acceptable alignment with the theoretical normal distribution.

While the Shapiro-Wilk test suggests a slight deviation from normality, the K-S test and graphical analysis support the assumption of approximate normality. This level of deviation is not sufficient to disqualify the model for practical use.

Future Observations Forecast

To generate forecasts for future electricity consumption, the two best-performing models from the analysis will be used:

fit_arima   # SARIMA(1,1,1)(1,1,1)[12]
fit_stlm    # STLM(ARIMA)

Forecast accuracy and behavior will be compared visually and numerically using the held-out test set.

SARIMA(1,1,1)(1,1,1)[12]

# forecast
arima_forecast <- forecast(fit_arima, h = length(ts_test))

# evaluation metrics
forecast::accuracy(arima_forecast, ts_test)
##                     ME     RMSE      MAE         MPE     MAPE      MASE
## Training set  10.45888 617.0807 448.1913 -0.02916893 2.165709 0.5437481
## Test set     -52.21608 562.8728 465.7346 -0.35018427 2.483365 0.5650317
##                     ACF1 Theil's U
## Training set -0.03951362        NA
## Test set      0.58261040 0.4207397
# plot the forecast (train+test vs forecast)
autoplot(ts_data, series = "Actual") +
  autolayer(arima_forecast, series = "Forecast", PI = FALSE) +
  labs(title = "Forecast vs Full Series - SARIMA(1,1,1)(1,1,1)[12]",
       x = "Time (Monthly)",
       y = "Electricity Consumption (GWh)") +
  scale_colour_manual(values = c("Actual" = "black", "Forecast" = "blue"))

# plot the forecast (test vs forecast)
autoplot(ts_test, series = "Test") +
  autolayer(arima_forecast, series = "Forecast", PI = FALSE) +
  labs(title = "Forecast vs Test Set - SARIMA(1,1,1)(1,1,1)[12]",
       x = "Time (Monthly)",
       y = "Electricity Consumption (GWh)") +
  scale_colour_manual(values = c("Test" = "black", "Forecast" = "blue"))

STLM(ARIMA)

# forecast
stlm_forecast <- forecast(fit_stlm, h = length(ts_test))

# evaluation metrics
forecast::accuracy(stlm_forecast, ts_test)
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set  -8.743192 607.9242 464.0256 -0.1231423 2.230672 0.5629583
## Test set     -26.131182 563.2452 460.1395 -0.2107488 2.451736 0.5582437
##                       ACF1 Theil's U
## Training set -0.0006021931        NA
## Test set      0.5836708163 0.4218294
# plot the forecast (train+test vs forecast)
autoplot(ts_data, series = "Actual") +
  autolayer(stlm_forecast, series = "Forecast", PI = FALSE) +
  labs(title = "Forecast vs Full Series - STLM(ARIMA)",
       x = "Time (Monthly)",
       y = "Electricity Consumption (GWh)") +
  scale_colour_manual(values = c("Actual" = "black", "Forecast" = "blue"))

# plot the forecast (test vs forecast)
autoplot(ts_test, series = "Test") +
  autolayer(stlm_forecast, series = "Forecast", PI = FALSE) +
  labs(title = "Forecast vs Test Set - STLM(ARIMA)",
       x = "Time (Monthly)",
       y = "Electricity Consumption (GWh)") +
  scale_colour_manual(values = c("Test" = "black", "Forecast" = "blue"))

Forecast accuracy was assessed using forecast::accuracy(), which compares fitted values on the training set and forecast values against the test set. In both cases, the training and test error metrics are very close, indicating no signs of overfitting.

Visually, the predicted values follow the behavior of the actual test data closely, with aligned seasonal patterns and trend shifts.

Both models generalize well to unseen data and can be considered valid for forecasting purposes.

Results

Forecast Accuracy Comparison on Test Set

The forecast accuracy of the SARIMA and STLM models is evaluated using standard error metrics:

  • ME (Mean Error): Measures bias. A value close to 0 is desirable, but this metric can be misleading due to cancellation of positive and negative errors.

  • RMSE (Root Mean Squared Error): Sensitive to large deviations. Useful when large errors are particularly undesirable. Penalizes outliers more heavily.

  • MAE (Mean Absolute Error): Interpretable and robust. Less sensitive to outliers than RMSE. A solid general-purpose metric.

  • MPE (Mean Percentage Error): Can produce misleading values when actuals approach zero. May result in large positive or negative values.

  • MAPE (Mean Absolute Percentage Error): Intuitive and scale-independent, but unreliable when actual values are near zero.

  • MASE (Mean Absolute Scaled Error): Scale-free and comparable across datasets. Values < 1 indicate better performance than the naive forecast. Reliable and interpretable.

  • ACF1 (Lag-1 Autocorrelation of Residuals): Measures autocorrelation in residuals. Ideal values are near 0, indicating residuals are uncorrelated.

  • Theil’s U: Compares performance to a naive benchmark. Values < 1 indicate better-than-naive forecasts.

acc_arima <- forecast::accuracy(arima_forecast, ts_test)
acc_stlm  <- forecast::accuracy(stlm_forecast, ts_test)

comparison <- rbind(
  SARIMA = acc_arima["Test set", ],
  STLM   = acc_stlm["Test set",  ]
)

round(comparison, 3)
##             ME    RMSE     MAE    MPE  MAPE  MASE  ACF1 Theil's U
## SARIMA -52.216 562.873 465.735 -0.350 2.483 0.565 0.583     0.421
## STLM   -26.131 563.245 460.140 -0.211 2.452 0.558 0.584     0.422

The forecast accuracy metrics for SARIMA and STLM are very similar across all indicators, reflecting comparable overall performance.

Analysis of 95% Prediction Intervals

The 95% prediction intervals (PI) represent the range within which future observations are expected to fall with 95% confidence.

# extract forecast time indices
time_vals <- time(arima_forecast$mean)
df_arima <- data.frame(
  time = time_vals,
  lo95 = arima_forecast$lower[,2],
  hi95 = arima_forecast$upper[,2]
)

df_stlm <- data.frame(
  time = time(stlm_forecast$mean),
  lo95 = stlm_forecast$lower[,2],
  hi95 = stlm_forecast$upper[,2]
)

# plot the models 95% PI bands
ggplot() +
  autolayer(ts_test, series = "Actual") +
  geom_ribbon(data = df_arima, aes(x = time, ymin = lo95, ymax = hi95, fill = "SARIMA 95% PI"), alpha = 0.5) +
  geom_ribbon(data = df_stlm, aes(x = time, ymin = lo95, ymax = hi95, fill = "STLM 95% PI"), alpha = 0.5) +
  autolayer(ts_test, series = "Actual") +
  labs(title = "Forecast Comparison: SARIMA vs STLM",
       x = "Time (Monthly)",
       y = "Electricity Consumption (GWh)") +
  scale_fill_manual(values = c("SARIMA 95% PI" = "#0072B2", "STLM 95% PI" = "#D55E00")) +
  scale_color_manual(values = c("Actual" = "black")) +
  theme_minimal()

Visually, both the SARIMA and STLM forecasts maintain intervals that consistently cover the actual values in the test set, indicating calibrated uncertainty estimates. The width of the intervals remains stable over time.

# accuracy of 95% PI
inside_95_arima <- ts_test >= arima_forecast$lower[,2] & ts_test <= arima_forecast$upper[,2]
inside_95_stlm <- ts_test >= stlm_forecast$lower[,2] & ts_test <= stlm_forecast$upper[,2]

cat("The SARIMA has", mean(inside_95_arima)*100,"% of the real values inside its 95% PI.","\n",
    "The STLM has", mean(inside_95_stlm)*100,"% of the real values inside its 95% PI")
## The SARIMA has 100 % of the real values inside its 95% PI. 
##  The STLM has 100 % of the real values inside its 95% PI

The test observations fall within the 95% intervals for all time points, suggesting that both models produce reliable forecasts.

# avg width of 95% PI
width_95_arima <- arima_forecast$upper[,2] - arima_forecast$lower[,2]
width_95_stlm <- stlm_forecast$upper[,2] - stlm_forecast$lower[,2]

cat("The SARIMA 95% PI has an avg width of", mean(width_95_arima),"\n",
    "The STLM 95% PI has an avg width of", mean(width_95_stlm))
## The SARIMA 95% PI has an avg width of 4378.183 
##  The STLM 95% PI has an avg width of 3639.196

The STLM model produces narrower 95% PI on average compared to SARIMA. This means that the STLM model yields forecasts with lower uncertainty and higher precision. Narrower PIs indicate that the model is more confident in its predictions, which is desirable when coverage is not sacrificed.

Model Choice

The chosen model is the STLM with ARIMA applied to the seasonally adjusted component, justified by the following points:

These factors support the selection of STLM(ARIMA) as the final model for forecasting monthly electricity consumption.