Page 1 of 4

BENCHMARKING PROBABILISTIC MACHINE LEARNING MODELS FOR

ARCTIC SEA ICE FORECASTING

Sahara Ali, Seraj Al Mahmud Mostafa, Xingyan Li, Sara Khanjani, Jianwu Wang, James Foulds, Vandana Janeja

University of Maryland Baltimore County, MD 21250, USA

ABSTRACT

The Arctic is a region with unique climate features, motivat- ing new AI methodologies to study it. Unfortunately, Arc- tic sea ice has seen a continuous decline since 1979. This

not only poses a significant threat to Arctic wildlife and sur- rounding coastal communities but is also adversely affecting

the global climate patterns. To study the potential of AI in

tackling climate change, we analyze the performance of four

probabilistic machine learning methods in forecasting sea-ice

extent for lead times of up to 6 months, further comparing

them with traditional machine learning methods. Our com- parative analysis shows that Gaussian Process Regression is

a good fit to predict sea-ice extent for longer lead times with

lowest RMSE error.

Index Terms— Arctic sea ice, climate change, proba- bilistic machine learning, Gaussian Process Regression

1. INTRODUCTION

The enormous areas of Arctic ice and snow are responsible

for reflecting sunlight back to space which keeps the planet

cool and regulates global and regional weather patterns [11].

However, the Arctic sea ice has seen a continuous decline

since 1979 and is half of what it was in 1970. It is impor- tant to predict fluctuations in the Arctic sea ice by modeling

the weather patterns as it can improve our understanding of

potential changes facing the global climate. Since the cli- mate data presents high spatiotemporal correlations, machine

learning models have shown promising results in spatiotem- poral data mining leading to short and long term weather fore- casting [10]. In order to convince physicists of the potential

of these data-driven approaches, however, quantifying model

uncertainty still remains a major challenge. Probabilistic ma- chine learning models, particularly Bayesian models, provide

a principled approach for quantifying uncertainty. Though re- cent data-driven approaches have shown promising results in

sea-ice forecasting, they still struggle with sub-seasonal fore- casting at longer lead times [2, 3]. We therefore present a

rigorous study of four probabilistic models and two baseline

models to forecast sea ice extent at multiple lead times, fur- ther proposing directions to quantify epistemic uncertainty

This work is partially supported by grants OAC–1942714 and OAC–

2118285 from National Science Foundation.

in model predictions. Our results illustrate the value of the

probabilistic modeling approach in this context. We have

open-sourced our work at github.

1 The contributions of our

work include: (i) Customizing probabilistic machine learning

models to forecast sea ice extent (SIE) at lead times of 1 to

6 months. (ii) Performing a rigorous study of the accuracy

versus lead-time tradeoff for the implemented models. (iii)

Benchmarking the performance of probabilistic and standard

models to assist researchers in forecasting sea ice variations.

2. METHODOLOGY

In this section, we will first explain the problem definition

and then provide details on the dataset used and the methods

implemented. The end-to-end pipeline of the benchmarking

process is illustrated in Figure 1. A high-level summary of

our comparative analysis is given in Table 1.

Fig. 1. End-to-end pipeline of our benchmarking experi- ments.

2.1. Problem Definition

Owing to the fluctuations in sea-ice over summer and winter

seasons, accurate sub-seasonal forecasting becomes a great

challenge. To tackle this challenge, we compare the predic- tions of all six models at a lead time of 1 to 6 months. We

specifically look at the following problem definition: Given N

months of historic meteorological and sea-ice data X, learn a

probabilistic function to forecast sea-ice extent Y for the next

M months in future.

Yt+m = f(Xt−n, Xt−n+1, . . . , Xt) (1)

1https://github.com/big-data-lab-umbc/sea-ice-prediction

Page 2 of 4

Table 1. Comparison of predictive models and their important characteristics

Methods Purpose Variables Type Probabil- Uncertainty Quantification

istic

MLR Regression technique; uses more than one in- dependent variable to predict a response [1]. Multivariate ML No Indirectly via model ensem- bling

LSTM

Sequence prediction or time-series forecast- ing; predicts future values over large time pe- riods from sequence of data [5].

Multivariate DL No Indirectly via dropout layers

or model ensembling

GRNN

Regression technique; non-parametric

method that provides estimates of continu- ous variable and function approximations

[9].

Multivariate DL Yes Yes

GPR Regression technique; computes predictive

distribution using Bayesian approach [8]. Multivariate ML Yes Yes

HMM

Sequence prediction; capable of observing

unknown facts using Markov process that

contains hidden parameters [4].

Univariate ML Yes Indirectly via model ensem- bling

BLR

Regression technique; outcome is drawn

from a probability distribution instead of

point estimates [7].

Multivariate ML Yes Yes

2.2. Dataset

This study utilizes 512 temporal records of observational data

for 42 years from 1979-2021 over the Arctic region. These

records include monthly mean values of sea-ice extent (SIE)

from Nimbus-7 SSMR and DMSP SSM/I-SSMIS passive mi- crowave data version2

and 9 meteorological data variables ob- tained from ERA-5 global reanalysis product.3 The choice

and details of these variables is presented in our previous

causal discovery study conducted on Arctic sea-ice [6]. To

conduct our experiments, we first combined all the raw vari- able datasets to have single temporal and spatial resolution,

i.e. monthly means and 1 degree (180 lat.×360 lon.) of spa- tial resolution. Next we replaced missing values with inter- polated values for sea-ice and replaced other missing values

with 0. Finally we normalized data using the MinMax nor- malization technique.

2.3. Baseline Models

We implemented two widely used predictive models as base- line methods to compare with our probabilistic models. These

include the Multiple Linear Regression (MLR) and Long

Short Term Memory Model (LSTM).

2.3.1. Multiple Linear Regression (MLR)

We trained a Multiple Linear Regression model to predict sea

ice extent values with a sequential split on the data contain- ing all 10 variables including sea ice extent as input features.

The predicted sea ice extent values are for a lead time of M

2NSIDC (nsidc.org/data/NSIDC-0051)

3ECMWF (cds.climate.copernicus.eu/cdsapp!/home)

months. Here, M is from 1 to 6. Six MLR models are trained,

one for each lead time.

2.3.2. Long Short Term Memory Model (LSTM)

LSTM is a variant of the Recurrent Neural Network (RNN)

used for time series data analysis and forecasting. Our LSTM

based network comprises two many-to-many LSTM and one

many-to-one LSTM layer, one dropout layer and three fully- connected layers. Like MLR, we trained six LSTM mod- els for 6 months’ lead times and optimized them using the

‘Adam’ optimizer.

2.4. Probabilistic Models

We implemented following four probabilistic models for fore- casting sea ice extent for lead times of 1 to 6 months.

2.4.1. General Regression Neural Network (GRNN)

GRNN is a modified form of a radial basis network (RBF),

that estimates values for continuous variables using nonpara- metric estimators of probability density functions (PDF) [9].

GRNN uses an extra layer of summation and weight connec- tion between the hidden and output layer. The summation

layer takes the input from previous layers and sums those val- ues together to form the probability density function (PDF)

using the Parzen window. GRNN uses Equation 2 where y

is the estimator output, x is the estimator input vector and

E[y|x] is the expected value of output. For the given input

vector x, f(x, y) would be the joint PDF of x and y:

Page 3 of 4

E[y|x] =

R ∞

−∞ y.f(x, y))dy

R ∞

−∞ f(x, y))dy

. (2)

2.4.2. Hidden Markov Model (HMM)

HMM is a statistical model able to observe hidden events us- ing a Markov process. It consists of (i) a sequence of observ- able variables, (ii) a sequence of hidden states, (iii) a transi- tion matrix, and (iv) an emission matrix to explain distribu- tion of observed variables generated from the hidden states.

The transition matrix is composed of probability for each hid- den state transforming to the other, while the emission matrix

comprises probabilities for each observed variable to be gen- erated by the corresponding hidden states. We implemented

the GaussianHMM variant of the model as it handles contin- uous distributions for observation state sequence. The num- ber of hidden states is kept at six as the sea ice extent of six

lead times need to be estimated, and the fractional changes

are taken as features.

2.4.3. Gaussian Process Regression (GPR)

We find GPR a good fit for our problem since it is a pop- ular non-parametric probabilistic model for regression with

a Bayesian approach. Instead of calculating the probability

distribution of parameters of a specific function, GPR calcu- lates the probability distribution over all admissible functions

that fit the data. Therefore, when we fit GPR on our train- ing data it estimates the posterior distribution. Given fea- tures X and outcome y, GPR calculates the posterior dis- tribution using Bayes’ rule as given in Equation 3 where w

represents the model parameters learnt. While fitting GPR,

the log-marginal-likelihood (LML) is optimized which helps

learn the parameters of the kernel. We implemented GPR by

setting the prior mean of the GP to be the training data’s mean.

We repeated the training procedure 10 times to prevent LML

from being trapped in local optima.

p(w|y, X) = p(y|X, w)p(w)

p(y|X)

(3)

2.4.4. Bayesian Linear Regression (BLR)

Lastly, we implemented BLR to test its usability on the prob- lem domain. In contrast to MLR, here both the output y and

the model parameters are generated from a normal (Gaussian)

distribution characterized by a mean and variance. First, we

specified priors for the model parameters using normal distri- bution, then we created a model mapping the training inputs

to the training outputs, finally we used a Markov Chain Monte

Carlo (MCMC) algorithm to draw 500 samples from the pos- terior distribution for the model parameters. The end result

is an estimate of the posterior distribution for the parameters

using which we can compute predicted outcomes for a given

test dataset.

2.5. Uncertainty Quantification

A valuable feature of probabilistic models is their ability to

predict probability distributions instead of discrete values.

We can estimate model uncertainty by computing variance

and standard deviation of the posterior predictive probabil- ities given by the models. In case of non-probabilistic mod- els, the epistemic uncertainty can be quantified by ensembling

predictions from multiple copies of the same model.

3. RESULTS AND DISCUSSION

We trained all models on the first 40 years of data and tested

them on last 30 months, that is, from February 2019 to August

2021. We evaluated their performance by calculating the Root

Mean Square Error (RMSE) (in million square kilometers).

To tackle such large values, we normalized RMSE scores by

dividing RMSE values with the mean of observed sea ice ex- tent, that is, y ̄. Lower the values of RMSE and NRMSE, better

the predictive performance. We further investigated the vari- ations in predicted SIE values due to input features using the

correlation co-efficient score, also known as R-Squared or R2

Score. Higher the values of R2 Score, better the performance.

Looking at the results in Table 2, it is evident that the pre- diction error increases with the increase in lead times. The in- creasing RMSE implies that models’ performance are greatly

affected by seasonal patterns, however, we noticed different

patterns in the performance of different models.

Evaluating the performance of deep learning models, we

see that GRNN outperformed LSTM – one of the widely used

methods for Arctic sea-ice prediction [2]. This implies that

GRNN can be a good alternative for learning non-linearities

in data alongwith uncertainty prediction over longer lead

times. In case of HMM, as it follows the forward-backward

algorithm, the steady climbing in the RMSE scores makes

sense because the prediction of each observation strongly de- pends on the previous states; also evident by negative R2

Score. Further, we observed that GPR gets better from one

lead time to another, which makes it the best fit for the ice

extent prediction for different lead times. This commendable

performance could be due to GPR’s underlying Bayesian ap- proach, which infers a probability distribution over all possi- ble values, irrespective of the sequential patterns. However,

BLR performed against our expectations and had the highest

error for all lead times, making it unfit for the problem do- main. We did not investigate BLR’s failure to reach conver- gence but it can be inspected further in future studies. Qual- itatively, we observed in Figure 2 that both GPR and GRNN

better captured the peak values during Summer and Winter

seasons, however GPR outperformed GRNN for the lead time

of 6 months.

4. CONCLUSION AND FUTURE WORK

In this paper, we compared six of the renowned probabilistic

and standard machine learning models suitable for forecast-

Page 4 of 4

Table 2. RMSE, NRMSE and R-Squared scores for all models for lead times of 1 to 6 months.

RMSE NRMSE R-Squared

Lead time

(months) 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

MLR 0.4331 0.8928 1.4292 1.7148 2.0042 2.1663 0.0412 0.0850 0.1361 0.1633 0.1908 0.2062 0.985 0.937 0.839 0.769 0.684 0.631

HMM 1.6535 3.3911 4.7954 6.3511 7.0038 7.2644 0.1558 0.3201 0.4540 0.6038 0.6715 0.7018 0.783 0.081 -0.860 -2.286 -2.931 -3.066

GPR 0.4523 0.4514 0.4055 0.3985 0.4007 0.3936 0.0430 0.0720 0.0978 0.1113 0.1197 0.1256 0.983 0.955 0.917 0.892 0.875 0.863

LSTM 0.6274 1.1022 1.4313 1.5610 1.7222 1.7636 0.0697 0.1060 0.1338 0.1561 0.1517 0.2025 0.958 0.903 0.845 0.789 0.801 0.645

GRNN 0.5784 0.8048 1.0311 1.2550 1.5549 1.8162 0.0550 0.0766 0.0981 0.1195 0.1480 0.1729 0.973 0.949 0.916 0.876 0.810 0.740

BLR 8.0842 6.5638 8.4614 7.2429 8.3365 6.6301 0.7618 0.5307 0.7973 0.6825 0.7856 0.6248 -4.815 -2.834 -5.371 -3.668 -5.184 -2.911

Fig. 2. Time-series predictions from five models versus the observations for lead time of 1 to 6 months.

ing sea-ice extent at greater lead times. We conclude that:

(i) Probabilistic models can be a good alternative to deep

learning models for small datasets. (ii) All models’ perfor- mance decreases as lead times increase. However, GPR out- performed MLR and LSTM with a 12% increase in R2-Score

making it the best fit for the ice extent prediction at greater

lead times. (iii) HMM and BLR are not suitable for seasonal

forecasting at greater lead times. In future, we plan to ex- tend the benchmarking on spatiotemporal data incorporating

higher dimensions in the performance evaluation.

5. REFERENCES

[1] L. S. Aiken, S. G. West, and S. C. Pitts. Multiple linear

regression. Handbook of Psychology, pages 481–507,

2003.

[2] S. Ali, Y. Huang, X. Huang, and J. Wang. Sea Ice Fore- casting using Attention-based Ensemble LSTM. ArXiv

e-prints, page arXiv:2108.00853, July 2021.

[3] T. R. Andersson, J. S. Hosking, M. Perez-Ortiz, ́

B. Paige, A. Elliott, C. Russell, S. Law, D. C. Jones,

J. Wilkinson, T. Phillips, et al. Seasonal Arctic sea

ice forecasting with probabilistic deep learning. Nature

Communications, 12(1):1–12, 2021.

[4] L. E. Baum and T. Petrie. Statistical inference for prob- abilistic functions of finite state Markov chains. The An- nals of Mathematical Statistics, 37(6):1554–1563, 1966.

[5] S. Hochreiter and J. Schmidhuber. Long short-term

memory. Neural Computation, 9(8):1735–1780, 1997.

[6] Y. Huang, M. Kleindessner, A. Munishkin, D. Varshney,

P. Guo, and J. Wang. Benchmarking of Data-Driven

Causality Discovery Approaches in the Interactions of

Arctic Sea Ice and Atmosphere. Frontiers in Big Data,

4, 2021.

[7] T. Minka. Bayesian linear regression. Technical report,

Citeseer, 2000.

[8] J. Quinonero-Candela and C. E. Rasmussen. A unify- ing view of sparse approximate Gaussian process re- gression. The Journal of Machine Learning Research,

6:1939–1959, 2005.

[9] D. F. Specht et al. A general regression neural network.

IEEE Transactions on Neural Networks, 2(6):568–576,

1991.

[10] S. Wang, J. Cao, and P. Yu. Deep Learning for Spatio- Temporal Data Mining: A Survey. IEEE Transactions

on Knowledge and Data Engineering, pages 1–1, 2020.

[11] M. Wendisch, A. Macke, A. Ehrlich, C. Lupkes, ̈

M. Mech, D. Chechin, K. Dethloff, C. B. Velasco,

H. Bozem, M. Bruckner, et al. The Arctic cloud puz- ̈

zle. Bulletin of the American Meteorological Society,

100(5):841–871, 2019.