Time Series Forecasting means analyzing and modeling time-series data to make future decisions. Some of the applications of Time Series Forecasting are weather forecasting, sales forecasting, business forecasting, stock price forecasting, etc. The ARIMA model is a popular statistical technique used for Time Series Forecasting. If you want to learn Time Series Forecasting with ARIMA, this article is for you. In this article, I will take you through the task of Time Series Forecasting with ARIMA using the Python programming language.
ARIMA stands for Autoregressive Integrated Moving Average. It is an algorithm used for forecasting Time Series Data. ARIMA models have three parameters like ARIMA(p, d, q). Here p, d, and q are defined as:
I hope you have now understood the ARIMA model. In the section below, I will take you through the task of Time Series Forecasting of stock prices with ARIMA using the Python programming language.
Now let’s start with the task of Time Series Forecasting with ARIMA. I will first collect Google stock price data using the Yahoo Finance API. If you have never used Yahoo Finance API, you can learn more about it here. Now here’s how to collect data about the Google’s Stock Price: 1
import pandas as pd
2
import yfinance as yf
3
import datetime
4
from datetime import date, timedelta
5
today = date.today()
6
7
d1 = today.strftime("%Y-%m-%d")
8
end_date = d1
9
d2 = date.today() - timedelta(days=365)
10
d2 = d2.strftime("%Y-%m-%d")
11
start_date = d2
12
13
data = yf.download('GOOG',
14
start=start_date,
15
end=end_date,
16
progress=False)
17
data["Date"] = data.index
18
data = data[["Date", "Open", "High", "Low", "Close", "Adj Close", "Volume"]]
19
data.reset_index(drop=True, inplace=True)
20
print(data.tail())
Date Open High Low Close \ 247 2022-06-13 2148.919922 2184.370117 2131.760986 2137.530029 248 2022-06-14 2137.800049 2169.149902 2127.040039 2143.879883 249 2022-06-15 2177.989990 2241.260010 2162.375000 2207.810059 250 2022-06-16 2162.989990 2185.810059 2115.850098 2132.719971 251 2022-06-17 2130.699951 2184.989990 2112.571045 2157.310059 Adj Close Volume 247 2137.530029 1837800 248 2143.879883 1274000 249 2207.810059 1659600 250 2132.719971 1765700 251 2157.310059 2163500
We only need the date and close prices columns for the rest of the task, so let’s select both the columns and move further: 1
data = data[["Date", "Close"]]
2
print(data.head())
Date Close 0 2021-06-21 2529.100098 1 2021-06-22 2539.989990 2 2021-06-23 2529.229980 3 2021-06-24 2545.639893 4 2021-06-25 2539.899902
Now let’s visualize the close prices of Google before moving forward: 1
import matplotlib.pyplot as plt
2
plt.style.use('fivethirtyeight')
3
plt.figure(figsize=(15, 10))
4
plt.plot(data["Date"], data["Close"])
Before using the ARIMA model, we have to figure out whether our data is stationary or seasonal. The data visualization graph about the closing stock prices above shows that our dataset is not stationary. To check whether our dataset is stationary or seasonal properly, we can use the seasonal decomposition method that splits the time series data into trend, seasonal, and residuals for a better understanding of the time series data: 1
from statsmodels.tsa.seasonal import seasonal_decompose
2
result = seasonal_decompose(data["Close"],
3
model='multiplicative', freq = 30)
4
fig = plt.figure()
5
fig = result.plot()
6
fig.set_size_inches(15, 10)
So our data is not stationary it is seasonal. We need to use the Seasonal ARIMA (SARIMA) model for Time Series Forecasting on this data. But before using the SARIMA model, we will use the ARIMA model. It will help you learn using both models. To use ARIMA or SARIMA, we need to find the p, d, and q values. We can find the value of p by plotting the autocorrelation of the Close column and the value of q by plotting the partial autocorrelation plot. The value of d is either 0 or 1. If the data is stationary, we should use 0, and if the data is seasonal, we should use 1. As our data is seasonal, we should use 1 as the d value. Now here’s how to find the value of p: 1
pd.plotting.autocorrelation_plot(data["Close"])
In the above autocorrelation plot, the curve is moving down after the 5th line of the first boundary. That is how to decide the p-value. Hence the value of p is 5. Now let’s find the value of q (moving average): 1
from statsmodels.graphics.tsaplots import plot_pacf
2
plot_pacf(data["Close"], lags = 100)
In the above partial autocorrelation plot, we can see that only two points are far away from all the points. That is how to decide the q value. Hence the value of q is 2. Now let’s build an ARIMA model: 1
p, d, q = 5, 1, 2
2
from statsmodels.tsa.arima_model import ARIMA
3
model = ARIMA(data["Close"], order=(p,d,q))
4
fitted = model.fit(disp=-1)
5
print(fitted.summary())
ARIMA Model Results ============================================================================== Dep. Variable: D.Close No. Observations: 251 Model: ARIMA(5, 1, 2) Log Likelihood -1328.041 Method: css-mle S.D. of innovations 48.034 Date: Tue, 21 Jun 2022 AIC 2674.083 Time: 06:12:58 BIC 2705.812 Sample: 1 HQIC 2686.851 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- const -1.5031 2.251 -0.668 0.505 -5.914 2.908 ar.L1.D.Close 0.0443 0.243 0.182 0.856 -0.432 0.520 ar.L2.D.Close 0.7582 0.204 3.712 0.000 0.358 1.158 ar.L3.D.Close -0.0690 0.079 -0.870 0.385 -0.224 0.086 ar.L4.D.Close -0.0623 0.069 -0.901 0.369 -0.198 0.073 ar.L5.D.Close 0.0992 0.075 1.327 0.186 -0.047 0.246 ma.L1.D.Close -0.0923 0.234 -0.394 0.694 -0.552 0.367 ma.L2.D.Close -0.7388 0.191 -3.877 0.000 -1.112 -0.365 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.1301 -0.0000j 1.1301 -0.0000 AR.2 -1.4091 -0.2578j 1.4325 -0.4712 AR.3 -1.4091 +0.2578j 1.4325 0.4712 AR.4 1.1583 -1.7339j 2.0852 -0.1563 AR.5 1.1583 +1.7339j 2.0852 0.1563 MA.1 1.1026 +0.0000j 1.1026 0.0000 MA.2 -1.2276 +0.0000j 1.2276 0.5000 -----------------------------------------------------------------------------
Here’s how to predict the values using the ARIMA model: 1
predictions = fitted.predict()
2
print(predictions)
2 -2.108482 3 -0.789990 4 -3.688940 5 -0.777623 6 -2.472432 ... 247 2.866723 248 2.486679 249 7.659670 250 5.277199 251 8.960482 Length: 250, dtype: float64
The predicted values are wrong because the data is seasonal. ARIMA model will never perform well on seasonal time series data. So, here’s how to build a SARIMA model: 1
import statsmodels.api as sm
2
import warnings
3
model=sm.tsa.statespace.SARIMAX(data['Close'],
4
order=(p, d, q),
5
seasonal_order=(p, d, q, 12))
6
model=model.fit()
7
print(model.summary())
Statespace Model Results ========================================================================================== Dep. Variable: Close No. Observations: 252 Model: SARIMAX(5, 1, 2)x(5, 1, 2, 12) Log Likelihood -1280.516 Date: Tue, 21 Jun 2022 AIC 2591.032 Time: 06:15:00 BIC 2643.179 Sample: 0 HQIC 2612.046 - 252 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 -0.0803 3.857 -0.021 0.983 -7.639 7.479 ar.L2 0.9622 3.583 0.269 0.788 -6.060 7.984 ar.L3 -0.0029 0.182 -0.016 0.987 -0.360 0.354 ar.L4 0.0123 0.193 0.064 0.949 -0.365 0.390 ar.L5 0.0586 0.249 0.236 0.814 -0.429 0.546 ma.L1 0.0256 3.032 0.008 0.993 -5.918 5.969 ma.L2 -0.9726 2.979 -0.327 0.744 -6.811 4.866 ar.S.L12 0.2082 0.783 0.266 0.790 -1.327 1.743 ar.S.L24 0.1491 0.086 1.738 0.082 -0.019 0.317 ar.S.L36 -0.0226 0.182 -0.124 0.901 -0.379 0.334 ar.S.L48 -0.1415 0.089 -1.595 0.111 -0.315 0.032 ar.S.L60 -0.0981 0.132 -0.744 0.457 -0.356 0.160 ma.S.L12 -1.2637 0.717 -1.762 0.078 -2.669 0.142 ma.S.L24 0.2782 0.759 0.367 0.714 -1.210 1.766 sigma2 2203.0788 1934.635 1.139 0.255 -1588.737 5994.894 =================================================================================== Ljung-Box (Q): 29.16 Jarque-Bera (JB): 21.53 Prob(Q): 0.90 Prob(JB): 0.00 Heteroskedasticity (H): 2.69 Skew: 0.15 Prob(H) (two-sided): 0.00 Kurtosis: 4.44 ===================================================================================
Now let’s predict the future stock prices using the SARIMA model for the next 10 days: 1
predictions = model.predict(len(data), len(data)+10)
2
print(predictions)
252 2155.450727 253 2174.383879 254 2138.454522 255 2118.298381 256 2117.235728 257 2112.857380 258 2099.387811 259 2085.703155 260 2117.912628 261 2133.935300 262 2168.589946 dtype: float64
Here’s how you can plot the predictions: 1
data["Close"].plot(legend=True, label="Training Data", figsize=(15, 10))
2
predictions.plot(legend=True, label="Predictions")
So this is how you can use ARIMA or SARIMA models for Time Series Forecasting using Python.
ARIMA stands for Autoregressive Integrated Moving Average. It is an algorithm used for forecasting Time Series Data. If the data is stationary, we need to use ARIMA, if the data is seasonal, we need to use Seasonal ARIMA (SARIMA). I hope you liked this article about Time Series Forecasting with ARIMA using Python. Feel free to ask valuable questions in the comments section below.