Time-series outlier detection using Prophet on weather data
Last updated
Was this helpful?
Last updated
Was this helpful?
The Prophet outlier detector uses the time series forecasting package explained in . The underlying Prophet model is a decomposable univariate time series model combining trend, seasonality and holiday effects. The model forecast also includes an uncertainty interval around the estimated trend component using the of the extrapolated model. Alternatively, full Bayesian inference can be done at the expense of increased compute. The upper and lower values of the uncertainty interval can then be used as outlier thresholds for each point in time. First, the distance from the observed value to the nearest uncertainty boundary (upper or lower) is computed. If the observation is within the boundaries, the outlier score equals the negative distance. As a result, the outlier score is the lowest when the observation equals the model prediction. If the observation is outside of the boundaries, the score equals the distance measure and the observation is flagged as an outlier. One of the main drawbacks of the method however is that you need to refit the model as new data comes in. This is undesirable for applications with high throughput and real-time detection.
Note
To use this detector, first install Prophet by running:
This will install Prophet, and its major dependency PyStan. PyStan is currently only . If this detector is to be used on a Windows system, it is recommended to manually install (and test) PyStan before running the command above.
The example uses a weather time series dataset recorded by the . The dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. Like the , we only use data collected between 2009 and 2016.
Select subset to test Prophet model on:
Prophet model expects a DataFrame with 2 columns: one named ds
with the timestamps and one named y
with the time series to be evaluated. We will just look at the temperature data:
We train an outlier detector from scratch:
Define the test data. It is important that the timestamps of the test data follow the training data. We check this below by comparing the first few rows of the test DataFrame with the last few of the training DataFrame:
Predict outliers on test data:
We can first visualize our predictions with Prophet's built in plotting functionality. This also allows us to include historical predictions:
It is clear that the further we predict in the future, the wider the uncertainty intervals which determine the outlier threshold.
Let's overlay the actual data with the upper and lower outlier thresholds predictions and check where we predicted outliers:
Outlier scores and predictions:
The outlier scores naturally trend down as uncertainty increases when we predict further in the future.
Let's look at some individual outliers:
Please check out the as well as the original on how to customize the Prophet-based outlier detector and add seasonalities, holidays, opt for a saturating logistic growth model or apply parameter regularization.
We can also plot the breakdown of the different components in the forecast. Since we did not do full Bayesian inference with mcmc_samples
, the uncertaintly intervals of the forecast are determined by the of the extrapolated trend.