A short description on various different outlier detection methods for times-series data and their implementations.
Z-Score Outlier Detection
One of the simplest way to automate outlier identification for machine learning is to implement a z-score confidence band approach.
Assuming you start with a series of annual prices p1, … , pn, we take the difference to get annual returns r1, … , rn-1. We compute the mean and the standard deviation of the returns.
μ = (∑n rn)/n and σ = sqrt(∑n (rn– μ)2/n)
To identify outliers, establish the confidence intervals on the returns and with a flexible range. Set a threshold value that controls the confidence intervals, call it z for z-score. Now you can identify outliers anytime they are outside the upper and lower bounds on these confidence intervals, i.e.,
rn > μ + z*σ or rn < μ – z*σ
Adjusting the z-score gives you a scaling band to use for identifying returns which are on the extreme values of the your distribution. Plotting this you can visualize which points were identified.
The following guide can be used to establish the z-score values based on the probabilities of the events you want to capture. This assumes a normal distribution, but stock return distributions are usually leptokurtic having more kurtosis or “fatty tails”, so using higher z-score values are recommended.
Commentaires