Outliers occur often in data sets. For example cosmic rays in astronomical imaging: the image of your target galaxy can be affected by a cosmic ray in one of the five exposures you took in one night. As a result, when you compare the measured magnitude of your target galaxy in all the exposures, you will get measurements like this (all in magnitudes) 19.8, 20.1, 20.5, 17.0, 19.9 (all fluctuating around magnitude 20, except the much brighter 17th magnitude measurement).
Normally, you would simply take the mean of these measurements to estimate the magnitude of your target with more precision. However, the 17th magnitude measurement above is clearly wrong and will significantly affect the mean: without it, the mean magnitude is 20.07, but with it, the mean is 19.46:
$ echo " 19.8 20.1 20.5 17 19.9" \ | tr ' ' '\n' \ | aststatistics --mean 1.94600000000000e+01 $ echo " 19.8 20.1 20.5 19.9" \ | tr ' ' '\n' \ | aststatistics --mean 2.00750000000000e+01
This difference of 0.61 magnitudes (or roughly 1.75 times) is significant (for the definition of magnitudes in astronomy, see Brightness, Flux, Magnitude and Surface brightness). In the simple example above, you can visually identify the “outlier” and manually remove it. But in most common situations you will not be so lucky! For example when you want to stack the five images of the five exposures above, and each image has \(4000\times4000\) (or 16 million!) pixels and not possible by hand in a reasonable time (an average human’s lifetime!).
This tutorial reviews the effect of outliers and different available ways to remove them. In particular, we will be looking at stacking of multiple datasets and collapsing one dataset along one of its dimensions. But the concepts and methods are applicable to any analysis that is affected by outliers.
JavaScript license information
GNU Astronomy Utilities 0.23 manual, July 2024.