In order to define detection thresholds on the image, or calibrate it for measurements (subtract the signal of the background sky and define errors), we need some basic measurements. For example, the quantile threshold in NoiseChisel (--qthresh option), or the mean of the undetected regions (Sky) and the Sky standard deviation (Sky STD) which are the output of NoiseChisel and Statistics. But astronomical images will contain a lot of stars and galaxies that will bias those measurements if not properly accounted for. Quantifying where signal is present is thus a very important step in the usage of a dataset; for example, if the Sky level is over-estimated, your target object’s magnitude will be under-estimated.
Let’s start by clarifying some definitions: Signal is defined as the non-random source of flux in each pixel (you can think of this as the mean in a Gaussian or Poisson distribution). In astronomical images, signal is mostly photons coming of a star or galaxy, and counted in each pixel. Noise is defined as the random source of flux in each pixel (or the standard deviation of a Gaussian or Poisson distribution). Noise is mainly due to counting errors in the detector electronics upon data collection. Data is defined as the combination of signal and noise (so a noisy image of a galaxy is one dataset).
When a dataset does not have any signal (for example, you take an image with a closed shutter, producing an image that only contains noise), the mean, median and mode of the distribution are equal within statistical errors. Signal from emitting objects, like astronomical targets, always has a positive value and will never become negative, see Figure 1 in Akhlaghi and Ichikawa 2015. Therefore, when signal is added to the data (you take an image with an open shutter pointing to a galaxy for example), the mean, median and mode of the dataset shift to the positive, creating a positively skewed distribution. The shift of the mean is the largest. The median shifts less, since it is defined after ordering all the elements/pixels (the median is the value at a quantile of 0.5), thus it is not affected by outliers. Finally, the mode’s shift to the positive is the least.
Inverting the argument above gives us a robust method to quantify the significance of signal in a dataset: when the mean and median of a distribution are approximately equal we can argue that there is no significant signal. In other words: when the quantile of the mean (\(q_{mean}\)) is around 0.5. This definition of skewness through the quantile of the mean is further introduced with a real image the tutorials, see Skewness caused by signal and its measurement.
However, in an astronomical image, some of the pixels will contain more signal than the rest, so we cannot simply check \(q_{mean}\) on the whole dataset. For example, if we only look at the patch of pixels that are placed under the central parts of the brightest stars in the field of view, \(q_{mean}\) will be very high. The signal in other parts of the image will be weaker, and in some parts it will be much smaller than the noise (for example, 1/100-th of the noise level). When the signal-to-noise ratio is very small, we can generally assume no signal (because its effectively impossible to measure it) and \(q_{mean}\) will be approximately 0.5.
To address this problem, we break the image into a grid of tiles199 (see Tessellation). For example, a tile can be a square box of size \(30\times30\) pixels. By measuring \(q_{mean}\) on each tile, we can find which tiles that contain significant signal and ignore them. Technically, if a tile’s \(|q_{mean}-0.5|\) is larger than the value given to the --meanmedqdiff option, that tile will be ignored for the next steps. You can read this option as “mean-median-quantile-difference”.
The raw dataset’s pixel distribution (in each tile) is noisy, to decrease the noise/error in estimating \(q_{mean}\), we convolve the image before tessellation (see Convolution process. Convolution decreases the range of the dataset and enhances its skewness, See Section 3.1.1 and Figure 4 in Akhlaghi and Ichikawa 2015. This enhanced skewness can be interpreted as an increase in the Signal to noise ratio of the objects buried in the noise. Therefore, to obtain an even better measure of the presence of signal in a tile, the mean and median discussed above are measured on the convolved image.
There is one final hurdle: raw astronomical datasets are commonly peppered with Cosmic rays. Images of Cosmic rays are not smoothed by the atmosphere or telescope aperture, so they have sharp boundaries. Also, since they do not occupy too many pixels, they do not affect the mode and median calculation. But their very high values can greatly bias the calculation of the mean (recall how the mean shifts the fastest in the presence of outliers), for example, see Figure 15 in Akhlaghi and Ichikawa 2015. The effect of outliers like cosmic rays on the mean and standard deviation can be removed through \(\sigma\)-clipping, see Sigma clipping for a complete explanation.
Therefore, after asserting that the mean and median are approximately equal in a tile (see Tessellation), the Sky and its STD are measured on each tile after \(\sigma\)-clipping with the --sigmaclip option (see Sigma clipping). In the end, some of the tiles will pass the test and will be given a value. Others (that had signal in them) will just be assigned a NaN (not-a-number) value. But we need a measurement over each tile (and thus pixel). We will therefore use interpolation to assign a value to the NaN tiles.
However, prior to interpolating over the failed tiles, another point should be considered: large and extended galaxies, or bright stars, have wings which sink into the noise very gradually. In some cases, the gradient over these wings can be on scales that is larger than the tiles (for example, the pixel value changes by \(0.1\sigma\) after 100 pixels, but the tile has a width of 30 pixels).
In such cases, the \(q_{mean}\) test will be successful, even though there is signal. Recall that \(q_{mean}\) is a measure of skewness. If we do not identify (and thus set to NaN) such outlier tiles before the interpolation, the photons of the outskirts of the objects will leak into the detection thresholds or Sky and Sky STD measurements and bias our result, see Detecting large extended targets. Therefore, the final step of “quantifying signal in a tile” is to look at this distribution of successful tiles and remove the outliers. \(\sigma\)-clipping is a good solution for removing a few outliers, but the problem with outliers of this kind is that there may be many such tiles (depending on the large/bright stars/galaxies in the image). We therefore apply the following local outlier rejection strategy.
For each tile, we find the nearest \(N_{ngb}\) tiles that had a usable value (\(N_{ngb}\) is the value given to --outliernumngb). We then sort them and find the difference between the largest and second-to-smallest elements (The minimum is not used because the scatter can be large). Let’s call this the tile’s slope (measured from its neighbors). All the tiles that are on a region of flat noise will have similar slope values, but if a few tiles fall on the wings of a bright star or large galaxy, their slope will be significantly larger than the tiles with no signal. We just have to find the smallest tile slope value that is an outlier compared to the rest, and reject all tiles with a slope larger than that.
To identify the smallest outlier, we will use the distribution of distances between sorted elements. Let’s assume the total number of tiles with a good mean-median quantile difference is \(N\). They are first sorted and searching for the outlier starts on element \(N/3\) (integer division). Let’s take \(v_i\) to be the \(i\)-th element of the sorted input (with no blank values) and \(m\) and \(\sigma\) as the \(\sigma\)-clipped median and standard deviation from the distances of the previous \(N/3-1\) elements (not including \(v_i\)). If the value given to --outliersigma is displayed with \(s\), the \(i\)-th element is considered as an outlier when the condition below is true.
$${(v_i-v_{i-1})-m\over \sigma}>s$$
Since \(i\) begins from the \(N/3\)-th element in the sorted array (a quantile of \(1/3=0.33\)), the outlier has to be larger than the \(0.33\) quantile value of the dataset (this is usually the case; otherwise, it is hard to define it as an “outlier”!).
Once the outlying tiles have been successfully identified and set to NaN, we use nearest-neighbor interpolation to give a value to all tiles in the image. We do not use parametric interpolation methods (like bicubic), because they will effectively extrapolate on the edges, creating strong artifacts. Nearest-neighbor interpolation is very simple: for each tile, we find the \(N_{ngb}\) nearest tiles that had a good value, the tile’s value is found by estimating the median. You can set \(N_{ngb}\) through the --interpnumngb option. Once all the tiles are given a value, a smoothing step is implemented to remove the sharp value contrast that can happen on the edges of tiles. The size of the smoothing box is set with the --smoothwidth option.
As mentioned above, the process above is used for any of the basic measurements (for example, identifying the quantile-based thresholds in NoiseChisel, or the Sky value in Statistics). You can use the check-image feature of NoiseChisel or Statistics to inspect the steps and visually see each step (all the options that start with --check). For example, as mentioned in the NoiseChisel optimization tutorial, when given a dataset from a new instrument (with differing noise properties), we highly recommend to use --checkqthresh in your first call and visually inspect how the parameters above affect the final quantile threshold (e.g., have the wings of bright sources leaked into the threshold?). The same goes for the --checksky option of Statistics or NoiseChisel.
JavaScript license information
GNU Astronomy Utilities 0.23 manual, July 2024.