--- title: "Local Outlier Detection with ssMRCD" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Local Outlier Detection with ssMRCD} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, fig.dim = c(7, 4.5), comment = "#>" ) ``` We use this vignette to reproduce the real world example for local outlier detection analysed and described in Puchhammer and Filzmoser (2023). All functions are included in the package `ssMRCD`. ```{r} library(ssMRCD) library(ggplot2) library(dplyr) library(rnaturalearth) library(rnaturalearthdata) ``` ## Data Preparation The original data from GeoSphere Austria (2022) is pre-cleaned and saved in the data frame object `weatherAUT2021`. Additional information can be found on the helping page. ```{r, eval = FALSE} ? weatherAUT2021 ``` Load the data from the package. ```{r setup} data("weatherAUT2021") head(weatherAUT2021) rownames(weatherAUT2021) = weatherAUT2021$name ``` For nice plotting use the package `rnaturalearth`. ```{r} # Load Austria as sf object austria <- ne_countries(scale = "medium", country = "Austria", returnclass = "sf") g_boundary = ggplot() + geom_sf(data = austria, fill = "transparent", color = "black") + theme_classic() ``` To apply the ssMRCD and finally the local outlier detection function `local_outliers_ssMRCD`, we need to specify groups based on spatial proximity and the relative influence/weights between the groups/neighborhoods. Since a prominent part of Austria is Alpine landscape, we choose many small groups in based on a grid. We construct the grid based on the longitude and latitude values and group the observations into neighborhoods. We summarize neighborhoods for a very small number of observations. ```{r} # group by spatial grid cut_lon = c(9:16, 18) cut_lat = c(46, 47, 47.5, 48, 49) groups = groups_gridbased(x = weatherAUT2021$lon, y = weatherAUT2021$lat, cutx = cut_lon, cuty = cut_lat) table(groups) # join particularly small groups together groups[groups == 2] = 1 groups[groups == 3] = 4 groups[groups == 5] = 4 groups[groups == 6] = 7 groups[groups == 11] = 15 groups = as.numeric(as.factor(groups)) table(groups) ``` The final neighborhood structure can be seen in the following plot, where the neighborhoods are differently colorized. ```{r} g_groups = g_boundary + geom_text(data = weatherAUT2021, aes(x = lon, y = lat, col = as.factor(groups), label = groups)) + geom_hline(aes(yintercept = cut_lat), lty = "dashed", col = "gray") + geom_vline(aes(xintercept = cut_lon), lty = "dashed", col = "gray") + labs(x = "Longitude", y = "Latitude", title = "Austrian Weather Stations: Neighborhood Structure") + theme_classic() + theme(legend.position = "none") g_groups ``` A natural choice for the `weights` matrix when using a neighborhood structure based on spatial coordinates is a geographical inverse-distance weighting matrix. It is the default value for the `local_outliers_ssMRCD` function but can explicitly be calculated using the function `geo_weights`. This function returns also the center of each neighborhood. E.g. for neighborhood 4 we have the following weights, corresponding to the transparancy of the arrows. ```{r} GW = geo_weights(coordinates = weatherAUT2021[, c("lon", "lat")], groups = groups) g_weights = g_groups + labs(title = "Austrian Weather Stations: Weighting Matrix W") + geom_segment(aes(x = GW$centersN[4, 1], y = GW$centersN[4, 2], xend = GW$centersN[-4, 1]-0.1, yend = GW$centersN[-4, 2]-0.05, alpha = GW$W[4, -4]), arrow = arrow(length = unit(0.25, "cm")), linewidth = 2, col = "blue")+ geom_text(aes(x = GW$centersN[, 1], y = GW$centersN[, 2], label = 1:length(GW$centersN[, 2]))) g_weights ``` ## Basic Estimation of ssMRCD with Default Values The default amount of smoothing is `lambda = 0.5` suggested by the parameter sensitivity studies. The default setting in the function `ssMRCD` applies the default setting and returns the model. ```{r, eval = FALSE} ? ssMRCD ``` ```{r, eval = TRUE, message = FALSE} set.seed(123) out = ssMRCD(X = weatherAUT2021[, 1:6], groups = groups, weights = GW$W, lambda = 0.5, tuning = NULL) class(out) ``` There are also some build in plotting functions and methods. ```{r, eval = TRUE, message = FALSE} # plot the tolerance ellipses in the geographical space plots = plot(x = out, type = c("convergence","ellipses", "ellipses_geo"), geo_centers = GW$centersN, variables = c("s", "t"), manual_rescale = 0.001) plots$plot_geoellipses + geom_sf(data = austria, fill = "transparent", color = "black") plots$plot_ellipses plots$plot_convergence ``` ### Parameter Tuning The parameter tuning is focused on the parameter `lambda`, other hyper-parameters like `k` or the neighborhood structure can similarly be tuned in a similar fashion. There are two approaches for tuning `lambda` along the grid given to the `lambda` argument. The first approach is based on detecting artificially introduced outliers. Thus, the geographic coordinates and parameters for the local outlier detection method need to be provided in the list given to `tuning`. For using this tuning method, you need to set `tuning$method = "local contamination"`. ```{r, eval = TRUE, message = FALSE} set.seed(123) out = ssMRCD(X = weatherAUT2021[, 1:6], groups = groups, weights = GW$W, tuning = list(method = "local contamination", plot = TRUE, k = 10, coords = weatherAUT2021[, c("lon", "lat")], cont = 0.05, repetitions = 3), lambda = c(0.25, 0.5, 0.75)) ``` Some comments on using the "local contamination" tuning approach. Observations are switched and thus, this should introduce outliers that we know of. Nevertheless, you should keep in mind that the observations are switched entirely at randomly. If unlucky, we might switch similar observations and, thus, the measured performance of the detection method might suffer. However, since we want to compare the different parameter settings on the same contaminated data sets this should not affect the fine tuning itself. Moreover, we implicitly assume that the original data set has no outliers. This is in general not the case. Thus, the false-negative rate (FNR) is not the true FNR regarding all outliers but possibly biased. Nevertheless, the parameter tuning is a way to see the effects of the parameter setting on the FNR and a proxy of the false-positive rate (FPR), which is given by the total number of found outliers. Plots returned by the function `ssMRCD` if `tuning$plots = TRUE` show both criteria. The second approach is based on the minimizing residuals described in Puchhammer, Wilms and Filzmoser (2024). This is generally faster and does not depend on spatial coordinates. It is especially useful for multi-group data that is not spatial. A plot is provided if `tuning$plot = TRUE`. ```{r, eval = TRUE, message = FALSE} set.seed(123) out = ssMRCD(X = weatherAUT2021[, 1:6], groups = groups, weights = GW$W, tuning = list(method = "residuals", plot = TRUE), lambda = seq(0, 1, 0.1)) ``` ## Local Outlier Detection Either by specifying a value for `k` like `k = 10` which gives the number of observations to compare with or the value for `dist` (e.g. `dist = 1.5`) as the radius of a circle around the observation where all observations inside are compared to the initial observation. A default value for `k` that is used among various local outlier detection methods is 10. However, depending on the spatial structure of the data it makes sense to use other values as well. If we are only interested in the covariance estimation we can use the function `ssMRCD`. A more convenient way if we are interested in local outlier detection is to use the function `local_outliers_ssMRCD`, which already embeds the covariance estimation. (Fine tuning of `lambda` for covariance estimation can only be done using the function `ssMRCD`.) ```{r, eval = FALSE} ? locOuts ``` ```{r, eval = FALSE, message = FALSE} set.seed(123) res = locOuts(data = weatherAUT2021[, 1:6], coords = weatherAUT2021[, c("lon", "lat")], groups = groups, lambda = 0.5, k = 10) summary(res) ``` The found outliers can be accessed by the list element `outliers`. ```{r, eval = FALSE, message = FALSE} cat(weatherAUT2021$name[res$outliers], sep = ",\n") ``` ### Diagnostics for Local Outlier Detection For the local outlier detection method there are several plots available regarding diagnostics (see `?plot.locOuts`). ```{r, eval = FALSE} ? plot.locOuts ``` #### 1) Histogramm of Next Distances The histogram shows all next distances together with the cut-off value with a specified number of bins. ```{r, eval = FALSE,fig.dim = c(5,3.5)} plot(res, type = "hist")$p_hist ``` The spatial plot shows the outliers on the map, the 3D-plot as well, but with an additional axis for the next-distance/next-distance divided by cut-off value. The line plot shows the scaled values and the specified area in the map. Orange-coloured lines are other outliers on the map, the darkred colored line is the selected observation (`focus`). ```{r, eval = FALSE} plot(res, type = "spatial")$p_spatial + geom_sf(data = austria, fill = "transparent", color = "black") ``` ```{r, eval = FALSE, fig.dim = c(7,3)} # SCHOECKL plot(res, type = "pcp", observation = "SCHOECKL", scale = "zscore")$p_pcp ``` ## References GeoSphere Austria (2022): . Puchhammer P. and Filzmoser P. (2023): Spatially smoothed robust covariance estimation for local outlier detection.