Multivariate Outlier Detection

library(MOutliers)

Introduction

Outliers are unusual data points that are different from the rest of the data. It is important to detect these outliers because they have an effect on data analysis, models, and conclusions.

MOutliers package provides tools to detect and visualize multivariate outliers using robust statistical methods:

Function Documentation

Function: detect_multivariate_outliers()

Parameters

1. data (Required)

A numeric dataframe that contains the variables of interest. Each row corresponds to one observation and each column to one variable.

2. method (Optional)

A character value specifying the detection method. Options include:

Default is “mahalanobis”.

3. alpha (Optional)

A numeric value representing the cutoff level for detecting outliers, based on the quantiles of the chi-squared distribution. Default is 0.975.

Returns

The function returns a data frame that combines the original input dataset with the following additional columns:

Example Usage

Example 1: Simulated Data

This example demonstrates detecting multivariate outliers using simulated data.

set.seed(123)
df <- data.frame(
  x = c(rnorm(50), 6),
  y = c(rnorm(50), 6)
)
head(df)
#>             x           y
#> 1 -0.56047565  0.25331851
#> 2 -0.23017749 -0.02854676
#> 3  1.55870831 -0.04287046
#> 4  0.07050839  1.36860228
#> 5  0.12928774 -0.22577099
#> 6  1.71506499  1.51647060
# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df, method = "mahalanobis", alpha = 0.975)
head(result_mahal)
#>             x           y  Distance Outlier
#> 1 -0.56047565  0.25331851 0.4024629   FALSE
#> 2 -0.23017749 -0.02854676 0.1081832   FALSE
#> 3  1.55870831 -0.04287046 1.9705648   FALSE
#> 4  0.07050839  1.36860228 1.0943377   FALSE
#> 5  0.12928774 -0.22577099 0.1909745   FALSE
#> 6  1.71506499  1.51647060 1.8800060   FALSE
# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df, method = "mcd", alpha = 0.975)
head(result_mcd)
#>             x           y  Distance Outlier
#> 1 -0.56047565  0.25331851 0.4591213   FALSE
#> 2 -0.23017749 -0.02854676 0.1299266   FALSE
#> 3  1.55870831 -0.04287046 2.5319996   FALSE
#> 4  0.07050839  1.36860228 2.7497316   FALSE
#> 5  0.12928774 -0.22577099 0.2077008   FALSE
#> 6  1.71506499  1.51647060 6.5143416   FALSE
# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df, method = "pca", alpha = 0.975)
head(result_pca)
#>             x           y  Distance Outlier
#> 1 -0.56047565  0.25331851 0.3295383   FALSE
#> 2 -0.23017749 -0.02854676 0.1515636   FALSE
#> 3  1.55870831 -0.04287046 1.3505140   FALSE
#> 4  0.07050839  1.36860228 0.8355279   FALSE
#> 5  0.12928774 -0.22577099 0.1610487   FALSE
#> 6  1.71506499  1.51647060 2.6579984   FALSE

Example 2: Existing Dataset (mtcars)

This example demonstrates detecting multivariate outliers using a real dataset (mtcars) with three variables: mpg, hp, and wt.

df_mtcars <- mtcars[, c("mpg", "hp", "wt" )]
head(df_mtcars)
#>                    mpg  hp    wt
#> Mazda RX4         21.0 110 2.620
#> Mazda RX4 Wag     21.0 110 2.875
#> Datsun 710        22.8  93 2.320
#> Hornet 4 Drive    21.4 110 3.215
#> Hornet Sportabout 18.7 175 3.440
#> Valiant           18.1 105 3.460
# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df_mtcars, method = "mahalanobis",alpha = 0.975)
head(result_mahal)
#>                    mpg  hp    wt  Distance Outlier
#> Mazda RX4         21.0 110 2.620 1.4554908   FALSE
#> Mazda RX4 Wag     21.0 110 2.875 0.6848547   FALSE
#> Datsun 710        22.8  93 2.320 1.8717032   FALSE
#> Hornet 4 Drive    21.4 110 3.215 0.5058688   FALSE
#> Hornet Sportabout 18.7 175 3.440 0.1960802   FALSE
#> Valiant           18.1 105 3.460 2.0085341   FALSE
# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df_mtcars, method = "mcd",alpha = 0.975)
head(result_mcd)
#>                    mpg  hp    wt  Distance Outlier
#> Mazda RX4         21.0 110 2.620 1.4032515   FALSE
#> Mazda RX4 Wag     21.0 110 2.875 0.4356093   FALSE
#> Datsun 710        22.8  93 2.320 1.7928535   FALSE
#> Hornet 4 Drive    21.4 110 3.215 0.7528113   FALSE
#> Hornet Sportabout 18.7 175 3.440 1.8629727   FALSE
#> Valiant           18.1 105 3.460 3.1254814   FALSE
# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df_mtcars, method = "pca",alpha = 0.975)
head(result_pca)
#>                    mpg  hp    wt  Distance Outlier
#> Mazda RX4         21.0 110 2.620 0.5460497   FALSE
#> Mazda RX4 Wag     21.0 110 2.875 0.3829775   FALSE
#> Datsun 710        22.8  93 2.320 1.5163542   FALSE
#> Hornet 4 Drive    21.4 110 3.215 0.3326773   FALSE
#> Hornet Sportabout 18.7 175 3.440 0.2723783   FALSE
#> Valiant           18.1 105 3.460 0.4647775   FALSE

Function: plot_outliers()

Parameters

1. data (Required)

A numeric dataframe with atleast two continous variables.

2. method (Optional)

A character value specifying the outlier detection approach. Options include:

Default is “mahalanobis”.

3. alpha (Optional)

A numeric value specifying the cutoff quantile for identifying outliers from the chi-squared distribution. Default is 0.975.

Returns

A set of 2D scatterplots for each pair of variables in the dataset. Only works for either Mahalanobis or MCD distances. Outlier are highlighted in red, while inliers are shown in black. The function also arranges all pairwise scatterplots into one frame.

Example Usage

Example 1: Simulated Data

This example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using simulated data.

# Mahalanobis Distance
plot_outliers(df, method = "mahalanobis", alpha = 0.975)

# Minimum Covariance Determinant (MCD)
plot_outliers(df, method = "mcd", alpha = 0.975)

Example 2: Existing Dataset (mtcars)

This example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using a real dataset (mtcars) with three variables: mpg, hp, and wt.

# Mahalanobis Distance
plot_outliers(df_mtcars, method = "mahalanobis", alpha = 0.975)

# Minimum Covariance Determinant (MCD)
plot_outliers(df_mtcars, method = "mcd", alpha = 0.975)