| Type: | Package |
| Title: | Analysis of Interval DAta |
| Version: | 0.2.0 |
| Description: | Tools for the analysis of interval-valued data, including construction, visualization, and statistical modeling. The package provides the 'intData' class for representing interval-valued data, along with functions to aggregate microdata and to estimate parameters of latent distributions. Barycenter and covariance matrix estimation is implemented based on the Mallows distance (Oliveira et al. (2025) <doi:10.48550/arXiv.2407.05105>). Robust estimation of the symbolic covariance matrix is implemented via the Interval Minimum Covariance Determinant (IMCD) estimator, enabling outlier detection based on the robust squared Interval-Mahalanobis distance, as proposed by Loureiro et al. (2026b) <doi:10.48550/arXiv.2604.26769>. Explainable outlier detection is supported through Shapley value based decomposition of the squared robust Interval-Mahalanobis distance, allowing assessment of variable contributions to outlyingness (Loureiro et al. (2026a) <doi:10.48550/arXiv.2606.26307>). Shapley interaction indices are also implemented, along with visualization tools to support interpretation of the results. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| URL: | https://github.com/catarinaploureiro/AIDA, https://catarinaploureiro.github.io/AIDA/ |
| BugReports: | https://github.com/catarinaploureiro/AIDA/issues |
| LazyData: | true |
| LazyDataCompression: | xz |
| VignetteBuilder: | knitr |
| Language: | en-US |
| Imports: | cellWise, cowplot, fmsb, ggbeeswarm, ggplot2, kde1d, MASS, methods |
| Depends: | R (≥ 3.6) |
| Suggests: | CerioliOutlierDetection, corrplot, ggrepel, knitr, plotly, RColorBrewer, rmarkdown, robustbase, scales, testthat (≥ 3.0.0) |
| Config/roxygen2/version: | 8.0.0 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-30 10:42:32 UTC; catar |
| Author: | Catarina P. Loureiro
|
| Maintainer: | Catarina P. Loureiro <catarinapadrela@tecnico.ulisboa.pt> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-30 11:42:13 UTC |
Equality Comparison for intData Objects
Description
Compare two intData objects for equality.
Compare two intData objects for inequality.
Usage
## S4 method for signature 'intData,intData'
e1 == e2
## S4 method for signature 'intData,intData'
e1 != e2
Arguments
e1 |
An |
e2 |
An |
Value
A logical matrix indicating which elements are equal between the two intData objects.
A logical matrix indicating element-wise inequality of the two intData objects.
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where they follow a Beta distribution.
Description
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where they follow a Beta distribution.
Usage
CalE.beta.beta(a1, b1, a2, b2)
Arguments
a1 |
Parameter alpha of the first Beta distribution. |
b1 |
Parameter beta of the first Beta distribution. |
a2 |
Parameter alpha of the second Beta distribution. |
b2 |
Parameter beta of the second Beta distribution. |
Value
Value
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where U_1 follows a Beta(a_1,b_1) and the PDF of U_2 is estimated by a KDE.
Description
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where U_1 follows a Beta(a_1,b_1) and the PDF of U_2 is estimated by a KDE.
Usage
CalE.beta.kde(micro, a1, b1)
Arguments
micro |
Latent microdata observations. |
a1 |
Parameter alpha of the Beta distribution. |
b1 |
Parameter beta of the Beta distribution. |
Value
Value
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where the PDF is estimated by a KDE.
Description
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where the PDF is estimated by a KDE.
Usage
CalE.kde.kde(micro1, micro2)
Arguments
micro1 |
Latent microdata observations of the first latent variable. |
micro2 |
Latent microdata observations of the second latent variable. |
Value
Value
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where they follow a Triangular distribution.
Description
Computes [\boldsymbol{\mathfrak{E}}_{UU}]_{ij}=\mathcal{E}(U_i,U_j) for the latent variables inherent to the macrodata, where they follow a Triangular distribution.
Usage
CalE.triang.triang(mo1 = 0, mo2 = 0)
Arguments
mo1 |
Mode of the triangular distribution of the first latent variable. |
mo2 |
Mode of the triangular distribution of the second latent variable. |
Value
Value
Centers Method for intData
Description
Centers Method for intData
Usage
Centers(Sdt)
## S4 method for signature 'intData'
Centers(Sdt)
Arguments
Sdt |
An object of class |
Value
A data.frame containing the centers of the intervals.
Interval-valued data Minimum Covariance Determinant (IMCD) estimation
Description
Applies an adaptation of the FAST-MCD algorithm to estimate location and scatter for interval-valued data.
Usage
IMCD(
data,
m = 0,
cutoff = c("farness", "adjbox", "chi-squared", "F-dist", "raw"),
cutoff_lvl = NULL
)
Arguments
data |
An |
m |
An integer specifying the subset size to use for the estimation. Defaults to |
cutoff |
Indicates which cutoff should be considered for reweighting the estimates:
Defaults to |
cutoff_lvl |
A numeric value specifying the level of the cutoff to be used.
If no value is provided, the function uses the default values associated with each cutoff method. |
Value
A list containing the robustly estimated parameters:
mean_IMCD_c |
Estimated mean of the centers of the interval data. |
mean_IMCD_r |
Estimated mean of the ranges of the interval data. |
cov_IMCD |
Estimated covariance (scatter) matrix ( |
final_z |
Binary vector indicating the inclusion of each observation in the reweighted subset. |
cutoff |
The cutoff method used for reweighting. |
cutoff_value |
Cutoff value used for reweighting. |
robust_dist |
Robust distances ( |
farness_probs |
Farness probabilities (if |
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2604.26769. https://arxiv.org/abs/2604.26769
Adapted from https://github.com/frankp-0/fastMCD.
The case cutoff=="F-dist" is adapted from package CerioliOutlierDetection (https://cran.r-project.org/package=CerioliOutlierDetection).
Examples
# Example using creditcard dataset
data(creditcard)
credit_card_int <- creditcard$intData
# Obtain reweighted IMCD estimates using farness cutoff
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
Interval-Mahalanobis Distance
Description
Calculate the squared Interval-Mahalanobis distance of all rows in the data and the barycenter.
Usage
IMah_dist(data, z = NULL, mean_c = NULL, mean_r = NULL, cov = NULL)
Arguments
data |
An |
z |
(Optional) A vector of 0 and 1, indicating which observations should be considered for the calculation.
If |
mean_c |
(Optional) A vector specifying the mean of centers. Defaults to |
mean_r |
(Optional) A vector specifying the mean of ranges. Defaults to |
cov |
(Optional) A covariance matrix. Defaults to |
Details
The squared Interval-Mahalanobis distance between \boldsymbol{x}=(\boldsymbol{c}^\top,\boldsymbol{r}^\top)^\top and the barycenter \boldsymbol{\mu}_B=(\boldsymbol{\mu}_C^\top,\boldsymbol{\mu}_R^\top)^\top of a population with symbolic covariance matrix \boldsymbol{\Sigma}_B (see int_cov) is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:d_\mathrm{IMah}(\boldsymbol{x})^2=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R),where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\begin{aligned} d_\mathrm{IMah}(\boldsymbol{x})^2&=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\\ &\quad+\mathbb{E}(U)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R), \end{aligned}where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:\begin{aligned} d_\mathrm{IMah}(\boldsymbol{x})^2&=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)+\dfrac{1}{4}(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}\left(\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_{B}^{-1}\right)(\boldsymbol{r}-\boldsymbol{\mu}_R)\\ &\quad+(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}\boldsymbol{\Sigma}_{B}^{-1}\boldsymbol{\Psi}(\boldsymbol{r}-\boldsymbol{\mu}_R), \end{aligned}where:
-
\boldsymbol{\Psi}=\text{diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
-
Value
A vector with the squared Interval-Mahalanobis distance of each observation.
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2604.26769. https://arxiv.org/abs/2604.26769
Examples
data(creditcard)
credit_card_int <- creditcard$intData
# Compute squared Interval-Mahalanobis distance using IMCD estimates of mean and covariance
credit_card_dist <- IMah_dist(credit_card_int)
Interval-Mahalanobis distance for all pairs
Description
Calculate the squared Interval-Mahalanobis distance of all pairs of observations in the data.
Usage
IMah_dist_pairs(data, cov = NULL)
Arguments
data |
An |
cov |
(Optional) A covariance matrix. Defaults to |
Details
The squared Interval-Mahalanobis distance between \boldsymbol{x}_1=(\boldsymbol{c}_1^\top,\boldsymbol{r}_1^\top)^\top and \boldsymbol{x}_2=(\boldsymbol{c}_2^\top,\boldsymbol{r}_2^\top)^\top of a population with symbolic covariance matrix \boldsymbol{\Sigma}_B (see int_cov) is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:d_\mathrm{IMah}(\boldsymbol{x}_1,\boldsymbol{x}_2)^2=(\boldsymbol{c}_1-\boldsymbol{c}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}_1-\boldsymbol{c}_2)+\delta(\boldsymbol{r}_1-\boldsymbol{r}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}_1-\boldsymbol{r}_2),where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\begin{aligned} d_\mathrm{IMah}(\boldsymbol{x}_1,\boldsymbol{x}_2)^2&=(\boldsymbol{c}_1-\boldsymbol{c}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}_1-\boldsymbol{c}_2)+\delta(\boldsymbol{r}_1-\boldsymbol{r}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}_1-\boldsymbol{r}_2)\\ &\quad+\mathbb{E}(U)(\boldsymbol{c}_1-\boldsymbol{c}_2)^\top\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{r}_1-\boldsymbol{r}_2), \end{aligned}where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:\begin{aligned} d_\mathrm{IMah}(\boldsymbol{x}_1,\boldsymbol{x}_2)^2&=(\boldsymbol{c}_1-\boldsymbol{c}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}(\boldsymbol{c}_1-\boldsymbol{c}_2)+\dfrac{1}{4}(\boldsymbol{r}_1-\boldsymbol{r}_2)^{\top}\left(\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_{B}^{-1}\right)(\boldsymbol{r}_1-\boldsymbol{r}_2)\\ &\quad+(\boldsymbol{c}_1-\boldsymbol{c}_2)^{\top}\boldsymbol{\Sigma}_{B}^{-1}\boldsymbol{\Psi}(\boldsymbol{r}_1-\boldsymbol{r}_2), \end{aligned}where:
-
\boldsymbol{\Psi}=\text{diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
-
If cov is not provided, it will be computed using the IMCD function.
Additionally, if cov is set as the identity matrix, the computed distance is the Mallows distance between pairs of observations.
Value
A matrix with the squared Interval-Mahalanobis distance of each pair of observations.
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2604.26769. https://arxiv.org/abs/2604.26769
Examples
data(creditcard)
credit_card_int <- creditcard$intData
credit_card_dist <- IMah_dist_pairs(credit_card_int)
Latent Case Method for intData
Description
Latent Case Method for intData
Usage
LatentCase(Sdt)
## S4 method for signature 'intData'
LatentCase(Sdt)
Arguments
Sdt |
An object of class |
Value
A character with the latent case.
Latent Distribution Method for intData
Description
Latent Distribution Method for intData
Usage
LatentDist(Sdt)
## S4 method for signature 'intData'
LatentDist(Sdt)
Arguments
Sdt |
An object of class |
Value
A character with the latent distribution(s).
Latent Parameters Method for intData
Description
Latent Parameters Method for intData
Usage
LatentParam(Sdt)
## S4 method for signature 'intData'
LatentParam(Sdt)
Arguments
Sdt |
An object of class |
Value
A list with the latent parameters.
LogRanges Method for intData
Description
LogRanges Method for intData
Usage
LogRanges(Sdt)
## S4 method for signature 'intData'
LogRanges(Sdt)
Arguments
Sdt |
An object of class |
Value
A data.frame containing the logarithms of the ranges.
Lower Bounds Method for intData
Description
Lower Bounds Method for intData
Usage
LowerBounds(Sdt)
## S4 method for signature 'intData'
LowerBounds(Sdt)
Arguments
Sdt |
An object of class |
Value
A data.frame containing the lower bounds of the intervals.
Mallows Distance
Description
Calculate the squared Mallows distance between all rows in data and the barycenter.
Usage
Mallows_dist(data, mean_c = NULL, mean_r = NULL)
Arguments
data |
An |
mean_c |
(Optional) A vector specifying the mean of centers. Defaults to |
mean_r |
(Optional) A vector specifying the mean of ranges Defaults to |
Details
The squared Mallows distance is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:d_\mathrm{M}(\boldsymbol{x})^2=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}(\boldsymbol{c}-\boldsymbol{\mu}_C)+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}(\boldsymbol{r}-\boldsymbol{\mu}_R),where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:d_\mathrm{M}(\boldsymbol{x})^2=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}(\boldsymbol{c}-\boldsymbol{\mu}_C)+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}(\boldsymbol{r}-\boldsymbol{\mu}_R) +\mathbb{E}(U)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top(\boldsymbol{r}-\boldsymbol{\mu}_R),where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:d_\mathrm{M}(\boldsymbol{x})^2=(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}(\boldsymbol{c}-\boldsymbol{\mu}_C)+(\boldsymbol{r}-\boldsymbol{\mu}_R)^{\top}\boldsymbol{\Delta}(\boldsymbol{r}-\boldsymbol{\mu}_R) +(\boldsymbol{c}-\boldsymbol{\mu}_C)^{\top}\boldsymbol{\Psi}(\boldsymbol{r}-\boldsymbol{\mu}_R),where:
-
\boldsymbol{\Psi}=\text{diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
\boldsymbol{\Delta}=\text{diag}(\mathbb{E}(U^2_1),\dots,\mathbb{E}(U^2_p))/4.
-
Value
A vector with the squared Mallows distance of each observation.
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Examples
data(creditcard)
credit_card_int <- creditcard$intData
credit_card_dist <- Mallows_dist(credit_card_int)
Number of Micro Units Method for intData
Description
Number of Micro Units Method for intData
Usage
NbMicroUnits(x)
## S4 method for signature 'intData'
NbMicroUnits(x)
Arguments
x |
An object of class |
Value
An integer specifying the number of micro units.
Ranges Method for intData
Description
Ranges Method for intData
Usage
Ranges(Sdt)
## S4 method for signature 'intData'
Ranges(Sdt)
Arguments
Sdt |
An object of class |
Value
A data.frame containing the ranges of the intervals.
Upper Bounds Method for intData
Description
Upper Bounds Method for intData
Usage
UpperBounds(Sdt)
## S4 method for signature 'intData'
UpperBounds(Sdt)
Arguments
Sdt |
An object of class |
Value
A data.frame containing the upper bounds of the intervals.
Subset an intData Object
Description
Extract a subset of rows and columns from an intData object.
Usage
## S4 method for signature 'intData'
x[i, j, ..., drop = TRUE]
Arguments
x |
An |
i |
Row indices or names to subset. Defaults to all rows. |
j |
Column indices or names to subset. Defaults to all columns. |
... |
Additional arguments (not used). |
drop |
Logical, passed to the underlying |
Value
An intData object containing the specified subset of rows and columns.
Obtain unweighted estimates for data with > 600 observations
Description
Obtain unweighted estimates for data with > 600 observations
Usage
bigIMCD(m, p, n, data)
Arguments
m |
An integer specifying number of observations to use |
p |
An integer specifying the number of columns in X |
n |
An integer specifying the number of total observations |
data |
An |
Value
A list of estimated location and scatter
Perform single iteration of C-step
Description
Perform single iteration of C-step
Usage
c_step(z, m, data)
Arguments
z |
A vector of 0 and 1, indicating which observations should be considered for the calculation |
m |
An integer specifying number of observations to use |
data |
An |
Value
A list of z, covariance, barycenter and robust distances
Compute Cal.E Latent Variables
Description
Computes \boldsymbol{\mathfrak{E}}_{UU} for the latent variables inherent to the macrodata.
Usage
cal.E.UU(
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
Umicro = NULL,
p = NULL
)
Arguments
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
Umicro |
Latent microdata observations. Needed if |
p |
Number of variables. |
Details
The matrix \boldsymbol{\mathfrak{E}}_{UU} is defined as follows:
-
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p.
Value
A p\times p matrix.
Column Names Method for intData
Description
Column Names Method for intData
Usage
## S4 method for signature 'intData'
colnames(x)
Arguments
x |
An object of class |
Value
A character vector of column names.
Credit Card Dataset
Description
This dataset contains interval data of credit card expenses, including min-max values, centers and ranges, microdata, and an intData object.
It is composed of 5 variables: Food, Social, Travel, Gas, and Clothes. It was aggregated by person-month.
Usage
data(creditcard)
Format
A list with the following components:
microdataA data frame with
1000rows and7columns. It contains the microdata, with individual measurements of each variable for all observations.min_maxA data frame with
36rows and10columns. Each row corresponds to a different observation, and each column gives the minimum and maximum values for each variable.centers_rangesA data frame with
36rows and10columns. Each row corresponds to the centers and ranges of the interval data.intDataAn
intDataobject with36interval-valued observations and5variables, constructed assuming the microdata follow symmetric triangular distributions.
References
This data was retrieved from Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley & Sons. doi:10.1002/9780470090183.
Examples
data(creditcard)
head(creditcard$min_max)
head(creditcard$microdata)
head(creditcard$intData)
Dimensions Method for intData
Description
Dimensions Method for intData
Usage
## S4 method for signature 'intData'
dim(x)
Arguments
x |
An object of class |
Value
A vector of the number of rows and columns.
Randomly draw a subset of observations
Description
Randomly draw a subset of observations
Usage
draw_z(m, data)
Arguments
m |
An integer specifying the number of observations to use |
data |
An |
Value
A vector representing an m-length subset of X
Entrecampos Air Quality Dataset
Description
This dataset contains interval data of air pollutants' concentrations, including min-max values and microdata.
This air quality dataset was obtained from a monitoring station in Entrecampos, Lisbon.
It is composed of 9 pollutants' concentration measures in µg/m3 during the years 2019, 2020, and 2021: sulphur dioxide (SO2), particles < 10µm, ozone (O3), nitrogen dioxide (NO2), carbon monoxide (CO), benzene (C6H6), particles < 2.5µm, nitrogen oxides (NOx), and nitrogen monoxide (NO).
For the microdata_transformed, min_max, and intData, the pollutant "benzene" was removed due to a high number of missing values.
The aggregation of the microdata was done by day.
Usage
data(entrecampos_air_quality)
Format
A list with the following components:
microdata_rawA data frame with
26304rows and11columns. It contains the raw microdata, with individual measurements of each variable for all observations.microdata_transformedA data frame with
26304rows and10columns. It contains the microdata, with individual measurements of each variable for all observations. Logarithmic transformations were applied to all variables and interpolation to deal with missing values.min_maxA data frame with
1096rows and17columns. Each row corresponds to a different observation, and each column gives the minimum and maximum values for each variable. The first column corresponds to the day, the next 8 to the minimum and the last 8 to the maximum.intDataAn
intDataobject, constructed using KDE for estimating the parameters of the latent distributions.
References
This data was retrieved from the Portuguese Environment Agency database available at https://qualar.apambiente.pt/.
Examples
data(entrecampos_air_quality)
head(entrecampos_air_quality$microdata_raw)
head(entrecampos_air_quality$microdata_transformed)
head(entrecampos_air_quality$min_max)
head(entrecampos_air_quality$intData)
Farness Estimation
Description
Estimate farness from a distance vector in order to identify outlier observations.
Usage
farness(dist, cutoff_value = NULL)
Arguments
dist |
Vector of distances of each observation. |
cutoff_value |
Optional cutoff value between 0 and 1 to flag outliers. If provided, the function returns both the farness probabilities and the cutoff distance value in the original distance scale. |
Value
Farness of each observation. Values between 0 and 1. If cutoff_value is provided, a list with the farness probabilities and the cutoff distance value in the original distance scale is returned.
References
J. Raymaekers and P.J. Rousseeuw (2021). Transforming variables to central normality. Machine Learning. doi:10.1007/s10994-021-05960-5
Based on the cellWise package: Raymaekers J, Rousseeuw P (2023). cellWise: Analyzing Data with Cellwise Outliers. R package version 2.5.3, https://CRAN.R-project.org/package=cellWise.
Examples
data(creditcard)
credit_card_int <- creditcard$intData
# Compute squared Interval-Mahalanobis distance
credit_card_dist <- IMah_dist(credit_card_int)
credit_card_farness <- farness(credit_card_dist, cutoff_value = 0.9)
Compute Latent Variables Parameters
Description
Obtain the parameters of the latent variables inherent to the macrodata.
Usage
get_latent_param(
LatentCase = c("U_id_symmetric", "U_id", "General"),
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
Umicro = NULL,
p = NULL,
estimate.DistParam = FALSE
)
Arguments
LatentCase |
A string specifying which of the three scenarios applies to the latent variables:
Defaults to |
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
Umicro |
Latent microdata observations.
Needed if |
p |
Number of variables. |
estimate.DistParam |
Logical parameter indicating if estimation of the parameters of the latent distributions should be performed. Can only be set to TRUE if |
Details
The parameters of the latent variables inherent to the macrodata are defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric, so its parameters are:-
\delta=\mathbb{E}(U^2)/4
-
-
"U_id": The latent variables are identically distributed, so its parameters are:-
\delta=\mathbb{E}(U^2)/4 -
\mathbb{E}(U)
-
-
"General": The latent variables do not have any nice properties, so its parameters are:-
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, and[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p -
\boldsymbol{\Psi}=\text{Diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p))
-
Value
A list with the parameters of the latent variables.
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Examples
data(creditcard)
CreditCard_min_max <- creditcard$min_max
CreditCard_microdata <- creditcard$microdata
# Define grouping variable for microdata aggregation
credit_agrby <- paste(CreditCard_microdata$Name, CreditCard_microdata$Month, sep = "_")
# Obtain latent variables inherent to the macrodata (standardized to [-1,1])
credit_card_U <- get_latent_var(microdata = CreditCard_microdata[,3:7],
macrodata = CreditCard_min_max,
agrby = credit_agrby,
agrlevels = row.names(CreditCard_min_max),
Seq = "LbUb_VarbyVar")
# Obtain parameters of the latent variables
credit_card_param <- get_latent_param(LatentCase = "General",
LatentDist = "KDE",
Umicro = credit_card_U)
Compute Latent Variables
Description
Obtain the latent variables inherent to the macrodata.
Usage
get_latent_var(
microdata,
macrodata,
agrby,
agrlevels,
Seq = c("AllLb_AllUb", "AllCen_AllRng", "LbUb_VarbyVar", "CenRng_VarbyVar")
)
Arguments
microdata |
A matrix containing the microdata. |
macrodata |
A data frame, matrix or |
agrby |
A factor used to specify the grouping of the microdata. |
agrlevels |
The categories/levels on which the microdata was aggregated. |
Seq |
Format of macrodata if it is a data frame or matrix. Available options are:
|
Details
The latent variables, U_{j}, are defined according to the following model:
Let X_j=(C_j,R_j)^\top=\left[C_j-\dfrac{R_j}{2}, C_j+\dfrac{R_j}{2}\right] represent the macrodata and
V_{j}=C_j+U_{j}\dfrac{R_j}{2},\quad j=1,\dots,p,
the microdata with U_{j} being random variables with support on [-1,1], uncorrelated with (C_j,R_j).
Value
A matrix with the same size as the microdata.
References
Oliveira, M.R., Azeitona, M., Pacheco, A., Valadas, R.. Association measures for interval variables. Advances in Data Analysis and Classification 16, 491–520 (2022). doi:10.1007/s11634-021-00445-8
Examples
data(creditcard)
CreditCard_min_max <- creditcard$min_max
CreditCard_microdata <- creditcard$microdata
# Define grouping variable for microdata aggregation
credit_agrby <- paste(CreditCard_microdata$Name, CreditCard_microdata$Month, sep = "_")
# Obtain latent variables inherent to the macrodata (standardized to [-1,1])
credit_card_U <- get_latent_var(microdata = CreditCard_microdata[,3:7],
macrodata = CreditCard_min_max,
agrby = credit_agrby,
agrlevels = row.names(CreditCard_min_max),
Seq = "LbUb_VarbyVar")
Head Method for intData
Description
Returns the first n rows of an intData object.
Usage
## S4 method for signature 'intData'
head(x, n = min(nrow(x), 6L))
Arguments
x |
An |
n |
The number of rows to return. |
Value
A subset of the intData object.
Cars Dataset
Description
This dataset contains interval data of car specifications, including min-max values. It is composed of 5 variables: Engine Capacity, Top Speed, Acceleration, Price and Class. The aggregation of the microdata was done by car model.
Usage
data(intCars)
Format
A list with the following components:
microdataA data frame with
27rows and9columns. It contains the lower and upper bounds for each variable.intDataAn
intDataobject with27interval-valued observations and4variables. The variable "Price" was log-transformed into "lnPrice". The microdata are not available, thus the default parameters of the latent distributions were used assuming a uniform distribution.
References
This data was retrieved from the MAINT.Data package, available at https://cran.r-project.org/package=MAINT.Data.
Examples
data(intCars)
head(intCars$microdata)
head(intCars$intData)
Interval Data Constructor
Description
Constructs an interval data object.
Usage
intData(
macrodata,
Seq = c("AllLb_AllUb", "AllCen_AllRng", "LbUb_VarbyVar", "CenRng_VarbyVar"),
LatentParam = NULL,
LatentCase = c("U_id_symmetric", "U_id", "General"),
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
Umicro = NULL,
estimate.DistParam = FALSE,
VarNames = NULL,
ObsNames = row.names(macrodata),
NbMicroUnits = integer(0)
)
Arguments
macrodata |
A data frame or matrix containing the macrodata. |
Seq |
Format of macrodata if it is a data frame or matrix. Available options are:
|
LatentParam |
A list with the parameters of the latent variables.
Expects a list with a single number if |
LatentCase |
A string specifying which of the three scenarios applies to the latent variables:
Defaults to |
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
Umicro |
Latent microdata observations.
Needed if |
estimate.DistParam |
Logical parameter indicating if estimation of the parameters of the latent distributions should be performed. Can only be set to TRUE if |
VarNames |
A character vector of variable names. |
ObsNames |
A character vector of observation names. |
NbMicroUnits |
An integer vector indicating the number of individual observations (microdata) aggregated by interval (macrodata). |
Value
An object of class intData.
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Adapted from package MAINT.Data (https://cran.r-project.org/package=MAINT.Data).
Examples
# Load microdat and macrodata
data(creditcard)
CreditCard_microdata <- creditcard$microdata
CreditCard_min_max <- creditcard$min_max
# Create an intData object using the min_max component of the dataset
# Assume a continuous uniform distribution for the latent variables
# This corresponds to LatentCase="U_id_symmetric"
# This is the default setting for the intData class
credit_card_int_unif <- intData(CreditCard_min_max,
Seq = "LbUb_VarbyVar",
VarNames = colnames(CreditCard_microdata)[3:7])
Interval Data Class
Description
A class to represent interval data.
Slots
CentersA data frame of centers of the intervals.
RangesA data frame of ranges of the intervals.
LatentParamA list with the parameters of the latent variables.
LatentCaseA string specifying which of the three scenarios applies to the latent variables:
-
"U_id_symmetric": The case where the latent variables are identically distributed and symmetric. -
"U_id": The case where the latent variables are identically distributed. -
"General": The case where the latent variables do not have any nice properties.
Defaults to
"U_id_symmetric".-
LatentDistA string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of (
"Unif","Triang","TNorm","InvTri","Beta","KDE","Degenerated"), if not, it is a vector with the distribution for each variable.ObsNamesA character vector of observation names.
VarNamesA character vector of variable names.
NObsA numeric value indicating the number of observations.
NIVarA numeric value indicating the number of interval variables.
NbMicroUnitsAn integer vector indicating the number of individual observations (microdata) aggregated by interval (macrodata).
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Adapted from package MAINT.Data (https://cran.r-project.org/package=MAINT.Data).
Compute Shapley Values for Interval-valued Data
Description
Outlier explanation based on Shapley values for interval-valued data. Decomposes the squared interval-valued Mahalanobis distance into additive outlyingness contributions of the variables.
Usage
int_Shapley(data, mean_c = NULL, mean_r = NULL, cov = NULL)
Arguments
data |
An |
mean_c |
(Optional) A vector specifying the mean of centers. Defaults to |
mean_r |
(Optional) A vector specifying the mean of ranges. Defaults to |
cov |
(Optional) A covariance matrix. Defaults to |
Details
The Shapley value decomposes the squared Interval-Mahalanobis distance (see IMah_dist) into additive outlyingness contributions of the variables.
Let \boldsymbol{\mu}_B=(\boldsymbol{\mu}_C^\top,\boldsymbol{\mu}_R^\top)^\top be the barycenter and \boldsymbol{\Sigma}_B the symbolic covariance matrix (see int_cov).
The Shapley value of an interval-valued observation \boldsymbol{x}=(\boldsymbol{c}^\top,\boldsymbol{r}^\top)^\top, for the Interval-Mahalanobis distance, is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:\boldsymbol{\phi}(\boldsymbol{x})=(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right]+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right],where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\begin{aligned} \boldsymbol{\phi}(\boldsymbol{x})&=(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right]+\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right]\\ &\quad+\dfrac{\mathbb{E}(U)}{2}(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right]+\dfrac{\mathbb{E}(U)}{2}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right], \end{aligned}where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:\begin{aligned} \boldsymbol{\phi}(\boldsymbol{x})&=(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right] +\dfrac{1}{4}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\left(\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_B^{-1}\right)(\boldsymbol{r}-\boldsymbol{\mu}_R)\right]\\ &\quad+\dfrac{1}{2}(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}\boldsymbol{\Psi}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right] +\dfrac{1}{2}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Psi}\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right], \end{aligned}where:
-
\boldsymbol{\Psi}=\text{Diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
-
Value
A matrix of Shapley values with row and column names corresponding to the rows and columns of the input data.
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Explainable Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2606.26307. https://arxiv.org/abs/2606.26307
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute Shapley values based on IMCD estimates of mean and covariance
credit_card_shapley <- int_Shapley(credit_card_int)
Compute Shapley Decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) for Interval-valued Data
Description
Decomposes the squared interval-valued Mahalanobis distance of each observation into outlyingness contributions of (Centers, Ranges, and CrossCentersRanges) per variable for interval-valued data.
Usage
int_Shapley_decomp(data, mean_c = NULL, mean_r = NULL, cov = NULL)
Arguments
data |
An |
mean_c |
(Optional) A vector specifying the mean of centers. Defaults to |
mean_r |
(Optional) A vector specifying the mean of ranges. Defaults to |
cov |
(Optional) A covariance matrix. Defaults to |
Details
Let \boldsymbol{\mu}_B=(\boldsymbol{\mu}_C^\top,\boldsymbol{\mu}_R^\top)^\top be the barycenter and \boldsymbol{\Sigma}_B the symbolic covariance matrix (see int_cov).
Based on the Shapley value (see int_Shapley), we can further decompose the Interval-Mahalanobis distance of an interval-valued observation \boldsymbol{x}=(\boldsymbol{c}^\top,\boldsymbol{r}^\top)^\top into contributions of the centers, ranges and cross-centers-ranges of the variables. The decomposition is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:Centers contribution:
(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right],Ranges contribution:
\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right],
where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:Centers contribution:
(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right],Ranges contribution:
\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right],CrossCentersRanges contribution:
\dfrac{\mathbb{E}(U)}{2}(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right]+\dfrac{\mathbb{E}(U)}{2}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right],
where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:Centers contribution:
(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right],Ranges contribution:
\dfrac{1}{4}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\left(\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_B^{-1}\right)(\boldsymbol{r}-\boldsymbol{\mu}_R)\right],CrossCentersRanges contribution:
\dfrac{1}{2}(\boldsymbol{c}-\boldsymbol{\mu}_C)\bullet\left[\boldsymbol{\Sigma}_B^{-1}\boldsymbol{\Psi}(\boldsymbol{r}-\boldsymbol{\mu}_R)\right]+\dfrac{1}{2}(\boldsymbol{r}-\boldsymbol{\mu}_R)\bullet\left[\boldsymbol{\Psi}\boldsymbol{\Sigma}_B^{-1}(\boldsymbol{c}-\boldsymbol{\mu}_C)\right],
where:
-
\boldsymbol{\Psi}=\text{Diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
Value
A list containing the matrix of Shapley value decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) per variable for each observation.
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Explainable Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2606.26307. https://arxiv.org/abs/2606.26307
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute Shapley decomposition into contributions of (Centers, Ranges, and CrossCentersRanges)
# based on IMCD estimates of mean and covariance
credit_card_shap_decomp <- int_Shapley_decomp(credit_card_int)
Compute Shapley interaction indices for Interval-valued Data
Description
Obtains a p \times p matrix containing pairwise outlyingness scores based on Shapley interaction indices for each observation.
Decomposes the squared interval-valued Mahalanobis distance of each observation into outlyingness contributions of pairs of variables.
Usage
int_Shapley_interaction(data, mean_c = NULL, mean_r = NULL, cov = NULL)
Arguments
data |
An |
mean_c |
(Optional) A vector specifying the mean of centers. Defaults to |
mean_r |
(Optional) A vector specifying the mean of ranges. Defaults to |
cov |
(Optional) A covariance matrix. Defaults to |
Details
Let \boldsymbol{\mu}_B=(\boldsymbol{\mu}_C^\top,\boldsymbol{\mu}_R^\top)^\top be the barycenter and \boldsymbol{\Sigma}_B the symbolic covariance matrix (see int_cov).
Let also \boldsymbol{\phi}(\boldsymbol{x}) be the Shapley value of \boldsymbol{x} (see int_Shapley) and \mathrm{diag}(\boldsymbol{v}) be the diagonal matrix whose main diagonal is the vector \boldsymbol{v}.
The Shapley interaction index of an interval-valued observation \boldsymbol{x}=(\boldsymbol{c}^\top,\boldsymbol{r}^\top)^\top, for the Interval-Mahalanobis distance, is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:\boldsymbol{\Phi}(\boldsymbol{x})=2(\boldsymbol{c}-\boldsymbol{\mu}_C)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\bullet\boldsymbol{\Sigma}_B^{-1} + 2\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)(\boldsymbol{r}-\boldsymbol{\mu}_R)^\top\bullet\boldsymbol{\Sigma}_B^{-1}-\mathrm{diag}\left(\boldsymbol{\phi}(\boldsymbol{x})\right),where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\begin{aligned} \boldsymbol{\Phi}(\boldsymbol{x})&=2(\boldsymbol{c}-\boldsymbol{\mu}_C)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\bullet\boldsymbol{\Sigma}_B^{-1} + 2\delta(\boldsymbol{r}-\boldsymbol{\mu}_R)(\boldsymbol{r}-\boldsymbol{\mu}_R)^\top\bullet\boldsymbol{\Sigma}_B^{-1}\\ &\quad+\mathbb{E}(U)(\boldsymbol{c}-\boldsymbol{\mu}_C)(\boldsymbol{r}-\boldsymbol{\mu}_R)^\top\bullet\boldsymbol{\Psi} + \mathbb{E}(U)(\boldsymbol{r}-\boldsymbol{\mu}_R)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\bullet\boldsymbol{\Sigma}_B^{-1}-\mathrm{diag}\left(\boldsymbol{\phi}(\boldsymbol{x})\right), \end{aligned}where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameter of the latent variables. -
"General": The latent variables do not have any nice properties:\begin{aligned} \boldsymbol{\Phi}(\boldsymbol{x})&=2(\boldsymbol{c}-\boldsymbol{\mu}_C)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\bullet\boldsymbol{\Sigma}_B^{-1} + \dfrac{1}{2}(\boldsymbol{r}-\boldsymbol{\mu}_R)(\boldsymbol{r}-\boldsymbol{\mu}_R)^\top\bullet\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_B^{-1}\\ &\quad+(\boldsymbol{c}-\boldsymbol{\mu}_C)(\boldsymbol{r}-\boldsymbol{\mu}_R)^\top\bullet\boldsymbol{\Sigma}_B^{-1}\boldsymbol{\Psi} + (\boldsymbol{r}-\boldsymbol{\mu}_R)(\boldsymbol{c}-\boldsymbol{\mu}_C)^\top\bullet\boldsymbol{\Psi}\boldsymbol{\Sigma}_B^{-1}-\mathrm{diag}\left(\boldsymbol{\phi}(\boldsymbol{x})\right), \end{aligned}where:
-
\boldsymbol{\Psi}=\text{Diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
Value
A list containing the matrix of Shapley interaction indices for each observation.
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Explainable Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2606.26307. https://arxiv.org/abs/2606.26307
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute Shapley interaction indices based on the mean and covariance matrix estimated by IMCD
credit_card_shap_inter <- int_Shapley_interaction(credit_card_int)
Interval-valued Covariance
Description
Calculate the interval-valued covariance matrix based on the covariance matrices of the centers and ranges or data.
Usage
int_cov(
data = NULL,
sigma_cc = NULL,
sigma_rr = NULL,
sigma_cr = NULL,
LatentParam = NULL,
LatentCase = c("U_id_symmetric", "U_id", "General")
)
Arguments
data |
An |
sigma_cc |
Covariance matrix of the centers. |
sigma_rr |
Covariance matrix of the ranges. |
sigma_cr |
Covariance matrix between the centers and ranges. |
LatentParam |
A list with the parameters of the latent variables.
Expects a list with a single number if |
LatentCase |
A string specifying which of the three scenarios applies to the latent variables:
Defaults to |
Details
This function calculates the interval-valued covariance matrix, \boldsymbol{\Sigma}_B, based on the covariance matrices of the centers, \boldsymbol{\Sigma}_{CC}, ranges, \boldsymbol{\Sigma}_{RR}, and the covariance matrix between the centers and ranges, \boldsymbol{\Sigma}_{CR}=\boldsymbol{\Sigma}_{RC}^\top.
The covariance matrix is defined according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:\boldsymbol{\Sigma}_B=\boldsymbol{\Sigma}_{CC}+\delta\boldsymbol{\Sigma}_{RR},where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\boldsymbol{\Sigma}_B=\boldsymbol{\Sigma}_{CC}+\delta\boldsymbol{\Sigma}_{RR}+\dfrac{\mathbb{E}(U)}{2}\left(\boldsymbol{\Sigma}_{CR}+\boldsymbol{\Sigma}_{RC}\right),where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameters of the latent variables. -
"General": The latent variables do not have any nice properties:\boldsymbol{\Sigma}_B=\boldsymbol{\Sigma}_{CC}+\dfrac{1}{4}\boldsymbol{\mathfrak{E}}_{UU}\bullet\boldsymbol{\Sigma}_{RR}+\dfrac{1}{2}\boldsymbol{\Sigma}_{CR}\boldsymbol{\Psi}+\dfrac{1}{2}\boldsymbol{\Psi}\boldsymbol{\Sigma}_{RC}where:
-
\boldsymbol{\Psi}=\text{diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
-
The covariance matrix can be calculated either based on the covariance matrices of the centers and ranges or based on the data. If the data is provided, the covariance matrices are calculated using the sample covariance of the centers and ranges and the sample covariance between centers and ranges.
For the robust estimation of the covariance matrix, see IMCD.
Value
The symbolic covariance matrix.
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Examples
data(creditcard)
credit_card_int <- creditcard$intData
credit_card_cov <- int_cov(credit_card_int)
Sample Interval-valued Covariance
Description
Calculate the interval-valued covariance matrix in function of z
Usage
int_cov_z(z, data)
Arguments
z |
A vector of 0 and 1, indicating which observations should be considered for the calculation |
data |
An |
Details
Let \boldsymbol{z}\in\{0,1\}^n be a vector indicating which m observations are “active”. This function calculates the sample interval-valued covariance matrix in function of \boldsymbol{z}: \boldsymbol{S}_B(\boldsymbol{z}).
Let \boldsymbol{C}, \boldsymbol{R} be the matrices of centers and ranges, respectively. Additionally, set:
\overline{\boldsymbol{c}}_B(\boldsymbol{z})=\dfrac{1}{m}\boldsymbol{C}^{\top}\boldsymbol{z}, \qquad \overline{\boldsymbol{r}}_B(\boldsymbol{z})=\dfrac{1}{m}\boldsymbol{R}^{\top}\boldsymbol{z}.
The sample interval-valued covariance matrix is obtained according to the LatentCase:
-
"U_id_symmetric": The latent variables are identically distributed and symmetric:\boldsymbol{S}_B(\boldsymbol{z})=\left(\dfrac{1}{m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{c}_{h}\boldsymbol{c}_{h}^{\top}\right)-\overline{\boldsymbol{c}}_B(\boldsymbol{z})\overline{\boldsymbol{c}}_B(\boldsymbol{z})^\top+\left(\dfrac{\delta}{m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{r}_{h}\boldsymbol{r}_{h}^{\top}\right)-\delta\overline{\boldsymbol{r}}_B(\boldsymbol{z})\overline{\boldsymbol{r}}_B(\boldsymbol{z})^\top,where
\delta=\mathbb{E}(U^2)/4is the parameter of the latent variables. -
"U_id": The latent variables are identically distributed:\begin{aligned} \boldsymbol{S}_B(\boldsymbol{z})&=\left(\dfrac{1}{m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{c}_{h}\boldsymbol{c}_{h}^{\top}\right)-\overline{\boldsymbol{c}}_B(\boldsymbol{z})\overline{\boldsymbol{c}}_B(\boldsymbol{z})^\top+\left(\dfrac{\delta}{m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{r}_{h}\boldsymbol{r}_{h}^{\top}\right)-\delta\overline{\boldsymbol{r}}_B(\boldsymbol{z})\overline{\boldsymbol{r}}_B(\boldsymbol{z})^\top\\ &\quad+\left(\dfrac{\mathbb{E}(U)}{2m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{c}_{h}\boldsymbol{r}_{h}^{\top}\right)-\dfrac{\mathbb{E}(U)}{2}\overline{\boldsymbol{c}}_B(\boldsymbol{z})\overline{\boldsymbol{r}}_B(\boldsymbol{z})^\top\\ &\quad+\left(\dfrac{\mathbb{E}(U)}{2m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{r}_{h}\boldsymbol{c}_{h}^{\top}\right)-\dfrac{\mathbb{E}(U)}{2}\overline{\boldsymbol{r}}_B(\boldsymbol{z})\overline{\boldsymbol{c}}_B(\boldsymbol{z})^\top, \end{aligned}where
\delta=\mathbb{E}(U^2)/4and\mathbb{E}(U)are the parameters of the latent variables. -
"General": The latent variables do not have any nice properties:\begin{aligned} \boldsymbol{S}_B(\boldsymbol{z})&=\left(\dfrac{1}{m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{c}_{h}\boldsymbol{c}_{h}^{\top}\right)-\overline{\boldsymbol{c}}_B(\boldsymbol{z})\overline{\boldsymbol{c}}_B(\boldsymbol{z})^\top\\ &\quad+\left(\dfrac{1}{4m}\boldsymbol{\mathfrak{E}}_{UU}\bullet\sum\limits_{h=1}^{n}z_{h}\boldsymbol{r}_{h}\boldsymbol{r}_{h}^{\top}\right)-\dfrac{1}{4}\boldsymbol{\mathfrak{E}}_{UU}\bullet\left[\overline{\boldsymbol{r}}_B(\boldsymbol{z})\overline{\boldsymbol{r}}_B(\boldsymbol{z})^\top\right]\\ &\quad+\left(\dfrac{1}{2m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{c}_{h}\boldsymbol{r}_{h}^{\top}\right)\boldsymbol{\Psi}-\dfrac{1}{2}\overline{\boldsymbol{c}}_B(\boldsymbol{z})\overline{\boldsymbol{r}}_B(\boldsymbol{z})^\top\boldsymbol{\Psi}\\ &\quad+\boldsymbol{\Psi}\left(\dfrac{1}{2m}\sum\limits_{h=1}^{n}z_{h}\boldsymbol{r}_{h}\boldsymbol{c}_{h}^{\top}\right)-\dfrac{1}{2}\boldsymbol{\Psi}\overline{\boldsymbol{r}}_B(\boldsymbol{z})\overline{\boldsymbol{c}}_B(\boldsymbol{z})^\top, \end{aligned}where:
-
\boldsymbol{\Psi}=\text{diag}(\mathbb{E}(U_1),\dots,\mathbb{E}(U_p)), -
[\boldsymbol{\mathfrak{E}}_{UU}]_{j\ell}=\mathcal{E}(U_j,U_\ell),j\neq \ell, with\mathcal{E}(U_j,U_\ell)=\int_0^1 F_{U_j}^{-1}(t) F_{U_\ell}^{-1}(t) \, dt, -
[\boldsymbol{\mathfrak{E}}_{UU}]_{jj}=\mathbb{E}(U_j^2),j,\ell=1,\dots,p, -
\bulletdenotes the Schur (or entrywise) product of matrices.
-
Value
The symbolic covariance matrix
References
Oliveira, M. R., Pinheiro, D., & Oliveira, L. (2025). Location and association measures for interval-valued data based on Mallows' distance. arXiv preprint arXiv:2407.05105. https://arxiv.org/abs/2407.05105
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2604.26769. https://arxiv.org/abs/2604.26769
Examples
data(creditcard)
credit_card_int <- creditcard$intData
# Compute sample interval-valued covariance matrix using the all the observations
z <- rep(1, nrow(credit_card_int))
credit_card_cov <- int_cov_z(z, credit_card_int)
Sample Mean
Description
Calculate the mean of X in function of z
Usage
int_mean_z(z, X)
Arguments
z |
A vector of 0 and 1, indicating which observations should be considered for the calculation |
X |
A matrix where the rows correspond to observations and the columns to variables |
Details
This function calculates the mean of \boldsymbol{X} in function of \boldsymbol{z}. If \boldsymbol{z} is a vector of 0 and 1, the mean is calculated for the m observations that are equal to 1:
\bar{\boldsymbol{x}}(\boldsymbol{z}) = \dfrac{1}{m} \boldsymbol{X}^\top \boldsymbol{z}.
Value
A vector where each element is the mean for each variable
Examples
n <- 100
p <- 4
X <- matrix(rnorm(n * p), ncol = p)
# if we consider all the observations the result obtained is the same as colMeans()
z <- c(rep(1, n))
int_mean_z(z, X)
colMeans(X)
Outlier Detection for Interval-Valued Data Based on Robust Distances
Description
Identifies potential outliers in interval-valued data using robust distance-based methods with customizable cutoff criteria.
Usage
int_outliers(
robust_dist,
cutoff = c("farness", "adjbox", "chi-squared", "F-dist"),
cutoff_lvl = NULL,
p = NULL,
z = NULL
)
Arguments
robust_dist |
A numeric vector containing the robust distances for each observation. |
cutoff |
A character string specifying the method for setting the outlier cutoff threshold. Options include:
Default is |
cutoff_lvl |
A numeric value specifying the level of the cutoff to be used.
If no value is provided, the function uses the default values associated with each cutoff method. |
p |
The number of variables in the data. Required for |
z |
A binary vector indicating the subset of observations used for initial robust estimation. Required for the |
Details
This function classifies observations as outliers based on robust distances and user-defined cutoff methods. It supports various approaches, including Chi-Squared quantiles, adjusted boxplots, F distribution quantiles, and farness probabilities.
Value
A list with the following components:
outliers_names |
Character vector of names for observations classified as outliers. |
is_outlier |
Logical vector indicating whether each observation is an outlier (TRUE) or not (FALSE). |
cutoff |
The cutoff method used for detecting outliers. |
cutoff_value |
Cutoff value used for detecting outliers. |
farness_probs |
Numeric vector of farness probabilities for each observation (only if |
References
Loureiro, C. P., Oliveira, M. R., Brito, P., & Oliveira, L. (2026). Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data. arXiv preprint arXiv:2604.26769. https://arxiv.org/abs/2604.26769
Case cutoff=="F-dist" is adapted from package CerioliOutlierDetection (https://cran.r-project.org/package=CerioliOutlierDetection).
Examples
# Example of detecting outliers using robust distances
set.seed(42)
robust_dist <- abs(rnorm(100))
result <- int_outliers(robust_dist, cutoff = "chi-squared", p = 5)
# Example using creditcard dataset
data(creditcard)
credit_card_int <- creditcard$intData
# Compute robust distances using IMCD estimates of mean and covariance
credit_card_dist <- IMah_dist(credit_card_int)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
Compute Mean Latent Variables
Description
Obtain the mean of the latent variables inherent to the macrodata.
Usage
meanU(
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
Umicro = NULL,
p = NULL
)
Arguments
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
Umicro |
Latent microdata observations. Needed if |
p |
Number of variables. |
Value
Either a diagonal matrix with the mean of each variable or a value if the variables are identically distributed.
Compute Mean Square Latent Variables
Description
Obtain the mean of the square of the latent variables inherent to the macrodata.
Usage
meanU2(
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
Umicro = NULL,
p = NULL
)
Arguments
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
Umicro |
Latent microdata observations. Needed if |
p |
Number of variables. |
Value
Either a diagonal matrix with the mean of the square of each variable or a value if the variables are identically distributed.
Aggregate Microdata into Interval-Valued Data
Description
Aggregates microdata from a data frame into interval-valued data using various criteria and latent distribution settings.
Usage
micro2intData(
microdata,
agrby,
agrcrt = "minmax",
LatentParam = NULL,
LatentCase = c("U_id_symmetric", "U_id", "General"),
LatentDist = c("Unif", "Triang", "TNorm", "InvTri", "Beta", "KDE", "Degenerated"),
TriangParam = 0,
BetaParam.a = 1,
BetaParam.b = 1,
estimate.DistParam = FALSE
)
Arguments
microdata |
A data frame containing the microdata. All columns should be numeric. |
agrby |
A factor used to specify the grouping of the microdata for aggregation. |
agrcrt |
A string or numeric vector of length 2 specifying the aggregation criterion. The default is |
LatentParam |
(Optional) A list with the parameters of the latent variables.
Expects a list with a single number if |
LatentCase |
A string specifying which of the three scenarios applies to the latent variables:
Defaults to |
LatentDist |
A string or vector of strings specifying the distribution(s) of the latent variables. If the variables are identically distributed it can be one of ( |
TriangParam |
Mode of the triangular distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.a |
Parameter alpha of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
BetaParam.b |
Parameter beta of the Beta distribution. If the latent variables are identically distributed, it is only necessary to provide a number, if not a vector is needed.
The default is |
estimate.DistParam |
Logical parameter indicating if estimation of the parameters of the latent distributions should be performed. Can only be set to TRUE if |
Details
This function processes a data frame of microdata and aggregates it into interval-valued data according to the specified grouping factor and aggregation criteria. It can handle different latent distribution cases and parameter settings.
If some rows contain invalid (non-finite or missing) values, those rows are removed before aggregation. If all rows in the resulting interval-valued data are degenerate (i.e., the lower bound equals the upper bound), the function will return NULL.
Value
An intData object containing the aggregated interval-valued data, or NULL if all units lead to degenerate intervals.
References
Adapted from package MAINT.Data (https://cran.r-project.org/package=MAINT.Data).
Examples
data(creditcard)
CreditCard_microdata <- creditcard$microdata
# Define grouping variable for microdata aggregation
credit_agrby <- factor(paste(CreditCard_microdata$Name, CreditCard_microdata$Month, sep = "_"))
# Create intData object by aggregating microdata using the default minmax criterion
# and using KDE for estimation of the latent distribution in the general case
credit_agr <- micro2intData(CreditCard_microdata[,3:7],
agrby = credit_agrby,
LatentCase = "General")
Variable Names Method for intData
Description
Variable Names Method for intData
Usage
## S4 method for signature 'intData'
names(x)
Arguments
x |
An object of class |
Value
A character vector of variable names.
Number of Columns Method for intData
Description
Number of Columns Method for intData
Usage
## S4 method for signature 'intData'
ncol(x)
Arguments
x |
An object of class |
Value
The number of columns.
Number of Rows Method for intData
Description
Number of Rows Method for intData
Usage
## S4 method for signature 'intData'
nrow(x)
Arguments
x |
An object of class |
Value
The number of rows.
Choose the 10 best estimates after iterating twice through initial sets
Description
Choose the 10 best estimates after iterating twice through initial sets
Usage
pick10(z_all, m, data)
Arguments
z_all |
A 2D matrix where each row specifies a subset of observations |
m |
An integer specifying number of observations to use |
data |
An |
Value
A list of z, covariance, barycenter and robust distances
Plot Method for Two intData Objects
Description
Plots one intData object against another, with options to visualize the intervals as crosses or rectangles.
Plots a single intData object, either in a vertical or horizontal layout.
Usage
## S4 method for signature 'intData,intData'
plot(
x,
y,
type = c("crosses", "rectangles", "crosses2"),
append = FALSE,
palette = rainbow(x@NObs),
...
)
## S4 method for signature 'intData,missing'
plot(
x,
casen = NULL,
layout = c("vertical", "horizontal"),
append = FALSE,
...
)
Arguments
x |
An |
y |
An |
type |
The type of plot to generate: "crosses" or "rectangles" or "crosses2". Default is "crosses". |
append |
Logical, if |
palette |
A vector with colors for each observation. |
... |
Additional graphical parameters. |
casen |
A vector specifying the case numbers to plot. Default is |
layout |
The layout of the plot: "vertical" or "horizontal". |
Value
A plot showing the relationship between the two intData objects.
A plot showing the intervals of the intData object.
Barplot of Shapley values for Interval-valued Data
Description
Barplot of Shapley values for Interval-valued Data
Usage
plot_bar_int_Shapley(
x,
cutoff_value = NULL,
cutoff_label = NULL,
palette = NULL,
abbrev.var = 20,
abbrev.obs = 20,
sort.obs = TRUE,
plot_IMah = TRUE,
IMah_label = expression(Robust ~ d[IMah]^2 * (bold(x))),
rotate_x = TRUE
)
Arguments
x |
A |
cutoff_value |
Numeric. The cutoff value used for detecting outliers. If |
cutoff_label |
Character. Label for the cutoff value line in the plot. |
palette |
A vector with colors for each variable. If |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.obs |
Logical. If |
plot_IMah |
Logical. If |
IMah_label |
Character. Label for the Interval-Mahalanobis distance in the plot legend. Default is "Robust |
rotate_x |
Logical. If |
Value
Returns a barplot that displays the Shapley values (int_Shapley) for each observation and optionally (plot_IMah = TRUE)
includes the squared (robust) Interval-Mahalanobis distance (IMah_dist) (black bar) and the corresponding outlier detection cut-off value (dotted line).
References
Adapted from package ShapleyOutlier (https://CRAN.R-project.org/package=ShapleyOutlier).
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Estimate the mean and covariance matrix
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_IMCD$robust_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute Shapley values
credit_card_shapley <- int_Shapley(credit_card_int,
mean_c = credit_card_IMCD$mean_IMCD_c,
mean_r = credit_card_IMCD$mean_IMCD_r,
cov = credit_card_IMCD$cov_IMCD)
# Plot Shapley values with cutoff line and Interval-Mahalanobis distance
plot_bar_int_Shapley(credit_card_shapley,
cutoff_value = credit_card_outliers$cutoff_value,
cutoff_label = "Farness 0.9",
palette = rainbow(credit_card_int@NIVar))
Barplot of Shapley value decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) for interval-valued data.
Description
Barplot of Shapley value decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) for interval-valued data.
Usage
plot_bar_int_Shapley_decomp(
shapley_decomp,
palette = NULL,
rotate_x = TRUE,
abbrev.obs = 20,
sort.obs = TRUE,
plot_IMah = FALSE
)
Arguments
shapley_decomp |
A list of matrices containing the Shapley value decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) for each observation. |
palette |
A vector with colors for each feature. If |
rotate_x |
Logical. If |
abbrev.obs |
Integer. If |
sort.obs |
Logical. If |
plot_IMah |
Logical. If |
Value
Returns a barplot that displays the Shapley value decomposition into contributions of (Centers, Ranges, and CrossCentersRanges) for each observation.
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute Shapley decomposition into contributions of Centers, Ranges, and CrossCentersRanges
# based on IMCD estimates of mean and covariance matrix
credit_card_shap_decomp <- int_Shapley_decomp(credit_card_int)
# Plot Shapley decomposition with contributions of Centers, Ranges, and CrossCentersRanges
plot_bar_int_Shapley_decomp(credit_card_shap_decomp, palette = rainbow(credit_card_int@NIVar))
Beeswarm plot of Shapley values for interval-valued data.
Description
Beeswarm plot of Shapley values for interval-valued data.
Usage
plot_beeswarm_int_Shapley(
shapley,
color_class,
color_label = NULL,
palette = NULL,
rotate_x = TRUE,
shape_class = NULL,
shape_label = NULL,
ggplotly = FALSE,
label_obs = NULL
)
Arguments
shapley |
A |
color_class |
A vector indicating the color class of each observation. If NULL (default), all points have the same color. |
color_label |
Character. Label for the color class. If NULL (default), no legend for the color class is shown. |
palette |
A vector with colors for each color class. Default is NULL. |
rotate_x |
Logical. If |
shape_class |
A vector indicating the shape class of each observation. If NULL (default), all points have the same shape. |
shape_label |
Character. Label for the shape class. If NULL (default), no legend for the shape class is shown. |
ggplotly |
Logical. If |
label_obs |
A vector with the names of the observations to be labeled in the plot when |
Value
Returns a beeswarm plot that displays the Shapley values (int_Shapley) for each observation and feature.
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Estimate the mean and covariance matrix
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_IMCD$robust_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute Shapley values
credit_card_shapley <- int_Shapley(credit_card_int,
mean_c = credit_card_IMCD$mean_IMCD_c,
mean_r = credit_card_IMCD$mean_IMCD_r,
cov = credit_card_IMCD$cov_IMCD)
# Beeswarm plot of Shapley values colored by outlier status
plot_beeswarm_int_Shapley(credit_card_shapley,
color_class = credit_card_outliers$is_outlier,
palette = c("gray50", "darkred"),
color_label = "Outlier Status")
Distance-Distance plot for interval-valued data.
Description
Distance-Distance plot for interval-valued data.
Usage
plot_dist_dist(
class_dist,
class_cutoff = NULL,
class_cutoff_label = NULL,
rob_dist,
rob_cutoff = NULL,
rob_cutoff_label = NULL,
obs_names = NULL,
ggplotly = FALSE,
color_class = NULL,
color_label = NULL,
palette = NULL,
shape_class = NULL,
shape_label = NULL,
label_obs = NULL
)
Arguments
class_dist |
A numeric vector containing the classical distances for each observation. |
class_cutoff |
Numeric. The cutoff value for the classical distances. |
class_cutoff_label |
Character. Label for the classical cutoff. If NULL (default), no legend for the classical cutoff is shown. |
rob_dist |
A numeric vector containing the robust distances for each observation. |
rob_cutoff |
Numeric. The cutoff value for the robust distances. |
rob_cutoff_label |
Character. Label for the robust cutoff. If NULL (default), no legend for the robust cutoff is shown. |
obs_names |
A character vector containing the names of the observations. If NULL (default), the names are taken from the names of class_dist. |
ggplotly |
Logical. If |
color_class |
A vector indicating the color class of each observation. If NULL (default), all points have the same color. |
color_label |
Character. Label for the color class. If NULL (default), no legend for the color class is shown. |
palette |
A vector with colors for each color class. If NULL (default), default ggplot2 colors are used. |
shape_class |
A vector indicating the shape class of each observation. If NULL (default), all points have the same shape. |
shape_label |
Character. Label for the shape class. If NULL (default), no legend for the shape class is shown. |
label_obs |
A vector with the names of the observations to be labeled in the plot when |
Value
Returns a Distance-Distance plot that displays the classical distances against the robust distances for each observation, highlighting outliers.
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute robust distances using IMCD estimates of mean and covariance
credit_card_dist <- IMah_dist(credit_card_int)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute classical distances and outliers
class_dist <- IMah_dist(credit_card_int, z = rep(1,credit_card_int@NObs))
class_outliers <- int_outliers(class_dist,
cutoff = "chi-squared",
p = credit_card_int@NIVar)
# Create a vector indicating if the observations are outliers or inliers
# based on the robust distance outlier detection
credit_card_is_outliers <- as.character(credit_card_outliers$is_outlier)
credit_card_is_outliers[credit_card_outliers$is_outlier] <- "Outlier"
credit_card_is_outliers[!credit_card_outliers$is_outlier] <- "Inlier"
# Plot Distance-Distance plot
plot_dist_dist(class_dist,
class_cutoff = class_outliers$cutoff_value,
class_cutoff_label = "0.975 chi-squared",
rob_dist = credit_card_dist,
rob_cutoff = credit_card_outliers$cutoff_value,
rob_cutoff_label = "0.9 farness",
color_class = credit_card_is_outliers,
palette = c("grey50", "red"))
Plot Shapley interaction indices
Description
Plot Shapley interaction indices
Usage
plot_int_Shapley_inter(
x,
abbrev = 10,
title = NULL,
legend = TRUE,
text_size = 22
)
Arguments
x |
A |
abbrev |
Integer. If |
title |
Character. Title of the plot. |
legend |
Logical. If TRUE (default), a legend is plotted. |
text_size |
Integer. Size of the text in the plot |
Value
Returns a figure consisting of two panels. The right panel shows the Shapley values, and the left panel the Shapley interaction indices.
References
Adapted from package ShapleyOutlier (https://CRAN.R-project.org/package=ShapleyOutlier).
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Estimate the mean and covariance matrix
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute Shapley interaction indices
credit_card_shap_inter <- int_Shapley_interaction(credit_card_int,
mean_c = credit_card_IMCD$mean_IMCD_c,
mean_r = credit_card_IMCD$mean_IMCD_r,
cov = credit_card_IMCD$cov_IMCD)
# Plot Shapley interaction for 1st observation
plot_int_Shapley_inter(credit_card_shap_inter[[1]])
Interval-Mahalanobis distance plot for interval-valued data.
Description
Interval-Mahalanobis distance plot for interval-valued data.
Usage
plot_interval_dist(
dist,
cutoff = NULL,
cutoff_label = NULL,
obs_names = NULL,
sort.obs = TRUE,
color_class = NULL,
color_label = NULL,
palette = NULL,
shape_class = NULL,
shape_label = NULL,
label_obs = NULL
)
Arguments
dist |
A numeric vector containing the Interval-Mahalanobis distances for each observation. |
cutoff |
A numeric vector containing cutoff values to be displayed as horizontal lines. |
cutoff_label |
A character vector containing labels for each cutoff. If NULL (default), default labels are generated. |
obs_names |
A character vector containing the names of the observations. If NULL (default), the names are taken from the names of dist. |
sort.obs |
Logical. If |
color_class |
A vector indicating the color class of each observation. If NULL (default), all points have the same color. |
color_label |
Character. Label for the color class. If NULL (default), no legend for the color class is shown. |
palette |
A vector with colors for each color class. If NULL (default), default ggplot2 colors are used. |
shape_class |
A vector indicating the shape class of each observation. If NULL (default), all points have the same shape. |
shape_label |
Character. Label for the shape class. If NULL (default), no legend for the shape class is shown. |
label_obs |
A vector with the names of the observations to be labeled in the plot. If NULL (default), no labels are shown and x-axis labels are displayed. |
Value
Returns a plot that displays the Interval-Mahalanobis distances for each observation, highlighting outliers based on specified cutoffs.
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Compute robust distances using IMCD estimates of mean and covariance
credit_card_dist <- IMah_dist(credit_card_int)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Create a vector indicating if the observations are outliers or inliers
# based on the robust distance outlier detection
credit_card_is_outliers <- as.character(credit_card_outliers$is_outlier)
credit_card_is_outliers[credit_card_outliers$is_outlier] <- "Outlier"
credit_card_is_outliers[!credit_card_outliers$is_outlier] <- "Inlier"
# Plot Interval-Mahalanobis distance plot
plot_interval_dist(credit_card_dist,
cutoff = credit_card_outliers$cutoff_value,
cutoff_label = c("0.9 farness"),
obs_names = rownames(credit_card_int),
sort.obs = FALSE,
color_class = credit_card_is_outliers,
palette = c("grey50", "red"))
Pairs-plot for Interval-valued Symbolic data.
Description
Adapted from pairs.panels (R package "psych") shows a scatter plot of matrices, with bivariate symbolic scatter plots below the diagonal, variables' names on the diagonal, and all the symbolic correlations above the diagonal. Useful for descriptive statistics of symbolic objects described by interval variables.
Usage
plot_pairs_int(
data,
type = c("rectangles", "crosses", "crosses2"),
cex.cor = 2,
corr = NULL,
palette = rainbow(nrow(data)),
fill_col = "gray50",
is_outlier = NULL,
...
)
Arguments
data |
An |
type |
The type of plot to generate: "rectangles" or "crosses" or "crosses2". Default is "rectangles". |
cex.cor |
Character expansion factor |
corr |
A matrix with the symbolic correlations; if not provided the upper panel is omitted |
palette |
A vector with colors for each observation. |
fill_col |
If |
is_outlier |
A vector with logical values indicating if the observation is an outlier or not. It makes the line width of the outlying observations thicker. Default is NULL. |
... |
Additional graphical parameters. |
Value
A scatter plot matrix is drawn in the graphic window. The lower off diagonal draws scatter plots, the diagonal variables' names, the upper off diagonal reports all the symbolic correlations.
Examples
data(creditcard)
credit_card_int <- creditcard$intData
# Compute covariance and correlation matrices
credit_card_cov <- int_cov(credit_card_int)
credit_card_cor <- cov2cor(credit_card_cov)
plot_pairs_int(credit_card_int,
corr = credit_card_cor,
labels = colnames(credit_card_int))
# Alternatively, highlight outliers in the scatter plot and use the robust correlation matrix
# Obtain reweighted IMCD estimates using farness cutoff
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_IMCD$robust_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
outliers_colors <- rep('gray50',credit_card_int@NObs)
names(outliers_colors) <- rownames(credit_card_int)
outliers_colors[credit_card_outliers$outliers_names] = 'red'
plot_pairs_int(credit_card_int,
corr = cov2cor(credit_card_IMCD$cov_IMCD),
palette = outliers_colors,
labels = colnames(credit_card_int),
type = "rectangles",
is_outlier = credit_card_outliers$is_outlier)
Radar plot of Shapley values for interval-valued data.
Description
Radar plot of Shapley values for interval-valued data.
Usage
plot_radar_int_Shapley(shapley, palette = NULL, sort.obs = FALSE)
Arguments
shapley |
A |
palette |
A vector of palette for each observation. Default is black. |
sort.obs |
Logical. If |
Value
Returns a radar plot that displays the Shapley values (int_Shapley) for each observation.
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Estimate the mean and covariance matrix
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_IMCD$robust_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute Shapley values
credit_card_shapley <- int_Shapley(credit_card_int,
mean_c = credit_card_IMCD$mean_IMCD_c,
mean_r = credit_card_IMCD$mean_IMCD_r,
cov = credit_card_IMCD$cov_IMCD)
# colors
outliers_colors <- rep('black',credit_card_int@NObs)
names(outliers_colors) <- rownames(credit_card_int)
outliers_colors[credit_card_outliers$outliers_names] = '#009de0'
plot_radar_int_Shapley(credit_card_shapley, palette = outliers_colors)
Scatter Plot for Interval-valued Data
Description
Create a scatter plot for interval-valued symbolic data, visualizing the symbolic data as rectangles or crosses, with the first two variables on the x and y axes. The function allows customization of colors, fill colors, and outlier representation.
Usage
plot_scatter_int(
data,
type = c("rectangles", "crosses", "crosses2"),
palette = rainbow(nrow(data)),
fill_col = "gray50",
is_outlier = NULL,
...
)
Arguments
data |
An |
type |
The type of plot to generate: "rectangles", "crosses" or "crosses2". Default is "rectangles". |
palette |
A vector with colors for each observation. Default is |
fill_col |
If |
is_outlier |
A vector with logical values indicating if the observation is an outlier or not. It makes the line width of the outlying observations thicker. Default is NULL. |
... |
Additional graphical parameters. |
Value
A scatter plot is drawn in the graphic window. The scatter plot shows the symbolic data as rectangles or crosses, with the first two variables on the x and y axes.
Examples
data(creditcard)
credit_card_int <- creditcard$intData
plot_scatter_int(credit_card_int[, c(3, 5)])
# Alternatively, highlight outliers in the scatter plot
# Compute robust distances using IMCD estimates of mean and covariance
credit_card_dist <- IMah_dist(credit_card_int)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_dist, "farness", 0.9)
outliers_colors <- rep('gray50', credit_card_int@NObs)
names(outliers_colors) <- rownames(credit_card_int)
outliers_colors[credit_card_outliers$outliers_names] = 'red'
plot_scatter_int(credit_card_int[, c(3, 5)],
palette = outliers_colors,
is_outlier = credit_card_outliers$is_outlier)
Tileplot of Shapley values for interval-valued data.
Description
Tileplot of Shapley values for interval-valued data.
Usage
plot_tile_int_Shapley(
shapley,
outliers = NULL,
rotate_x = TRUE,
abbrev.var = FALSE,
abbrev.obs = FALSE,
sort.var = FALSE,
sort.obs = FALSE,
show_values = FALSE
)
Arguments
shapley |
A |
outliers |
A list containing the outliers' names as returned by |
rotate_x |
Logical. If |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
show_values |
Logical. If |
Value
Returns a tileplot that displays the Shapley values (int_Shapley) for each observation and variable. Optionally, only the outliers are highlighted in the plot.
References
Adapted from package ShapleyOutlier (https://CRAN.R-project.org/package=ShapleyOutlier).
Examples
# Create intData object
data(creditcard)
credit_card_int <- creditcard$intData
# Estimate the mean and covariance matrix
credit_card_IMCD <- IMCD(credit_card_int,
m = floor(nrow(credit_card_int)*0.75),
cutoff = "farness",
cutoff_lvl = 0.9)
# Detect outliers using farness cutoff
credit_card_outliers <- int_outliers(credit_card_IMCD$robust_dist,
cutoff = "farness",
cutoff_lvl = 0.9)
# Compute Shapley values
credit_card_shapley <- int_Shapley(credit_card_int,
mean_c = credit_card_IMCD$mean_IMCD_c,
mean_r = credit_card_IMCD$mean_IMCD_r,
cov = credit_card_IMCD$cov_IMCD)
plot_tile_int_Shapley(credit_card_shapley,
outliers = credit_card_outliers,
sort.var = TRUE,
sort.obs = TRUE)
Print Method for Summary intData
Description
Print Method for Summary intData
Usage
## S4 method for signature 'summaryintData'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments passed to print. |
Value
The object itself, returned invisibly. Called for its side effects (printing).
Row Bind for intData
Description
Combine multiple intData objects by rows.
Usage
rbind(..., deparse.level = 1)
## S4 method for signature 'intData'
rbind(..., deparse.level = 1)
Arguments
... |
|
deparse.level |
An integer controlling the construction of labels in the result (default is |
Value
An intData object with rows combined from the input intData objects.
Row.Names Method for intData
Description
Row.Names Method for intData
Usage
## S4 method for signature 'intData'
row.names(x)
Arguments
x |
An object of class |
Value
A character vector of row names.
Row Names Method for intData
Description
Row Names Method for intData
Usage
## S4 method for signature 'intData'
rownames(x)
Arguments
x |
An object of class |
Value
A character vector of row names.
Safely invert a covariance matrix with Moore-Penrose generalized inverse fallback
Description
Computes a numerically stable inverse of a covariance matrix. The function:
Attempts standard inversion via
solve().If the matrix is ill-conditioned, falls back to a Moore-Penrose generalized inverse.
Usage
safe_solve_cov(cov, verbose = TRUE)
Arguments
cov |
A numeric covariance matrix. |
verbose |
Logical; if |
Details
When the covariance matrix is singular or nearly singular, direct inversion
may fail or produce unstable results. This function ensures robustness by
using Moore-Penrose generalized inverse (via MASS::ginv()).
The pseudo-inverse effectively ignores directions with negligible variance, which may slightly affect interpretations (e.g., Mahalanobis distances or Shapley values).
Value
A matrix representing:
The inverse of
covif well-conditionedA Moore–Penrose generalized inverse if inversion fails
Examples
set.seed(1)
# Example where inversion fails
X <- matrix(rnorm(20), ncol = 5)
cov_X <- cov(X)
#solve(cov_X) # Standard inversion fails
safe_solve_cov(cov_X) # Returns a generalized inverse
# Example where inversion does not fail
Y <- cbind(rnorm(20), rnorm(20, mean=1, sd=2))
cov_Y <- cov(Y)
solve(cov_Y) # Standard inversion succeeds
safe_solve_cov(cov_Y) # Returns same result
Show Method for intData
Description
Show Method for intData
Show Method for Summary intData
Usage
## S4 method for signature 'intData'
show(object)
## S4 method for signature 'summaryintData'
show(object)
Arguments
object |
An object of class |
Value
The object itself, returned invisibly. Called for its side effects (printing).
Obtain unweighted estimates for data with <= 600 observations
Description
Obtain unweighted estimates for data with <= 600 observations
Usage
smallIMCD(m, data)
Arguments
m |
An integer specifying the number of observations to use |
data |
An |
Value
A list of estimated barycenter and symbolic covariance matrix
Spotify Tracks Dataset
Description
This dataset contains interval data of Spotify tracks' audio features, including min-max values and trimmed intervals, as well as the microdata. It is composed of 11 audio features: duration, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and popularity. The aggregation of the microdata was done by track genre.
Usage
data(spotify_tracks)
Format
A list with the following components:
microdataA data frame with
81033rows and20columns. It contains the microdata, with individual measurements of each variable for all observations.microdata_transformedA data frame with
81033rows and20columns. It contains the transformed microdata, with individual measurements of each variable for all observations. Logarithmic transformations were applied to "loudness" and "tempo". "duration_ms" in milliseconds was converted to "duration" in minutes. "popularity" was scaled to the range[0,1].intData_minmaxAn
intDataobject with111interval-valued observations and11variables, constructed using min-max aggregation based on the transformed microdata.intData_trimmedAn
intDataobject with111interval-valued observations and11variables, constructed using trimmed aggregation (1\%trimming) based on the transformed microdata.
References
This data was retrieved from Kaggle (DOI:10.34740/KAGGLE/DSV/4372070; Spotify Tracks Dataset by Maharshi Pandya).
Examples
data(spotify_tracks)
head(spotify_tracks$intData_minmax)
head(spotify_tracks$intData_trimmed)
head(spotify_tracks$microdata)
head(spotify_tracks$microdata_transformed)
Iterate through C-step
Description
Iterate through C-step
Usage
step_it(z, m, data, it = 0)
Arguments
z |
A vector of 0 and 1, indicating which observations should be considered for the calculation |
m |
An integer specifying number of observations to use |
data |
An |
it |
An optional integer specifying the number of C-steps to perform.
With |
Value
A list of z, covariance, barycenter and robust distances
Summary Method for intData
Description
Summary Method for intData
Usage
## S4 method for signature 'intData'
summary(object)
Arguments
object |
An object of class |
Value
An object of class summaryintData.
Summary Interval Data Class
Description
A class to represent the summary of interval data.
Slots
CentersumarA table summarizing the centers.
RngsumarA table summarizing the ranges.
Tail Method for intData
Description
Returns the last n rows of an intData object.
Usage
## S4 method for signature 'intData'
tail(x, n = min(nrow(x), 6L))
Arguments
x |
An |
n |
The number of rows to return. |
Value
A subset of the intData object.