---
title: "Sample Selection SFA Metafrontier (groupType = \"sfaselectioncross\")"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sample Selection SFA Metafrontier (groupType = "sfaselectioncross")}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = TRUE
)
```

## Overview

**Sample selection bias** arises when the observed sample is not a random draw from the
population. For example:

- Only firms above a revenue threshold are surveyed.
- Only farms that adopted a technology are observed using it.
- Participation in a programme is voluntary, so only volunteers are observed.

If selection into the sample is correlated with firm efficiency, ignoring this leads to
biased frontier estimates. `sfaselectioncross` implements the **two-step ML estimator of
Greene (2010)**, leveraging the sample selection correction provided via `sfaR`
(Dakpo et al. 2022), which corrects for this bias using a probit selection equation
(Heckman 1979 correction).

The selection model requires:
- A **binary selection indicator** `d` (1 = selected/observed, 0 = not selected).
- A **selection equation formula** `selectionF` specifying which variables drive selection.
  At least one variable must appear in `selectionF` but not in the main frontier formula.
- Only selected observations (`d == 1`) participate in the frontier and receive efficiency
  estimates. Efficiency for non-selected observations is `NA`.

## Data Preparation (Simulated Example)

We simulate data following the approach in the `sfaR` documentation:

```{r data}
library(smfa)

N <- 500; set.seed(12345)
z1 <- rnorm(N); z2 <- rnorm(N)
v1 <- rnorm(N); v2 <- rnorm(N)
g  <- rnorm(N)
e1 <- v1
e2 <- 0.7071 * (v1 + v2)
ds <- z1 + z2 + e1
d  <- ifelse(ds > 0, 1, 0)        # 1 = selected into the sample
group <- ifelse(g > 0, 1, 0)      # two technology groups
u  <- abs(rnorm(N))
x1 <- abs(rnorm(N)); x2 <- abs(rnorm(N))
y  <- abs(x1 + x2 + e2 - u)
dat <- as.data.frame(cbind(y = y, x1 = x1, x2 = x2,
                            z1 = z1, z2 = z2, d = d, group = group))

# About 50% of observations are selected
table(dat$d)
#>    0    1
#> 1013  987
```

## Method 1: sfaselectioncross + LP Metafrontier

```{r lp}
meta_sel_lp <- smfa(
  formula    = log(y) ~ log(x1) + log(x2),
  selectionF = d ~ z1 + z2,      # selection equation: d is the binary indicator
  data       = dat,
  group      = "group",
  S          = 1L,
  udist      = "hnormal",
  groupType  = "sfaselectioncross",
  modelType  = "greene10",        # Greene (2010) two-step ML correction
  lType      = "kronrod",         # integration method for the selection likelihood
  Nsub       = 20,               # number of sub-intervals for numerical integration
  uBound     = Inf,
  method     = "bfgs",
  itermax    = 2000,
  metaMethod = "lp"
)
summary(meta_sel_lp)
```

> **Note:** The `selectionF` argument is compulsory for `groupType = "sfaselectioncross"`.
> The left-hand side must be the binary selection variable (`d`). At least one regressor in
> the selection equation should *not* appear in the main frontier formula (exclusion
> restriction for identification).

## Method 2: sfaselectioncross + QP Metafrontier

```{r qp}
meta_sel_qp <- smfa(
  formula    = log(y) ~ log(x1) + log(x2),
  selectionF = d ~ z1 + z2,
  data       = dat,
  group      = "group",
  S          = 1L,
  udist      = "hnormal",
  groupType  = "sfaselectioncross",
  modelType  = "greene10",
  lType      = "kronrod",
  Nsub       = 20,
  uBound     = Inf,
  method     = "bfgs",
  itermax    = 2000,
  metaMethod = "qp"
)
summary(meta_sel_qp)
```

## Method 3: sfaselectioncross + SFA (Huang)

```{r huang}
meta_sel_huang <- smfa(
  formula     = log(y) ~ log(x1) + log(x2),
  selectionF  = d ~ z1 + z2,
  data        = dat,
  group       = "group",
  S           = 1L,
  udist       = "hnormal",
  groupType   = "sfaselectioncross",
  modelType   = "greene10",
  lType       = "kronrod",
  Nsub        = 100,
  uBound      = Inf,
  method      = "bfgs",
  itermax     = 2000,
  metaMethod  = "sfa",
  sfaApproach = "huang"
)
summary(meta_sel_huang)
```

## Method 4: sfaselectioncross + SFA (O'Donnell)

```{r odonnell}
meta_sel_odonnell <- smfa(
  formula     = log(y) ~ log(x1) + log(x2),
  selectionF  = d ~ z1 + z2,
  data        = dat,
  group       = "group",
  S           = 1L,
  udist       = "hnormal",
  groupType   = "sfaselectioncross",
  modelType   = "greene10",
  lType       = "kronrod",
  Nsub        = 100,
  uBound      = Inf,
  method      = "bfgs",
  itermax     = 2000,
  metaMethod  = "sfa",
  sfaApproach = "ordonnell"
)
summary(meta_sel_odonnell)
```

## Interpreting the Selection Correction

The first-stage probit model estimates the selection probability. The key additional
parameter in the frontier model is `rho` — the correlation between the selection equation
error and the frontier equation noise.

```{r rho}
# The rho parameter appears in the summary output:
# ----------------------------------------------------------------
#              Selection bias parameter
# ----------------------------------------------------------------
#           Coefficient Std. Error z value  Pr(>|z|)
# rho          0.89550    0.28696  3.1207  0.001804 **

# A significant rho indicates selection bias IS present and the
# correction is important.
```

| `rho` value | Interpretation |
|-------------|----------------|
| ≈ 0, p > 0.05 | No significant selection bias; standard SFA may be sufficient |
| > 0, p < 0.05 | Positive selection — efficient firms are more likely selected |
| < 0, p < 0.05 | Negative selection — inefficient firms are more likely selected |

## Extracting Efficiencies

Only selected observations (those with `d == 1`) receive efficiency estimates:

```{r eff}
eff_sel <- efficiencies(meta_sel_lp)

# Non-selected observations have NA efficiencies
sum(is.na(eff_sel$TE_group_BC))   # count of non-selected obs

# Subset for selected observations in group 1
sel_grp1 <- eff_sel[eff_sel$group == 1 & !is.na(eff_sel$TE_group_BC), ]
summary(sel_grp1[, c("TE_group_BC", "TE_meta_BC", "MTR_BC")])
```