Methodology

Overview

Traditional residual diagnostics for generalized linear models (GLMs) face significant challenges when applied to discrete outcome data. The unifres package implements the unified functional residual framework described in Liu, Lin, & Zhang (2025), which addresses these limitations through a novel approach to residual analysis.

The problem with traditional residuals

Limitations for discrete data

Traditional residuals (Pearson, deviance, etc.) are point statistics that:

  1. Cannot capture full residual randomness for discrete outcomes
  2. Lack interpretability for binary and count data
  3. Show patterns even for correct models due to discrete data structure
  4. Depend on specific link functions, making comparisons difficult

Example: binary outcome

For a logistic regression with outcome \(Y \in \{0, 1\}\) and fitted probability \(\hat{p}\):

  • Traditional residuals can only take two values
  • This creates artificial patterns in diagnostic plots
  • Makes it difficult to assess model adequacy

Functional residuals: the solution

Core concept

Instead of point residuals, functional residuals represent the entire distribution of residual randomness:

\[F_i(t) = P(U_i \leq t \mid Y_i)\]

where \(U_i\) is a uniform random variable capturing the residual for observation \(i\).

Key properties

  1. Uniform Distribution: Under a correctly specified model, \(U_i \sim \text{Uniform}(0,1)\)
  2. Full Information: Captures all residual randomness, not just a point estimate
  3. Model-Free: Works across all GLM families
  4. Interpretable: Direct probabilistic interpretation

Construction for different models

Binary outcomes (logistic regression)

For \(Y_i \in \{0, 1\}\) with fitted probability \(\hat{p}_i\):

Endpoints: - If \(Y_i = 0\): \(U_i \in [0, 1-\hat{p}_i]\) - If \(Y_i = 1\): \(U_i \in [1-\hat{p}_i, 1]\)

Count outcomes (Poisson regression)

For \(Y_i \in \{0, 1, 2, ...\}\) with fitted rate \(\hat{\lambda}_i\):

Endpoints: \(U_i \in [P(Y \leq y_i - 1), P(Y \leq y_i)]\)


Diagnostic tools

1. Function-Function (Fn-Fn) Plots

The ffplot() function plots the average functional residual against the theoretical CDF:

\[\bar{F}(t) = \frac{1}{n}\sum_{i=1}^n F_i(t) \quad \text{vs} \quad t\]

Interpretation: - Good fit: \(\bar{F}(t) \approx t\) (points follow diagonal) - Poor fit: Systematic deviations from diagonal - Analogous to Q-Q plots but for functional residuals

Mathematical Basis: Under correct model specification: \(\mathbb{E}[F_i(t)] = t\) for all \(t \in [0,1]\)

2. Functional Residual Density (FRED) Plots

The fredplot() function visualizes the density of functional residuals against covariates:

Construction: 1. Expand each \(F_i\) into dense grid of points 2. Create 2D density plot: covariate vs residual value 3. Add LOESS smoother to detect patterns

Interpretation: - Good fit: Uniform horizontal band - Poor fit: Patterns indicate: - Curvature → Missing polynomial terms - Funneling → Heteroscedasticity - Gaps → Zero-inflation or structural issues

3. Derived residuals

From functional residuals, we can derive point-based residuals:

Surrogate residuals

Random sample from the functional residual distribution:

\[r_i^{(s)} \sim F_i\]

Use: Traditional residual plots, quick diagnostics

Probability-scale residuals

Expected value of the functional residual:

\[r_i^{(p)} = 2\mathbb{E}[U_i] - 1\]

which centers the residuals at 0 with range [-1, 1].

Use: Centered at 0, similar to traditional residuals


Advantages over traditional methods

1. Unified framework

Single approach works for: - Binary outcomes (logistic regression) - Count outcomes (Poisson, negative binomial) - Ordinal outcomes (proportional odds models) - Zero-inflated models - Continuous outcomes (as special case)

2. Meaningful interpretation

  • Residuals have probabilistic interpretation
  • Uniform(0,1) under null hypothesis
  • Easy to communicate to non-statisticians

3. Better power

Functional residuals can detect departures that traditional methods miss:

  • Missing interaction terms
  • Incorrect link functions
  • Unmodeled heterogeneity

4. Visual clarity

FRED plots provide clearer visual diagnostics than traditional residual plots for discrete data.


Theoretical foundation

Probability integral transform

For continuous random variable \(X\) with CDF \(F\): \[F(X) \sim \text{Uniform}(0,1)\]

For discrete outcomes, functional residuals extend this via interval representation.

Asymptotic properties

Under regularity conditions:

  1. Consistency: \(\bar{F}(t) \xrightarrow{p} t\) for all \(t\)
  2. Asymptotic Normality: \(\sqrt{n}(\bar{F}(t) - t)\) is asymptotically normal
  3. Weak Convergence: Enables formal hypothesis tests for model adequacy

Comparison to existing methods

Method Data Type Interpretability Power Visualization
Pearson All Moderate Low for discrete Poor for discrete
Deviance All Low Moderate Poor for discrete
Quantile Continuous High High Good
Randomized Quantile Discrete Moderate Moderate Moderate
Functional Residuals All High High Excellent

Implementation details

Resolution parameter

Both R and Python implementations use a resolution parameter (default 101):

  • Controls the grid density for expanding functional residuals
  • Higher resolution → smoother plots but slower computation
  • Recommendation: 51-201 depending on dataset size

Computational complexity

  • Functional residuals: \(O(n)\) computation
  • Fn-Fn plot: \(O(n \times r)\) where \(r\) is resolution
  • FRED plot: \(O(n \times r)\) plus density estimation

For large datasets (\(n > 10,000\)), consider subsampling for visualization.


Extended applications

Beyond GLMs

The framework extends to:

  1. Generalized Additive Models (GAMs) - R package supports mgcv::gam()
  2. Zero-Inflated Models - R package supports pscl::zeroinfl()
  3. Ordinal Regression - R package supports VGAM::vglm()

Future extensions

Potential applications include:

  • Survival models (censored data)
  • Mixed effects models
  • Time series models
  • Spatial models

References

Primary Reference:

Liu, D., Lin, Z., & Zhang, H. (2025). A unified framework for residual diagnostics in generalized linear models and beyond. Journal of the American Statistical Association, 1–29. https://doi.org/10.1080/01621459.2025.2504037

Related Work:

  • Dunn, P. K., & Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5(3), 236-244.
  • Feng, C., Li, L., & Sadeghpour, A. (2020). A comparison of residual diagnosis tools for diagnosing regression models for count data. BMC Medical Research Methodology, 20, 1-21.

Mathematical notation

Symbol Meaning
\(Y_i\) Observed outcome for observation \(i\)
\(\hat{p}_i\) Fitted probability (binary models)
\(\hat{\lambda}_i\) Fitted rate (count models)
\(F_i(t)\) Functional residual CDF for observation \(i\)
\(U_i\) Uniform random variable representing residual
\(r_i^{(s)}\) Surrogate residual
\(r_i^{(p)}\) Probability-scale residual

Next steps