



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The importance of estimability in linear models when dealing with rank-deficient models, which can produce questionable predictions. The author proposes tools to help package developers ensure that only estimable predictions are returned, flagging non-estimable ones. The document also includes an example of how to implement these tools in R using the estimability package.
Typology: Summaries
1 / 5
This page cannot be seen from the preview
Don't miss anything!
Abstract When a linear model is rank-deficient, then predictions based on that model become questionable because not all predictions are uniquely estimable. However, some of them are, and the estimability package provides tools that package developers can use to tell which is which. With the use of these tools, a model object’s predict method could return estimable predictions as-is while flagging non-estimable ones in some way, so that the user can know which predictions to believe. The estimability package also provides, as a demonstration, an estimability-enhanced epredict method to use in place of predict for models fitted using the stats package.
Introduction
Consider a linear regression or mixed model having fixed component of the matrix form X βββ. If X is not of full column rank, then there is not a unique estimate b of βββ. However, consider using λλλ ′ b to estimate the value of some linear function λλλ ′ βββ = ∑j λ j β j. (We use x ′^ to denote the transpose of a vector x .) For some λλλ s, the prediction depends on the solution b ; but for others—the estimable ones—it does not.
An example will help illustrate the issues. In the following commands, we create four predictors x1–x and a response variable y:
x1 <- -4 : 4 x2 <- c(-2, 1, -1, 2, 0, 2, -1, 1,-2) x3 <- 3 * x1 - 2 * x x4 <- x2 - x1 + 4 y <- 1 + x1 + x2 + x3 + x4 + c(-.5, .5, .5, -.5, 0, .5, -.5, -.5, .5)
Clearly, x3 and x4 depend linearly on x1 and x2 and the intercept. Let us fit two versions of the same model to these data, entering the predictors in different orders, and compare the regression coefficients:
mod1234 <- lm(y ~ x1 + x2 + x3 + x4) mod4321 <- lm(y ~ x4 + x3 + x2 + x1) zapsmall(rbind(b1234 = coef(mod1234), b4321 = coef(mod4321)[c(1, 5:2)]))
(Intercept) x1 x2 x3 x b1234 5 3 0 NA NA b4321 -19 NA NA 3 6
Note that in each model, two regression coefficients are NA. This indicates that the associated predictors were excluded due to their linear dependence on the others.
The problem that concerns us is making predictions on new values of the predictors. Here are some predictor combinations to try:
testset <- data.frame(
And here is what happens when we make the predictions:
cbind(testset,
x1 x2 x3 x4 pred1234 pred 1 3 1 7 2 14 14 2 6 2 14 4 23 47
Warning message: In predict.lm(mod1234, new = testset) : prediction from a rank-deficient fit may be misleading
Note that the two models produce identical predictions in some cases (rows 1, 3, and 4), and different ones in the others. There is also a warning message (we suppressed this the second time) telling us that this could happen.
It happens that cases 1, 3, and 4 are estimable, and the others are not. It would be helpful to know which predictions to trust, rather than a vague warning. The epredict function in estimability (Lenth,
require("estimability") rbind(epred1234 = epredict(mod1234, newdata = testset),
1 2 3 4 5 6 epred1234 14 NA 23 5 NA NA epred4321 14 NA 23 5 NA NA
The results for non-estimable cases are indicated by NAs. Note that in both models, the same new-data cases are identified as estimable, and they are the ones that yield the same predictions. Note also that estimability was determined separately from each model. We do not need to compare two different fits to determine estimability.
Adding estimability checking to a modeling package
It is a simple matter to add estimability checking to the predict method(s) in a new or existing package.
This is implemented in code in a manner something like this:
fitmodel <- function(...) { ... code to create: 'X' (the fixed-effects model matrix), 'QR' (QR decomp of 'X'), 'object' (the fitted model object) ... object$nonest <- estimability::nonest.basis(QR) object }
predict.mymodel <- function(object, newdata, tol = 1e-8, ...) { ... if(!missing(newdata)) { ... code to create: 'newX' (the model matrix for 'newdata') ... estble <- estimability::is.estble(newX, object$nonest, tol)
... code to flag non-estimable predictions ...
ith element of Xb is x ′ i b where x ′ i is the ith row of X. Thus, x ′ i βββ is estimable for each i, and so are any linear combinations of these. Indeed, λλλ ′ βββ is estimable if and only if λλλ ′^ is in the row space of X —or equivalently, λλλ is in the column space of X ′^ (Monahan, 2008, Result 3.1).
From the above result, we can establish estimability of λλλ ′ βββ either by showing that λλλ is in the column space of X ′, or by showing that it is orthogonal to the null space of X (Monahan, 2008, Methods 3.2 and 3.3, respectively). One way to implement the second idea is to note that if ( X ′ X )−^ is any generalized inverse of X ′ X , then P = ( X ′ X )( X ′ X )−^ is a projection onto the column space of X ′, and so I − P is a projection onto the null space of X. The columns of I − P thus comprise a basis for this null space. Accordingly, estimability of λλλ ′ βββ can be determined by showing that λλλ ′( I − P ) = 0 ′.
SAS uses I − P , as described above, to test estimability. Of course, in computation we need to set a tolerance for being close enough to 0. SAS deems λλλ ′ βββ non-estimable if λλλ 6 = 0 and max | λλλ ′( I − P )| > ψ · max | λλλ |, where ψ is a tolerance factor with a default value of 10−^4. For details, see SAS Institute Inc. (2013), the chapter on “Shared Concepts,” documentation for the ESTIMATE statement and its SINGULAR option.
In the estimability package, we obtain the null basis in a different way than shown above, in part because the QR decomposition of X is already (usually) available in an ‘lm’ object. Suppose that X is n × p, then we can write X = QR , where Q (n × p) has orthonormal columns, and R (p × p) is upper triangular with nonzero diagonal elements. If X is of rank r < p, then columns are pivoted so that the linearly dependent ones come last, and we only need r columns of Q and the top r rows of R , along with the pivoting information. Call these pivoted, down-sized matrices Q ˜ and R ˜ respectively; and let ˜ X (still n × p) denote X with its columns permuted according to the pivoting. We now have that X^ ˜ = Q ˜ ˜ R , and R ˜ comprises the first r rows of an upper triangular matrix. Observe that each row of ˜ X is thus a linear combination of the rows of ˜ R —that is, the row space of ˜ X is the same as the row space of R^ ˜. So the much smaller matrix R ˜ has everything we need to know to determine estimability.
Now, let d = p − r denote the rank deficiency, and define
R ˜′ p×r^0 r×d I d×d
Then S is an upper triangular matrix with nonzero diagonal—hence it is of full rank p = r + d. Let S = TU be the QR decomposition of S. We then have that T has orthonormal columns, and in fact it is a Gram-Schmidt orthonormalization of S. Accordingly, the first r columns of T comprise an orthonormalization of the columns of R ˜′, and thus they form a basis for the row space of R ˜ and hence of X ˜. Letting N ˜ denote the last d columns of T , we have that N ˜ has orthonormal columns, all of which are orthogonal to the first r columns of T and hence to the row space of R ˜ and ˜ X. It follows that N ˜ is an orthonormal basis for the null space of X ˜. After permuting the rows of N ˜ according to the original pivoting in the QR decomposition of X , we obtain a basis N for the null space of X.
To test estimability of λλλ ′ βββ , first compute z ′^ = λλλ ′ N , which is a d-vector. Theoretically, λλλ ′ βββ is estimable if and only if z ′ z = 0 ; but for computational purposes, we need to set a tolerance for its being suitably small. Again, we opt to deviate from SAS’s approach, which is based on max |zi |. Instead, we deem λλλ ′ βββ non-estimable when λλλ 6 = 0 and z ′ z ≥ τ · λλλ ′ λλλ , where τ is a tolerance value. Our default value is τ = ( 10 −^4 )^2 = 10 −^8. The rationale for this criterion is that it is rotation-invariant: We could replace N ← VN where V is any p × p orthogonal matrix, and z ′ z will be unchanged. Such rotation invariance seems a desirable property because the assessment of estimability does not depend on which null basis is used.
Bells and whistles
The estimability package’s epredict function serves as a demonstration of adding estimability check- ing to the predict methods in the stats package. It works for lm, aov, glm, and mlm objects. It also provides additional options for the type argument of predict. When newdata is provided, type = "estimability" returns a logical vector showing which rows of newdata are estimable; and type = "matrix" returns the model matrix for newdata.
An accompanying enhancement is eupdate, which runs update on a model object and also adds a nonest member. Subsequent epredict calls on this object will use that as the null basis instead of having to reconstruct it. For example,
mod123 <- eupdate(mod1234,. ~. - x4)
mod123$nonest
[,1] [1,] 0. [2,] -0. [3,] 0. [4,] 0.
Now let’s find out which predictions in testset are estimable with this model:
epredict(mod123, newdata = testset, type = "estimability")
1 2 3 4 5 6 TRUE TRUE TRUE TRUE TRUE FALSE
Thus, more of these test cases are estimable when we drop x 4 from the model. Looking at testset, this makes sense because two of the previously non-estimable rows differ from estimable ones only in their values of x 4.
Conclusion
In rank-deficient linear models used for predictions on new predictor sets, it is important for users to know which predictions are to be trusted, and which of them would change if the same model had been specified in a different way. The estimability package provides an easy way to add this capability to new and existing model-fitting packages.
Bibliography
R. V. Lenth. estimability: Estimability Tools for Linear Models, 2015. URL http://CRAN.R-project.org/ package=estimability. R package version 1.1. [p196]
J. F. Monahan. A Primer on Linear Models. Chapman & Hall/CRC, 2008. [p197, 198]
SAS Institute Inc. SAS/STAT Software, Version 13.2. SAS Institute Inc., Cary, NC, 2013. URL http: //www.sas.com/. [p198]
S. R. Searle. Linear Models. Wiley Classics. Wiley-Interscience, 1997. [p197]
G. A. F. Seber and A. J. Lee. Linear Regression Analysis. John Wiley & Sons, Inc., second edition, 2003. [p197]
Russell V. Lenth The University of Iowa 241 Schaeffer Hall Iowa City, IA 52242 USA russell-lenth@uiowa.edu