11. Linear Models

The function lm() returns an Splus object called a fitted linear least-squares model object, or lm object for short:

lm(formula, data)

where formula is the structural formula that specifies the model and data is the data frame containing the data (omitted if the data is not stored in a data frame). The following example uses the dataframe car.test.frame described in the section on data frames.

>  fuel.fit_lm(100/Mileage ~ Weight + Disp., car.test.frame)

> fuel.fit
lm(formula=100/Mileage~Weight + Disp.,data=car.test.frame)

 (Intercept)      Weight        Disp.
   0.4789733 0.001241421 0.0008543589

Degrees of freedom: 60 total; 57 residual
Residual standard error: 0.3900812
A list of the components of an lm object and of the arguments which can be passed through the lm() function can be found at the end of this section.

The lm object fuel.fit can be used as an argument to other functions to obtain summaries of the data or to extract specific information about the model. The function summary() gives a more detailed description of the model than the one obtained by printing out the lm object. The components of an lm.summary object are listed at the end of this section.

> fuel.summary_summary(fuel.fit)
> fuel.summary

lm(formula=100/Mileage~Weight + Disp.,data = car.test.frame)
     Min      1Q  Median     3Q    Max
 -0.8109 -0.2559 0.01971 0.2673 0.9812

             Value Std. Error t value Pr(>|t|)
(Intercept) 0.4790 0.3418     1.4014  0.1665
     Weight 0.0012 0.0002     7.2195  0.0000
      Disp. 0.0009 0.0016     0.5427  0.5895

Residual standard error: 0.3901 on 57 degrees of freedom
Multiple R-Squared: 0.7438
F-statistic: 82.75 on 2 and 57 degrees of freedom, 
the p-value is 0

Correlation of Coefficients:
       (Intercept)  Weight
Weight -0.8968
 Disp.  0.4720     -0.8033
Specific information about the model can be printed using the functions coef(), resid() and fitted(). The same information can be obtained using fuel.fit$coef, fuel.fit$resid, and fuel.fit$fitted.

> coef(fuel.fit) 
 (Intercept)      Weight        Disp.
   0.4789733 0.001241421 0.0008543589

> formula(fuel.fit2)
100/Mileage ~ Weight + Disp. + Type

             * the formula() function extracts the model
               formula from the lm object fuel.fit2

The update() function is used to modify the model formula used in fuel.fit:

> fuel.fit2_lm(update(fuel.fit, . ~ . + Type))

> fuel.fit2
lm(formula = update(fuel.fit, . ~ . + Type))

 (Intercept)      Weight       Disp.      Type1      Type2
    2.960617 4.16396e-05 0.007594073 -0.1452804 0.09808411
      Type3       Type4     Type5
 -0.1240942 -0.04358047 0.1915062

Degrees of freedom: 60 total; 52 residual
Residual standard error: 0.3139246
By default, contrasts used for factors are the Helmert parametrization. Here, the coefficient for Type 1 corresponds to the contrast between the first two factors, the coefficient for Type 2 corresponds to the contrast between the third factor and the mean of the first two, etc. To view the contrast matrix, type fuel.fits2$contrasts. The type of contrast used can be changed by specifying a different method as an argument to the options() function, or bychoosing a contrast in the model formula using the C() function. The four choices are: helmert, poly, sum, and treatment.

> fuel.fit2_lm(update(fuel.fit, . ~ . + C(Type, sum)))

              * the first argument to the C() function 
                specifies the factor and the second 
                argument specifies the contrasts.

> X11();par(mfrow=c(2,1));plot(fuel.fit)

              * the plot() function has a default method
                for lm objects
              * two plots are produced:  one of the response
                against the fitted values, with the y=x line
                superimposed, the other of the absolute value
                of the residuals against their fitted values

View plots
Suppose we have another data frame called new.cars for which we want to predict fuel consumption using the model from fuel.fit. This is provided by the function predict().

> predict(fuel.fit, new.cars)

              * all the variables used in the model fuel.fit
                must also be in the data frame new.cars

Assessing the fit of a model

Quantities for assessing the fit of a least squares regression model can be obtained by extracting the necessary vectors from an lm object and computing the regression diagnostics, or by using the functions lsfit() and ls.diag(). The function lm.influence takes an lm object as its argument and returns an object whose components contain summaries of the n models obtained by omitting one observation. The first component, coefficients, is a matrix whose rows are the coefficients for the model with the corresponding observation omitted. The second component, sigma, is a vector of estimates for the residual standard error in the corresponding model. The final component, hat, is the hat matrix. A wide variety of diagnostics can be computed using these components and those of the lm object:

Quantity         Expression            Meaning

Standardized    e/(s*(1-h)^.5)        Residuals with equal variance

Studentized     e/si*(1-h)^.5)        Use si as standard error

DFBETAS         bi/(si %o% xxi^.5)    The change in the coefficients,
                                      scaled by the standard error
                                      for the coefficients

DFFIT           h*e/(1-h)             The change in the fitted value
                                      when that observation is dropped

DFFITS          h^.5*e/(si*(1-h))     Change in fitted values,
                                      standardized to variance 1
where the objects are extracted from an lm object (lmf) as described below:

> lms_summary(lmf)
> lmi_lm.influence(lmf)

> e_residuals(lmf)
> s_lms$sigma
> si_lmi$sigma
> xxi_diag(lms$cov.unscaled)
> h_lmi$hat
> bi_coef(lmf) - coef(lmi)
Some of these quantities can be obtained from the function ls.diag(). The function ls.diag() takes as its argument a list like the output of lsfit(). The function lsfit() is similar to the lm() function except that the formula is not specified using a formula object, but rather the model and response matrices are passed on as arguments to the function. These matrices can be created from the variables or extracted from an lm object.

> fuel.model_lm(fuel.fit,x=T,y=T)
                        * an lm object is constructed which
                          saves the model matrix and the 
> fuel.ls_lsfit(fuel.model$x, fuel.model$y, intercept=F)
                        * the model matrix and the response
                          are then passed on to the lsfit()
                        * the argument intercept=F is
                          specified because the intercept is
                          already included in the model matrix
                        * weights can also be specified using
                          the argument wt=

> ls.print(fuel.ls)
                        * ls.print() is similar to summary() 

Residual Standard Error = 0.3901,  Multiple R-Square = 0.7438
N = 60,  F-statistic = 82.7518 on 2 and 57 df, p-value = 0

            coef std.err t.stat p.value
Intercept 0.4790  0.3418 1.4014  0.1665
   Weight 0.0012  0.0002 7.2195  0.0000
    Disp. 0.0009  0.0016 0.5427  0.5895

> fuel.diag_ls.diag(fuel.ls)

              * ls.diag() computes regression diagnostics
              * the function returns a list with components:

std.dev        residual standard deviation
hat            vector containing the diagonal of the hat matrix
std.res        standardized residuals
stud.res       studentized residuals
cooks          Cook's distance for each observation
dfits          change in fitted value when each observation is deleted
correlation    correlation matrix for the parameter estimates
std.err        standard errors of the parameter estimates
cov.unscaled   unscaled covariance matrix for the parameter estimates
An lm.object is a list with components (eg.: fuel.fit$coefficients):

effects            orthogonal, single-degree-of-freedom effects
R                  triangular facor of the decomposition
rank               the computed rank
assign             list of assignments of coefficients (and effects) to the
                   terms of the model
terms              an object summarizing the formula
call               call that produced the object
contrasts          a list with matrices or vectors of contrasts used to code
df.residual        number of degrees of freedom for residuals
R.assign           same as assign, for an over-specified model (in terms of
                   full-rank part of model)
assign.residuals   vector identifying effects assigned to residuals
qr                 qr decomposition object (optional)
model              model frame (optional)
x                  model matrix (optional)
y                  response (optional)
The following is a list of other possible arguments to the function lm(), with the default action when not specified, if applicable:

weights=             *  vector of observation weights

subset=              *  expression specifying rows to be used
                     *  may be logical, numeric (row numbers),
                        or character (row names)

na.action=na.fail    *  a function to filter missing values
                     *  default creates an error message,
                        a possible alternative is na.omit
                        which deletes observations with
                        missing values

method="qr"          *  least squares fitting method to be used
                     *  method="model.frame" returns the model.frame

model=F              *  if T returns the model frame component model

x=F                  *  if T the model matrix is returned in component x

y=F                  *  if T the response is returned in component y

contrasts=NULL       *  a list giving contrasts for all or some of the
                        factors appearing in the model formula
                     *  the elements of the list should be a contrast
                        matrix, or a function to compute such a matrix
The function summary computes a list with the following components:

correlation     correlation coefficient matrix for the coefficients
cov.unscaled    unscaled covariance matrix
df              degrees of freedom for the model and for the residuals
coefficients    a matrix with three columns: coefficients, standard errors,
r.squared       the multiple R-squared statistic
fstatistic      numeric vector of length three giving the F test
                   - the first element is the statistic, the last two are the df
residuals       the model residuals (weighted if applicable)
sigma           residual standard error estimate
terms           the terms object used in fitting this model

Further Reading

John M. Chambers, Trevor J.Hastie, Statistical Models in S, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, 1992, Ch. 3

Where to now?

Table of Contents

Model Building