7. Data Frames

Data frames are a class of objects to represent data which are usually used in fitting models. They are similar to matrices in that the variables can be treated as columns and the observations as rows. They are more general than matrices in the sense that matrices in Splus assume all the elements to be of the same mode (all numeric, logical, character strings, etc. ).

Splus has a library of datasets available with which to experiment. One of these, cu.summary, contains data from Consumer Reports' yearly automobile issue. The following is a subset of this dataset:

> cu.summary[1:10,]

                Price   Country Reliability Mileage  Type
Acura Integra 4 11950     Japan Much better      NA Small
   Dodge Colt 4  6851     Japan          NA      NA Small
   Dodge Omni 4  6995       USA  Much worse      NA Small
 Eagle Summit 4  8895       USA      better      33 Small
Ford Escort   4  7402       USA       worse      33 Small
 Ford Festiva 4  6319     Korea      better      37 Small
   GEO Metro  3  6695     Japan          NA      NA Small
   GEO Prizm  4 10125 Japan/USA Much better      NA Small
  Honda Civic 4  6635 Japan/USA Much better      32 Small
Hyundai Excel 4  5899     Korea       worse      NA Small
Data frames can be created by using the read.table(), data.frame(), expand.grid(), or the as.data.frame() functions.

Suppose the table above is recorded in a text file called cars. Character strings by default cannot contain white spaces, the data would therefore have to be edited before being read into Splus using the read.table() function. There are three ways in which this could be done:

     1)  quote the strings which contain white spaces:

          "Acura Integra 4"  11950   Japan  "Much better"   NA   Small

          > cars_read.table("cars")

          * read.table() reads in a text file and saves the data in a
            data frame

     2)  use an explicit default field separator character:

          Acura Integra 4:11950:Japan:Much better:NA:Small
          > cars_read.table("cars",sep=":")

     3) organize the data into fixed format fields:

     Col. 1               17        27    33               50 53
          Acura Integra 4 11950     Japan Much better      NA Small
If each field starts in the same column on each line, then the data do not need to be edited and could be read in with the following commands:

     > columns_c(1,17,27,33,50,53)
     > cars_read.table("cars", sep=columns, header = T)
When explicit columns are used as separators, Splus cannot automatically tell if the first line contains headers, so the argument header=T is specified.

Splus assumes that the first row in the text file contains the variable names, and that the first column with non-numeric non-duplicated names contains the row names. If the first row and/or the first column are not valid variable names (ie.: non-numeric with no duplicates), Splus uses the row numbers for the rows and/or "V1", "V2", etc. for the columns. Row names and column names may also be specified as arguments to the read.table() function. Suppose the text file cars2 does not contain any row or column names:

> car.names_c("Acura Integra 4", "Dodge Colt 4", ...
> car.vars_c("Price", "Country", "Reliabilty", "Mileage", "Type")
> cars_read.table("cars2",row.names=car.names,col.names=car.vars)
If the text file does contain column labels, specifying header=T will let Splus know that the first line is a header line and should be ignored. It is also possible to use row.names=field # (to specify from which field row names are to come from) or row.names= var. name (to specify the name of the variable to be used as row names).

Variables in data frames can be anything that is indexed by the set of rows. The following can be used for statistical models:

  1. numeric vectors
  2. factors and ordered factors
  3. numeric matrices
The function data.class() returns the class of an object:

> sapply(cars,data.class)
     Price  Country Reliabilty   Mileage     Type
 "numeric" "factor" "factor"   "numeric" "factor"

                            * the function sapply() is similar to
                              the function apply() but is used
                              specifically with lists or data frames
A numeric vector can be coerced to mode factor using the function as.factor(). Similarly, a matrix or a list can be coerced into a data frame using the function as.data.frame(). The function data.frame() is used to combine Splus objects into a data frame. Character or logical vectors are converted into factors, matrices contribute one variable per column in the matrix, lists contribute one variable for each component of the list, applied recursively, and the variables in data frames become variables in the new data frame. Numeric vectors, factors, and ordered factors each contribute a single variable. If any argument to data.frame() is of the form I(x), then x will retain its class in the new data frame (ie.: a character vector will not be converted into a factor).

Suppose we have a model which predicts cholesterol values by systolic blood pressure and age. We might want to use the model to predict cholesterol values over a regular grid of blood pressures and ages.

> systol_seq(110,130,by=10)
> age_seq(20,50,by=10)
There are 3 values of systol, 4 values of age, and 12 pairs of values. Since predict() and similar computations require the new data to be a data frame, the function expand.grid() is used to create a data frame with every combination of the two variables:

> chol.grid_expand.grid(systol=systol,age=age)
> chol.grid
  systol age
 1  110   20
 2  120   20
 3  130   20
 4  110   30
 5  120   30
 6  130   30
 7  110   40
 8  120   40
 9  130   40
10  110   50
11  120   50
12  130   50

Further Reading

John M. Chambers, Trevor J.Hastie, Statistical Models in S, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, 1992, pp. 45-94

Where to now?

Table of Contents

Graphical Methods I