Splus has a library of datasets available with which to experiment. One of these, cu.summary, contains data from Consumer Reports' yearly automobile issue. The following is a subset of this dataset:
> cu.summary[1:10,] Price Country Reliability Mileage Type Acura Integra 4 11950 Japan Much better NA Small Dodge Colt 4 6851 Japan NA NA Small Dodge Omni 4 6995 USA Much worse NA Small Eagle Summit 4 8895 USA better 33 Small Ford Escort 4 7402 USA worse 33 Small Ford Festiva 4 6319 Korea better 37 Small GEO Metro 3 6695 Japan NA NA Small GEO Prizm 4 10125 Japan/USA Much better NA Small Honda Civic 4 6635 Japan/USA Much better 32 Small Hyundai Excel 4 5899 Korea worse NA SmallData frames can be created by using the read.table(), data.frame(), expand.grid(), or the as.data.frame() functions.
Suppose the table above is recorded in a text file called cars. Character strings by default cannot contain white spaces, the data would therefore have to be edited before being read into Splus using the read.table() function. There are three ways in which this could be done:
1) quote the strings which contain white spaces: "Acura Integra 4" 11950 Japan "Much better" NA Small > cars_read.table("cars") * read.table() reads in a text file and saves the data in a data frame 2) use an explicit default field separator character: Acura Integra 4:11950:Japan:Much better:NA:Small > cars_read.table("cars",sep=":") 3) organize the data into fixed format fields: Col. 1 17 27 33 50 53 Acura Integra 4 11950 Japan Much better NA SmallIf each field starts in the same column on each line, then the data do not need to be edited and could be read in with the following commands:
> columns_c(1,17,27,33,50,53) > cars_read.table("cars", sep=columns, header = T)When explicit columns are used as separators, Splus cannot automatically tell if the first line contains headers, so the argument header=T is specified.
Splus assumes that the first row in the text file contains the variable names, and that the first column with non-numeric non-duplicated names contains the row names. If the first row and/or the first column are not valid variable names (ie.: non-numeric with no duplicates), Splus uses the row numbers for the rows and/or "V1", "V2", etc. for the columns. Row names and column names may also be specified as arguments to the read.table() function. Suppose the text file cars2 does not contain any row or column names:
> car.names_c("Acura Integra 4", "Dodge Colt 4", ... > car.vars_c("Price", "Country", "Reliabilty", "Mileage", "Type") > cars_read.table("cars2",row.names=car.names,col.names=car.vars)If the text file does contain column labels, specifying header=T will let Splus know that the first line is a header line and should be ignored. It is also possible to use row.names=field # (to specify from which field row names are to come from) or row.names= var. name (to specify the name of the variable to be used as row names).
Variables in data frames can be anything that is indexed by the set of rows. The following can be used for statistical models:
> sapply(cars,data.class) Price Country Reliabilty Mileage Type "numeric" "factor" "factor" "numeric" "factor" * the function sapply() is similar to the function apply() but is used specifically with lists or data framesA numeric vector can be coerced to mode factor using the function as.factor(). Similarly, a matrix or a list can be coerced into a data frame using the function as.data.frame(). The function data.frame() is used to combine Splus objects into a data frame. Character or logical vectors are converted into factors, matrices contribute one variable per column in the matrix, lists contribute one variable for each component of the list, applied recursively, and the variables in data frames become variables in the new data frame. Numeric vectors, factors, and ordered factors each contribute a single variable. If any argument to data.frame() is of the form I(x), then x will retain its class in the new data frame (ie.: a character vector will not be converted into a factor).
Suppose we have a model which predicts cholesterol values by systolic blood pressure and age. We might want to use the model to predict cholesterol values over a regular grid of blood pressures and ages.
> systol_seq(110,130,by=10) > age_seq(20,50,by=10)There are 3 values of systol, 4 values of age, and 12 pairs of values. Since predict() and similar computations require the new data to be a data frame, the function expand.grid() is used to create a data frame with every combination of the two variables:
> chol.grid_expand.grid(systol=systol,age=age) > chol.grid systol age 1 110 20 2 120 20 3 130 20 4 110 30 5 120 30 6 130 30 7 110 40 8 120 40 9 130 40 10 110 50 11 120 50 12 130 50
Graphical Methods I