4. Simple Statistics

Summary Statistics

max
min
range
mean
median
var
cor
quantile
summary
> x_c(1,2,3,3,3,4,7,8,9,NA)   * when there are missing values in the
                                data, the functions max(), min(),
                                range(), mean(), and median() return NA,
                                and the functions var(), cor(), and
                                quantile() return an error message

> max(x, na.rm=T)
[1] 9                         * specifying na.rm=T in the function
                                max() forces Splus to remove any
                                missing values from the vector x and
                                to return the maximum value in x

> min(x, na.rm=T)
[1] 1

> range(x, na.rm=T)
[1] 1 9

> mean(x, na.rm=T)
[1]  4.444444

> mean(x, trim=0.2, na.rm=T)
[1]  4.285714                 * the argument trim can take any value
                                between 0 and 0.5 inclusive to be
                                trimmed from each end of the ordered
                                data
                              * if trim=0.5, the result is the median

> median(x, na.rm=T)
[1] 3

> quantile(x, probs=c(0,0.1,0.9), na.rm=T)
  0% 10% 90%                  * the function quantile() returns the
   1 1.8 8.2                    quantiles of x specified in the
                                argument probs
If there are no missing values in the vector x, it is not necessary to specify na.rm=T - simply use min(x), max(x), etc.

These functions may also be used on matrices; they will not be applied to the rows or columns individually but rather will find the max, min, etc. of the whole matrix

> var(x[!is.na(x)])
[1] 8.027778                  * missing values are removed from the vector
                                x using the subscript !is.na(x)
                              * specifying two arguments to the var()
                                function, var(x,y) returns the covariance
                                between the two arguments
                              * arguments may be vectors or matrices

> y_c(1,2,3,4,5,6,7,8,9,10)
> cor(x[!is.na(x)],y[!is.na(x)]
[1] 0.9504597                 * because the cor() function requires x
                                and y to be of the same length, it is
                                necessary to remove the value of y
                                corresponding to the missing value in x;
                                this is done using y[!is.na(x)]

> summary(x)
 Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
    1       3      3 4.444       7    9    1


> z_c(5,4,3,2,1,9,8,7,6,5)
> pmax(x,y,z)                        
 [1]  5  4  3  4  5  9  8  8  9 NA
> pmin(x,y,z)                      
 [1]  1  2  3  2  1  4  7  7  6 NA
                              * pmax() returns the maximum value for each
                                position in a number of vectors
                              * likewise, pmin() returns the minimum value 
                              * na.rm=T may also be specified to remove
                                missing values

Statistical Distributions

    < dist >        Parameters            Defaults            Distributions

    beta          shape1, shape2        -, -                Beta
    binom         size,prob             -, -                Binomial
    cauchy        location, scale       0, 1                Cauchy
    chisq         df                    -                   Chisquare
    exp           rate (1/mean  )       1                   Exponential
    f             df1, df2              -, -                F
    gamma         shape                 -                   GAMMA
    geom          prob                  -                   Geometric
    hyper         m, n, k               -, -, -             Hypergeometric
    lnorm         mean, sd (of log)     0, 1                Lognormal
    logis         location, scale       0, 1                Logistic
    norm          mean, sd              0, 1                Normal
    nrange        size, nevals          -, 200              Normal Range
                                        -, - for rnrange
    pois          lambda                -                   Poisson
    t             df                    -                   Student's t
    unif          min, max              0, 1                Uniform
    weibull       shape                 -                   Weibull
    wilcox        m, n                  -, -                Wilcoxon
For help on the use of the d < dist > (), p < dist > (), q < dist > (), and r < dist > () functions for each of these distributions, use help with the name of the distribution as it appears in the column Distribution, (eg.: help(GAMMA)) with the following exceptions: for logis type help(dlogis), for nrange type help(dnrange), for the F distribution and Student's t distribution, type help.start(gui='motif'), click on Probability Distributions and Random Numbers under the column Categories, then click on F or T in the left-hand column

> dnorm(0)
[1] 0.3989423                     * returns the density at 0 for the
                                    normal distribution

> X11()                                    
> plot(seq(-3,3,0.1), dnorm(seq(-3,3,0.1)), type="l")
                              
                                  * the d < dist >  () functions can be
                                    used to plot the density function
                                    for each of the above distributions
> pnorm(1.96)
[1] 0.9750021                     * returns the cumulative probability
                                    at 1.96 for the normal distribution

> qnorm(0.9750021)
[1] 1.96                          * returns the 97.5th percentile for
                                    the normal distribution

> rnorm(5)
[1] -0.7160094  0.3953744  1.2587492  0.3022640 -0.4109508
                                  * generates 5 random standard normal
                                    variables

> rexp(5,1/3)           
[1] 0.1204068 0.1937435 9.3637550 0.8051347 1.0450249
                                  * this could also have been written as
                                    > rexp(5, rate=1/3)

Further Reading

Richard A. Becker, John M. Chambers, Allan R. Wilks, The New S Language. A Programming Environmnent for Data Analysis and Graphics, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, 1988, pp. 45, 48-50, 539

Where to now?

Table of Contents

Factors