EXAMPLE 1 - CANCER MORTALITY RATES FOR 9 OREGON COUNTIES x - index of exposure (C1) y - cancer mortality rate per 100,000 people (C2) a. SCATTERPLOT MTB > plot c2 c1 - 210+ * * - rate - - - 175+ * - - * - - * 140+ * - * * - - - * 105+ ------+---------+---------+---------+---------+---------+index 2.0 4.0 6.0 8.0 10.0 12.0 There is a fairly strong positive linear association between the two variables. There are no apparent outliers. b. MTB > regress c2 on 1 in c1; SUBC > predict 4; SUBC > predict 15. The regression equation is rate = 115 + 9.23 index Predictor Coef Stdev t-ratio p Constant 114.716 8.046 14.26 0.000 index 9.231 1.419 6.51 0.000 s = 14.01 R-sq = 85.8% R-sq(adj) = 83.8% Analysis of Variance SOURCE DF SS MS F p Regression 1 8309.6 8309.6 42.34 0.000 Error 7 1373.9 196.3 Total 8 9683.5 Fit Stdev.Fit 95% C.I. 95% P.I. 151.64 4.75 (140.40,162.88) (116.65,186.63) 253.19 15.45 (216.64,289.74) (203.85,302.52) x denotes a row with x values away from the center xx denotes a row with very extreme x values Note that the slope is positive as expected from the scatterplot c. The slope = 9.23 This means that according to the regression equation for a one unit increase in ind4ex of exposure on average cancer mortality increases by 9.23 per 100,000. d. According to the regression equation when index of exposure equals 4, the cancer mortality rate equals 115 + 9.23(4) = 151.92 (using the Minitab ouput fit = 151.64) e. According to the regression equation when index of exposure equals 15, the cancer mortality rate equals 115 + 9.23(15) = 253.45 (using the Minitab ouput fit = 253.19) Note that there are two xx beside the output for "predict 15". This indicates that we are extrapolating. The x value that we are predicting at is much larger than the largest value of x in the data set. The relationship between x and y described by the regression line may not hold for values of x as high as 15. f. MTB > regress c2 on 1 in c1 c3 c4; SUBC> resids c5. The regression equation is rate = 115 + 9.23 index Predictor Coef Stdev t-ratio p Constant 114.716 8.046 14.26 0.000 index 9.231 1.419 6.51 0.000 s = 14.01 R-sq = 85.8% R-sq(adj) = 83.8% Analysis of Variance SOURCE DF SS MS F p Regression 1 8309.6 8309.6 42.34 0.000 Error 7 1373.9 196.3 Total 8 9683.5 c3 is a column of standardized residuals c4 is a column of fitted values c5 is a column of raw residuals (ei) To confirm that c5 is a column of the raw residuals (ei) raw residual = y - fitted value c2 contains the y values and c4 the corresponding fitted values so raw residuals = c2 - c4 MTB > let c6 = c2-c4 MTB > print c1 c2 c4 c6 c5 ROW index rate C4 C6 C5 1 2.49 147.1 137.702 9.3980 9.3980 2 2.57 130.1 138.440 -8.3405 -8.3405 3 3.41 129.9 146.195 -16.2949 -16.2949 4 1.25 113.5 126.255 -12.7550 -12.7550 5 1.62 137.5 129.671 7.8294 7.8294 6 3.83 162.3 150.072 12.2279 12.2279 7 11.64 207.5 222.170 -14.6698 -14.6698 8 6.41 177.9 173.889 4.0107 4.0107 9 8.34 210.3 191.706 18.5940 18.5940 Note that c5 and c6 are the same (as they should be) Remember that for the least squares line the sum of the residuals equals zero. To confirm this MTB > sum c5 SUM =-0.000022888 (Because of rounding errors the calculated sum is not exactly zero.) TO CALCULATE THE SUM OF SQUARES DUE TO ERROR MTB > let c7 = c5*c5 MTB > sum c7 SUM = 1373.9 ( This corresponds to the value for SS Error on the ANOVA table) g. R-squared is the coefficient of determination and it tells us what proportion of the variability in y is explained by x. In this case, R-squared is fairly large - 85.8% of the variablity in y is explained by x. h. MTB > CORR C1 C2 Correlation of index and rate = 0.926 (CORR C2 C1 gives the same output.)