Parametric Statistics
32 Research Design Applications with PROC GLM
Learner Outcomes
After reading this chapter you should be able to:
- Compute the significance of the difference between three or more sample means using PROC GLM for the one-way analysis of variance test
- Compute the significance of the association between an outcome and one or several predictors using PROC GLM as a linear regression model
- Compute the post hoc comparison between sample means when the F statistic is significant using posthoc analysis procedures (in either ANOVA applications or linear regression applications)
INTRODUCTION TO GENERAL LINEAR MODELS IN SAS
A univariate general linear model is defined as a statistical model in which a dependent variable is modeled in relation to a set of predictor variables. The predictor variables can be categorical independent variables with multiple levels, or they can be a continuous variable, or the predictor variables can be a combination of categorical and continuous independent variables. In the application of statistical processing for research designs, where the dependent variable is a continuous scaled score, and the independent variables are categorically scored, the researcher can use either the analysis of variance or a general linear model.
In SAS, the F statistic can be computed with either the PROC ANOVA procedures described previously or with the PROC GLM procedure with similar post-analytic processes to establish not only the significance of the main effects but also of the characteristics of the distribution, like measures of normality and equality of variance, there are limitations to the application of the PROC ANOVA which suggest that the use of PROC GLM is more appropriate. For example, the PROC GLM procedure is preferable to PROC ANOVA when using unbalanced comparison groups, when combining categorical and continuous predictors as in an analysis of covariance, and when attempting to evaluate the dependent measure using complex interactions as in nested designs.
In this chapter, we will explore the SAS application of the PROC GLM procedures to evaluate the F statistic represented by the statement: F = variance between samples divided by the variance within samples. Next, we will explore the relationship between the outcome and predictor variables based on the concept that the dependent variable = independent variable ± error, which we can represent algebraically as: [latex]Y_{ij} = \beta_{0} \pm \beta_{i}X_{i} + \epsilon[/latex]
Extending from this General Linear Model (GLM) approach, we will introduce the General Linear Mixed Model, which we will analyze with the PROC MIXED application, which adds the following parameter [latex]U_{i}[/latex] into the General Linear Model Equation. This parameter represents the random effect in the model. [latex]Y_{ij} = \beta_{0} \pm \beta_{i}X_{i} \pm U_{i} + \epsilon[/latex]
Applying PROC GLM to evaluate a one-way ANOVA design.
The following describes a 12 week experiment in which researchers were interested in the effects of coffee consumption on resting systolic blood pressure for a sample of healthy male participants. The study participants were randomly selected from the total sample of volunteers and randomly allocated into three groups. Group 1 was comprised of 20 individuals that were asked to consume a total of 2000 ml of coffee each morning of the 12-week program between the hours of 6 and 8 am. Group 2 was comprised of 20 individuals that were asked to consume a total of 2000 ml of de-caffeinated coffee each morning of the 12-week program between the hours of 6 and 8 am, and Group 3 was comprised of 20 individuals that were asked to consume a total of 2000 ml of hot water with no additive each morning of the 12-week program between the hours of 6 and 8 am. Resting systolic blood pressure measures were taken on day 84 and recorded in the following table. The dependent variable was then determined to be the systolic resting blood pressure on day 84. The raw data and SAS code are shown below:
Group 1 – caffeinated coffee
Systolic Blood Pressure (mmHg) |
Group 2 – de-caffeinated coffee
Systolic Blood Pressure (mmHg) |
Group 3 – Placebo
Systolic Blood Pressure (mmHg) |
134 | 115 | 125 |
152 | 114 | 126 |
161 | 119 | 128 |
139 | 115 | 122 |
149 | 114 | 126 |
158 | 113 | 117 |
167 | 115 | 113 |
151 | 111 | 116 |
148 | 123 | 114 |
144 | 110 | 115 |
124 | 115 | 129 |
122 | 116 | 116 |
121 | 113 | 118 |
129 | 119 | 112 |
129 | 111 | 116 |
128 | 112 | 127 |
127 | 110 | 123 |
131 | 115 | 126 |
128 | 111 | 124 |
124 | 114 | 125 |
data glm1;
Title ‘GLM analysis of Systolic Blood Pressure Data’;
input id 1-2 @4 grp sysbp;
datalines;
134 115 125
152 114 126
161 119 128
139 115 122
149 114 126
158 113 117
167 115 113
151 111 116
148 123 114
144 110 115
124 115 129
122 116 116
121 113 118
129 119 112
129 111 116
128 112 127
127 110 123
131 115 126
128 111 124
124 114 125
;
proc sort data=glm1; by id;
proc glm;
class grp; model sysbp = grp;
run;
The output from this SAS Program is explained below.
GLM analysis of Systolic Blood Pressure Data using Systolic Blood Pressure (SYSBP) as the Dependent Variable
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 6169.23333 | 3084.61667 | 37.57 | <.0001 |
Error | 57 | 4679.75000 | 82.10088 | ||
Corrected Total | 59 | 10848.98333 |
R-Square | Coeff Var | Root MSE | sysbp Mean |
0.568646 | 7.278849 | 9.060953 | 124.4833 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
grp | 2 | 6169.233333 | 3084.616667 | 37.57 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
grp | 2 | 6169.233333 | 3084.616667 | 37.57 | <.0001 |
The comparison of means across groups was analyzed using the SAS code lsmeans grp/ adjust= scheffe; as shown here.
GLM analysis of Systolic Blood Pressure Data
The GLM Procedure using Least Squares Means Adjustment for Multiple Comparisons: Scheffe
grp | sysbp LSMEAN | LSMEAN Number |
1 | 138.300000 | 1 |
2 | 114.250000 | 2 |
3 | 120.900000 | 3 |
Least Squares Means for effect grp Pr > |t| for H0: LSMean(i)=LSMean(j)Dependent Variable: sysbp |
|||
i/j | 1 | 2 | 3 |
1 | <.0001 | <.0001 | |
2 | <.0001 | 0.0763 | |
3 | <.0001 | 0.0763 |
means grp /hovtest welch tukey scheffe;
GLM analysis of Systolic Blood Pressure Data- Main Effects Analysis
Levene’s Test for Homogeneity of sysbp Variance ANOVA of Squared Deviations from Group Means |
|||||
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
grp | 2 | 406309 | 203155 | 15.59 | <.0001 |
Error | 57 | 742833 | 13032.2 |
Welch’s ANOVA for sysbp | |||
Source | DF | F Value | Pr > F |
grp | 2.0000 | 33.35 | <.0001 |
Error | 32.1316 |
GLM analysis of Systolic Blood Pressure Data with the Post Hoc t Tests (LSD) for sysbp
Note: This test controls the Type I comparison wise error rate, not the experiment wise error rate.
Alpha | 0.05 |
Error Degrees of Freedom | 57 |
Error Mean Square | 82.10088 |
Critical Value of t | 2.00247 |
Least Significant Difference | 5.7377 |
Means with the same letter are not significantly different. | |||
t Grouping | Mean | N | grp |
A | 138.300 | 20 | 1 |
B | 120.900 | 20 | 3 |
C | 114.250 | 20 | 2 |
GLM analysis of Systolic Blood Pressure Data with the Tukey’s Studentized Range (HSD) Test for sysbp
Note: This test controls the Type I experiment-wise error rate, but it generally has a higher Type II error rate than REGWQ.
Alpha | 0.05 |
Error Degrees of Freedom | 57 |
Error Mean Square | 82.10088 |
Critical Value of Studentized Range | 3.40311 |
Minimum Significant Difference | 6.895 |
Means with the same letter are not significantly different. | |||
Tukey Grouping | Mean | N | grp |
A | 138.300 | 20 | 1 |
B | 120.900 | 20 | 3 |
B | 114.250 | 20 | 2 |
GLM analysis of Systolic Blood Pressure Data with the Scheffe’s Test for sysbp
Note: This test controls the Type I experiment-wise error rate.
Alpha | 0.05 |
Error Degrees of Freedom | 57 |
Error Mean Square | 82.10088 |
Critical Value of F | 3.15884 |
Minimum Significant Difference | 7.202 |
Means with the same letter are not significantly different. | |||
Scheffe Grouping | Mean | N | grp |
A | 138.300 | 20 | 1 |
B | 120.900 | 20 | 3 |
B | 114.250 | 20 | 2 |
If we rerun the analysis with the class statement removed we can generate the coefficients for the independent variables.
proc glm ;
model sysbp = grp;
Parameter | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 141.8833333 | 3.96644269 | 35.77 | <.0001 |
grp | -8.7000000 | 1.83610618 | -4.74 | <.0001 |
Adding A Second Grouping Factor To a GLM Model
Consider the analysis we used in the PROC ANOVA computations used in Chapter 9, where we were interested in evaluating the effects of a one-hour activity break into the workday, believing that such an opportunity could reduce the resting heart rates of the participants and thereby lead to a healthier workforce.
You will recall that the research design began with 66 participants that were randomly selected from a sample of employees within the company, and randomly allocated to one of three treatment groups. In the following analysis, we used PROC GLM and the post hoc procedure LSMEANS to evaluate the cell-wise interaction component to evaluate the individual cell means between the treatment levels (walking versus dancing versus book reading), for each level of sex (males versus females).
PROC glm data=anova2x3;
title ‘Using PROCGLM to determine interaction effect ‘;
class sex group ;
model hrchange =sex group sex*group;
lsmeans sex*group/ diff;
run;
The results from the LSMEANS analysis are shown here Using PROC GLM to determine interaction effect
The GLM Procedure: Least Squares Means
sex | group | hrchange LSMEAN | LSMEAN Number |
F | 1 | -4.5454545 | 1 |
F | 2 | -10.3181818 | 2 |
F | 3 | 5.8181818 | 3 |
M | 1 | -4.2727273 | 4 |
M | 2 | -2.0000000 | 5 |
M | 3 | 6.5454545 | 6 |
Least Squares Means for effect sex*group Pr > |t| for H0: LSMean(i)=LSMean(j)Dependent Variable: hrchange |
||||||
i/j | 1 | 2 | 3 | 4 | 5 | 6 |
1 | <.0001 | <.0001 | 0.8183 | 0.0336 | <.0001 | |
2 | <.0001 | <.0001 | <.0001 | <.0001 | <.0001 | |
3 | <.0001 | <.0001 | <.0001 | <.0001 | 0.5404 | |
4 | 0.8183 | <.0001 | <.0001 | 0.0573 | <.0001 | |
5 | 0.0336 | <.0001 | <.0001 | 0.0573 | <.0001 | |
6 | <.0001 | <.0001 | 0.5404 | <.0001 | <.0001 |
Notice the matrix indicates the probability level at which the pairwise comparisons between cell means are different. Sine most comparisons were significantly different, only the comparisons that showed a probability level of p >0.05, are highlighted in red. These results support the notion that being physically active, whether it be dancing or walking as planned exercise, has a positive effect on reducing resting heart rates, and more so for females than males.