Parametric Statistics

# 31 Research Design Applications with PROC GLM

Learner Outcomes

After reading this chapter you should be able to:

• Compute the significance of the difference between three or more sample means using PROC GLM for the one-way analysis of variance test
• Compute the significance of the association between an outcome and one or several predictors using PROC GLM as a linear regression model
• Compute the post hoc comparison between sample means when the F statistic is significant using posthoc analysis procedures (in either ANOVA applications or linear regression applications)

INTRODUCTION TO GENERAL LINEAR MODELS IN SAS

A univariate general linear model is defined as a statistical model in which a dependent variable is modeled in relation to a set of predictor variables. The predictor variables can be categorical independent variables with multiple levels, or they can be a continuous variable, or the predictor variables can be a combination of categorical and continuous independent variables. In the application of statistical processing for research designs, where the dependent variable is a continuous scaled score, and the independent variables are categorically scored, the researcher can use either the analysis of variance or a general linear model.

In SAS, the F statistic can be computed with either the PROC ANOVA procedures described previously or with the PROC GLM procedure with similar post-analytic processes to establish not only the significance of the main effects but also of the characteristics of the distribution, like measures of normality and equality of variance, there are limitations to the application of the PROC ANOVA which suggest that the use of PROC GLM is more appropriate. For example, the PROC GLM procedure is preferable to PROC ANOVA when using unbalanced comparison groups, when combining categorical and continuous predictors as in an analysis of covariance, and when attempting to evaluate the dependent measure using complex interactions as in nested designs.

In this chapter, we will explore the SAS application of the PROC GLM procedures to evaluate the F statistic represented by the statement: F = variance between samples divided by the variance within samples. Next, we will explore the relationship between the outcome and predictor variables based on the concept that the dependent variable = independent variable ± error, which we can represent algebraically as: $Y_{ij} = \beta_{0} \pm \beta_{i}X_{i} + \epsilon$

Extending from this General Linear Model (GLM) approach, we will introduce the General Linear Mixed Model, which we will analyze with the PROC MIXED application, which adds the following parameter $U_{i}$ into the General Linear Model Equation. This parameter represents the random effect in the model. $Y_{ij} = \beta_{0} \pm \beta_{i}X_{i} \pm U_{i} + \epsilon$

Applying PROC GLM to evaluate a one-way ANOVA design.

The following describes a 12 week experiment in which researchers were interested in the effects of coffee consumption on resting systolic blood pressure for a sample of healthy male participants.  The study participants were randomly selected from the total sample of volunteers and randomly allocated into three groups.  Group 1 was comprised of 20 individuals that were asked to consume a total of 2000 ml of coffee each morning of the 12-week program between the hours of 6 and 8 am.  Group 2 was comprised of 20 individuals that were asked to consume a total of 2000 ml of de-caffeinated coffee each morning of the 12-week program between the hours of 6 and 8 am, and Group 3 was comprised of 20 individuals that were asked to consume a total of 2000 ml of hot water with no additive each morning of the 12-week program between the hours of 6 and 8 am. Resting systolic blood pressure measures were taken on day 84 and recorded in the following table. The dependent variable was then determined to be the systolic resting blood pressure on day 84. The raw data and SAS code are shown below:

 Group 1 – caffeinated coffee Systolic Blood Pressure (mmHg) Group 2 – de-caffeinated coffee Systolic Blood Pressure (mmHg) Group 3 – Placebo Systolic Blood Pressure (mmHg) 134 115 125 152 114 126 161 119 128 139 115 122 149 114 126 158 113 117 167 115 113 151 111 116 148 123 114 144 110 115 124 115 129 122 116 116 121 113 118 129 119 112 129 111 116 128 112 127 127 110 123 131 115 126 128 111 124 124 114 125
options pagesize=55 linesize=120 center date;
data glm1;
Title ‘GLM analysis of Systolic Blood Pressure Data’;
input id 1-2 @4 grp sysbp;
datalines;
134 115 125
152 114 126
161 119 128
139 115 122
149 114 126
158 113 117
167 115 113
151 111 116
148 123 114
144 110 115
124 115 129
122 116 116
121 113 118
129 119 112
129 111 116
128 112 127
127 110 123
131 115 126
128 111 124
124 114 125
;
proc sort data=glm1; by id;
proc glm;
class grp; model sysbp = grp;
run;

The output from this SAS Program is explained below.

GLM analysis of Systolic Blood Pressure Data using Systolic Blood Pressure (SYSBP) as the Dependent Variable

 Source DF Sum of Squares Mean Square F Value Pr > F Model 2 6169.23333 3084.61667 37.57 <.0001 Error 57 4679.75000 82.10088 Corrected Total 59 10848.98333
 R-Square Coeff Var Root MSE sysbp Mean 0.568646 7.278849 9.060953 124.4833
 Source DF Type I SS Mean Square F Value Pr > F grp 2 6169.233333 3084.616667 37.57 <.0001
 Source DF Type III SS Mean Square F Value Pr > F grp 2 6169.233333 3084.616667 37.57 <.0001

The comparison of means across groups was analyzed using the SAS code lsmeans grp/ adjust= scheffe;  as shown here.
GLM analysis of Systolic Blood Pressure Data
The GLM Procedure using Least Squares Means Adjustment for Multiple Comparisons: Scheffe

 grp sysbp LSMEAN LSMEAN Number 1 138.300000 1 2 114.250000 2 3 120.900000 3
 Least Squares Means for effect grp Pr > |t| for H0: LSMean(i)=LSMean(j)Dependent Variable: sysbp i/j 1 2 3 1 <.0001 <.0001 2 <.0001 0.0763 3 <.0001 0.0763

means grp /hovtest welch tukey scheffe;

GLM analysis of Systolic Blood Pressure Data- Main Effects Analysis

 Levene’s Test for Homogeneity of sysbp Variance ANOVA of Squared Deviations from Group Means Source DF Sum of Squares Mean Square F Value Pr > F grp 2 406309 203155 15.59 <.0001 Error 57 742833 13032.2
 Welch’s ANOVA for sysbp Source DF F Value Pr > F grp 2.0000 33.35 <.0001 Error 32.1316

GLM analysis of Systolic Blood Pressure Data with the Post Hoc t Tests (LSD) for sysbp

Note: This test controls the Type I comparison wise error rate, not the experiment wise error rate.

 Alpha 0.05 Error Degrees of Freedom 57 Error Mean Square 82.1009 Critical Value of t 2.00247 Least Significant Difference 5.7377
 Means with the same letter are not significantly different. t Grouping Mean N grp A 138.300 20 1 B 120.900 20 3 C 114.250 20 2

GLM analysis of Systolic Blood Pressure Data with the Tukey’s Studentized Range (HSD) Test for sysbp

Note: This test controls the Type I experiment-wise error rate, but it generally has a higher Type II error rate than REGWQ.

 Alpha 0.05 Error Degrees of Freedom 57 Error Mean Square 82.1009 Critical Value of Studentized Range 3.40311 Minimum Significant Difference 6.895
 Means with the same letter are not significantly different. Tukey Grouping Mean N grp A 138.300 20 1 B 120.900 20 3 B 114.250 20 2

GLM analysis of Systolic Blood Pressure Data with the Scheffe’s Test for sysbp

Note: This test controls the Type I experiment-wise error rate.

 Alpha 0.05 Error Degrees of Freedom 57 Error Mean Square 82.1009 Critical Value of F 3.15884 Minimum Significant Difference 7.202
 Means with the same letter are not significantly different. Scheffe Grouping Mean N grp A 138.300 20 1 B 120.900 20 3 B 114.250 20 2

If we rerun the analysis with the class statement removed we can generate the coefficients for the independent variables.

proc glm ;
model sysbp = grp;

 Parameter Estimate Standard Error t Value Pr > |t| Intercept 141.8833333 3.96644269 35.77 <.0001 grp -8.7000000 1.83610618 -4.74 <.0001

Adding A Second Grouping Factor To a GLM Model

Consider the analysis we used in the PROC ANOVA computations used in Chapter 9, where we were interested in evaluating the effects of a one-hour activity break into the workday, believing that such an opportunity could reduce the resting heart rates of the participants and thereby lead to a healthier workforce.

You will recall that the research design began with 66 participants that were randomly selected from a sample of employees within the company, and randomly allocated to one of three treatment groups.  In the following analysis, we used PROC GLM and the post hoc procedure LSMEANS  to evaluate the cell-wise interaction component to evaluate the individual cell means between the treatment levels (walking versus dancing versus book reading), for each level of sex (males versus females).

PROC glm data=anova2x3;
title ‘Using PROCGLM to determine interaction effect ‘;
class sex group ;
model hrchange =sex group sex*group;
lsmeans sex*group/ diff;
run;

The results from the LSMEANS analysis are shown here Using PROC GLM to determine interaction effect

The GLM Procedure: Least Squares Means

 sex group hrchange LSMEAN LSMEAN Number F 1 -4.5454545 1 F 2 -10.3181818 2 F 3 5.8181818 3 M 1 -4.2727273 4 M 2 -2.0000000 5 M 3 6.5454545 6
 Least Squares Means for effect sex*group Pr > |t| for H0: LSMean(i)=LSMean(j)Dependent Variable: hrchange i/j 1 2 3 4 5 6 1 <.0001 <.0001 0.8183 0.0336 <.0001 2 <.0001 <.0001 <.0001 <.0001 <.0001 3 <.0001 <.0001 <.0001 <.0001 0.5404 4 0.8183 <.0001 <.0001 0.0573 <.0001 5 0.0336 <.0001 <.0001 0.0573 <.0001 6 <.0001 <.0001 0.5404 <.0001 <.0001

Notice the matrix indicates the probability level at which the pairwise comparisons between cell means are different. Sine most comparisons were significantly different, only the comparisons that showed a probability level of p >0.05, are highlighted in red. These results support the notion that being physically active, whether it be dancing or walking as planned exercise, has a positive effect on reducing resting heart rates, and more so for females than males. 