Parametric Statistics

# PART 1: Measures of Central Tendency

The most common measure of central tendency is the mean or average score. The mean is a calculated score that is intended to represent all of the scores in the distribution (set of scores).

The formula for the mean of a sample is shown here:

${\overline{x}} = \Sigma{(x_i)\over{n}}$

Where:

• ${\overline{x}}$ refers to the sample mean
• $\Sigma{(x_i)} refers to the sum of all the scores • i refers to the “ith” case within the distribution • n refers to all of the cases within the distribution. To calculate the mean for a continuous variable, add up all of the values and divide the sum of values by the number of values. Below is a set of blood glucose measures for 5 patients. These data are represented in millimoles per litre (mmol/L). Pn represents the nominal value label for each patient, so that P1 is patient 1. P1 4.2 mmol/L, P2 5.6 mmol/L, P3 7.9 mmol/L, P4 10.2 mmol/L, P5 7.5 mmol/L, Follow these steps to calculate the mean: • First add the values together: 4.2 + 5.6 + 7.9 + 10.2 + 7.5 = 35.4. • Next, divide by the number of values (to produce the average): 35.4/5 = 7.08 mmol/L We can also use SAS to compute the mean for a set of scores. Two specific SAS programs that process measures of central tendency are PROC MEANS, and PROC UNIVARIATE. Each of these programs was designed to produce descriptive statistics for a sample of scores. Below are the SAS commands to compute the mean for a set of 10 resting heart rate scores. In this first program we used the SAS procedural command PROC MEANS to compute three basic estimates: the mean, the standard deviation and the minimum/maximum scores for the sample dataset of 10 numbers. SAS PROC MEANS to Produce Descriptive Statistics for a Sample of 10 Numbers DATA MN_HR; INPUT ID SCORE @@; DATALINES; 01 48 02 54 03 66 04 72 05 56 06 68 07 48 08 67 09 55 10 84 ; PROC MEANS DATA=MN_HR; VAR SCORE; RUN; Notice in the code written above, the semi-colon (;) is placed on a separate line below the set of scores. While PROC MEANS, in its simplest form (without options) provides three basic estimates that describe estimates within a distribution, the SAS procedural command PROC UNIVARIATE not only computes the mean but also creates the Basic Statistical Measures Table which provides an entire summary of descriptive statistics. The output generated by the SAS program above – using the PROC MEANS statement without options – produced a table of summary estimates that included the mean and standard deviation as well as the minimum and maximum values for the dataset. SAS Output from the MEANS Procedure: Variable of interest was Heart Rate  N Mean Std Dev Minimum Maximum 10 61.80 11.56 48.00 84.00 When we call the PROC UNIVARIATE procedure of SAS, the output is a more complete table of summaries that include estimates of centrality but also the moments, measures of variance, and the tests of the location of the mean, as shown below. SAS PROC UNIVARIATE to Produce Descriptive Statistics for a Sample of 10 Numbers PROC UNIVARIATE DATA=MN_HR; VAR SCORE; RUN; The UNIVARIATE Procedure -- Variable: SCORE  MOMENTS N 10 Sum Weights 10 Mean 61.8 Sum Observations 618 Std Deviation 11.5547008 Variance 133.511111 Skewness 0.55954538 Kurtosis -0.2284272 Uncorrected SS 39394 Corrected SS 1201.6 Coeff Variation 18.6969269 Std Error Mean 3.65391723  Tests for Location: Mu0=0 Test STATISTIC ESTIMATE p Value Student's t t 16.91336 Pr > |t| .0001 Sign M 5 Pr >= |M| 0.0020 Signed Rank S 27.5 Pr >= |S| 0.0020 ## Comparing the Mean for a Sample to the Expected Mean for a Population In the output from the PROC UNIVARIATE procedure, SAS includes a table in which the mean for the variable: SCORE is compared to the mean for the Standard Normal Distribution (SND). The SND represents the hypothetical population mean and has a value of 0 with a standard deviation of 1. In the SAS table shown above, entitled Tests for Location: Mu0=0 the comparison of the sample mean ([latex]{\overline{x}}$ ) to the population (${\mu}$ ) is evaluated with the Student’s t-Test.

The results presented in the table above show that the Student’s t-Statistic value is 16.91 and the probability associated with this estimate is <0.001. Together these values indicate that the observed sample mean is significantly different than the hypothesized expected mean for the population (set at Mu0=0) from which the sample was drawn.

However, what if we wanted to establish a suggested value for the population mean that is not 0, but that is based on value reported in the literature?  In this case, we could assign a suggested value to the population mean and then compare the observed mean for the sample to the expected value for a population.  In the following code, we test this notion.

Assign a suggested value to the population mean

PROC TTEST H0=54
PLOTS(SHOWH0)
ALPHA=0.05;
VAR SCORE;
RUN;

The SAS output is given below. The results indicate that the average score for the sample (${\overline{x}}$ = 61.80) is not significantly different at the probability level of p < 0.05 than the expected score of (${\mu}$ =54). Notice, in addition to the table of output SAS also includes a graph illustrating the shape of the distribution and the comparison of the sample estimate to the expected population estimate of centrality.

 The t-test Procedure DF t Value Pr > |t| 9 2.13 0.0615
 Parameter estimates Mean 95% CL Mean 61.8000 Lower limit: 53.5343 Upper Limit: 70.0657

Considering that the confidence interval shown here includes the mean for the sample (61.8) and the mean for the population which we set apriori as 54, no significant difference is observed, between that which is expected and that which was observed. This estimate is illustrated in the following graph.

## Calculate the Mean for A Frequency Distribution

In the following example, we compute the mean for frequency distribution. The formula to compute the mean of a frequency distribution is shown here as:

${\overline{x}} = {\Sigma{fx_i}\over{n}}$

Where:

• f refers to the frequency in each interval
• xi refers to the mid-point of the interval
• i refers to the “ith” case within the distribution
• n refers to all of the cases within the distribution.

Below is the frequency distribution table for the heights of 200 individuals. The data represent heights recorded in centimetres and organized into seven categories. The SAS code to compute the mean for this set of data is shown below the table. Notice that the table is reduced to a simple composition of two variables which includes the mid-point of the category represented by the variable: GRPMDPT, and the number of individuals, whose height scores fall within the specific category, represented by the variable: COUNTS.

 Column 1 cell boundaries Column 2 frequency (f) Column 3 cell mid-point Column 4 (f) x cell midpoint Column 5 (col 4 ÷ n) 158.5 – 161.5 4 160 4 x 160 = 640 640/200 = 3.2 161.5 – 164.5 12 163 12 x 163 = 1956 1956/200 = 9.78 164.5 – 167.5 44 166 44 x 166 = 7304 7304/200 = 36.52 167.5 – 170.5 64 169 64 x 169 = 10816 10816/200 = 54.08 170.5 – 173.5 56 172 56 x 172 = 9632 9632/200 = 48.16 173.5 – 176.5 16 175 16 x 175 = 2800 2800/200 = 14.00 176.5 – 179.5 4 178 4 x 178 = 712 712/200 = 3.56 ${\overline{x}} = {\Sigma{fx_i}\over{n}}$ ${\overline{x}} = {33860\over 200}$ = 169.3 The ${\overline{x}}$ is the sum of column 5

The SAS code to compute the mean for data in the table above

DATA FREQMN;
INPUT GRPMDPT COUNTS @@;
CRSPRDCT= GRPMDPT*COUNTS;
/* COMPUTE RATIO FOR THE CROSS PRODUCT USING GROUP MIDPOINT X CELL FREQUENCY */
XP_RATIO=CRSPRDCT/200;
LABEL GRPMDPT = ‘GROUP MIDPOINT’
COUNTS = ‘NUMBER OF CASES PER CELL’
CRSPRDCT = ‘CROSS PRODUCT PER CELL’
XP_RATIO = 'CROSS PRODUCT RATIO';
DATALINES;
160 4 163 12 166 44 169 64 172 56 175 16 178 4
;
PROC PRINT;
VAR GRPMDPT COUNTS CRSPRDCT XP_RATIO;
SUM CRSPRDCT XP_RATIO;
FOOTNOTE1 "* THE MEAN IS PRODUCED AS THE SUM OF THE VARIABLE XP_RATIO";
FOOTNOTE2 "** THE MEAN CAN ALSO BE CALCULATED FROM THE SUM OF THE VARIABLE CRSPRDCT ÷ 200";
RUN;

The output generated by the SAS program above is the table of raw data presented in column form and includes the sums of the columns used to compute the mean for the frequency distribution.

 Obs grpmdpt counts crsprdct cp_ratio 1 160 4 640 3.20 2 163 12 1956 9.78 3 166 44 7304 36.52 4 169 64 10816 54.08 5 172 56 9632 48.16 6 175 16 2800 14.00 7 178 4 712 3.56 33860 169.30

* The mean is produced as the sum of the variable XP_RATIO

** The mean can also be calculated from the sum of the variable crsprdct ÷ 200

## The Weighted Mean Score

In some situations, we may wish to combine means from several samples. Under such circumstances, we need to consider the sample size (or weight) of the distribution from which the means were drawn. By adjusting each independent sample mean by the number of subjects in the respective sample from which the means were drawn, we are able to provide different relative contributions of each mean to the total mean of all samples combined. The formula for a weighted mean from two samples is shown here. The formula for the mean of a sample is shown here:

${\overline{x}}={n_i\times{\overline{x_1}}+n_2{\overline{x_2}}\over{n_1 + n_2}}$

## The Median Score

The median score is also a measure of central tendency, and it is defined as the middle score in a set of ordered scores.  In the example below, we begin with a set of scores (an array), we next sort the scores from lowest to highest.  Then we identify the number that is in the middle of the ordered set of scores where half the numbers are above the identified middle score, and half the numbers are below the identified middle score.

Example: Median

The median is the middle score. Considering the heart rate values again, we put these readings in order of magnitude and then identify which value is in the middle:

• 57
• 59
• 59
• 75
• 78
• 78
• 85
• 88
• 88
• 88

In this case, we have an even number of values (n = 10) so we can calculate the average of the two values in the middle. It just so happens that they are the same value in this example (78) so the median is 78.

• initial array of scores: {12, 72, 56, 34, 35, 13, 36, 16, 67}
• sorted array of scores: {12, 13, 16, 34, 35, 36, 56, 67, 72}
• sorted array of scores: {12, 13, 16, 34, 35, 36, 56, 67, 72}

Notice in the example above, regardless of the actual scores, the middle score in the ordered set of scores is the median, which in this set is 35.

When we have an even number of scores in our array there is a special caveat to identifying the median score in the distribution (set of scores). When we have two scores selected as the identified middle score we simply compute the average between the two identified middle scores and use that number as the median score.  That is, we add the two middle scores together and divide by 2.

• initial array of scores: {22, 32, 86, 44, 25, 13, 16, 18, 47, 11}
• sorted array of scores: {11, 13, 16, 18, 22, 25, 32, 44, 47, 86}
• computed median for the array: {11, 13, 16, 18, 22, 23.5, 25, 32, 44, 47, 86}

## The Mode Score

The mode score is the third measure of central tendency, and it is defined as the most frequently occurring score in a set of scores. In the example below, we simply count the number of scores that are the same within a set of scores, within an array or within a distribution.

Below are 10 resting heart rate values:

78, 88, 57, 59, 75, 85, 88, 78, 59, 88

The mode is 88 because it appears most often.

In the following example of 16 scores, the number 2 occurs 3 times, but the number 27 occurs 4 times therefore we would identify 27 as the mode score.

2, 2, 2, 5, 6, 14, 15, 23, 26, 27, 27, 27, 27, 28, 37, 41