Goodness of Fit and Related Chi-Square Tests
14 Percentiles
What is a percentile?
The term “per cent” refers to “per 100”, and thus a percentile is a score representing a value relative to a base 100 scale.
The computation of percentiles is a useful way to evaluate scores within a frequency distribution, ie. the set of frequency scores.
The percentile provides a baseline at which a given proportion of scores will fall.
In other words, if we consider the 60th percentile, then we are suggesting that 60% of the scores in a distribution or set of scores will fall below that particular value.
Percentiles always refer to a specific position within a frequency distribution.
Formulas to compute percentiles for grouped data:
i) [latex]{k} = (\frac{frequency}{N} \times{100})[/latex]
ii) [latex]{\beta} = (\frac{\textit{Cumulative Frequency for all scores below the Category of Interest}}{N}) \times{100})[/latex]
iii) [latex]\textit{Percentile}={\beta} + (0.5 \times{k})[/latex]
The 0.5 is used to compute half of the number of scores within the category in which the number of interest resides.
Consider computing the percentile for the score 71 in the frequency distribution shown in Table 14.1
Table 14.1 Frequency Distribution Output
Cell Boundaries | Freq (f) | [latex]{k} = (\frac{frequency}{N} \times{100})[/latex] | Cum. Freq. | β |
58.5-61.5 | 4 | 4/200 * 100 = 0.02 * 100 = 2 | 4 | 4/200 * 100 = 2 |
61.5-64.5 | 12 | 12/200 * 100 = 0.06 * 100 = 6 | 16 | 16/200 * 100 = 8 |
64.5-67.5 | 44 | 44/200 * 100 = 0.22 * 100 = 22 | 60 | 60/200 * 100 = 30 |
67.5-70.5 | 64 | 64/200 * 100 = 0.32 * 100 = 32 | 124 | 124/200 * 100 = 62 |
70.5-73.5 | 56 | 56/200 * 100 = 0.28 * 100 = 28 | 180 | 180/200 * 100 = 90 |
73.5-76.5 | 16 | 16/200 * 100 = 0.08 * 100 = 8 | 196 | 196/200 * 100 = 98 |
76.5-79.5 | 4 | 4/200 * 100 = 0.02 * 100 = 2 | 200 | 200/200 * 100 = 100 |
The total sample of scores = 200. We are interested in the specific score with a value of 71. The score 71 resides within the category that has cell boundaries 70.5 to 73.5. This category has a corresponding frequency of 56, which indicates that there are 56 scores within the upper and lover boundaries of the category from 70.5 to 73.5. We can then enter 56 as the frequency value and 200 as the value of N in the following equation to determine the value of k in our series of percentile equations.
i) [latex]{k} = (\frac{frequency}{N} \times{100})[/latex]
[latex]{k} = (\frac{56}{200} \times{100}) = 28[/latex]
Here we see that in this scenario k= 28 where k represents the percent of scores in the category of interest. 56 of 200 scores represents 28% of all scores in our distribution.
Next we determine the value for [latex]{\beta}[/latex] based on the equation, [latex]{\beta} = (\frac{\textit{Cumulative frequency for all scores below the category of interest}}{N}) \times{100})[/latex]. The score for [latex]{\beta}[/latex] represents the cumulative proportion of scores in the data set up to the category in which our score of interest resides. In this example the Cumulative frequency for all scores below the category of interest refers to the cumulative frequency in the category that precedes the catergory in which our score (71) resides. Here the Cumulative frequency for all scores below the category of Interest is 124. Using the equation to compute [latex]{\beta}[/latex] shown here we see that the value is 62.
ii) [latex]{\beta} = (\frac{\textit{124}}{200}) \times{100}) = 62[/latex]
After we have determined k and [latex]{\beta}[/latex], we can then work through the steps in equation iii) to determine the percent of scores falling at or below our score of interest.
iii) [latex]\textit{Percentile}={62} + (0.5 \times{28})[/latex]
[latex]\textit{Percentile}={62} + (14)[/latex]
[latex]\textit{Percentile}=76^{th} \textit{percentile}[/latex]
The outcome indicates that 76 percent of the scores within this set (distribution) of scores fall below the score of 71.
Working through the computation of percentiles from a set of scores
Use the table of frequency distributions for heights of Grade 5 elementary school children, to compute the percentiles for the following values 123, 136, 138,149,152, indicate the values of k , and the percentile scores. Fill in the missing data in the following table to obtain a complete data set.
Table 14.2 Frequency Distribution For Heights Of Grade 5 Elementary School Children.
Category | Frequency | Cumulative Frequency |
120-122 | 1 | 1 |
123-125 | 3 | 4 |
126-128 | 3 | 7 |
129-131 | 3 | |
132-134 | 1 | 11 |
135-137 | 13 | |
138-140 | 1 | 14 |
141-143 | 2 | |
144-146 | 2 | 18 |
147-149 | 2 | |
150-152 | 3 | |
sum of freq= |
In August 2016 Brazil hosted the Olympic Summer Games. However, several athletes decided to boycott the games because of the risk of exposure to the ZIKA virus. The ZIKA is a virus that can be transmitted through the bite of an infected Aedes mosquito. The ZIKA virus is extremely dangerous for young women as it can reside in the blood for up to 3 months and if the woman becomes pregnant, the virus can have negative consequences for the developing fetus. In particular, the ZIKA virus has been implicated in the development of microcephaly in newborn children.
In this example, we will use a series of random number generating commands to create a data set with four variables and 1000 cases. The variables are sex, sport and case and will use the following format: sex (1=m, 2=f), sport (1=golf, 2=equestrian, 3=swimming, 4=gymnastics, 5=track & field), case (1=yes, 2=no), and days which is a continuous variable representing the number of days since exposed to ZIKA virus-carrying mosquitoes.
PROC FORMAT;
VALUE SEXFMT 1 =’MALE’ 2 =’FEMALE’;
VALUE SPRTFMT 1 =’GOLF’ 2 =’EQUESTRIAN’ 3 =’SWIMMING’
4 =’GYMNASTICS’ 5 =’TRACK & FIELD’;
VALUE CASEFMT 1=’PRESENT’ 2=’ABSENT’;
DATA SASRNG;
/* Create 3 new variables labelled SCORE1 SCORE2 SCORE3 */
ARRAY SCORES SCORE1-SCORE3;
/* Set 1000 cases per variable */
DO K=1 TO 1000;
DAYS=RANUNI(13)*100;
DAYS=ROUND(DAYS, 0.02);
/* Loop through each variable to establish 1000 randomly generated scores */
DO I=1 TO 3;
SCORES(I)=RANUNI(I)*1000;
SCORES(I)=ROUND(SCORES(I));
SCORES(I)=1+(MOD(SCORES(I),105));
/* The variable sex will relate to score1, create a filter to establish the binary score for sex based on the randomly generated output */
IF SCORE1 > 55 THEN SEX = 2;
IF SCORE1 >2 AND SCORE1<56 THEN SEX = 1;
/* Sport Type */
IF SCORE2 >90 THEN SPORT = 5;
IF SCORE2 >80 AND SCORE2<91 THEN SPORT = 4;
IF SCORE2 >60 AND SCORE2<81 THEN SPORT = 3;
IF SCORE2 >30 AND SCORE2<61 THEN SPORT = 2;
IF SCORE2 >5 AND SCORE2<31 THEN SPORT=1;
/* Case */
IF SCORE3 > 48 THEN CASE = 1;ELSE CASE = 2;
END;
OUTPUT;
END; RUN;
PROC SORT DATA =SASRNG; BY SEX;
PROC FREQ; TABLES SEX SPORT CASE SEX*CASE;
FORMAT SEX SEXFMT. SPORT SPRTFMT. CASE CASEFMT. ;
PROC FREQ; TABLES SPORT*CASE;BY SEX;
FORMAT SEX SEXFMT. SPORT SPRTFMT. CASE CASEFMT. ;
PROC UNIVARIATE; VAR DAYS;
OUTPUT OUT=PCTLS PCTLPTS = 30 60
PCTLPRE = DAYS_
PCTLNAME = PCT30 PCT60;
PROC PRINT DATA= PCTLS;
RUN;
In SAS we can compute the specific percentiles using the PROC UNIVARIATE; feature on the continuous variable. The command PROC UNIVARIATE; VAR days; produces the following output table to produce a chart of percentiles for the variable: DAYS.
Table 14.3 Frequency Distribution Output Showing Percentiles
Level | Quantile |
100% Max | 99.94 |
99% | 98.66 |
95% | 94.34 |
90% | 89.61 |
75% Q3 | 73.13 |
50% Median | 46.83 |
25% Q1 | 24.75 |
10% | 10.23 |
5% | 4.86 |
1% | 1.27 |
0% Min | 0.02 |
However, we can also compute specific percentile values for a continuous variable using the PCTLPTS=, PCTLPRE=, and PCTLNAME= options.
Together these three commands help us to identify and label specific percentiles within a data set. For example, to select a specific percentile, such as the 30th percentile we use PCTLPTS= 30. The command PCTLPRE= provides the specific prefix in the label for a percentile. For example, here we use the prefix days_ and then follow the command with the PCTLNAME= command to list the label of the percentile. For example, the sequence of commands: PCTLPTS= 30, fPCTLPRE= DAYS_, and the PCTLNAME= pct30, identifies and labels the 30th percentile within the data set. In the following code we compute the 30th and 60th percentiles for the continuous variable: DAYS, using SAS Commands to identify specific percentiles.
SAS CODE to produce specific percentiles
output out=Pctls pctlpts = 30 60 |
pctlpre = days_ |
pctlname = pct30 pct60; |
OUTPUT from the code above:
Obs | days_pct30 | days_pct60 |
1 | 28.64 | 57.08 |
The PROC FREQ procedure in SAS enables us to create descriptive tables for the frequency distribution of the categorical variables. For example, we can compute the number of females and males in our sample, as well as the number of individuals across each of the sports, and then we can actually create a number to represent the number of cases of ZIKA in our randomly generated data set of 1000 participants.
TABLE 14.5 ZIKA Random Number Generated data for SEX
sex | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
male | 533 | 53.30 | 533 | 53.30 |
female | 467 | 46.70 | 1000 | 100.00 |
TABLE 14.6 ZIKA Random Number Generated data for Sports
sport | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
golf | 266 | 26.60 | 266 | 26.60 |
equestrian | 286 | 28.60 | 552 | 55.20 |
swimming | 192 | 19.20 | 744 | 74.40 |
gymnastics | 96 | 9.60 | 840 | 84.00 |
track & field | 160 | 16.00 | 1000 | 100.00 |
TABLE 14.7 ZIKA Random Number Generated data for Disease Present/Absent
case | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
present | 505 | 50.50 | 505 | 50.50 |
absent | 495 | 49.50 | 1000 | 100.00 |
This procedure also enables us to create cross-tabular tables for comparisons of variables.
TABLE 14.8 ZIKA Random Number Generated Cross Tabulations
Table of Frequencies for case by sex | |||
SEX | CASES | ||
Present | Absent | Total | |
Male | 275 | 258 | 533 |
Female | 230 | 237 | 467 |
COLUMN TOTALS | 505 | 495 | 1000 |
As in most SAS procedures, by including the PROC SORT command, we can arrange the processing and subsequent output of the data to control for the categorical variable(s). In this example we computed the cross-tabulation of the frequency distribution for the variables SPORT and CASE, controlling for SEX, to separate the output for Males and Females.
The table format provides the following data within each cell: frequency, followed by cell percent, followed by row percent, followed by column percent as shown in this example for the sport: golf.
TABLE 14.9 ZIKA Random Number Generated Cross Tabulations
Table of Frequencies for case by sports | |||
SPORT | CASES | ||
Present | Absent | Total | |
MALE GOLF | Cell Freq = 73
Cell Pct = 13.70 Row Pct = 53.28 Col Pct = 26.55 |
Cell Freq = 64
Cell Pct = 12.01 Row Pct = 46.72 Col Pct = 24.81 |
Row Total = 137
Row Pct = 25.70 |
FEMALE GOLF | Cell Freq = 56
Cell Pct = 11.99 Row Pct = 43.41 Col Pct = 24.35 |
Cell Freq = 73
Cell Pct = 15.63 Row Pct = 56.59 Col Pct = 30.80 |
Row Total =129
Row Pct = 27.62 |
COLUMN TOTALS | 505 | 495 | 1000 |