Advanced Concepts for Applied Statistics in Healthcare
Research using simulated data is often done to predict future events based on real-world data. For example, health researchers can use demographic information about a cohort within the population or about the health care workforce to predict how many new health care professionals we will need in the years to come. This information can then be used to make decisions about the number of students that universities and colleges should accept into their programs in order to meet the predicted needs. Likewise, using computer simulation tools, administrators can create financial forecasting models based on selected expenditure statements and health sector administrative data to estimate future costs and establish budgetary guidelines appropriately.
Creating and using a simulated dataset is also an excellent way to practice the application of statistical methods without having to collect real-world data. That is, we can create a simulated dataset by first establishing the set of independent and dependent variables that are of interest to us in our research project. Next we provide a range of possible responses for each variable. Then we use a random number generator to create a dataset that is estimated from the set of values we provide to the computer.
This is an amazing learning opportunity as it enables you to create, albeit artificially, a complete dataset with the variables in which you are interested. The experience is invaluable as it provides you with the opportunity to critically evaluate both the strategies for input as well as the interpretation for output. Although not required, working through a computer-simulated dataset during the development of your research proposal can help you develop your data analysis plan, and enable you to become familiar with the ranges and nuances of the important variables.
In the following program we will generate a single value based on a SAS random number generator. Here we will control the function of the random number generator by controlling the parameters of the processor to ensure that our output falls within a specific range.
Annotated Example Using SAS to produce data for a random variable.
The following SAS program creates a continuous-discrete variable that we will call AGE. We begin by initializing the variable with the SAS RAND function and we use *1000000 to create the range of the distribution (0 to 1 million) from which the SAS RAND function will draw a number
Next we set the range for the number so that it is in a logical age range for our use. The statements below ensure that the SAS RAND function will produce a value that is above 40 but less than 72
IF AGE>72 THEN AGE=25+ABS((MOD(AGE,50)));
IF AGE<40 THEN AGE=AGE+ABS((MOD(AGE,50)));
We then print the value that we produced with the random number generator.
PROC PRINT; VAR AGE;
The program above produced one age value within a predesignated age range.
Using a random number generator to produce ICD-9 codes
In this next example, we will produce a randomly generated dataset consisting of ICD-9 codes. In this SAS program we first create the data, then we organize the output into categories, and finally we produce a horizontal bar chart of the relative percentile values for each category of ICD-9 codes.
Consider the following program to evaluate the primary diagnosis for a group of patients visiting a healthcare clinic. The data are generated using a customized random number generator that generates data in the form of ICD-9 codes. Since the codes are based on a continuous number line several unique values can be generated to represent the various sub-conditions of that which a patient may present to a healthcare provider. Here we simplify the organization of the codes by creating categories and using the SAS PROC FORMAT command to assign the categories to the output.
SAS Code To Organize Categories Of ICD-9 Codes
VALUE CATFMT 1=’Infectious/parasitic’
4=’ Blood/blood-forming organs’
5=’ Mental disorders’
6=’ Nervous system’
7=’ Sense organs’
8=’ Circulatory system’
9=’ Respiratory system’
10=’ Digestive system’
11=’ Genitourinary system’
13=’ Skin & subcutaneous tissue’
14=’ MSK & connective tissue’
15=’ Congenital anomalies’
16=’ Perinatal period Conditions’
17=’ Injury and poisoning’
20=’ Diagnosis not reported’
18=’ Supplementary classification’;
In this example, the random number generator produces ICD-9 scores as the dependent variable which we assign with the label (PRDIAG). The SAS code uses a DO loop to create a set of 500 scores, representing ICD-9 score for each patient. The data are drawn from a normal distribution at random using the command: PRDIAG=RAND(“NORMAL”)*10000; We seed the random number generator for k=500 times with the CALL STREAMINIT(K); command. We also set a maximum absolute value for the dependent variable using the modulus math function MOD().
DO K=1 TO 500;
CALL STREAMINIT(K); /* SEED THE RNG ON EACH LOOP FOR K TIMES */
/* SET MAX RANDOM NUMBER TO 1500 */
/* ROUND THE RANDOM NUMBERS TO 2 DECIMAL PLACES */
Next we use if-then logic statements to organize the randomly generated numbers into specific categories based on specific cutpoints. Notice these commands are included within the DO loop. The loop is closed with the commands OUTPUT; followed by END;
IF PRDIAG = 95 OR PRDIAG = 99 THEN CATEGORY=20;
IF PRDIAG >=001 AND PRDIAG<94 THEN CATEGORY=1;
IF PRDIAG >=96 AND PRDIAG<99 THEN CATEGORY=1;
IF PRDIAG >99 AND PRDIAG<140 THEN CATEGORY=1;
IF PRDIAG >=140 AND PRDIAG<240 THEN CATEGORY=2;
IF PRDIAG >=240 AND PRDIAG<280 THEN CATEGORY=3;
IF PRDIAG >=280 AND PRDIAG<290 THEN CATEGORY=4;
IF PRDIAG >=290 AND PRDIAG<320 THEN CATEGORY=5;
IF PRDIAG >=320 AND PRDIAG<390 THEN CATEGORY=6;
IF PRDIAG >=390 AND PRDIAG<460 THEN CATEGORY=7;
IF PRDIAG >=460 AND PRDIAG<520 THEN CATEGORY=8;
IF PRDIAG >=520 AND PRDIAG<580 THEN CATEGORY=9;
IF PRDIAG >=580 AND PRDIAG<630 THEN CATEGORY=10;
IF PRDIAG >=630 AND PRDIAG<677 THEN CATEGORY=11;
IF PRDIAG >=680 AND PRDIAG<710 THEN CATEGORY=12;
IF PRDIAG >=710 AND PRDIAG<740 THEN CATEGORY=13;
IF PRDIAG >=740 AND PRDIAG<760 THEN CATEGORY=14;
IF PRDIAG >=760 AND PRDIAG<780 THEN CATEGORY=15;
IF PRDIAG >=780 AND PRDIAG<800 THEN CATEGORY=16;
IF PRDIAG >=800 AND PRDIAG<1000 THEN CATEGORY=17;
IF PRDIAG >=1000 THEN CATEGORY=18;
The SAS commands to create a frequency distribution table are shown below. By using a frequency distribution table the author can provide a standard presentation of important summry statistics within the data set. For example, here we show the organization of the randomly generated numbers within each of the designated categories while also presenting the relative percentages that the categories represent within this data set (see Cumulative Percent column). The frequency distribution table is followed by the horizontal bar chart of the percentage of diagnoses within each category. In this figure we included the data values at the end of each horizontal bar.
PROC FREQ; TABLES CATEGORY;
TITLE1 ‘FREQUENCY DISTRIBUTION FOR RNG ICD-9 CODES’;
PROC SGPLOT DATA=PRDIAG; HBAR CATEGORY/ GROUPDISPLAY = CLUSTER
STAT=PERCENT DATALABELFITPOLICY=NONE DATALABEL;
XAXIS LABEL=”PERCENT OF CASES”;
YAXIS LABEL=”DISEASE/DIAGNOSIS CATEGORIES”;
FORMAT CATEGORY CATFMT. ;
TITLE1 ‘PERCENT OF REPORTED DIAGNOSIS CATEGORY’; RUN;
Frequency distribution for RNG ICD-9 codes
The FREQ Procedure
 ICD-9 codes refer to the International Classification of Disease Codes – version 9.
Consider an example using the Lotto 649
combination of six numbers from 1 to 49 is extremely low:
Which expands to 1 chance in 13,983,816 combinations.
Considering the low probability of winning the grand prize (i.e. all 6 numbers the player chooses will be selected), it is expected that the Lotto 649 lottery should be strategy free. If however, the selection process is not random, but rather follows a specific pattern, then the chance of winning will not remain constant and a strategy to predict outcome could be developed.
Here we will generate data for one draw. That is, using SAS code we will create a random number generator to produce a unique set of 6 numbers that simulates the data that could be generated by the Lotto 649.
The program to generate 6 numbers at random from a set of 49 numbers is shown here. In this instance we have a few constraints. First, we need to be sure that once the first number is drawn, it is not placed back into the set of 49 to be redrawn on a subsequent step. This is because the lotto uses a strategy of sampling without replacement and therefore each draw selects only 6 unique numbers. Likewise, in presenting the output from the random number gnerators we need to be sure that the data are reported as discrete scores and not as decimal based continuous scores; and finally, in filtering the numbers produced we need to be sure that the numbers range from 1 to 49 inclusive.
Copy the following program to your SAS workspace and run the program to see which lucky lottery numbers you can produce. This program has several important features that are noted by the comments /* comment */ within the code.
/* NOTE THE CALL STREAMINIT(13); Command
To create reproducible random numbers then seed the system with the streaminit command. If RAND() is used without an initial streaminit the program will use the value of the system clock and the random numbers will change each time the program is run.
* CALL STREAMINIT(13); /* CREATES REPRODUCIBLE NUMBERS */
DO UNTIL (CHOICE1 NE 0);
CHOICE1 = RAND(“NORMAL”)*1000000000000;
CHOICE1 = ROUND(CHOICE1);
CHOICE1 = 1+(MOD(CHOICE1,49));
CHOICE1 = ABS(CHOICE1);
* CALL STREAMINIT(999);
DO UNTIL (CHOICE2 NE CHOICE1 AND CHOICE2 NE 0);
CHOICE2 = RAND(“NORMAL”)*1000000000000;
CHOICE2 = ROUND(CHOICE2);
CHOICE2 = 1+(MOD(CHOICE2,49));
CHOICE2 = ABS(CHOICE2);
* CALL STREAMINIT(28);
DO UNTIL (CHOICE3 NE CHOICE2 AND CHOICE3 NE CHOICE1 AND CHOICE3 NE 0);
CHOICE3 = RAND(“NORMAL”)*1000000000000;
CHOICE3 = ROUND(CHOICE3);
CHOICE3 = 1+(MOD(CHOICE3,49));
CHOICE3 = ABS(CHOICE3);
* CALL STREAMINIT(218);
DO UNTIL (CHOICE4 NE CHOICE3 AND CHOICE4 NE CHOICE2 AND CHOICE4 NE CHOICE1 AND CHOICE4 NE 0);
CHOICE4 = RAND(“NORMAL”)*1000000000000;
CHOICE4 = ROUND(CHOICE4);
CHOICE4 = 1+(MOD(CHOICE4,49));
CHOICE4 = ABS(CHOICE4);
* CALL STREAMINIT(28);
DO UNTIL (CHOICE5 NE CHOICE4 AND CHOICE5 NE CHOICE3 AND CHOICE5 NE CHOICE2 AND CHOICE5 NE CHOICE1 AND CHOICE5 NE 0);
CHOICE5 = RAND(“NORMAL”)*1000000000000;
CHOICE5 = ROUND(CHOICE5);
CHOICE5 = 1+(MOD(CHOICE5,49));
CHOICE5 = ABS(CHOICE5);
* CALL STREAMINIT(68);
DO UNTIL (CHOICE6 NE CHOICE5 AND CHOICE6 NE CHOICE4 AND CHOICE6 NE CHOICE3 AND CHOICE6 NE CHOICE2 AND CHOICE6 NE CHOICE1 AND CHOICE6 NE 0);
CHOICE6 = RAND(“NORMAL”)*1000000000000;
CHOICE6 = ROUND(CHOICE6);
CHOICE6 = 1+(MOD(CHOICE6,49));
CHOICE6 = ABS(CHOICE6);
PROC PRINT; VAR CHOICE1 CHOICE2 CHOICE3 CHOICE4 CHOICE5 CHOICE6;
So then how many combinations of six numbers are we really talking about?
To compute the number of possible combinations of 6 numbers from the 49 numbers, we need to use the following combinatorial (or factorial) formula. We have 49 numbers choose 6. The number 49 represents the population from which the sample of 6 numbers will be chosen. We write the formula for determining the combinations using the following combinatorial equation:
or we may wish to write the formula using a factorial format as:
Therefore the number of all possible combinations of 6 numbers from a set of 49 consecutive numbers is:
Yet you won’t be happy unless all of your numbers were chosen, but REALLY what is the chance that all six of your numbers will be selected by the lottery machine. Well since you only bought one ticket, then your chance of winning the lottery is 1 in 13,983,816 chances, or
The value 0.000000071 represents the probability associated with your set of scores.
While this example is fairly straight-forward it is somewhat abstract and is not guaranteed to make you a winner. It does however present the basic concepts in presenting a value for a variable that is generated randomly from the set of all possible outcomes. Let’s now turn our attention to an applied health example and see how we can use the utility of the random number generators and computer simulation to create a dataset that exemplifies a real world example.
An Applied Health Example using Simulated Data
Consider for example that you are asked to assess the benefits of a 12-week pulmonary rehabilitation program, consisting of exercise and education, for a cohort of individuals with varying classifications of chronic obstructive pulmonary disease (COPD). The intake data include demographic variables such as the individual’s age, sex, height, and weight; and performance data such as the distance walked in 6 minutes, a physician based rating of COPD, the program participant’s self reported smoking status, years smoked; and physiological measures such as forced expiratory volume in 1 second, and resting heart rate.
In the following example we will generate data artificially using random number generators written with SAS code. In this way we can produce a simulated dataset that we can then use to observe what might happen if we were to actually conduct a research study with the same parameters and considerations.
Using random number generators we create the data set to produce a set of values representing 20 individuals (a random selection of males and females). The variables used in the table along with the variable types and the possible minimum and maximum range for each variable are presented in Table 6.1 below.
Table 6.1 Variables Used To Produce A Sample Of Raw Data For The COPD Clinic
|Variable Name & Variable label||Variable Type||Range of Values|
|Patient identification — Px id||discrete||1 to 20|
|Age in years. — age||discrete||45 to 75|
|Sex — sex||discrete||m: male; f: female;|
|Height — ht||continuous||1.5 m to 2.0 m|
|Weight — wt||continuous||50 kg to 150 kg|
|– Distance walked in 6 minutes — walkdist||continuous||54 metres to 150 meters|
|Rating of COPD severity — severity||discrete||MI: mild; MO: moderate;
|Smoking status — smoke||discrete||S: smoker; EX: ex-smoker;
NON: never smoked
|Years as a smoker — yrsmoke||continuous||<1 to max years smoked|
|Forced expiratory vol in 1 sec — FEV1||continuous||1.5 – 4.0|
|Resting heart rate — rhr||continuous||50 to 100|
Here we will use SAS code to produce the table of random numbers for each of the variables listed above. Recent developments in high speed computing and the creation of the Mersenne-Twister Random Number Generator which is now used by SAS, have led to the creation of the RAND() function. As stated in the SAS Knowledge Base (SAS(R) 9.3 Functions and CALL Routines), the RAND function can generate random numbers for a distribution specified by the user.
In the following example the random number generator was seeded with the statement: call streaminit(n);
/* where n refers to any number you wish to use */
Here we specify that the data we generate will be drawn from the normal distribution.
/* here we use n=13 to seed the RNG */
SAS User Notes provide an explanation of the RAND() function as follows:
where is an observation from the normal distribution with a mean of θ and a standard deviation of λ that has the following probability density function:
θ: is the mean parameter Default:0
: is the standard deviation parameter Default:1
Range: > 0.
Once we have established the parameters for random number selection we begin writing the SAS program to create random number generators as we would for any SAS program. Start by stating the options that you would like included in the output and then name the workspace using normal SAS code.
OPTIONS PAGESIZE=63 LINESIZE=90 DATE;
Our next statement is to create an array. An array is a set of variables that generally have some commonality and that you wish to process together. In our example, we will start by creating an array that we name SCORES, and which has three elements or variables.
ARRAY SCORES SEX SEVERITY SMOKING;
By naming the array, as we have here (SCORES) we can refer to the array SCORES later to reference the specific elements that are contained within. For example, since the array has three elements, then SCORES(1) refers to the first element—the participant’s SEX, while SCORES(2) refers to the second element—the SEVERITY of the COPD condition, and SCORES(3) refers to the third element—the patient’s SMOKING status.
Once we create the workspace in SAS, we next use the do-loop statements to generate a data set consisting of 20 cases. The first do-loop (DO K=1 TO 20) tells SAS to execute the statements within the loop 20 times.
The second do-loop (DO K=1 TO 3) is contained within the first loop and is designed to provide data specifically for the variables SEX, SEVERITY, and SMOKING
DO K=1 TO 20;
DO I=1 TO 3;
Figure 6.1 Functions of The Do-Loop To Generate Random Numbers For The Array SCORES
Finally, we end the do loops with the following statement sequence.
The first END; statement closes the inside loop that begins with DO I=1 TO 3; likewise, the OUTPUT; statement is needed to assign the RNG values to each variable for each participant, the outside loop (DO K=1 TO 20😉 is closed with the second END; statement.
Figure 6.2 Closing the do-loops and producing output
The actual statement sequence to generate a random number for each of the variables in the array is shown here as a three step process beginning by seeding the Random Number Generator (RNG) with CALL STREAMINIT(N); where the (N) can be any number you wish to use. In this first example here we used the number 13 (only because 13 is MY lucky number!).
The call statement initiates or seeds the random number generator.
OPTIONS PAGESIZE=63 LINESIZE=90 DATE;
ARRAY SCORES SEX SEVERITY SMOKING;
DO K=1 TO 20;
DO I=1 TO 3;
This sequence of statemente invokes the random number generator and places a value in each element of the array (i.e., the list of variables).
After generating the random numbers for each variable in the array SCORE (SEX, SEVERITY, SMOKING) we then process the number with a logic filter so that it makes sense in relation to the range of scores that we would expect to see for each given variable.
For example, if the RNG produces a value of 75 for the variable sex, then what does that mean?
Well actually it is meaningless until you assign the meaning.
We assign meaning to the values within a variable using logic statements. For each of the elements (variables) within the array we process the RNG value with logic statements that will make the data relevant to our variables.
For example, the logic statements to convert the RNG values for sex are shown here. In this situation we convert the numeric variable for sex to a text variable that we call sex. Since we have text labels that extend beyond 8 characters we use the length statement with the $ to ensure that the full length of the text label is used.
/* LOGIC STATEMENTS FOR THE VARIABLE: SEX */
LENGTH SEX $12;
IF SEX > 175 THEN SEX = ‘NOT STATED’;
IF SEX >54 AND SEX<175 THEN SEX = ‘FEMALE’;
IF SEX >0 AND SEX<55 THEN SEX = ‘MALE’;
In SAS, the logic statements use the if-then conventional approach. That is, for every IF statement we use a corresponding THEN statement. In this way we process the RNG values to be within the range of logical outcomes for the variable that we are creating.
/* LOGIC STATEMENTS TO CREATE CATEGORIES FOR THE VARIABLE: COPD SEVERITY TYPE */
IF SEVERITY >55 THEN SEVERITY = 3;
IF SEVERITY >27 AND SEVERITY<56 THEN SEVERITY = 2;
IF SEVERITY >4 AND SEVERITY<28 THEN SEVERITY = 1;
/* LOGIC STATEMENTS TO CREATE CATEGORIES FOR THE VARIABLE: SMOKING STATUS */
IF SMOKING >55 THEN SMOKING = 3;
IF SMOKING >27 AND SMOKING<56 THEN SMOKING = 2;
IF SMOKING >4 AND SMOKING<28 THEN SMOKING = 1;
Next we create RNGs for the remaining seven variables that we plan to include in the analysis. We do not need to include these in the array and can simply generate the values when SAS walks through the outer do loop. The independent execution of the rand(“normal”) function can run with a new seed and a new maximum score. Notice that these follow the array processing statements.
The continuous discrete variables were age, years smoked, resting heart rate, weight, and distance walked in 6 minutes (measured in metres).
/* CONTINUOUS DISCRETE VARIABLE AGE */
CALL STREAMINIT(13); AGE=RAND(“NORMAL”)*1000000000000;
IF AGE>72 THEN AGE=35+ABS((MOD(AGE,50)));
/* CONTINUOUS DISCRETE VARIABLE YRSMOKE */
CALL STREAMINIT(13); YRSMOKE=RAND(“NORMAL”)*1000000000000;
/* CONTINUOUS DISCRETE VARIABLE RHR */
CALL STREAMINIT(69); RHR=RAND(“NORMAL”)*1000000000000;
/* CONTINUOUS DISCRETE VARIABLE WT */
CALL STREAMINIT(45); WT=RAND(“NORMAL”)*1000000000000;
IF WT>85 THEN WT=55+ABS((MOD(WT,12)));
/* CONTINUOUS DISCRETE VARIABLE WALKDIST */
Next we created the continuous decimal variables. Again these statements are placed within the do loops to produce a full set of 20 outputs.
/* CONTINUOUS DECIMAL VARIABLE FEV1 */
CALL STREAMINIT(99); FEV1=RAND(“NORMAL”)*1000000000000;
/* CONTINUOUS DECIMAL VARIABLE HT */
CALL STREAMINIT(21); HT=RAND(“NORMAL”)*1000000000000;
IF HT<1.5 THEN HT=1.2+ABS((MOD(HT,1.1)));