Basic Principles

1 Introduction

Basic Principles for Applied Statistics in Healthcare

The primary goal of this textbook is to provide the reader with the opportunity to learn fundamental statistical concepts while introducing the reader to data analysis skills that will enable them to become critical consumers of research and enhance their confidence in asking and answering questions about health-related issues.

The main premise of this book is to present the basic concepts of statistical methods while using SAS coding methods to evaluate data and to develop a conceptual understanding of what the results are telling us about the data. While each chapter introduces the essential theoretical foundation of statistical concepts, each concept is presented through the unpacking of relevant examples using SAS programming code.

From a pedagogical perspective, this textbook will introduce the essential elements of applied health using statistical applications at an intermediate level while providing examples for the reader to relate applications of these methods to health data. The methods include but are not limited to examples from health, with a view to understanding and implementing research design and statistical applications that researchers may use as a basis for the development of research hypotheses and a theoretical foundation for program planning, policy changes, and program modifications.

We will begin this textbook with a few “trade secrets” of researchers that are at the cusp of where research methods bridges with applied statistical analysis.

You got your numbers where

It is critical to intimately know your sources of data. Often in healthcare, we can access secondary sources of data from large databases. Other times we are fortunate to have had a summer student or a recent team evaluation effort that involved using some surveys to document the patient experience. The very first step in any statistical analysis is to understand the numbers you are about to analyze. To decide which analysis to use, we need to understand the methodological and measurement properties of our variables. We have parametric options for analyzing continuous data and non-parametric for categorical or discrete data. The understanding you need to have of the numbers you are about to analyze goes deeper than this. You need to understand exactly how data were collected and entered to determine if they are ready for analysis. The key questions you need to be able to answer before trusting the data are worth your valuable analysis time focus on issues of measurement, risks of bias, and accounts of missing data.


The things we study in applied social and health sciences are very rarely directly observable. Things that are directly observable include the number of days someone stayed in a hospital or waited for a referral to see a specialist. We can directly observe prescriptions filled, but we cannot directly observe the number of times someone took the drug as prescribed. We can ask people to self-report their compliance and hope that they are being fully honest with perfectly accurate memories. If asking a question about the effects of a drug on given decreased symptom reporting or recovery, this self-report data becomes extremely valuable. What is important as the person analyzing the data is to recognize that this is not a perfect picture of exactly what happened for every person in your sample. There is a chance that some people misremembered and the others are trying to please you and reporting better behavior than what is happening at home. These are, hopefully, small deviations from the true patterns of taking the drug that happened in reality- the events you want to associate statistically with decreased reported symptoms and increased recovery. There are two ways we handle the fact that our data on difficult to observe variables are never perfect. First, we try to be strategic about how the data are collected. If you have a patient report monthly about how they are taking their medication, then there are greater risks of generalizing behaviors and misremembering exactly how many times medication was taken late or not at all. We can design data collection strategies that are closer to the moment of taking the medication – asking the participants to journal their behaviors on a chart beside their meds, or use an electronic application to signal when it is time to take the medication and to click a checkmark at the time when they did, in fact, take the medication. These data can be directly fed into a database to remove data entry errors that are possible with chart reviews. Taking on such a high-tech strategy requires the resources, technological and financial, to create the application. It also requires the participants to have the technology and understanding of the technology to use the application. This in turn can create a selection bias that systemically excludes people who cannot afford a mobile device or who are not tech-savvy enough to use the application. The point of this hypothetical scenario is not to discourage you from collecting data! The lesson to be learned is that researchers and healthcare providers make many nuanced decisions that go into exactly how data are collected. If you are going to analyze data, you need to have the full story (methods of data collection) behind exactly where your numbers came from. With this information, you will better understand and more accurately interpret the results you see when you analyze the data. We cannot strive to capture perfectly things that are not directly observable. We can use principles of good measurement and research methods to come as close to valid (aka accurate) and reliable (aka consistent) data as possible.

As someone who might be about to analyze data and form conclusions from analyses that impact patient care or health system decision making, it is imperative that you take a critical perspective to assess the quality of the data in front of you. When you report your results, it is important that you make transparent the limitations of your measures so that others can draw their own conclusions as relevant to their contexts. It is also important that you make the decision before you start to analyze data, that the data are worth analyzing. It is tempting when one can open a data file and start asking and answering questions to take at face value that the numbers accurately and consistently represent the variables we are interested in. If we do that we run the danger of perpetuating misinformation, which can have negative consequences for the subsequent healthcare decisions made.

Surveys and Data Collection Forms

Sometimes there are questionnaires, surveys, or data collection sheets available to us in our places of work. Forms that people have used before and we would be expected to use moving forward. Sometimes we do not have a tool for data collection waiting for us and we must do a literature search to look for a survey that we can give to patients to collect data. This is not an easy task, certainly not for a beginning researcher, or even for someone who has been doing research for a long time but is switching topics and entering a new field (e.g., a cardiologist and a surgical nurse who have done extensive epidemiological work using secondary data from large databases decide to take on a study about patient experience and quality of life requiring primary data collection tools and sampling strategies). It is possible that you are joining a team that has already selected the data collection tools. At whatever stage of the process you find yourself in, never trust a snappy survey title! If looking for measures, for example, on shared decision making, there are a few options out there. Each one was constructed in a different way, was used initially with a different patient population and research question in mind by the authors. There are many steps that go into creating consistent and accurate measures. We will not go into the details of these steps here. We will, however, try to convince you to do three things – read the entire measurement tool before distributing it, think about how similar or different the context you plan to use it in is in comparison to the context in which it was developed, and do not change the survey in any way.

Look at the actual measurement tool – read each item. If you understand what you mean by your construct, for example, ‘shared decision making’ in the study you are about to conduct, read the items of the survey you have found and ask yourself if what is being asked represents what you want to know. This is the first step, and one that anyone can do, with or without measurement expertise.

It will also benefit you to have a sense or consult with a member of your team, to assess the extent to which the survey you have selected has demonstrated validity and reliability in contexts like the context where you plan to collect data. For example, a survey with demonstrated reliability and validity, created for adults in rural South American villages may not be generalizable to your intended sample of adolescents in a European urban setting.

The third and final word of advice on standardized survey tools is to NOT change the wording of any of the items. Standardized surveys are nice when they have demonstrated validity and reliability in contexts like our own because we can trust that the measure will capture the thing we want to know about. There is a science to this process of standardization. Order of items, the wording of items, and the associated response scale are all things that have been carefully thought about. If you change them, then you no longer can report that the tool you are using has been previously demonstrated to be valid and reliable.

Risks of Bias

The risk of bias is a technical term that simply means ways that we can blur the picture of the variables we are intending to represent numerically and in turn, sway the results of our statistical analyses. Data are simply numbers that represent variables or concepts that are important to us in the real world. It is easy to see that a number cannot ever perfectly represent “happiness”, “shared decision-making” or “depression”. When we have good measurement tools though, we can generate a score that is a good representation of these and so many other complicated constructs. At least good enough to be able to see patterns and relationships to other constructs that can contribute to our knowledge and inform the decisions we make. The goal is to paint the most realistic picture possible of the construct we aim to capture. Knowing the painting will never literally be the exact same as the thing being painted, we can accept that there will be some degree of error in each person’s or each observation’s score. What we do not want is something so abstract that we do not know if we are looking at a duck or a truck. There are things that we do as researchers and things that happen during study conduct that create risks of bias. Biases can be little, and they can be forgivable. Though blurry, we still see the painting is of a duck. We can also bias our data in extreme ways – ways so extreme that for either our entire sample or for subsets of it, we can no longer decipher what we are measuring at all. We encourage you to learn more about the risks of bias relevant to your area of research and to consider these carefully in relation to how they might impact the conclusions you draw from your research findings. Two examples of the many sources of bias a study can suffer from are performance bias and attrition bias.

Performance biases are the systematic differences experienced by participants in the study that are not relevant to the study. For example, if doing an experiment where you are assessing the impact of online patient education to inform healthcare decision-making, you want to measure the effect that the intervention (new online source of useful information) has, and not anything else. If the participants who are randomly assigned to receive the intervention also get 45 minutes of healthcare provider time to talk about the site and have a personal conversation with a healthcare provider that otherwise (and for the control condition) would not have been experienced, then you are at great risk of performance bias. You no longer know if any observed differences between your control and intervention are because of the online learning opportunity, the interaction with the healthcare provider, or most likely the combination of the two. The question you are asking yourself here, in the case of an experimental design is “Did people in both groups have exactly the same experiences, aside from the experimental intervention?” This goes the same for descriptive-comparative studies. It is important to engage participants and give experiences of study participants that are the same for the groups you are collecting data from. For example, if comparing prescription compliance behaviors of 30 – 40-year-olds and 70 – 80-year-olds, then you want to have all participants have the same study experience so that you do not add a performance bias from, perhaps, spending more time with the older group and giving them more attention and support to complete daily journals than the younger group.

Attrition bias is a perfect prelude to the next section on the need to account for missing data. Attrition bias addresses risk to your sampling strategy. If you could generate a careful random sample (or stratified random sample) from a population of interest, you will be highly motivated to maintain that random sample for the duration of your study. People might cease participation in your study, and these reasons will be systematic- for example, people with more stress in their lives or fewer resources might not be as able to keep coming to a laboratory to participate or might need to move and you might lose track of them. If those who stop participating are in any way systematically different than those who stay in the study, you have suffered from attrition bias. The representative sample you started with is no longer representative of the population. It is now representative of those in the population with the resources to participate in the study. In healthcare, particularly if there is an intervention being tested, we need to keep track of when and why people stop participating. If doing a drug study, people who have adverse events from the drug might be the ones to stop participating, and those who benefit from the drug are more likely to be the people who remain in the study over time. This is a fatal flaw for a study assessing the risks and benefits of a drug. The same goes for a patient-education study. If those who are just not that interested in learning more about their health stop participating and those who have a high degree of interest stay in, then the learning gains you might see through statistical analysis might have more to do with interest than your educational material. It is important to track the reasons people stop participating, and to be transparent in your reports about all of this information. A question you might ask yourself to assess attrition bias in a study where you are comparing groups (randomly assigned or naturally occurring) is: Did the same proportion of people stop participating across the groups, and for the same reasons?

Accounting for missing data

The final methodological bridge we need to cross to prepare you for understanding the nature of the data you are about to analyze is the need to understand exactly why there are blanks, or empty cells, in your data set. In an ideal world, if you have five variables in your study and 300 participants, you have data on all five variables for all 300 people. This is a complete data set. A complete dataset is a rare occurrence. There are two categories of reasons for missing data that are important for you to keep track of, systematic and random. Systematically missing data means there is a reason the data are missing. When this happens, you need to be able to report why the data are missing. An important step when creating a database is to have pre-set missing data codes so that you do not have any empty cells.

What is a missing data code you might ask? Great question! Before you analyze your data, there should not be any empty cells in your database. All data that are missing should be coded so that you can know why the data are missing, and report those reasons in your study reports. When reporting results of study findings, you must always pair your statistics (e.g. t-statistics and p-values) with sample size. Often we see that the sample size varies from one analysis to another within a study. This is because we rarely have a perfectly complete dataset. Explaining why the data are missing helps us understand the potential risks of bias, and the statistical power for each analysis reported. When deciding what your codes will be, choose numbers that are impossible for your dataset. For example, if -1 is an impossible number for all the data you will collect, then it is a good code. Disciplines, or even local research groups, create codes that they use consistently to represent missing data. For example, one of the authors of this textbook tends to collect survey and demographic data that would typically range in scores from 0 to 500. If this is the full range of possible values in the dataset, then the research group can consistently use the code 999 for randomly missing data, 888 for attrition due to moving away, 777 for attrition due to participant choice to stop participation for lack of interest. Death would be an extremely rare occurrence for this research group so they do not have a pre-set code for this reason for missing data. If it were to happen in a study, a code would be created.

We will provide two examples of systematic missing data. One reason for systematic missing data is that the variable was not applicable for some of the people in your study. For example, if you are looking at the length of hospital stay as a variable, that will only be relevant for patients who needed to be admitted to the hospital. If your dataset includes patients who may or may not have been admitted to hospital, then the sample for any analysis answering questions about the length of hospital stay is constricted, appropriately so, to those who stayed in the hospital. Another common example in healthcare where non-applicable systematic missing data are adverse events. If adverse events are part of your dataset, then of course you will only have data on adverse events for the few patients who experienced an adverse event. You need to provide a code for those who did not experience an adverse event to indicate that the data are not available because they do not exist. Another systematic missing data, which is quite different in its implications for your study is attrition. When a person starts a study and does not complete it, you might choose (with their informed consent) to keep the data that has been collected to date. When you do this, you need to explain why you no longer have data for them. For each reason a person stopped participating, you need to have a code to indicate why they stopped participating. Three common reasons in health research are death, moving away, and simply choosing to not continue participation (possibly the study was logistically inconvenient, taking too much time they did not want to spend on it, or travel to the lab was inconvenient).

The other kind of missing data will become apparent once you have coded all the data that you have an explanation for. This remaining missing data is random. Random missing data have no explanation for why they are missing, and it does not appear that there are systematic differences for specific groups in your sample. These missing data, if you only have a very small percentage, might be able to be statistically recovered. For small amounts of missing data, we can impute scores that are best guesses as to what those scores might have been, based on mean of nearby points, or multiple imputation based on variance and co-variances of nearby variables. Randomly missing data also need their own unique code so that you can tabulate how much you have, and perhaps, tell the program how to impute scores to replace these codes.

We hope that this section has highlighted for you some measurement and methodological issues that must be addressed to trust and use the data you analyze. If you are still not quite convinced, let us link these issues directly to what you are about to learn about by introducing the concept of the error term. For example,  in statistical analyses the bigger the error term, the less likely an effect will be seen even though an effect might actually exist.

A Guiding Principle

An essential guiding principle is an acronym GIGO – garbage in, garbage out. This is especially important when selecting measurement tools, inheriting a dataset with variable names you want to trust at face value and wishing you could ignore how many unexplained empty cells you see in a dataset. If you do not have quality data, there is no way to end with quality answers to your important research questions.


Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Applied Statistics in Healthcare Research Copyright © 2020 by William J. Montelpare, Ph.D., Emily Read, Ph.D., Teri McComber, Alyson Mahar, Ph.D., and Krista Ritchie, Ph.D. is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, except where otherwise noted.

Share This Book