Data Screening and Cleaning

William J. Montelpare; Emily Read; Teri McComber; Alyson Mahar; Krista Ritchie

SAS Programming

10 Data Screening and Cleaning

This textbook was developed to demonstrate biostatistical research applications that use SAS coding and the Webulators to resolve questions that arise in healthcare research. In the following sections, we will work through the concepts of biostatistical applications using specific examples that are demonstrated with SAS coding or the application of the Webulators. And while it may be tempting to dive straight into your main analysis once you have your data but in most situations you need to do some work to first prepare your data. Before you start testing your main research hypotheses it is vital to get to know your data, screen it for errors, and make informed decisions about how you deal with missing values, outliers, and violations of underlying assumptions of the statistical applications you plan to use.

It may sound like a lot of work, but in the long run, taking the time to get to know your data and address these issues will ensure that you are confident about your results and you won’t have to re-do your analysis because you later find a mistake or erroneous scores in the dataset. It is rare that you have a dataset that is 100% perfect as you begin your analysis so as a researcher you need to make some informed decisions about how to deal with the limitations of real-world data.

The process of screening and cleaning quantitative data generally involves the following components:

Checking data accuracy (Is the data entered correctly?)
Checking data completeness (How much data is missing? Are there patterns of missingness within the set of responses, or recorded values in the dataset?)
Assessing the distribution of the data (How are values spread out in your sample?)
Assessing the validity & reliability of measures (Are you measuring what you want to measure? Are your results repeatable?)

License

Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Applied Statistics in Healthcare Research Copyright © 2020 by William J. Montelpare, Ph.D., Emily Read, Ph.D., Teri McComber, Alyson Mahar, Ph.D., and Krista Ritchie, Ph.D. is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, except where otherwise noted.

License

Share This Book