{"id":283,"date":"2020-04-01T10:57:57","date_gmt":"2020-04-01T14:57:57","guid":{"rendered":"http:\/\/pressbooks.library.upei.ca\/montelpare\/?post_type=chapter&#038;p=283"},"modified":"2020-08-24T14:16:14","modified_gmt":"2020-08-24T18:16:14","slug":"working-with-missing-data","status":"publish","type":"chapter","link":"https:\/\/pressbooks.library.upei.ca\/montelpare\/chapter\/working-with-missing-data\/","title":{"raw":"Working with Missing Data","rendered":"Working with Missing Data"},"content":{"raw":"In this section, we will work through the concepts of dealing with missing data using specific examples that are demonstrated with SAS coding, and which are based on the SAS Studio Education Analytic Suite.\r\n\r\n<hr \/>\r\n\r\n<h2>Missing Values<\/h2>\r\nMissing data are observations that we intended to record but did not. Values can be missing for different reasons and most of the time we don\u2019t know the exact reason why people didn\u2019t answer certain questions. However, we can look at how much data is missing as well as the patterns of missing values and determine whether missingness is related to the variable itself, other variables in the dataset, or has no apparent pattern. In the following sections, we will go through three categories of missing data that are commonly used in research to explain why data is missing.\r\n<h2>How much data is missing?<\/h2>\r\nThe overall percentage of data that is missing is important. Generally, if less than 5% of values are missing then it is acceptable to ignore them (REF). However, the overall percentage missing alone is not enough; you also need to pay attention to <em>which<\/em> data is missing. Often you may need to consider deleting cases (participants) or individual variables that are missing a ton of values. This step alone can drastically improve the integrity of your data and reduce the overall percentage of missing values in your dataset.\r\n\r\n<hr \/>\r\n\r\n<h1>Types of Missing Data<\/h1>\r\nThere are several types of missing data, as we will discuss here.\u00a0 Some types are easy to consider and account for, while others are confusing and may be less obvious to the novice researcher.\r\n<h3>Data Missing at Random<\/h3>\r\nIn this situation, data is not actually missing at random which makes the name of the category very confusing! \u00a0MAR data happen when missing values are related to another variable in the data set. That is, the missing value (y) depends on x, but not y (itself). Here are some examples:\r\n\r\n<strong>Example 1: <\/strong>\r\n\r\nIn a survey of health care professionals, <strong>nurses<\/strong> do not report their<strong> age<\/strong>. In this case, being a nurse (x) predicts the missing data for age (y).\r\n\r\n<strong>Example 2: <\/strong>\r\n\r\nIn a family survey, <strong>single parents <\/strong>do not report their income. In this case, being a single parent (x) predicts the missing data for income (y)\r\n\r\n<strong>Example 3:<\/strong>\r\n\r\nEmployees who fear their manager do not report their job satisfaction. In this case, employees might be afraid to report their job satisfaction for fear of reprisal.\r\n\r\nNote that in real life you might find MAR patterns in your data but the rationale behind them is still speculative. Unless you go back and check with the participants it is impossible to prove.\r\n<h3>Missing Completely at Random (MCAR)<\/h3>\r\nMissing data that doesn\u2019t have a pattern of missingness is referred to as data missing completely at random (MCAR). This is the ideal situation when you have missing data because missing values are random so any influence they have on your analysis is also random. Here are some examples of MCAR situations:\r\n\r\n<strong>Example 1:<\/strong>\r\n\r\nYou conduct a study on heart transplant patients and discover that 10 patients did not answer 2-3 questions on your survey.\u00a0 The questions that are missing are different for each person and there is no pattern.\r\n\r\n<strong>Example 2:<\/strong>\r\n\r\nYou conduct an RCT comparing the effects of fish oil supplementation versus placebo on anxiety levels in nursing students.\u00a0 1 patient in the control group forgot to take their supplement on Sept 10th because they were busy. 2 patients in the experimental group missed their dose on October 3rd and Nov 19th, respectively.\u00a0 One woke up late. The other one burnt their breakfast and got distracted.\r\n\r\nThere is NO pattern causing the missing data.\r\n<h3>Not Missing at Random (NMAR)<\/h3>\r\nThe last category is data not missing at random. In this situation, missingness is because of the variable itself. In other words, there is a reason why people don\u2019t want to answer that particular question. Usually this happens with sensitive questions.\r\n\r\n<strong>Example 1:<\/strong>\r\n\r\nPeople who are overweight do not report their weight. In this case, being overweight (x) predicts the missing data for weight (x).\r\n\r\n<strong>Example 2:<\/strong>\r\n\r\nSingle parents do not report their marital status. In this case, being a single parent (x) predicts the missing data for marital status (x).\r\n\r\n<hr \/>\r\n\r\n<h3>Analyzing Missing Values<\/h3>\r\nThe default in SAS is to delete missing values from your analysis. The effect this has on your results depends on how much of your data is missing. SAS offers a number of robust options for dealing with missing data but the focus of this section is on being able to see how much of your data is missing and examine patterns.\r\n\r\nOne of the easiest ways to examine missing data patterns is to use the PROC MI command which is the multiple imputation (MI) procedure in SAS.\r\n\r\n<code><\/code>\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<p class=\"textbox__title\">The following code uses the NIMPUTE=0 option to create the \"Missing Data Patterns\" table for the specified variables.<\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<div>\r\n<pre>ODS SELECT MISSPATTERN;\r\nPROC MI DATA = NAMEOFDATASET NIMPUTE=0;\r\nVAR VAR_1 VAR_2 VAR_3;\r\nRUN;<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<code><\/code>\r\n\r\n<strong>Let\u2019s do an example together:<\/strong>\r\n\r\nIn this example you are interested in knowing more about stress levels of caregivers of older adults with dementia. You send out a pilot survey and get an initial sample of 16 people to answer it.\u00a0 The questionnaire includes demographic variables and a five-item questionnaire to measure stress rated on a 5-point Likert scale from 1 = strongly disagree to 5 = strongly agree. Higher scores indicate higher levels of stress.\r\n<div class=\"textbox textbox--examples\"><header class=\"textbox__header\">\r\n<p class=\"ABodyCopy\">The first step of course is to set up your data file in SAS:<\/p>\r\n\r\n<\/header>\r\n<div class=\"textbox__content\">\r\n<pre>OPTIONS PAGESIZE=60 LINESIZE=80 CENTER DATE;\r\nDATA CAREGIVER;\r\nLABEL\r\nID = \u2018PARTICIPANT ID\u2019\r\nSTRESS1 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM 1\u2019\r\nSTRESS2 =\u2018CAREGIVER STRESS QUESTIONNAIRE ITEM 2\u2019\r\nSTRESS3 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM3\u2019\r\nSTRESS4 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM4\u2019\r\nSTRESS5 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM5\u2019\r\nSEX = \u2018SEX\u2019;\r\nINPUT ID 1-2 SEX 4 STRESS1 6 STRESS2 8 STRESS3 10 STRESS4 12 STRESS5 14;\r\nDATALINES;\r\n01 0 4 3 4 5 4\r\n02 1 3 2 3 4 4\r\n03 1\u00a0\u00a0 3 3\u00a0\u00a0 3\r\n04 0 1 2 1 2 3\r\n05 1 4 4\u00a0\u00a0 5 3\r\n06 1 2 3 3 4 4\r\n07 0 3 3 5\u00a0\u00a0 5\r\n08 1 3 5 4 5 3\r\n09 1 4 4 5 4 4\r\n10 1\u00a0\u00a0 2 4 4 4\r\n11 0 4 3 4 5 5\r\n12 1 2 1 2 3 4\r\n13 0 1 2 4 4 2\r\n14 1 3 4 4 5 4\r\n15 1 3 4 5 4 3\r\n16 1\u00a0\u00a0 4 3 2 1\r\n;\r\nRUN;<\/pre>\r\n<\/div>\r\n<\/div>\r\n<p class=\"ABodyCopy\" style=\"margin-left: 0cm\">Next we use the MI code template but we replace NAMEOFDATASET with the actual name of our dataset (CAREGIVER) and replace the VARIABLE NAMES with the actual names of the variables in our dataset:<\/p>\r\n\r\n<div class=\"textbox textbox--examples\">\r\n<div class=\"textbox__content\">\r\n<div>\r\n<pre>ODS SELECT MISSPATTERN;\r\nPROC MI DATA = CAREGIVER NIMPUTE=0;\r\nVAR ID SEX STRESS1 STRESS2 STRESS3 STRESS4 STRESS5;\r\nRUN;<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\nAs you can see in the figure below, SAS uses this code to produce a table showing the number of cases with each pattern of missing data. First we look at the lefthand side of the table to examine the patterns of missingness in our dataset. In this example, you can see that Group 1 has no missing data because there is an \u201cX\u201d in each of the variables in the dataset. This means that there is data for each variable for participants in this group. The frequency of participants in this group is in the column labelled \u201cfreq\u201d and you can see that there are 11 people with no missing values. By looking at the next column over which is labelled \u201cpercent\u201d, we can see that this represents 68.75% of the sample.\r\n\r\nThe next pattern of missing data is Group 2. Looking across the columns, we can see that there is no \u201cX\u201d for <strong>stress4<\/strong>. This means that participants in Group 2 answered all the questions except that one. There is only 1 person in this group and they represent 6.25% of the data. We can continue doing the same interpretation for Groups 4-6 in the table.\r\n\r\nThis table also provides the means for each of the variables. Again, don\u2019t forget that \u201cmeans\u201d for some variables are not meaningful. For example, the mean values provided for Participant ID and sex should be ignored here. What is valuable though is that for continuous variables you can compare their means for participants with different patterns of missing data. For example, for the variable Stress2, you can see that the mean is the same for Groups 1, 2, 4, and 5 but it is higher for Group 3. Although this is a small sample for illustrative purposes, you can hopefully see how the information in this table can help you understand the patterns of missing values in your data better.\r\n<h6 style=\"text-align: center\">Output table showing patterns of missing values<\/h6>\r\n<img src=\"http:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData.png\" alt=\"\" class=\"aligncenter wp-image-291 size-full\" width=\"864\" height=\"202\" \/>\r\n\r\n<hr \/>\r\n\r\nFinally, consider that in research we can always expect to have missing data for a variety of reasons. The SAS program provides a powerful platform for calculating data while recognizing strategies to handle missing data.","rendered":"<p>In this section, we will work through the concepts of dealing with missing data using specific examples that are demonstrated with SAS coding, and which are based on the SAS Studio Education Analytic Suite.<\/p>\n<hr \/>\n<h2>Missing Values<\/h2>\n<p>Missing data are observations that we intended to record but did not. Values can be missing for different reasons and most of the time we don\u2019t know the exact reason why people didn\u2019t answer certain questions. However, we can look at how much data is missing as well as the patterns of missing values and determine whether missingness is related to the variable itself, other variables in the dataset, or has no apparent pattern. In the following sections, we will go through three categories of missing data that are commonly used in research to explain why data is missing.<\/p>\n<h2>How much data is missing?<\/h2>\n<p>The overall percentage of data that is missing is important. Generally, if less than 5% of values are missing then it is acceptable to ignore them (REF). However, the overall percentage missing alone is not enough; you also need to pay attention to <em>which<\/em> data is missing. Often you may need to consider deleting cases (participants) or individual variables that are missing a ton of values. This step alone can drastically improve the integrity of your data and reduce the overall percentage of missing values in your dataset.<\/p>\n<hr \/>\n<h1>Types of Missing Data<\/h1>\n<p>There are several types of missing data, as we will discuss here.\u00a0 Some types are easy to consider and account for, while others are confusing and may be less obvious to the novice researcher.<\/p>\n<h3>Data Missing at Random<\/h3>\n<p>In this situation, data is not actually missing at random which makes the name of the category very confusing! \u00a0MAR data happen when missing values are related to another variable in the data set. That is, the missing value (y) depends on x, but not y (itself). Here are some examples:<\/p>\n<p><strong>Example 1: <\/strong><\/p>\n<p>In a survey of health care professionals, <strong>nurses<\/strong> do not report their<strong> age<\/strong>. In this case, being a nurse (x) predicts the missing data for age (y).<\/p>\n<p><strong>Example 2: <\/strong><\/p>\n<p>In a family survey, <strong>single parents <\/strong>do not report their income. In this case, being a single parent (x) predicts the missing data for income (y)<\/p>\n<p><strong>Example 3:<\/strong><\/p>\n<p>Employees who fear their manager do not report their job satisfaction. In this case, employees might be afraid to report their job satisfaction for fear of reprisal.<\/p>\n<p>Note that in real life you might find MAR patterns in your data but the rationale behind them is still speculative. Unless you go back and check with the participants it is impossible to prove.<\/p>\n<h3>Missing Completely at Random (MCAR)<\/h3>\n<p>Missing data that doesn\u2019t have a pattern of missingness is referred to as data missing completely at random (MCAR). This is the ideal situation when you have missing data because missing values are random so any influence they have on your analysis is also random. Here are some examples of MCAR situations:<\/p>\n<p><strong>Example 1:<\/strong><\/p>\n<p>You conduct a study on heart transplant patients and discover that 10 patients did not answer 2-3 questions on your survey.\u00a0 The questions that are missing are different for each person and there is no pattern.<\/p>\n<p><strong>Example 2:<\/strong><\/p>\n<p>You conduct an RCT comparing the effects of fish oil supplementation versus placebo on anxiety levels in nursing students.\u00a0 1 patient in the control group forgot to take their supplement on Sept 10th because they were busy. 2 patients in the experimental group missed their dose on October 3rd and Nov 19th, respectively.\u00a0 One woke up late. The other one burnt their breakfast and got distracted.<\/p>\n<p>There is NO pattern causing the missing data.<\/p>\n<h3>Not Missing at Random (NMAR)<\/h3>\n<p>The last category is data not missing at random. In this situation, missingness is because of the variable itself. In other words, there is a reason why people don\u2019t want to answer that particular question. Usually this happens with sensitive questions.<\/p>\n<p><strong>Example 1:<\/strong><\/p>\n<p>People who are overweight do not report their weight. In this case, being overweight (x) predicts the missing data for weight (x).<\/p>\n<p><strong>Example 2:<\/strong><\/p>\n<p>Single parents do not report their marital status. In this case, being a single parent (x) predicts the missing data for marital status (x).<\/p>\n<hr \/>\n<h3>Analyzing Missing Values<\/h3>\n<p>The default in SAS is to delete missing values from your analysis. The effect this has on your results depends on how much of your data is missing. SAS offers a number of robust options for dealing with missing data but the focus of this section is on being able to see how much of your data is missing and examine patterns.<\/p>\n<p>One of the easiest ways to examine missing data patterns is to use the PROC MI command which is the multiple imputation (MI) procedure in SAS.<\/p>\n<p><code><\/code><\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<p class=\"textbox__title\">The following code uses the NIMPUTE=0 option to create the &#8220;Missing Data Patterns&#8221; table for the specified variables.<\/p>\n<\/header>\n<div class=\"textbox__content\">\n<div>\n<pre>ODS SELECT MISSPATTERN;\r\nPROC MI DATA = NAMEOFDATASET NIMPUTE=0;\r\nVAR VAR_1 VAR_2 VAR_3;\r\nRUN;<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p><code><\/code><\/p>\n<p><strong>Let\u2019s do an example together:<\/strong><\/p>\n<p>In this example you are interested in knowing more about stress levels of caregivers of older adults with dementia. You send out a pilot survey and get an initial sample of 16 people to answer it.\u00a0 The questionnaire includes demographic variables and a five-item questionnaire to measure stress rated on a 5-point Likert scale from 1 = strongly disagree to 5 = strongly agree. Higher scores indicate higher levels of stress.<\/p>\n<div class=\"textbox textbox--examples\">\n<header class=\"textbox__header\">\n<p class=\"ABodyCopy\">The first step of course is to set up your data file in SAS:<\/p>\n<\/header>\n<div class=\"textbox__content\">\n<pre>OPTIONS PAGESIZE=60 LINESIZE=80 CENTER DATE;\r\nDATA CAREGIVER;\r\nLABEL\r\nID = \u2018PARTICIPANT ID\u2019\r\nSTRESS1 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM 1\u2019\r\nSTRESS2 =\u2018CAREGIVER STRESS QUESTIONNAIRE ITEM 2\u2019\r\nSTRESS3 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM3\u2019\r\nSTRESS4 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM4\u2019\r\nSTRESS5 = \u2018CAREGIVER STRESS QUESTIONNAIRE ITEM5\u2019\r\nSEX = \u2018SEX\u2019;\r\nINPUT ID 1-2 SEX 4 STRESS1 6 STRESS2 8 STRESS3 10 STRESS4 12 STRESS5 14;\r\nDATALINES;\r\n01 0 4 3 4 5 4\r\n02 1 3 2 3 4 4\r\n03 1\u00a0\u00a0 3 3\u00a0\u00a0 3\r\n04 0 1 2 1 2 3\r\n05 1 4 4\u00a0\u00a0 5 3\r\n06 1 2 3 3 4 4\r\n07 0 3 3 5\u00a0\u00a0 5\r\n08 1 3 5 4 5 3\r\n09 1 4 4 5 4 4\r\n10 1\u00a0\u00a0 2 4 4 4\r\n11 0 4 3 4 5 5\r\n12 1 2 1 2 3 4\r\n13 0 1 2 4 4 2\r\n14 1 3 4 4 5 4\r\n15 1 3 4 5 4 3\r\n16 1\u00a0\u00a0 4 3 2 1\r\n;\r\nRUN;<\/pre>\n<\/div>\n<\/div>\n<p class=\"ABodyCopy\" style=\"margin-left: 0cm\">Next we use the MI code template but we replace NAMEOFDATASET with the actual name of our dataset (CAREGIVER) and replace the VARIABLE NAMES with the actual names of the variables in our dataset:<\/p>\n<div class=\"textbox textbox--examples\">\n<div class=\"textbox__content\">\n<div>\n<pre>ODS SELECT MISSPATTERN;\r\nPROC MI DATA = CAREGIVER NIMPUTE=0;\r\nVAR ID SEX STRESS1 STRESS2 STRESS3 STRESS4 STRESS5;\r\nRUN;<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<p>As you can see in the figure below, SAS uses this code to produce a table showing the number of cases with each pattern of missing data. First we look at the lefthand side of the table to examine the patterns of missingness in our dataset. In this example, you can see that Group 1 has no missing data because there is an \u201cX\u201d in each of the variables in the dataset. This means that there is data for each variable for participants in this group. The frequency of participants in this group is in the column labelled \u201cfreq\u201d and you can see that there are 11 people with no missing values. By looking at the next column over which is labelled \u201cpercent\u201d, we can see that this represents 68.75% of the sample.<\/p>\n<p>The next pattern of missing data is Group 2. Looking across the columns, we can see that there is no \u201cX\u201d for <strong>stress4<\/strong>. This means that participants in Group 2 answered all the questions except that one. There is only 1 person in this group and they represent 6.25% of the data. We can continue doing the same interpretation for Groups 4-6 in the table.<\/p>\n<p>This table also provides the means for each of the variables. Again, don\u2019t forget that \u201cmeans\u201d for some variables are not meaningful. For example, the mean values provided for Participant ID and sex should be ignored here. What is valuable though is that for continuous variables you can compare their means for participants with different patterns of missing data. For example, for the variable Stress2, you can see that the mean is the same for Groups 1, 2, 4, and 5 but it is higher for Group 3. Although this is a small sample for illustrative purposes, you can hopefully see how the information in this table can help you understand the patterns of missing values in your data better.<\/p>\n<h6 style=\"text-align: center\">Output table showing patterns of missing values<\/h6>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData.png\" alt=\"\" class=\"aligncenter wp-image-291 size-full\" width=\"864\" height=\"202\" srcset=\"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData.png 864w, https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData-300x70.png 300w, https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData-768x180.png 768w, https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData-65x15.png 65w, https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData-225x53.png 225w, https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-content\/uploads\/sites\/49\/2020\/04\/missingData-350x82.png 350w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><\/p>\n<hr \/>\n<p>Finally, consider that in research we can always expect to have missing data for a variety of reasons. The SAS program provides a powerful platform for calculating data while recognizing strategies to handle missing data.<\/p>\n","protected":false},"author":56,"menu_order":4,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[47],"contributor":[],"license":[],"class_list":["post-283","chapter","type-chapter","status-publish","hentry","chapter-type-standard"],"part":180,"_links":{"self":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapters\/283","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/wp\/v2\/users\/56"}],"version-history":[{"count":9,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapters\/283\/revisions"}],"predecessor-version":[{"id":285,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapters\/283\/revisions\/285"}],"part":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/parts\/180"}],"metadata":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapters\/283\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/wp\/v2\/media?parent=283"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/pressbooks\/v2\/chapter-type?post=283"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/wp\/v2\/contributor?post=283"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.library.upei.ca\/montelpare\/wp-json\/wp\/v2\/license?post=283"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}