Data Cleaning (Methods Discussion Group Feb 24, 2017)

Michelle Dimitris talked about classic data cleaning problems and solutions at today’s Methods Group meeting.  From dates to missing data to free text fields, the discussion carried on into students and professors highlighting their own experiences and offering solutions to others’ problems.  Below is a summary.

Dates are often problematic (hence the proliferation of dating sites!…jk).  Month-date-year, date-month-year, we’ve all faced this confusion. The solution? Avoid the confusion the first place. In other words, ensure that data collection forms always specify the desired date format.  Short of altering the survey, Michelle suggests keeping a running list of unclear dates for fieldworkers to clarify if possible. This involves checking date ranges to identify implausible values and interchangeable values: otherwise, 04-01-2015, for example, can be April 1 or January 4. Once we have dates in specific formats, we often wish to alter this format or ‘split’ the date into its year-month-date components.  Using ‘split‘ in Stata or date objects in R or other software is one way to do this. Parsing data allows you to specify the character by which to split, for example, in ’04-01-2015′, the character is ‘-‘.  The date will then be split into separate variables.

While cognitive dissonance is fascinating to psychologists, contradictory responses can be another pain for data cleaning.  For example, income = $40K for one answer, but income < $20K when the same question is asked another way.  Which do we believe?  Similar to dates, it is best to prevent the problem by ensuring there are no duplicate questions on the survey without very good reason.  Summarise, tab and if statements in Stata allow us to verify the extent of illogical answers by, for example, determining the number of cases that responded income > $20K on one question, and the opposite on another question. Hopefully the problem is minimal, but if the fieldworker cannot clarify, we need a priori rules on which answer to consider valid.  This solution will inevitably involve some subjective decision making.

Similarly missing data and free text responses are two other classes of classic data cleaning problems.  Missing data is problematic for data cleaning when it is ambiguous.  Is the data missing because a particular response option was unavailable, or because a person refused to to respond? Such ambiguity can be prevented through proper questionnaire design, but, while tempting, avoid including free text ‘Other’ options as a potential solution without providing an exhaustive list of specific options first. Giving respondents free reign to write in answers = coding nightmare.  While commands such as ‘parse‘ and regular expression searching can help classify them, free text fields are generally only  useful if qualitative analysis is planned. Without a plan to analyse the ‘other’ box, writing out as many relevant options as possible and lumping the rest as a simple ‘Other’ check box  leads to much cleaner data.  Several people highlighted this problem with extracted medical charts in particular, and solutions varied from manual coding to query-based classification schemes using the commands mentioned above.

Finally, some rules  cut across the whole discussion.  First, never alter your actual dataset. Rather, make all ‘adjustments’ through code. Second, most of this issues can be solved at the data-collection phase, so try to be involved in the instrument design. Finally, remember that your data management software does not have to be the same as your analysis software.  Some tools (potentially SAS) are better for data mangement than others, and it may be worth learning some basic commands in another software to make your life easier.

For Michelle’s slides, see: https://drive.google.com/file/d/0BwmCu4M3g_jveUFUSlhIaFJpQzQ/view