October 2005 — Features
Print this article | Email this articleClick here to receive your FREE subscription to T.H.E. Journal
Scrubbing Data for D3M
A Closer Look at Dirty Data
Various filing methods. Think your
school’s data can’t be all that dirty? Think
again. In a common database repository, a
“routine” method for data input is rare, unless specified guidelines and structures
are in place to allow consistency.
Information can be filed in a myriad of
ways (e.g., names,initials,seating charts,or
ID numbers). Sometimes data are stored
on electronic media with first and last
names or numeric or alphabetical identification. Numeric identification can consist
of multiple digits that are left- or right-
justified. For example, a student with an
ID number of 45632 is probably the same
student as 00045632, but is unlikely to be
the same student as 45632000.
What’s that address again? Another example is the use of student addresses as identification. Some computer-database users might enter only the letter “W” to indicate “West” in an address, while others might type the full word in the street name. As simple as this may seem, an incorrect entry contaminates the data- base. For instance, entering the street information as “W. Parker Boulevard” is quite different from entering “West Parker Boulevard.” The first refers to the street’s direction (i.e., the street name is “Parker,” and the street runs east to west); the latter format refers to the street name “West Parker.” Even the “Boulevard” designation presents various formats (e.g., Bl., Blv., Bd., or Blvd.).
No shortcuts. Again, consistent and uniform formatting upon data entry is critical to data quality. When simple guideline procedures are neglected and shortcuts, abbreviations, careless entries, or use of common nomenclatures occur, the resulting data are impure and jeopardize proper analysis. With such patterns of anomalies in formatting information, data corruption is likely; therefore, conversion or translation scripting is used to clean the data entry. And bad data is like a stealth virus: It runs in the background, systemati- cally “gnawing away” at the database residing on the hard drive, before becoming public. It eventually strikes in an invasive manner, and, before you know it, all of the data within the computer system are corrupted. Dirty data are much like calcium buildup in a water pipe; sooner or later, the consequence will become evident. On the other hand, clean data uphold the integrity of analysis and effective D3M.
Process Change for Clean Data
Design a floor plan. Because dirty data are
often associated with data-entry shortcuts, misinterpretations/misunderstandings, and carelessness due to the personal
preferences of the individual keying or
otherwise inputting the data (and let’s face
it—even today, data entry remains a
largely manual process), it requires a
“floor plan” or procedural layout. Data
cleansing must become the responsibility
of each individual playing a role in data
creation and/or processing at any level,
and that begins with the instrument used
to collect the data. Therefore, schools and districts should have a specific and consis-
tent procedure in place for requesting
data, which includes the following: