October 2005 — Features
Print this articleClick here to receive your FREE subscription to T.H.E. Journal
Scrubbing Data for D3M
‘Mopping’ and
‘Scrubbing’ Bad Data
To ensure data quality, two processes must
be in place: mopping and scrubbing.
Mopping. “Mopping” (or mapping)
data is the process of locating or identifying
where and how data are stored throughout
the respective organization. This procedure provides database personnel with the
information required to develop a data
plan surrounding organizational needs,
wants, costs, and desired outcomes.
Data collection and input are impor-
tant steps in the data-analysis process
toward successful D3M. Data collection
begins with data mapping, which equates
to surveying needs as well as the areas
where data are located. We all know far too
well that data can be found in unexpected
places such as boxes in a storage area, desk
drawers or cupboards within a classroom,
a file cabinet belonging to a principal, or
the files of a secretary. This data can be
stored on floppy disks, computer hard
drives, CD-ROMs, recording tapes, notebooks or sticky pads, grade books, and
“notes” areas within teacher handbooks.
Because of the various potential locations
and conditions of data (once resurrected
from their hidden places), a concoction of
data-quality issues often needs to be
resolved before there is even a hope that
analysis can begin. The best way to resolve
these anomalies: scrub the data.
Scrubbing. The common adage “garbage in, garbage out” is appropriately analogous to data collection and input; however, data “scrubbing” is often overlooked within the process of data collection. Data scrubbing pertains to removing erroneous pieces of information within a data set. These bits and pieces of information are debris that is “dirty” or “contaminated” and can drastically affect the outcome of data analysis. The source of data impurities can range from inaccurate inputting, incomplete information, improperly formatted structures, to the most common impurity source: duplication of information. The truth is, consideration of impure or dirty data will result in a flawed analysis, potentially leading to an inaccurate prognosis and/or a diagnosis with the implementation of always fatal interventions.
Data scrubbing or cleansing is crucial; the process results in high-quality data that are appropriate for effective data analysis. It removes fallacious marks or debris—datum by datum—either manually or through a series of scripts. Yet, fortunately, data scrubbing requires no special art or science. But beware: Although a plethora of vendors marketing data-cleansing software project their individual products as the best on the market, many of these products would not serve the educational environment in an effective manner. For example, the most popular scrubber within today’s marketplace is a zip-code scrubber. Zip-code errors have the least impact of all demographic data affecting outcomes within a K-12 school environment. In most cases, students attend schools within their zip-code region, so there is no need for cleansing zip codes with expensive software. However, other demographic information entered into educational databases with no consistent rules or guidelines can indeed require future scrubbing.