October 2005 — Features

Print this article

Click here to receive your FREE subscription to T.H.E. Journal

Scrubbing Data for D3M

‘Mopping’ and ‘Scrubbing’ Bad Data
To ensure data quality, two processes must be in place: mopping and scrubbing. Mopping. “Mopping” (or mapping) data is the process of locating or identifying where and how data are stored throughout the respective organization. This procedure provides database personnel with the information required to develop a data plan surrounding organizational needs, wants, costs, and desired outcomes. Data collection and input are impor- tant steps in the data-analysis process toward successful D3M. Data collection begins with data mapping, which equates to surveying needs as well as the areas where data are located. We all know far too well that data can be found in unexpected places such as boxes in a storage area, desk drawers or cupboards within a classroom, a file cabinet belonging to a principal, or the files of a secretary. This data can be stored on floppy disks, computer hard drives, CD-ROMs, recording tapes, notebooks or sticky pads, grade books, and “notes” areas within teacher handbooks. Because of the various potential locations and conditions of data (once resurrected from their hidden places), a concoction of data-quality issues often needs to be resolved before there is even a hope that analysis can begin. The best way to resolve these anomalies: scrub the data.

Scrubbing. The common adage “garbage in, garbage out” is appropriately analogous to data collection and input; however, data “scrubbing” is often overlooked within the process of data collection. Data scrubbing pertains to removing erroneous pieces of information within a data set. These bits and pieces of information are debris that is “dirty” or “contaminated” and can drastically affect the outcome of data analysis. The source of data impurities can range from inaccurate inputting, incomplete information, improperly formatted structures, to the most common impurity source: duplication of information. The truth is, consideration of impure or dirty data will result in a flawed analysis, potentially leading to an inaccurate prognosis and/or a diagnosis with the implementation of always fatal interventions.

Data scrubbing or cleansing is crucial; the process results in high-quality data that are appropriate for effective data analysis. It removes fallacious marks or debris—datum by datum—either manually or through a series of scripts. Yet, fortunately, data scrubbing requires no special art or science. But beware: Although a plethora of vendors marketing data-cleansing software project their individual products as the best on the market, many of these products would not serve the educational environment in an effective manner. For example, the most popular scrubber within today’s marketplace is a zip-code scrubber. Zip-code errors have the least impact of all demographic data affecting outcomes within a K-12 school environment. In most cases, students attend schools within their zip-code region, so there is no need for cleansing zip codes with expensive software. However, other demographic information entered into educational databases with no consistent rules or guidelines can indeed require future scrubbing.