Scrubbing Data for D3M

by Neil Mercurius

Is your school or district planning to develop or purchase tools and technologies to better enable data-driven decision-making (D3M)? First, you’ll need to make sure that your data is top-quality.

Data-driven decision-making (D3M) appears to be the new buzz phrase for this century, the information age. On the education front, teachers and administrators are engaging in data-centered dialog in grade-level meetings, lounges, hallways, and class-rooms as they brainstorm toward closing the gap in student achievement. Clearly, such discussion among professional educators has dramatically increased since the enactment of the No Child Left Behind Act of 2001. And as a result of NCLB, the teaching community is establishing data repositories to analyze information to improve teaching and learning within the school environment. In addition, business organizations specializing in data collection, analysis, and technology reporting are rapidly emerging in large numbers to assist schools as they struggle to meet state, federal, and local requirements. Many software vendors, in fact, now serve the education market through their development of disaggregation products—those specifically designed to help schools meet their goals and close the gap in student achieve-ment via D3M.

Using a Data Repository
Of course, data collection within schools and toward the improvement of teaching and learning is not a new practice. Schools have been collecting, analyzing, and reporting data manually to local, state, and federal agencies for many years. Over time, and within the personnel pool of most schools, technically skilled employees have played a customary role in developing electronic spreadsheets and databases to meet data’s increasing demand. Yet, as this demand continues to grow, increasingly greater sophistication in data analysis—necessitating highly qualified database personnel and systems—is also expected on an ongoing basis. No longer is large-scale data analysis conducted with database software such as Excel (, File Maker Pro (, or similar spreadsheet programs. Instead, robust “relational” database systems, such as Microsoft SQL or Oracle engines (, are now used to manipulate such data. After immeasurable hours have been invested in the manual “crunching” of numbers, new technologies have minimized the burden of educational reporting. Technology has assumed the role of managing data, freeing teachers and administrators to focus on making decisions drawn from the data. The influx of relational database management systems (RDBMSs), data warehousing (DW) strategies, and procedures based upon the Schools Interoperability Framework (SIF) repository model is generating a knowledge base, or pool of data-analysis experts,that is significantly influencing the decision-making of school administrators as they review the multiple measures now available to them. But before any decision can be made with regard to developing or purchasing a RDBMS, incorporating a DW strategy, or even considering basing an important decision on a SIF repository model, schools and districts must have procedures in place for ensuring quality data.

‘Mopping’ and ‘Scrubbing’ Bad Data
To ensure data quality, two processes must be in place: mopping and scrubbing. Mopping. “Mopping” (or mapping) data is the process of locating or identifying where and how data are stored throughout the respective organization. This procedure provides database personnel with the information required to develop a data plan surrounding organizational needs, wants, costs, and desired outcomes. Data collection and input are impor- tant steps in the data-analysis process toward successful D3M. Data collection begins with data mapping, which equates to surveying needs as well as the areas where data are located. We all know far too well that data can be found in unexpected places such as boxes in a storage area, desk drawers or cupboards within a classroom, a file cabinet belonging to a principal, or the files of a secretary. This data can be stored on floppy disks, computer hard drives, CD-ROMs, recording tapes, notebooks or sticky pads, grade books, and “notes” areas within teacher handbooks. Because of the various potential locations and conditions of data (once resurrected from their hidden places), a concoction of data-quality issues often needs to be resolved before there is even a hope that analysis can begin. The best way to resolve these anomalies: scrub the data.

Scrubbing. The common adage “garbage in, garbage out” is appropriately analogous to data collection and input; however, data “scrubbing” is often overlooked within the process of data collection. Data scrubbing pertains to removing erroneous pieces of information within a data set. These bits and pieces of information are debris that is “dirty” or “contaminated” and can drastically affect the outcome of data analysis. The source of data impurities can range from inaccurate inputting, incomplete information, improperly formatted structures, to the most common impurity source: duplication of information. The truth is, consideration of impure or dirty data will result in a flawed analysis, potentially leading to an inaccurate prognosis and/or a diagnosis with the implementation of always fatal interventions.

Data scrubbing or cleansing is crucial; the process results in high-quality data that are appropriate for effective data analysis. It removes fallacious marks or debris—datum by datum—either manually or through a series of scripts. Yet, fortunately, data scrubbing requires no special art or science. But beware: Although a plethora of vendors marketing data-cleansing software project their individual products as the best on the market, many of these products would not serve the educational environment in an effective manner. For example, the most popular scrubber within today’s marketplace is a zip-code scrubber. Zip-code errors have the least impact of all demographic data affecting outcomes within a K-12 school environment. In most cases, students attend schools within their zip-code region, so there is no need for cleansing zip codes with expensive software. However, other demographic information entered into educational databases with no consistent rules or guidelines can indeed require future scrubbing.

A Closer Look at Dirty Data
Various filing methods. Think your school’s data can’t be all that dirty? Think again. In a common database repository, a “routine” method for data input is rare, unless specified guidelines and structures are in place to allow consistency. Information can be filed in a myriad of ways (e.g., names,initials,seating charts,or ID numbers). Sometimes data are stored on electronic media with first and last names or numeric or alphabetical identification. Numeric identification can consist of multiple digits that are left- or right- justified. For example, a student with an ID number of 45632 is probably the same student as 00045632, but is unlikely to be the same student as 45632000.

What’s that address again? Another example is the use of student addresses as identification. Some computer-database users might enter only the letter “W” to indicate “West” in an address, while others might type the full word in the street name. As simple as this may seem, an incorrect entry contaminates the data- base. For instance, entering the street information as “W. Parker Boulevard” is quite different from entering “West Parker Boulevard.” The first refers to the street’s direction (i.e., the street name is “Parker,” and the street runs east to west); the latter format refers to the street name “West Parker.” Even the “Boulevard” designation presents various formats (e.g., Bl., Blv., Bd., or Blvd.).

No shortcuts. Again, consistent and uniform formatting upon data entry is critical to data quality. When simple guideline procedures are neglected and shortcuts, abbreviations, careless entries, or use of common nomenclatures occur, the resulting data are impure and jeopardize proper analysis. With such patterns of anomalies in formatting information, data corruption is likely; therefore, conversion or translation scripting is used to clean the data entry. And bad data is like a stealth virus: It runs in the background, systemati- cally “gnawing away” at the database residing on the hard drive, before becoming public. It eventually strikes in an invasive manner, and, before you know it, all of the data within the computer system are corrupted. Dirty data are much like calcium buildup in a water pipe; sooner or later, the consequence will become evident. On the other hand, clean data uphold the integrity of analysis and effective D3M.

Process Change for Clean Data
Design a floor plan. Because dirty data are often associated with data-entry shortcuts, misinterpretations/misunderstandings, and carelessness due to the personal preferences of the individual keying or otherwise inputting the data (and let’s face it—even today, data entry remains a largely manual process), it requires a “floor plan” or procedural layout. Data cleansing must become the responsibility of each individual playing a role in data creation and/or processing at any level, and that begins with the instrument used to collect the data. Therefore, schools and districts should have a specific and consis- tent procedure in place for requesting data, which includes the following:
Use a standard form to request data
Be specific about the information sought
State the purpose for the request
Explain how the data are needed
Inform as to how often the data will be requested

Preventing the Pitfalls
DB design attributes vs. field entry. As school personnel advance their skills in using RDBMS, DW, and SIF strategies, data quality becomes increasingly crucial. In fact, RDBMS, DW, and SIF strategies require data integrity and a schema of the data fields and layouts. Normally, a unique identification number with a required number of digits serves as the primary key. Unique numbers throughout a school or school district will promote optimal data quality, data integration, data migration, data profiling, data analysis, and data management. But these attributes are of greater importance in the design of a database than in data cleansing. Whether a particular field is labeled “M” for middle name or marital status, or identified as “Gender” or “Sex,” is not as critical as how the data response is identified inside the field. For example, is the gender or sex of the respondent “male,” “ml,” “m,” “1,” or “2”? Inconsistencies will present problems and ultimately require scrubbing.

Multiple uses and other ills. Data impurities within school databases often derive from a single database being used for multiple purposes. When a variety of users are called upon to input important information into a single database, there must be clear guidelines surrounding the data entry. Another way a database can be saturated with impurities is by using hieroglyphics to denote students with special needs or students within special programs. For example, a database used to identify services to English-language learners may stipulate data entry to include “E,” “e,” or “El” behind student names or following their student identification numbers or ethnicity. A GATE- identified student may have a “+,” “++,” or “G” symbolized behind the response within one of these demographic fields to represent participation in the program. Scrubbing the hieroglyphics may require database personnel or a programmer to write scripts or develop a manual describing the tedious process of removing the described anomalies. Such hieroglyphics serve no effectual purpose in a database and should be avoided. Similarly, carelessness in data entry, either through keying or scanning, requires the adoption of a proven quality-control process. Entering a student address as 5657 instead of 5567, or a telephone number as 909. 555.5657 rather than 909.555.5567, are forms of “littering” the database. Other concerns involve the accuracy of raw scores, percent correct, normal curve equivalents, percentiles, grade equivalents, and standard scores. Statistics are also used in data analysis, and confusion in frequency counts, percentages, modes, means, and standard deviations can often be tracked to dirty data.

Live by the Scrub, Die by the Scrub
In the end, data scrubbing or cleansing is the most critical step in the data-collection process toward effective analysis that ultimately results in top-flight data- driven decision-making. To get there, attention must be paid by all participants in the collection process to the procedures in place at each school site for the standardization of data entry. And while scrubbing data is the responsibility of all individuals involved in the process, ultimate accountability must always reach the individual who entered the data. To ensure entry of clean data, a process of validation must begin with that individual. Experts recommend a system that returns dirty data to their source for proper reentry. When all involved recognize that data quality is an individual responsibility— and the magnitude of the potential consequences from analysis and decisions based upon dirty data are realized—effective and consistent D3M will be the result.

• Use a standard form to request data
• Be specific about the information sought
• State the purpose for the request
• Explain how the data are needed
• Inform as to how often the data will be requested

• Attention must be paid by all participants in the collection process to the procedures in place at each school site for the standardization of data entry.
• Scrubbing data is the responsibility of all individuals involved in the process; however, ultimate accountability must always reach the individual who entered the data.
• A validation process must begin with the individual who entered the data; return dirty data to the entry source forproper reentry.

Neil Mercurius is the director of information/education technology and assessment at Jurupa Unified School District (CA).

This article originally appeared in the 10/01/2005 issue of THE Journal.