October 2005 — Features

Print this article | Email this article

Click here to receive your FREE subscription to T.H.E. Journal

Scrubbing Data for D3M

Use a standard form to request data
Be specific about the information sought
State the purpose for the request
Explain how the data are needed
Inform as to how often the data will be requested

Preventing the Pitfalls
DB design attributes vs. field entry. As school personnel advance their skills in using RDBMS, DW, and SIF strategies, data quality becomes increasingly crucial. In fact, RDBMS, DW, and SIF strategies require data integrity and a schema of the data fields and layouts. Normally, a unique identification number with a required number of digits serves as the primary key. Unique numbers throughout a school or school district will promote optimal data quality, data integration, data migration, data profiling, data analysis, and data management. But these attributes are of greater importance in the design of a database than in data cleansing. Whether a particular field is labeled “M” for middle name or marital status, or identified as “Gender” or “Sex,” is not as critical as how the data response is identified inside the field. For example, is the gender or sex of the respondent “male,” “ml,” “m,” “1,” or “2”? Inconsistencies will present problems and ultimately require scrubbing.

Multiple uses and other ills. Data impurities within school databases often derive from a single database being used for multiple purposes. When a variety of users are called upon to input important information into a single database, there must be clear guidelines surrounding the data entry. Another way a database can be saturated with impurities is by using hieroglyphics to denote students with special needs or students within special programs. For example, a database used to identify services to English-language learners may stipulate data entry to include “E,” “e,” or “El” behind student names or following their student identification numbers or ethnicity. A GATE- identified student may have a “+,” “++,” or “G” symbolized behind the response within one of these demographic fields to represent participation in the program. Scrubbing the hieroglyphics may require database personnel or a programmer to write scripts or develop a manual describing the tedious process of removing the described anomalies. Such hieroglyphics serve no effectual purpose in a database and should be avoided. Similarly, carelessness in data entry, either through keying or scanning, requires the adoption of a proven quality-control process. Entering a student address as 5657 instead of 5567, or a telephone number as 909. 555.5657 rather than 909.555.5567, are forms of “littering” the database. Other concerns involve the accuracy of raw scores, percent correct, normal curve equivalents, percentiles, grade equivalents, and standard scores. Statistics are also used in data analysis, and confusion in frequency counts, percentages, modes, means, and standard deviations can often be tracked to dirty data.

Enter the Greenlight Essay Contest

Students: Tell us how your school can use technology to protect the environment. Win a 30-seat computer lab! Sponsored by PC Mall Gov, HP, InFocus and T.H.E. Journal
www.pcmallgov.com/
greenlightcontest