October 2005 — Features
Print this article | Email this articleClick here to receive your FREE subscription to T.H.E. Journal
Scrubbing Data for D3M
• Use a standard form to request data• Be specific about the information sought
• State the purpose for the request
• Explain how the data are needed
• Inform as to how often the data will be requested
Preventing the Pitfalls
DB design attributes vs. field entry. As
school personnel advance their skills in
using RDBMS, DW, and SIF strategies,
data quality becomes increasingly crucial.
In fact, RDBMS, DW, and SIF strategies
require data integrity and a schema of the
data fields and layouts. Normally, a unique
identification number with a required
number of digits serves as the primary key.
Unique numbers throughout a school or
school district will promote optimal data
quality, data integration, data migration,
data profiling, data analysis, and data
management. But these attributes are of
greater importance in the design of a database than in data cleansing. Whether a
particular field is labeled “M” for middle
name or marital status, or identified as
“Gender” or “Sex,” is not as critical as how
the data response is identified inside the
field. For example, is the gender or sex of
the respondent “male,” “ml,” “m,” “1,” or
“2”? Inconsistencies will present problems
and ultimately require scrubbing.
Multiple uses and other ills. Data impurities within school databases often derive from a single database being used for multiple purposes. When a variety of users are called upon to input important information into a single database, there must be clear guidelines surrounding the data entry. Another way a database can be saturated with impurities is by using hieroglyphics to denote students with special needs or students within special programs. For example, a database used to identify services to English-language learners may stipulate data entry to include “E,” “e,” or “El” behind student names or following their student identification numbers or ethnicity. A GATE- identified student may have a “+,” “++,” or “G” symbolized behind the response within one of these demographic fields to represent participation in the program. Scrubbing the hieroglyphics may require database personnel or a programmer to write scripts or develop a manual describing the tedious process of removing the described anomalies. Such hieroglyphics serve no effectual purpose in a database and should be avoided. Similarly, carelessness in data entry, either through keying or scanning, requires the adoption of a proven quality-control process. Entering a student address as 5657 instead of 5567, or a telephone number as 909. 555.5657 rather than 909.555.5567, are forms of “littering” the database. Other concerns involve the accuracy of raw scores, percent correct, normal curve equivalents, percentiles, grade equivalents, and standard scores. Statistics are also used in data analysis, and confusion in frequency counts, percentages, modes, means, and standard deviations can often be tracked to dirty data.