Evaluating Teachers | Viewpoint
What Value-Added Assessments Won't Tell You About Effective Teachers
- By Patricia Deubel
There's more to being an effective teacher than raising standardized test scores, yet test scores have gained widespread acceptance among the public as the key indicator of performance--a perception that's fueled, in part, by the Race to the Top program, which defines an effective teacher as one "whose students achieve acceptable rates (e.g., at least one grade level in an academic year) of student growth" (United States Department of Education, 2009, p. 12).
Using value-added models in teacher evaluations has become the hot issue, with several states passing laws making this sort of measure of student achievement a significant factor in teacher evaluations--at least 50 percent in some states (Institute for Competitive Workforce, 2011). What constitutes the other 50 percent or more? There's much yet to be said about teacher effectiveness that value-added assessments won't tell us--namely about the human side of teaching and learning, professional practice, leadership, and learners' test perspectives.
In part 1 of this two-part examination, I introduced some of the complexities of value-added models and elaborated on my concerns regarding the nature of teaching and why experience and advanced degrees do matter. Here I will also delve more into finer details of my concerns about such models.
The Human Side of Teaching and Learning
Is it important to consider the human side of teaching and learning in measuring teacher effectiveness? The answer is "yes," if we believe "Teaching consists of classroom interactions among teachers and students" and that teachers facilitate "students' achievement of learning goals" (Hiebert & Grouws, 2007, p. 372). Yet it is this human side of teaching and learning that seems to be forgotten by those who look only at scores from value-added assessments and have no appreciation for the flaws of such models. Consider the publicly available database (see: Grading the Teachers: Value-Added Analysis) of teachers' names and scores for grades 3 through 5 published by the Los Angeles Times in 2010, which may have led a teacher, who was rated ineffective, to commit suicide (Lovett, 2010).
What if there had been an error?
While I appreciate that no system is perfect and measurements involve errors, we need to decide what is acceptable.
According to Peter Schochet and Hanley Chiang (2010) of Mathematica Policy Research, "error rates for comparing a teacher's performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data" in the upper elementary grades using student test score gain data and value-added models (p. i). Said another way, that's about one in four or one in three chance of error, respectively.
The answer is "yes" to the above question, if we believe in National Board for Professional Teaching Standards (NBPTS, 2002), one of which states that proficient teaching "ultimately requires judgment, improvisation, and conversation about means and ends" and that excellence in the craft involves "human qualities, expert knowledge and skill, and professional commitment" (p. 2). Those conversations with others attest to the collaborative nature of teaching, also embodied in the definition of effective teaching in Colorado's State Council for Educator Effectiveness Report and Recommendations to the State Board (State Council, 2011). As value-added scores are linked to individual teachers, it becomes apparent that separating the contributions of others in the teaching effectiveness equation is problematic. Students might be taught by many teachers each day, and possibly within team teaching environments or pull-out programs. Some might be intervention specialists in the content areas or tutors. There might be contributions to learning from parents and others outside of the school day, from instructors in any after-school programs, and summer school. Students on their own might gain additional learning by interacting with content and others online. Additional disturbing news from Jim Hull (2011), senior policy analyst for the Center for Public Education, is that value-added models differ in how they account for the impact of past teachers on current student performance.
If we believe in the human side of teaching and learning, value-added measures would not reveal all that teachers do to get students from point A to point B in learning across all domains: cognitive, affective, and psycho-motor. For example, how can test score data account for the degree that a teacher might have influenced the social and emotional aspects of learning, or an ability to foster interactions among learners? Indeed, Dan Goldhaber (2010), director of the Center for Education Data & Research, wrote, "It is unlikely, for example, that tests will cover socialization behavior that is taught in schools" (p. 13). Further, he acknowledged the year-to-year correlations of value-added measures are only modest with estimates ranging from 0.3 to 0.5 (perfect being 1.0). Those "estimates of effectiveness include measurement error, both because standardized tests are imprecise measures of what students know and because there are random elements such as classroom interaction that influence the performance of a group of students in a classroom" (pp. 2-3).
So, it is understandable that many teachers fear value-added models will not accurately measure the impact they have on students. Their fear is a legitimate concern: If teachers don't trust the data or how the data will be used, they will likely not use it to evaluate or improve their performance (Hull, 2011). There's more on how teachers impact learners.
How shall we measure the value of teacher-learner relationships? Value-added assessments will not reflect how well teachers have built those, as sometimes positive relationships are what keep learners on track to stay in school. By this, I do not mean that the teacher will become the pal. Effective teachers also promote and nurture career and college readiness. As Robert Marzano (2011) pointed out, they show interest in students' lives, advocate for students with such actions as the appearance of wanting them to do well and providing assistance to that end; they never give up on students; they behave in a friendly way. "With good relationships in place, all other instructional strategies seem to work better" (p. 82).
Building those relationships also involve addressing the many ways that students differ, so that ultimately all have respect for each other and all can learn in consideration of those differences. For example, some cultural differences among learners might trigger inappropriate behaviors, detracting from lessons of the day. What is acceptable in one culture might not be in another, such as in ways to speak to others (e.g., one at a time, and loud voice), levels of physical activity and verbal discourse needed with thinking and learning, attitudes about sharing and respecting physical space, authority figures, what constitutes an authority figure, and the manner in which deference is shown to authority figures (Voltz, Sims, & Nelson, 2010). Again, knowing about such differences and being able to address those in a positive manner adds to the complexity of teaching and measuring effectiveness. With the rise of K-12 online learning and communications with others at a distance, the complexities are compounded: Differences that might affect learning are not always apparent, and there are additional skills needed for building online relationships.
The Extent of Diversity in Classrooms
How shall we fairly account for the effect of diversity in classrooms? There's an extensive array of attributes beyond cultural differences. Diversity, as defined by New York's Queensborough Community College, involves "not only ways of being but ways of knowing" and "knowing how to relate to those qualities and conditions that are different from our own and outside the groups to which we belong, yet are present in other individuals and groups." In a single environment, learners and teachers themselves vary in beliefs, attitudes, perceptions, self-efficacy, motivation, learning styles, habits of mind, cultural influences, and demographics (e.g., male/female, sexual orientation, ethnicity, ability/disability, socio-economic status, religion/spirituality, etc.) ("Queensboro," n.d.). Add to this the extent that English is a second language for many learners in the United States.
Value-added models attempt to isolate the impact a teacher has on students' achievement from other factors of interest, such as student characteristics (Hull, 2011). Any factor considered in determining a value-added score would need a number associated with it for use in statistical correlations. As one looks at the many levels of diversity just noted, several (e.g., attitudes, motivation, self-efficacy, etc.) would not be considered in calculations, as, typically, data are not collected for them. Yet, teachers face all of those. Further, value-added models cannot fully control for variables because neither teachers nor their students are randomly assigned to either schools or classes, making it difficult to separate a teacher's impact on students from other non-observable measures, such as a student's motivation or help at home (Hull, 2011).
All that said, the most significant finding from a 2003 Rand Corporation investigation into value-added models is that because such models might not control for all variables of interest, "student achievement can never be shown conclusively to be due to individual teacher effectiveness" (McCaffrey, Lockwood, Koretz, & Hamilton, 2003, p. 3).
This last point is one to emphasize to the public.
Professional Practice and Leadership
How shall we account for implementations of best practices? Delving further into the professional and leadership side of teaching, value-added measures will not reveal the educator's ability to implement the many innovations and best practices that have significantly affected curriculum and instruction in recent times, nor how they might have assisted others in doing so, all of which add to teaching complexity and effectiveness. Of course one must know available strategic options--the what, where, when, how, and for whom questions surrounding a best practice--and what to do if an implementation does not go as well as expected. Among those options are differentiated instruction, Understanding by Design, state standards, development of curriculum frameworks, scope-and-sequence charts that inform teachers of what to teach and when to teach it, the expanded use of technology, active literacy, curriculum mapping, the proliferation of professional learning communities, and the rise in using formative assessments (Sullo, 2009). To this I would add effective uses of student data, the tiers of response to intervention, multiple intelligences, lesson study in which teachers collaborate in planning and refining their instruction plans, universal design for learning principles and online learning principles.
Consider the "best practice" of learning with new technologies. Although he was addressing higher education in voicing need for a theory of technology integration, Trent Batson (2011) made remarks applicable to K-12 stating:
We understand that, somehow, learning can be distributed, that active learning is now easier to manage, that social learning has become a fact of life, that authentic learning opportunities are all around us, that information is all around us, that archiving and examining student work produces a whole new harvest of learning--we understand all this and we understand "high-impact learning practices," but we have yet to put it all together. (online p. 3)
I fear a path of using results of standardized testing as a major factor to measure teaching effectiveness will decrease teacher incentives and ability to "put it all together" for the kind of education we envision for the 21st century. One has to wonder how we plan to account for the added value of the teacher who can work with new technologies. Certainly that skill set adds another layer of teaching complexity and has value for career and college readiness of our youth.
And where do any of these added skills related to best practices and their mastery come from? One might answer with experience or teachers using their personal time to experiment to learn on their own, peer-to-peer training and collaboration, formal professional development activities, or getting that next degree, which further supports commentary I made in part 1 on the role of experience and degrees in teaching.
Learner Test Perspectives
Will students take those tests seriously? In building a better evaluation system, HulI (2011) acknowledged that value-added models have flaws, but "they are much better than the system we have now. The fairest way to identify strong teaching is through a system that looks at student gains" (At a Glance section). On paper this sounds reasonable, particularly to the public. However, an important consideration relates to how students take tests. I have not forgotten a situation from years ago involving the California Test of Basic Skills. One learner turned in the test within a few minutes having just filled in the bubble-form answer sheet, and I suspected he had not even read the questions. Upon confronting him as to why he got done so quickly, he said, "I don't care. It doesn't count."
With no intention of singling out any state with plans to increase testing that will be required to use value-added models in evaluating all teachers, will every one of those new tests have high stakes attached to them for learners? Learners will decide quickly which tests "count" for them personally (e.g., passing a course, grade retention, graduation, possibly earning college credit from passing AP exams). They will set their priorities and respond to test questions accordingly. This would be particularly true for subjects that have not traditionally been subjected to standardized testing. It's an immediate concern for the usefulness of data, as regardless of conditions under which a test is taken and how seriously learners approach each test, results are potentially high stakes for teachers. Unfortunately for them, "90 percent of the variation in student gain scores is due to the variation in student-level factors that are not under" their control (Schochet & Chiang, 2010, p. 35). As Ronald Berk (2005) pointed out, using such outcomes of student testing as a measure of teaching effectiveness "is a sticky source because it is indirect. Teaching performance is being inferred from students' performance" (p. 55). Therein lies the crux of the problem for teacher buy-in to value-added models used in evaluations.
I appreciate the views of critics who find that current evaluation systems fail to identify the true variation in teacher effectiveness by rating all but a few teachers as satisfactory (Hull, 2011). However, there appears to be something insufficient in a value-added model in which teachers are "rated on a curve" relative to the effectiveness of their colleagues, as it undermines a collaborative nature of teaching. I question how this will be any better "politically" as, in a normal distribution, we will then have half of the teachers rated below average and half above average in effectiveness, even if there is only a slight difference between "worst" and "best," as Hull pointed out. In terms of this latter, I think of an uproar among parents and calls to the principal's office, if teachers graded their own students on a curve under similar circumstances. And I wonder how districts will answer charges from an unknowing public who in general might not know of the statistical nature of curves and who might turn a blind eye on any attempts at explanation of value-added scores. Some will point the finger saying, "What's going on--half of your teachers are below average in effectiveness?" The press will have a field day, particularly if they publish names and scores of those teachers, despite warnings against this from national testing, research, and policy experts (Buffenbarger, 2011).
In getting beyond those test scores, what's ultimately meaningful in determining a model for evaluating teachers relies on using multiple sources, as several researchers have suggested, that "build on the strengths of all sources, while compensating for the weaknesses in any" (Berk, 2005, p. 1; Goldhaber, 2010; Hull, 2011). The problem is identifying which group will work best. Certainly, if teacher compensation is tied to their evaluations, then those who determine that compensation should consider the guiding principles from the American Association of School Administrators, the American Federation of Teachers, the National Education Association, and the National School Boards Association (2011), which also support a multifactor approach. The caveat in those guiding principles is that any plan, whether or not it includes a value-added model, should be developed collaboratively with relevant stakeholders. It should promote collaboration, not competition; it should be "research-based" and improve student achievement.
We need to listen to teachers.
Am I buying into value-added models? I see their potential as one indicator of student growth to inform instruction for lower-stakes decisions. I'm still concerned about logistics and their use as a major factor in teacher evaluation and decisions regarding employment and merit pay. While value-added models add a layer of objectivity in evaluations, I share teacher concerns about flaws. And scores won't tell us about a teacher's contributions to the human side of teaching and learning, specifics on professional practice, and leadership. I'd like to see a comprehensive system that recognizes the multiple dimensions to teacher performance leading to the growth of learners and quality of a school or district as a whole and one that values and rewards experience and advanced degrees, as I had discussed in part 1 of this series--a huge undertaking. Ultimately, I agree with McCaffrey and his colleagues (2003). In regard to high-stakes decisions for individual teachers, the state of the art using value-added models "has not advanced to the point where such evaluations are likely to be precise enough to meet reasonable requirements for equity and precision, except in cases where we are interested only in teachers at the extremes of the distribution of effectiveness" (p. 5). I think of funding and wonder, if extremes are what we really are interested in, have we done all we can to salvage existing systems?
American Association of School Administrators, American Federation of Teachers, National Education Association, & National School Boards Association. (2011, February). Guiding principles for teacher incentive compensation plans.
Batson, T. (2011, April 6). Faculty "Buy-In"--To What?
Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. International Journal of Teaching and Learning in Higher Education,17(1), 48-62.
Buffenbarger, A. (2011, January 12). New York judge says "value-added" scores can be released. NEA Today.
Goldhaber, D. (2010, December). When stakes are high, can we rely on value added? Exploring the use of value-added models to inform teacher workforce decisions.
Hiebert, J., & Grouws, D. A. (2007). The effects of classroom mathematics teaching on students' learning. In F. K. Lester (Ed.), Second Handbook of Research on Mathematics Teaching and Learning (pp. 371-404).
Hull, J. (2011). Building a better evaluation system: At a glance. Alexandria, VA: Center for Public Education.
Institute for Competitive Workforce. (2011, January). In focus: A look into teacher effectiveness.
Lovett, I. (2010, November 9). Teacher suicide is flash point in debate over 'value-added analysis,' reforms. New York Times.
Marzano, R. (2011). Relating to students: It's what you do that counts. Educational Leadership, 68(6), 82-83.
McCaffrey, D., Lockwood, J. R., Daniel M. Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica, CA: Rand Corporation.
National Board for Professional Teaching Standards. (2002). What teachers should know and be able to do: The five core propositions of the national board.
Queensborough Community College (n.d.). Definition of diversity.
Ramirez, A. (2011). Merit pay misfires. Educational Leadership, 68(4), 55-58.
Schochet, P. Z., & Chiang, H. S. (2010). Error rates in measuring teacher and school Performance Based on Student Test Score Gains(NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
State Council for Educator Effectiveness. (2011, April 13). Final Report and Recommendations to the Colorado State Board of Education.
Sullo, B. (2009). The motivated student: Unlocking the enthusiasm for learning. Alexandria, VA: ASCD.
United States Department of Education. (2009). Race to the Top Program: Executive Summary. Washington, DC: Author.
Voltz, D., Sims, M., & Nelson, B. (2010). Connecting teachers, students, and standards: Strategies for success in diverse and inclusive classrooms. Alexandria, VA: ASCD.