摘要
Abstract This paper covers six interrelated issues in formative assessment (aka, ‘assessment for learning’). The issues concern the definition of formative assessment, the claims commonly made for its effectiveness, the limited attention given to domain considerations in its conceptualisation, the under‐representation of measurement principles in that conceptualisation, the teacher‐support demands formative assessment entails, and the impact of the larger educational system. The paper concludes that the term, ‘formative assessment’, does not yet represent a well‐defined set of artefacts or practices. Although research suggests that the general practices associated with formative assessment can facilitate learning, existing definitions admit such a wide variety of implementations that effects should be expected to vary widely from one implementation and student population to the next. In addition, the magnitude of commonly made quantitative claims for effectiveness is suspect, deriving from untraceable, flawed, dated, or unpublished sources. To realise maximum benefit from formative assessment, new development should focus on conceptualising well‐specified approaches built around process and methodology rooted within specific content domains. Those conceptualisations should incorporate fundamental measurement principles that encourage teachers and students to recognise the inferential nature of assessment. The conceptualisations should also allow for the substantial time and professional support needed if the vast majority of teachers are to become proficient users of formative assessment. Finally, for greatest benefit, formative approaches should be conceptualised as part of a comprehensive system in which all components work together to facilitate learning. Keywords: formative assessmentassessment for learning Acknowledgements I am grateful to Steve Chappuis, Joe Ciofalo, Terry Egan, Dan Eignor, Drew Gitomer, Steve Lazer, Christy Lyon, Yasuyo Sawaki, Cindy Tocci, Caroline Wylie, and two anonymous reviewers for their helpful comments on earlier drafts of this paper or the presentation upon which the paper was based; to Brent Bridgeman, Shelby Haberman, and Don Powers for their critique of selected effectiveness studies; to Dylan Wiliam, Jim Popham and Rick Stiggins for their willingness to consider differing points of view; and to Caroline Gipps for suggesting (however unintentionally) the need for a paper such as this one. Notes 1. Influential members of the group have included Paul Black, Patricia Broadfoot, Caroline Gipps, Wynne Harlen, Gordon Stobart, and Dylan Wiliam. See http://www.assessment-reform-group.org/ for more information on the Assessment Reform Group. 2. How does formative assessment differ from diagnostic assessment? Wiliam and Thompson (2008 Wiliam, D. and Thompson, M. 2008. “Integrating assessment with learning: What will it take to make it work?”. In The future of assessment: Shaping teaching and learning, Edited by: Dwyer, C.A. 53–82. New York: Erlbaum. [Google Scholar], 62) consider an assessment to be diagnostic when it provides information about what is going amiss and formative when it provides guidance about what action to take. They note that not all diagnoses are instructionally actionable. Black (1998 Black, P. 1998. Testing, friend or foe? The theory and practice of assessment and testing, London: Routledge/Falmer Press. [Google Scholar], 26) offers a somewhat different view, stating that: ‘… diagnostic assessment is an expert and detailed enquiry into underlying difficulties, and can lead to a radical re‐appraisal of a pupil's needs, whereas formative assessment is more superficial in assessing problems with particular classwork, and can lead to short‐term and local changes in the learning work of a pupil’. 3. Expected growth was calculated from the norms of the Metropolitan Achievement Test Eighth Edition (Harcourt Educational Measurement 2002 Harcourt Educational Measurement. 2002. Metropolitan8: Technical manual, San Antonio, TX: Author. [Google Scholar]), the Iowa Tests of Basic Skills Complete Battery (Hoover, Dunbar, and Frisbie 2001 Hoover, H.D., Dunbar, S.B. and Frisbie, D.A. 2001. Iowa Tests of Basic Skills Complete/Core Battery: Spring norms and score conversions with technical information, Itasca, IL: Riverside. [Google Scholar]), and the Stanford Achievement Test Series Tenth Edition (Pearson 2004 Pearson. 2004. Stanford Achievement Test Series Tenth Edition: Technical data report, Iowa City, IA: Author. [Google Scholar]). 4. Stiggins is reported to no longer stand by the claims quoted here (S. Chappuis, April 6, 2009 Chappuis, J., Chappuis, S. and Stiggins, R. 2009. “Formative assessment and assessment for learning”. In Meaningful measurement: The role of assessments in improving high school education in the twenty‐first century, Edited by: Pinkus, L.M. 55–76. Washington, DC: Alliance for Excellent Education. http://www.all4ed.org/files/MeanMeasCh3ChappuisStiggins.pdf (accessed August 3, 2009) [Google Scholar], personal communication). I have included them because they are published ones still frequently taken by others as fact. See Kahl (2007 Kahl, S. Formative assessment: An overview. Presentation at the Montana Office of Public Instruction ‘Assessment Toolkit’ conference. April23, Helena, MT. http://opi.mt.gov/PDF/Assessment/conf/Presentations/07MON_FormAssmt.ppt (accessed February 11, 2009) [Google Scholar]) for an example. 5. Cohen (1988 Cohen, J. 1988. Statistical power analysis for the behavioral sciences. , 2nd ed., Hillsdale, NJ: Lawrence Erlbaum Associates. [Google Scholar], 25–7) considers effects of .2 to be small, .5 to be medium, and .8 to be large. 6. It is possible that these values represent Black and Wiliam's retrospective extraction from the 1998 Black, P. and Wiliam, D. 1998a. Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2): 139–48. [Web of Science ®] , [Google Scholar] review of the range of mean effects found across multiple meta‐analytical studies done by other investigators on different topics (i.e., the mean effect found in a meta‐analysis on one topic was .4 and the mean effect found in a meta‐analysis on a second topic was .7). If so, the range of observed effects across individual studies would, in fact, be wider than the oft‐quoted .4 to .7 range of effects, as each meta‐analytic mean itself represents a distribution of study effects. But more fundamentally, the construction of any such range would seem specious according to Black and Wiliam's (1998c Black, P. and Wiliam, D. 1998c. Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1): 7–74. [Taylor & Francis Online] , [Google Scholar]) very own critique – i.e., ‘… the underlying differences between the studies are such that any amalgamations of their results would have little meaning’ (53). 7. A partial list of concerns includes confusing association with causation in the interpretation of results, ignoring in the interpretation the finding that results could be explained by (irrelevant) method factors, seemingly computing effect sizes before coding the same studies for the extent of use of formative assessment (introducing the possibility of bias in coding), giving no information on the reliability of the coding, and including many dated studies (57 of the 86 included articles were 30 or more years old) without considering publication date as a moderator variable. 8. The replicability of inferences and adjustments may be challenging to evaluate. It would be easiest to assess in team‐teaching situations in which both teachers might be expected to have a shared understanding of their classroom context and students. Outside of team contexts, replicability might be evaluated through video recording of teachers' formative assessment practice; annotation of the recording by those teachers to indicate their inferences, adjustments, and associated rationales; and review of the recordings and annotations by expert teachers for reasonableness. 9. Kane (2006 Kane, M.T. 2006. “Validation”. In Educational measurement, , 4th ed., Edited by: Brennan, R.L. 17–64. Westport, CT: American Council on Education/Praeger. [Google Scholar], 23) uses ‘interpretive argument’ to refer to claims and ‘validity argument’ to refer to the backing. For simplicity, I've used ‘validity argument’ to refer to both claims and backing. 10. One could certainly conceptualise the relationship between the validity and efficacy arguments the other way around; that is, with the efficacy argument being part of a broader validity argument, a formulation that would be consistent with Kane's (2006 Kane, M.T. 2006. “Validation”. In Educational measurement, , 4th ed., Edited by: Brennan, R.L. 17–64. Westport, CT: American Council on Education/Praeger. [Google Scholar], 53–6) views. Regardless of which argument is considered to be overarching, there is no disagreement on the essential point: both arguments are needed. 11. As suggested, there are other possible underlying causes for student error, some of which may be cognitive and others of which may be affective (e.g., not trying one's hardest to respond). Black and Wiliam (2009 Black, P. and Wiliam, D. 2009. Developing a theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1): 5–31. [Crossref], [Web of Science ®] , [Google Scholar], 17) suggest a variety of cognitive causes, including misinterpretation of language, question purpose or context, or the requirements of the task itself. Affective causes may be situational ones related, for instance, to the type of feedback associated with a particular task or teacher, or such causes may be more deeply rooted, as when a student's history of academic failure dampens motivation to respond even when he or she possesses the requisite knowledge. Boekaerts (as cited in Boekaerts and Corno 2005 Boekaerts, M. and Corno, L. 2005. Self‐regulation in the classroom: A perspective on assessment and intervention. Applied Psychology: An International Review, 54(2): 199–231. [Crossref], [Web of Science ®] , [Google Scholar], 202–3) offers a model to explain how students attempt to balance achievement goals and emotional well‐being in classroom situations.