A challenge of psychological research is that we are often trying to measure things that don’t have physical form, such as intelligence or anxiety. These abstract constructs are meaningful and useful variables, but you can’t measure out 10 grams of anxiety or describe intelligence in square meters. To measure variables like these, we come up with concrete things that can serve as reasonable proxies for the abstract constructs. For example, a score on an IQ test is a proxy for intelligence; we know it’s not exactly what intelligence is, but it’s related enough to the construct that the score is useful. We call these proxies operational definitions, and though it’s understood that they can never be a perfect representation of the abstract construct, it’s also clear that some operational definitions are better than others (e.g., a multidimensional IQ test administered by a trained clinician is probably a better proxy for the construct of intelligence than a ten-minute quiz found on Buzzfeed).

I bring all this up because when we talk about the validity of a grade, we are in fact talking about construct validity (i.e., the degree to which an operational definition is a good measure of the underlying abstract construct). A grade is typically intended to reflect a student’s achievement or mastery of specific learning goals/objectives and to measure this mastery we create assignments with grading criteria that we think will allow us some insight into that abstract construct.  I’ve written in the past about how individual grades that are determined based on the review of a collaborative product may have weak construct validity and how intragroup peer assessment can be used to modify those individual grades, but whether this will increase the construct validity of the grade is a question that remains to be answered. Sure, it will change students’ grades, but does it change them in a way that makes the scores a better approximation of each student’s mastery of the learning objectives that the instructor intended to assess with the original group assignment?

Imagine a computer science class in which student groups are tasked with creating simple video games to help them learn how to code in a new computer language. The instructor might look at the game each group creates and assign grades based on the level of coding proficiency or sophistication that the game demonstrates. In this case the learning outcome that is being assessed has to do with coding ability. At the end of the project, group members are asked to submit peer evaluations of everyone in their group, and the instructor plans to use these to adjust individual students’ final project grades.

If the objective is to use peer assessment to make the final grade a more valid indicator of individual student learning, then research would need to show that the modified grade has a stronger relationship with an actual measure of individual student learning than the original grade. For example, students in our imaginary computer science course could take a test that allows the instructor to assess their proficiency in the coding language used in the group project. Then, the instructor could analyze whether the modified grade on the group project was more strongly correlated with the grade on the test than the unmodified grade was. If it is, then on average, the peer assessment would appear to increase the validity of the group project grade.

Unfortunately, there isn’t much research that I’ve been able to uncover that does a good job of demonstrating this kind of validity. It seems like much of the research on this type of peer assessment is focused on using simulated data to come up with the best mathematical formula or weighting scheme for incorporating the peer assessment scores into the final project grade (e.g., Bushell 2006; Guzmán 2018). One exception is a study by Diane Baker (2008), which did show that average intragroup peer assessment predicted subsequent individual performance in a study that involved multiple rounds of team-based testing. This is a decent start at providing empirical support for the use of peer assessment to increase the validity of individual grades on group assignments; however, team-based testing is a very specific type of collaboration in which a student’s contributions are likely to be very visible to other group members (i.e., they either speak during the discussion of the quiz questions or not) and objectively helpful or not helpful (i.e., the contribution helps the group get the question right or not). To help improve the generalizability of the finding, it would be useful to see research that examines peer assessment in other types of assignments.

A final reason I remain skeptical about the validity of summative peer assessment is because of what we usually ask students to assess when they engage in intragroup peer assessment. These sorts of evaluations typically don’t ask students to assess each other’s level of mastery of the learning objectives for the original assignment (e.g., in the imaginary computer science class mentioned above, the peer evaluation likely wouldn’t ask students to rate their peers’ coding ability). Instead, they are frequently set up to ask about things like how much each group member contributed to the project (Dijkstra et al. 2016) or whether they demonstrated good teamwork skills (e.g., dependability, meeting attendance, communication, etc.) (Baker 2008). This seems like a problem to me because if the peer assessment is used to modify the grade on the collaborative project, then that grade does not solely serve as an assessment of the learning objective (e.g., coding ability). Instead, it now serves as a combination of a still potentially weak indicator of the learning objective and a separate measure of teamwork, which at this point has ambiguous construct validity in itself.

I’m not sure that changing the peer assessment so that it does ask students to assess their group members based on their mastery of the learning objectives associated with the assignment is a good idea either, though. There is evidence that students’ peer assessments of each other’s work correlate with instructor assessments (Falchikov and Goldfinch 2000). However, this research focuses on students as individuals producing separate products to be evaluated (e.g., each student does an oral presentation and their peers evaluate them). It’s less clear whether a similar degree of relatedness would be true for students working together to produce a group product.  

In the end, I suspect, based solely on observations from my own classes, that final group grades that have been individualized through the incorporation of peer assessment probably will be more strongly related to individual measures of learning than those that don’t, but I think this would mostly be driven by changes in the scores for free riders on the group project. Students who contribute nothing to a group project will probably be evaluated poorly by teammates, and I suspect that because they haven’t been engaged in the work, they would also perform poorly on any subsequent individual assessment of the same learning outcomes. However, it’s unclear to me whether the incorporation of peer assessment would similarly improve the validity of the grade for students who are moderate to high achieving. If that’s all you want, though, then it’s no problem. You can probably create a system that allows you to catch and penalize free riders in group work without a lot of difficulty, but I don’t think we can say with confidence that this solves the larger problem of deriving valid individual assessment from collaborative projects.


Baker, Diane F. 2008. “Peer Assessment in Small Groups: A Comparison of Methods.” Journal of Management Education, 32(2), 183–209. https://doi.org/10.1177/1052562907310489.

Bushell, Graeme. 2006. “Moderation of Peer Assessment in Group Projects.” Assessment & Evaluation in Higher Education, 31(1), 91-108. https://doi.org/10.1080/02602930500262395.

Dijkstra, Joost, Mieke Latijnhouwers, Adriaan Norbart, & Rene A. Tio. 2016. “Assessing the I in Group Work Assessment: State of the Art and Recommendations for Practice.” Medical Teacher, 38(7), 675–682. https://doi.org/10.3109/0142159X.2016.1170796.  

Falchikov, Nancy, & Judy Goldfinch. 2000. “Student Peer Assessment in Higher Education: A Meta-Analysis Comparing Peer and Teacher Marks.” Review of Educational Research, 70(3),  287–322. http://doi.org/10.3102/00346543070003287.

Guzmán, Sebastián. 2018. “Monte Carlo Evaluations of Methods of Grade Distribution in Group Projects: Simpler is Better.” Assessment & Evaluation in Higher Education, 43(6), 893-907. https://doi.org/10.1080/02602938.2017.1416457.

David Buck, associate professor of psychology, is the 2020-2022 Center for Engaged Learning Scholar. Dr. Buck’s CEL Scholar project focuses on collaborative projects and assignments as a high-impact practice.

How to Cite this Post

Buck, David. (2021, October 13). Is There Value in Summative Peer Assessment? [Blog Post]. Retrieved from https://www.centerforengagedlearning.org/typology-of-peer-assessment