This is a "stand-alone" version of an article on my main website, originally published September 9, 2012, edited July 21, 2018. Because of the move toward standards-based grading (yay!), the utility of the "averaging" methods listed here are now arguably nil, so though the time for this article's usefulness is past, some may find it interesting anyway.

Methods of Finding Central Tendencies for Report Card Grading

Before the current push toward standards-based grading, I had always had difficulty accepting the traditional 100 point grading scale. Poor performance resulting in "F" grades can severely affect averages, especially when the score is a "big F" (below 50%). I usually gave students at least 50% or 55% in a grading program so it would register as an "F" but would not kill their chances of ever getting out of the "F" range.

After discussing report card grading methods with colleagues at my school, I decided to investigate different methods of finding central tendencies of scores. I had no idea how many different ways there are to find central tendencies! I read, I computed, and now I comment.

To see how the methods of finding central tendencies stacked up, I imagined a student with various scores that included extreme outliers, as can happen in real life. Here are the scores I used to investigate the different methods. (You can change the scores and "Update" the graph if you are viewing this with a modern browser.)

I chose to test ten methods of computing a central tendency. These seemed the most appropriate for use in determining grades based on student assessments. The methods are listed by complexity of computation; the higher on the list and chart, the more complex they are to compute. Click on a method or scroll down to read more about each method.

RESULTS: When I did my investigation, I plotted the scores and looked at their placement (represented by the gray circles in the chart above). My "gut instinct" for the original set of scores was that the student was performing at about 77%. The methods that produced values closest to my "instinct" were Distance-Weighted Estimate, Trimean, Truncated Mean (10%), and Median in this case.

After completing all the calculations, I have come to the opinion that Distance-Weighted Estimate is the fairest method when using traditional 100% grading scales (as opposed to modern standards-based performance scoring) without low-outlier adjustments. Unfortunately, my preferred grading program, Jupiter Ed, does not compute grades using this method*. In fact, NO online, desktop, or mobile apps use this method! It is not even easily done in a spreadsheet, which many teachers use for keeping grades. So unfortunately, this method, though (IMHO) the fairest and most accurate representation of student ability based on cumulative assessments, is unavailable unless computed by hand or by using the form above (for up to ten scores). (If I get enough encouragement, I may try to build a spreadsheet or dedicated web page to use in my grading. If that happens, I will post it on my website.)

Below are my observations and opinions about each method used in my investigation.

* - Jupiter Ed Gradebook does now have two options for computing grades: traditional "Average" (Arithmetic Mean); and "Summative", using the average of the latest 20% of grades entered.

Distance-Weighted Estimate
Scores that are closer to other scores get more weight in finding the central tendency. Outliers have much less influence. In my opinion, this is the fairest measure of producing an "average" that truly reflects the student's performance as it tends heavily toward clustered scores. In other words, if a student scores mostly in the 80s, even if she has scored occasionally outside that range, she is pretty much an "80s" student.
CON: It is so cumbersome to compute! All the other methods were easy to implement in a standard spreadsheet using built-in functions. Because each value has to be compared to each other value, this is not easily done in a spreadsheet and I had to write a javascript program to compute the value.

Geometric Mean
Good for comparing data of differing ranges (e.g. compare students using 100 point spelling test AND 4 point writing rubric). In that case, without allowing for the different strengths of the scores, the spelling test would weigh 25 times more than the writing assignment. A student with good spelling skills but poor writing skills would appear much stronger than a student with good writing skills but mediocre or poor spelling.
CON: Only good for comparing sets of data to other sets, for example comparing students to each other. Not useful for computing overall central tendencies for a single student. Ideally, scores should always be based on the same range (worth the same) for similar assessments. Usually this is accomplished by weighting, for example in JupiterGrades, you can set all math tests as "worth" 20 points even though this test has 20 questions but the other has only 14.

Harmonic Mean
This method skews toward smaller values which mitigates the "bullying" power of large value outliers.
CON: This is not a good method for data sets with "big Fs". For example, if the data were on a 5 point scale(0-4), the 0 ("F" grade) would weigh as much as the 4 ("A"), but on a traditional 100 point scale, a "big F" of 20 points weighs much more than an "A" of 95 points. This method might work better if the "F" were given a value of 55 which would effectively reduce the range of the scores to ≈50 with each grade range being 10 points making this comparable to a 5 point scale with each grade range, and consequently each score, have as much weight as every other.

Finds the median then skews slightly higher or lower based on the data quartiles. In effect, it starts in the middle then moves up or down depending on the middle values of the upper and lower halves of values. For example, if the middle value of the lower half is farther away from the middle value of the upper half, then this skews down.
CON: Good for data sets where the values are evenly distributed. Because this method uses medians to compute its value, it is susceptible to the same limitation as the simple median (see below).

This finds the mean of first and third quartiles, within which lie the middlemost scores (the “normal” range on a bell curve). This would appear to be a good indicator since it seems logical that if most of the scores fall in this range, this should indicate the performance level.
CON: The logic fails in that it is "backward". Most of the scores fall within the central range, however the central range is not centered on majority of the scores! Since it uses medians in determining quartiles, it is susceptible to strong skewing in cases of distant outliers and/or small data sets.

Winsorized Mean
This method mitigates the effects of extreme outliers by replacing them with values closer to the median. Usually, 10%-25% of the values are replaced (I used 10% in my investigation).
CON: Variability of percentage of replacement values can lead to skewing and subjective mean computation. If a student has two really low outliers and only one really high outlier, do you replace one from each end of the data, leaving a really low outlier so the mean skews low, or replace two on each end, virtually eliminating the effect of the two low scores? One teacher might think that if a student scores that low multiple times, that should show up in the grades while another might wish to give him the "benefit of the doubt" that those two scores were both flukes.

Truncated Mean
Virtually the same benefits and limitations as Winsorized Mean. See above.

Finds the middle of the range (not the scale) of scores.
CON: It just finds the middle of the outliers. Outliers are, by definition, inconsistent. In looking at the data in my investigation, 67% of the scores were clustered in the range of 75 to 82 points, yet the midrange of the entire data set, computed using only two values, is 15 points below this range! In my opinion, the mean of the outliers does not truly represent a student's scores.

Arithmetic Mean
The is the one we teach our students in school. It is also helpfully built-in to all grading programs! Just find the average (mean) of all the values.
CON: Unfortunately it is very susceptible to skewing caused by distant outliers. This can be mitigated in most grading programs by manually or automatically dropping highest and/or lowest scores, or adjusting scores like "big Fs". However, this practice introduces subjectivity (see Winsorized Mean).

Another one we teach our elementary students, so it is also super easy to determine. In my investigation, using the scores I entered, this value was closest to my "instinctual" central tendency value of 77.
CON: This has the potential to radically skew if there are large gaps separating clusters of scores. This would tend to appear only in data sets of odd-numbered lengths (e.g. 9 scores). For example, in the data set {20, 42, 45, 50, 51, 79, 80, 82, 100}, most other methods put the central tendency somewhere between the 42-51 cluster and the 79-82, but the median is 51. (Click here to see the results using these values.) This value does not fairly give credit to any of the higher scores. Theoretically, a set of scores like this could arise in situations where the student is showing considerable growth in a subject and makes a sudden "growth spurt" in ability, e.g. from FBB to PRO. The median score pegs him as an FBB student. A simple average would put him in the BAS range as an overall representation of his performance level for the grading period. (And actually, using a standards-based reporting model, the student should be given a grade of PRO since he is now performing more or less consistently at that level. But that's a whole other discussion!)