The KRACH rating system for college hockey

The KRACH rating system is an attempt to combine the performance of each team with the strength of the opposition against which that performance was achieved, and to summarize the result as one number, a "rating", for each team. The higher the rating, the better the team.

Interpreting the ratings

The ratings are given on an "odds scale": that is, if team A is rated at 400 and team B at 200, team A is reckoned to have odds of 2 to 1 of defeating team B when they meet (since 400 is twice 200). Equivalently, team A is reckoned to have probability 2/3 of defeating team B (since 400/(400+200) is 2/3).

The ratings are calculated purely from the game results (win, loss or tie), and do not use the goals scored at all (in contrast to CHODR). Overtime wins count as wins, but shootout wins count as ties; ties, for KRACH's purposes, count as half a win and half a loss.

The published ratings are scaled so that the "average" team has a rating of 100. This is as much for adminstrative convenience as anything else; the odds depend only on the ratio of the ratings of the teams involved, and will be the same no matter what "average" is chosen.

Calculating the ratings

Unfortunately, the mathematical model that lies behind KRACH is a non-linear one, which means that there is no nice formula from which the ratings are calculated. The best that can be done is a kind of trial-and-error approach -- the following example is intended to convey the flavor:

Suppose you have four teams, A, B, C and D, whose ratings are known to be 150, 100, 75 and 50 respectively, and a team X which lost to A, tied against B, and defeated C and D. We want to find the rating for team X that summarizes this 2-1-1 record against this opposition. The calculation rests on the idea of "expected wins" -- this is, given the current guess at team X's rating, the sum of the probabilities of X winning each of the four games. For X's rating being 50, 100 and 150, the probabilities and expected wins look this way:

X's rating -->        50     100     150
Prob. of defeating:
                 A:  0.25    0.40    0.50
                 B:  0.33    0.50    0.60
                 C:  0.40    0.57    0.67
                 D:  0.50    0.67    0.75
                 ------------------------
             Total:  1.48    2.14    2.52  <-- expected wins

As you would expect, a stronger team with a higher rating is expected to win more of the games against a particular set of opponents than would a weaker team with a lower rating.

What we want to do is to figure out team X's rating based on their 2-1-1 record against these teams. 2-1-1 is 2.5 wins (tie = 0.5 win), so the best choice of rating for team X is the one at which the expected wins is also 2.5: "matching observed and expected wins", if you will. The table above suggests that a rating just less than 150 will do the job.

(You may think this is a bit high -- after all, our team X lost to team A, whose rating is 150 -- but, on the other hand, it was by no means certain that team X would defeat both team C and team D, yet it did. The positive (defeating C and D) balances the negative (losing to A).)

Why does it work?

There are two things we need to check, to make sure that the rating system is sensible:

  1. If you win more against the same opposition as another team, your rating will be higher.
  2. If you have the same record as another team, but against tougher opposition, your rating will be higher.

#1: Our team X came out with a rating of just under 150 by posting a 2-1-1 record against a certain set of teams. Now suppose a team Y played the same opposition, but was 1-2-1 (1.5 wins). From the table above, Team Y's rating needs to be just over 50, since then the observed 1.5 wins will match the expected 1.5 wins. This rating of 50 is considerably lower than team X's 150.

As for #2: suppose now that a team Z also played to a 2-1-1 record, but against opposition whose ratings were each a fifth those of team X's opposition, ie. 30, 20, 15, 10 -- a real bunch of patsies, in other words!

Z's rating -->         10      20      30
Prob. of defeating:
               "A":  0.25    0.40    0.50
               "B":  0.33    0.50    0.60
               "C":  0.40    0.57    0.67
               "D":  0.50    0.67    0.75
               --------------------------
             Total:  1.48    2.14    2.52  <-- expected wins

This time, looking for the rating corresponding to 2.5 expected wins suggests that Z should be rated just under 30. Team Z faced opposition "a fifth as tough" as team X did, and, as a result, found itself with a rating a fifth the size.

To summarize:

           Record    Opposition   Approx. rating
Team X:     2-1-1       tough         150
Team Y:     1-2-1       tough          50
Team Z:     2-1-1        easy          30

... which shows you how KRACH reconciles two teams, like Y and Z, that differ in both record and schedule strength. If the schedule strength differs enough, the team with the inferior record can certainly be ranked higher.

The above also suggests that, with few games in the database, fairly small changes in the data can have large effects on the rating. As a result, early in the season, the ratings are "jumpy", with teams moving apparently erratically up or down the ranking. But, as the season progresses, each weekend's games are a proportionally smaller addition to the database, so that the ratings do settle down.

Some more technical stuff

Unless you're keen, you won't lose much by quitting now :-)

I have it in mind, eventually, to write a proper mathematical description of what's going on. For now, I hope the following will shed some light.

Estimating all the ratings at once

For my examples above, I "conveniently" assumed that the ratings of teams A, B, C and D didn't change, so that I could focus on getting the rating of just one team at a time. In real life, after each weekend's results, I need to recalculate the ratings for all the teams at once -- even for teams that didn't play (their ratings can change indirectly, because their previous opponents did play on the weekend in question).

The same sort of trial-and-error does work, however, at least if you tidy it up a little. Starting with, say, last week's ratings, work out the observed and expected numbers of wins for each team. For those teams for which the "observed" number is greater, move their rating up, and for those whose "observed" number is less, move their rating down. How far up, or down, to move each team's rating is determined by the numerical algorithm you use; Newton's method converges in a small number of iterations, but involves (essentially) inverting a #teams by #teams matrix each time, so that each iteration is slow with as many as 44 teams, while the Gauss-Seidel procedure that I prefer needs more iterations, but each iteration is much simpler.

No matter what the numerical method, you keep adjusting the ratings until the observed and expected wins for each team are close enough to being equal. If the problem were linear (as it is with CHODR -- and, indeed, the other rating systems), the iteration would be unnecessary, since by solving the appropriate equations, you'd go straight to the solution.

Dealing with perfect (and perfectly futile) teams

The above works perfectly well as long as every team has gained at least one point and dropped at least one: in that case, it is always possible to find a supposed "rating" for each team that is too high (expected > observed) or too low (expected < observed), and thus the correct rating for the team is somewhere between the two. For a team that has won all its games, the story is different, however. Suppose our team X is 3-0 -- 3 observed wins; the expected number of wins is the sum of 3 probabilities, each of which is less than 1, so that the expected number of wins is always less than 3. No matter how big a rating you propose for team X, you can never make the observed and expected wins equal. The same is true in reverse for a team that has lost all its games; only a rating of 0 will produce a zero expected number of wins.

The problem arises because the probability scale, running from 0 to 1, has been "stretched out" onto the odds scale, which runs from 0 to infinity -- if you have a probability that seems to be 1, the odds that go with it will shoot off to infinity. (CHODR doesn't have this problem because it works exclusively with numbers of goals, and no matter how mismatched the teams, the goals scored are still finite!)

To deal with this problem, and the corresponding problems of teams with nearly-perfect records having ratings that are unreasonably high (or low), I pretend that each team has played, and tied, a "fictitious" game against a fictitious opponent with rating 100 (ie. average). Since this is the same for every team, the overall ranking is not unduly affected ("not biasedly affected" is the most accurate way to say it), and it ensures that each team has both gained and dropped a point -- in the fictitious game, if nowhere else -- so that the problem of infinitely large and zero ratings is sidestepped. I always include the fictitious games in the database, even when no teams have perfect (perfectly futile) records, so that comparisons from week to week can be made; as the season progresses, and more "real" games are played, the effect of the fictitious games diminishes.

As an example, let's pretend that teams A and B have fixed ratings of 75 and 50 (the ratings being fixed makes the example a bit clearer), and that team X has defeated them both. Without the fictitious game, this is the best that can be done:

                   
Rating for X -->    very big
Probs:
              A:     1 - small
              B:     1 - small
              ----------------
          total:     2 - small

("small" is so small that 2 * small = small!)

whereas if you include the fictitious game, you can get an answer:

Rating for X -->    200   400
Probs:
              A:    0.73  0.84
              B:    0.80  0.89
     fictitious:    0.67  0.80
     -------------------------
          total:    2.20  2.53

Team X has 2.5 wins in total (2 real + 0.5 fictitious), and so the appropriate rating is a little under 400. When the fictitious game is included, teams with perfect records will not necessarily be ranked ahead of those with blemishes on their record, which is as it should be, since a team that has established a perfect record against a bunch of nobodies is not deserving of any great respect.