Ratings and Champion Skill

July 20, 2017

By far, the most common request we received after the launch of The Better Meta was to incorporate player ranking (aka ELO, MMR, etc.). “Putting up some averages is cool, but it’s not going to cut it,” we heard. “Of course Sha Lin does poorly on average, he’s hard to play! But good Sha Lins are amazing!” or “Who cares that Torvalds have the highest winrate—Torvalds are terrible! They’re only good at killing noobs.” We agreed, and after a bit of work, we’re here to give you the data! In this post, we’re going to walk you through how we rated players and how to use ratings to figure out which champions are advantaged.

tl;dr: the new graphs are here.

Preliminaries

The analysis done in this post uses data from five months of competitive matches ranging from OB42 to OB53, which amounts to over 2550K matches between about 600K different people. This means each competitive player played an average of 42 games (median 12). Further breaking it down per champion, it’s only 6 games (median 2)! The following chart shows the number of players Y who’ve played at least X matches:

Loading...

This chart shows that there are a lot of “noobs”, so we absolutely have to take skill into consideration when looking at champion ability. Let’s try to separate the wheat from the chaff.

Pro tip: all the graphs in this post are interactive! Hover your mouse or click to see detailed information.

Basic Ratings

Hi-Rez does not currently provide in-game ratings publicly, so we had to build our own ratings system. Let’s take a look at that.

The most basic way to judge “how good” someone is at Paladins is to look at their winrate. A perfect player would win every game and would have a winrate of 100%; the worst player would have a winrate of 0%; in general, most players are somewhere in the middle.

However, players are generally matched up against players of the same skill level (a post for another time!), which means winrate isn’t enough by itself. We can do much better by taking into account the difficulty of the match—if you beat a bunch of terrible players, it says less about your skill than if you beat fantastic players.

Luckily, other people have done a lot of the hard work for us. There are a number of rating algorithms like ELO, TrueSkill™, and Glicko. We will be using Glicko-2, a variation of the Glicko algorithm, with some class and team extensions described below.

How Ratings Work Generally

Glicko-2 keeps track of two pieces of information about each player concerning their skill level: an estimated rating (our best guess of a player’s real rating) and a measure of uncertainty (how confident we are about that guess). The basic idea is that everyone starts at the same rating estimate (medium) and same uncertainty (high). For every game played, the estimated rating gets adjusted up (on a win) or down (on a loss), and the uncertainty goes down.1 Winning an easy game increases your rating a little bit, while losing an easy game decreases it a lot; additionally, games with other players whose ratings we’re more certain about will change your rating more dramatically. If you’re that kind of person, all the gory statistical details of this process can be found on Glickman’s website.

People often wonder why ratings systems don’t take into account things like damage done or kill/death ratios. One major problem with this idea is that it encourages the wrong kind of behavior (see the cobra effect). If people get points for kills, they will kill even when they should be healing; if they get points for heals, they will heal even when they should be retreating. As long as ratings are tied directly to winning and losing matches, there’s no way to game the system.

Extensions to Glicko

Glicko is designed for games with two symmetric players (e.g., chess). Paladins has both champions and teams, so we have to do something special to handle them.

Dealing with Champions

We take a very restrictive and simple view of champions. Each person gets a completely separate rating for each champion they play. The matches a person plays with champion A do not affect, in any way, that person’s champion B rating.

The reason we do this is simple: people aren’t good at all champions. Just because you’re a top-tier Androxus doesn’t mean you’re fantastic at playing Torvald. In fact, because we keep ratings separate, we can look for correlations between champions (another upcoming post!).

In the following sections, whenever we refer to “a player’s rating” or even just “a rating”, we always mean “a player’s rating for a particular champion”. This is quite a mouthful, so we don’t usually say the whole phrase. Similarly, “a player” means “a person playing a particular champion”. When a person switches champions, they’re a different player.

Dealing with Teams

To deal with the 5v5 nature of Paladins, we use the composite opponent technique and extend it to take into consideration teammates. For each match, we pretend as if each particular player played a 1v1 game against an imaginary opponent with a rating that is the sum of the player’s opponents’ ratings minus the sum of the allies’ ratings.

For example, if you have a rating of 2000 on a team with {1700, 1600, 1400, 1400} rated allies, and you played against a team of {2100, 1600, 1500, 1500, 1300}, then we pretend as if you’ve played a game against a single opponent with a rating of (2100 + 1600 + 1500 + 1500 + 1300) - (1700 + 1600 + 1400 + 1400), or 8000 - 6100, or 1900. From your perspective, it’s as if you played against an opponent of a slightly lower rating than your own (1900 vs 2000). This is done for every player in the match, so ten composite matches are given to the Glicko algorithm.

In our testing, our extension seems to have slightly more predictive power compared to the original technique (which only adds up the opponents), and indeed does the best of any technique we tried. Taking into consideration “carries” (e.g., skewed team rating distributions) is something we’ve started looking into, but doesn’t seem to affect things very much.

Paladins Ratings

Now that we can rate players, we’re going to start looking at the data. We’ll step through a series of modifications to a graph and wind up at the new Champion Ratings graph.

Step 1: Overall Distribution of Ratings

Let’s start with a look at how ratings are distributed:

Loading...

This histogram shows how common different ratings are for people who’ve played at least 10 games with the champion being rated. For any given rating, it shows how many players have nearly the same rating (e.g., 76108 players are rated between 1675 and 1725).

If you add up all the bins in this histogram, you get about ~740K total players. We know from above that there are ~330K actual people who have played at least 10 games, so this means each person has a little more than two champions in this histogram on average. The rightmost (2400) bin has 600 ratings in it, which means it represents 600/740K, or 0.08%, or the top eight hundredths of one percent of all ratings. In contrast, the middle bin contains about 10% of the ratings.

Remember, these rating numbers are for our system only and are not the same as those in-game, nor are they the same as on any other Paladins site! The rating numbers themselves are arbitrary—all that matters is the shape of the distribution. The shape is a “bell curve”, and the data is roughly normally distributed (with a mean of 1709 and a standard deviation of 196). Most players fall somewhere in the middle, and there are relatively few who are terrible or amazing.

As described above, every person gets a separate rating for each champion they play, so if you play multiple champions, you’d be represented in this chart more than once.

Loading...

This is basically the same graph as above, but we’re drawing lines between the tops of where each bar was.2 Just as before, the value Y at a rating X is how many players have about that rating (to within 50).

Step 3: Breaking Things Down by Champion

Remember, every person has a different rating for every champion they play. This means the Step 2 graph contains your great Ying rating somewhere on the right as well as your terrible Androxus rating somewhere on the left, etc. Let’s split each champion out separately and see what we get:

Loading...

This graph shows how many of each champion has a particular rating. Each line in this graph represents the distribution of success of a different champion, and if you were to add them all together, you’d get back to the previous graph.

Probably the easiest thing to see in this graph is that some of the curves are “bigger” (they have more area beneath them) than others. The area underneath a champion’s curve is a direct measure of the champion’s popularity: the more players who’ve played a particular champion, the more space their curve takes up on the graph. New champions like Zhin and Ash simply don’t have as many players as older champions like Makoa, so their curves are much closer to the x-axis.

Step 4: Normalizing by Champion Popularity

We want to find the best champions even if they are new or they aren’t popular, so let’s go ahead and normalize everything by popularity:

Loading...

Now all the curves trace out the same amount of area, so they are directly comparable. We haven’t shifted anything right or left—we’ve only scaled each curve up or down.

With that change, two specific curves are crying out for our attention: the purple one on the right and the orange one on the left. Why are they sticking out while the others are jumbled together in the middle? This time it’s because they represent “exceptional” champions—Torvald is the purple curve and Kinessa is orange. As you may know from our Champion Performance chart, Torvald has been doing particularly well (on average), and Kinessa particularly poorly.

Step 5: Focus on Kinessa

Let’s focus on Kinessa, and by way of comparison, Drogoz, for a second:

Loading...

These two graphs show how Kinessa’s and Drogoz’s ratings are distributed. They have a similar shape (indeed, in the previous chart we saw that all champion curves have similar shapes), but they are shifted horizontally and stretched a bit.

What can we learn from this picture? For starters, an average Kinessa has a rating around 1550, while an average Drogoz has a rating around 1750. This isn’t too surprising, since we already knew that Kinessa does worse on average than other champions. Considering that most ratings fall somewhere between 1100 and 2400, a difference of 200 doesn’t seem all that bad! However, nearly every rating worse than the 1700 average is dominated by Kinessa, while every rating greater than 1700 is dominated by Drogoz.

Step 6: Stack to Compare

To investigate this relationship in more detail, we’d like to know which champion has more or less of each rating. By stacking them on top of one another (just by adding them), we make it easy to see which champion has the most people at any given advantage. Here’s what that looks like:

Loading...

In case you’ve never seen an area chart before, it may be helpful to look at the equivalent bar chart:

Loading...

Now we can easily see what proportion of each rating is granted to each champion. This makes it super easy to compare ratings as long as they fall somewhere towards the middle, but it’s still really hard to see what’s going on in the extremes.

Step 7: Fill to 100%

To get the full picture we just need to make all the tiny bars bigger. Let’s scale up every bin to be the exact same height:

Loading...

Now we can clearly see the makeup of the entire range of performance3 for whatever champions we want. We see as before that Kinessas tend to represent more of the lower ratings while Drogoz represents more of the upper end, but we can also see that the trend continues even in the extreme ratings. Since Kinessa still maintains a significant presence at even the highest ratings, it’s possible to be a good Kinessa player.

The best of the best players are in the the rightmost (2400) bin. Inside, we see that it’s made up of 25% Kinessas and 75% Drogozes. We already normalized by popularity, so this discrepancy reflects the fact that it simply requires more skill (practice, talent, etc.) to be as good with Kinessa as with Drogoz. On the other hand, Drogoz is almost entirely absent at the lowest levels of play—this means it’s very difficult to be as bad at Drogoz as you can be with a Kinessa. For every worst possible Drogoz, there are about 16 equally bad Kinessas.

Step 8: Get Rid of Ratings

So far, we’ve been looking entirely at our made up ratings numbers. Even though the rightmost bin represents the top 0.08% of players, just how much does skill matter? To put the finishing touch on our graph, let’s turn the ratings into something a little more universal—winrates.

We can figure out how much advantage a particular rating confers by looking at the relationship between them:

Loading...

This graph shows, in practice, how a team’s rating advantage affects their chance to win. For example, if your team has a 600 point ratings advantage versus another team, then you have a 65% chance to win. This relationship4 means we can avoid looking at ratings directly, and instead look at how advantaged players are:

Loading...

On the x-axis, we’re now using “advantage” numbers. In this context, we say you have an advantage of X if when you replace an average player in a balanced game, your team would be X% more likely to win than before.

This is the graph we’ve been working towards—with this we can compare two or more champions across different levels of play. Let’s add in all the other champions and take one final look.

Analysis

Here are all the champion ratings available at the end of OB53 put into our fancy chart:

Loading...

Just as before, the height of each colored band represents how many ratings there are of a particular champion at a particular rating.

We’ve already examined Kinessa and Drogoz in detail, but they stand out on this graph as well. Looking at Kinessa’s performance on the extreme left, she is by far the easiest of all champions to be terrible with. On the other side of the graph, we have champions like Drogoz, Bomb King, Buck, and Torvald who make up most of the highest ratings. While the highest ratings in the game are consistently achieved using those champions, conversely, very few people have been able to get high ratings for champions like Skye or Makoa.5

It’s really interesting to look at Androxus: his band opens up on both ends, which means that it’s very easy to be bad with him, but it’s also possible to be great. While there are a few other champions that show this to a lesser extent (Kinessa and Cassie are other examples), Androxus is relatively even on both ends. This is definitely behavior we can’t see looking at averages, so it’s exciting that we could capture it.

Conclusion

If you followed through this post, you should understand how to read the new Champion Rating Overview graph. Hopefully this graph shows you what you’ve always wanted to know about champion ratings, but if you have any other ideas of graphs you’d like to see related to ratings, let us know and we’ll give it a shot! We’ve got a few ideas up our sleeves already, but we’d love to hear yours on Twitter.

  1. Ideally we’d also increase the uncertainty at the beginning of every patch and when players don’t play frequently, but we’re not doing this yet. [Edit 2017-08-16: We are now increasing uncertainty at the beginning of every patch.] 

  2. The y-axis could be made to show probability density by dividing through by the total number of players and the width of a bucket (50), but we think using player counts is more intuitive. 

  3. We have been surreptitiously squashing the most extreme data into the left- and rightmost two bins to ensure enough data for comparison. In this final chart, it means that the first and last bins are really cumulative sums, being sums of the more extreme ratings, while all other bins are not. Not a single person goes unaccounted for, even though there are so few people so far out. 

  4. The relationship is very closely fit by a logistic function

  5. All of this is related to “skill floors” and “skill ceilings”, but we’re avoiding these terms because the meanings aren’t universally agreed upon.