- Getting familiar with the dataset
- To scale or not to scale?
- Final thoughts
I’ve always wanted to try the makeover monday since I learned about the project. But busy being busy, you know.
Last week, however, two things happened:
- Statistics for the Arsenal’s last season was posted as a weekly project (and I love football).
- I accidentally noticed that.
So it was decided - I’m creating a visualisation.
I need to create it using something of course. What are my options? Data world provides quite some extensions there. After playing with a few (Tableau, Google DataStudio, Excel), I realised that the simplest and most flexible way to crunch some numbers for me was jupyter.
For me jupyter beats other tools for the EDA.
- I can
iterate quicklydue to the environment itself.
- Slice and dice data the way I want with
- Plot if instantly with
- Play with
sklearnor anything else from the python ecosystem for more clever algos.
To be fair, I can get everything of the above in RStudio (and probably more precise stats and subjectively more beautiful visuals out of the box), but I’m more comfortable with the python ecosystem.
Notebook for the process below can be found here.
Getting familiar with the dataset
Data represents some KPIs of players throughout the season. For example Saves, Tackles, Passes, Minutes Played, etc. Position data wasn’t in the dataset.
Initially, I wanted to cluster players together to see if there are any patterns and groups. Then realised it doesn’t make a good visualisation: hey look, these players have Saves, they are probably goalkeepers. So I decided to enhance the data with the information about positions.
Hey look, these players have some Saves, they are probably goalkeepers.
With this data, let’s look at performance by position. Maybe a radar chart can work for such kind of comparison?
There is a limited amount of axes on a graph for users to still make sense of it. How to choose metrics to display? There are some obvious choices of course:saves for goalkeepers, tackles for defenders, goals for forwards. But how about common attributes such as passes or touches?
To identify potentially interesting data points I started with a plot of correlations. I was looking for some interesting patterns, e.g. metrics that are not linearly correlated with the minutes played.
There are some:
- obviously bad choices (clearances off the line - only a couple of players have at least 1);
- not so obvious but not really interesting - the more you play the more your KPI is;
- and looks like there are some potentially distinguishable KPIs.
What would be the best way to put it on the graph?
To scale or not to scale?
If we want to create player profiles, we cannot just say Cech made X saves while Leno made Y. We need to normalise it. Most precise would be to do it by minutes played. So tackles per minute played or passes per minute played.
Having these numbers we can already iterate to find the best representation.
The good part here - real values are displayed for the KPIs. We can see Mustafi is completing around 0.06 clearances per minute played, more than anyone else. The problem - it’s hard to have all the numbers at the same scale, so some metrics skew the scale towards one direction and so it becomes hard to see values for smaller values.
Mustafi - first in memes but also completing around 0.06 clearances and 0.025 tackles per minute, more than any other defender in a team.
We can try to play with axis scales to get more equal distribution of numbers. As you can see now graphs are more equally distributed, but it’s a mess in readability.
You need to be careful using log scales, especially in a radar chart.
We can divide values by the maximum for the metric at this position. So all players will get a percentage from the maximum for the position in a team. As for the example above: Mustafi will get 1 for the clearances, Sokratis around 0.8 from that and so on.
Personally, I like the last approach most, even though with it we are losing the interpretability. Now it’s not possible to see the real values but the relative comparison is so much better.
So let’s just polish the graph a bit.
Cech was getting around 20% of playing time compared to Leno.
Distinctive styles: Cech is doing much more high claims while Leno uses more punches.
Cech has a better “saves per minute” ratio, but also might be the result of him playing much less.
4 most played Defenders
Mustafi leads on most defensive metrics. He also played most minutes from all defenders (3rd overall, after Leno and Aubameyang).
Sokratis has a similar profile, with the exception that he leads by far in yellow cards.
Bellerin and Kolasinac have more offensive profiles, both taking third place in overall team assists rank (5 each, Bellerin played less).
Midfielders with at least 20 appearances
There are two clear patterns on the chart: defensive and offensive. Xhaka and Torreira are representatives of the former while Mkhitaryan and Özil of the latter.
Ramsey is an interesting exception of a “complete” midfielder here, leading by assists per minute played, but also participating in tackles a lot.
Top 3 Forwards
Lacazette and Aubameyang are 2nd and 4th in most minutes played during the season.
Lacazette fouls five times more than Aubameyang or Iwobi.
Aubameyang leads on goals effectiveness, while Iwobi and Lacazette are great in assists.
I used canva to assemble the graphs above into the final image, added some red-white colours. I didn’t come up with any breakthroughs of course - things on graphs are already well known. However, I had a couple of interesting observations and discoveries during the process. And had some fun as well.
The greatest benefit was to document the process itself. As Jason Fried urges people to write more to really think the idea through, just sitting down and writing the process of analysis was really rewarding.