Unaggregate your data, bust a quote and learn a thing about modern football

analytics
Author

Elias Nema

Published

May 20, 2020

Introduction

Somehow I’ve spent the last week looking at Man City stats.

It all started with this quote in the book about the data and football that I’m reading now.

Spoiler alert: this proved to be wrong. But it triggered interest from my side. Man City is one of the most recognisable sides, but do they play that short?

A common pitfall

So I wanted to quickly double-check the fact. Which, turned out, to be not easy at all. Available data is either not complete or too aggregated, there is a shortage of reliable publicly available sources. For example, the official premier league site says that MC has scored 13 goals from outside the box during the 2017/18 season, but there is no information about distributions (the quote mentions only the first half of the season).

And here I fell in a very common mistake people make while working with data. It’s always tempting to look at aggregated data to see patterns and trends. However, often you need the exact opposite: take one sample, study it, see if it makes sense, see if it contains data that you expect.

This is a great example of a very common mistake people make when working with data. It’s always tempting to look at aggregated data to see patterns and trends. However, often you need the exact opposite: take one sample, study it, see if it makes sense, see if it contains data that you expect.

In fact, to prove the quote wrong I needed just a single example of a goal that City has scored from outside the box during that part of the season. So after searching for the best strikes of the season on YouTube I found that the quote didn’t hold up. Just take a look at this last-minute wonder-strike from Sterling.

So we’ve already proved the quote wrong simply by watching a couple of great football moments.

Going deeper

But now that we are in the topic, let’s dig deeper. The best website that I was able to find to look at more granular data is fbref. There I took statistics on every shot Man City did during their last 3 Premier League seasons. We also have a distance for each shot.

Visualising

Here and below I’ll be taking into consideration only first halves of the seasons. First, because the quote was about that. Second, this would be enough to see the tactical changes over time. Third, data for the 2019/20 season is incomplete anyways.

Let’s start with trying to understand if there is any significant difference between seasons.

This visualisation is of little use here, but at least we can see some distinctions between seasons.

Let’s quickly compute t-test statistics to get a numeric representation of it:

season 17/18 vs 18/19: Ttest_indResult(statistic=1.1348979627228202, pvalue=0.2568168035045629)
season 18/19 vs 19/20: Ttest_indResult(statistic=1.8185721011018878, pvalue=0.06943167359879912)
season 17/18 vs 19/20: Ttest_indResult(statistic=3.1215926762079387, pvalue=0.0018782472078475076)

This compounds to a huge difference from 2017 to 2019 seasons.

Indeed, looks like there is a small but not very significant difference between 2017 and 2018, while a much bigger difference between 2018 and 2019. This compounds to a huge difference from 2017 to 2019 seasons.

Let’s give this difference a better human interface.

Stacked graph

One way to look at distribution differences would be a stacked bar graph. This graph quickly falls apart when the number of dimension grows, but is perfect for our use case.

It’s not too detailed, but also very concise. And we clearly see that each season City is aiming to shoot more and from the closer range.

Density graphs

If we want to look even closer we can go with density graphs.

Wow, this tells a story now. A coincidence? I doubt that.

Conclusion & learnings

I believe football evolves by learning more about itself. It’s a known fact that teams have started optimising for shooting from positions with higher chance of scoring, which translates into less long-ranged shots.

Evolution of Man City from the season 2017 to 2019 shows that they indeed follow this trend of shooting from “better” positions. But does it make their game better? That’s the question we can’t objectively answer with data.

Some concise takeaways for the data folks
  • Don’t start with aggregated data. Firstly, make sure that single data points look reasonable.
  • Compare distributions numerically using the t-test, Kolmogorov-Smirnov test, etc.
  • Add a visual story by bucketing:
    • Histograms
    • Pivot histograms into a stacked chart for a different perspective
    • Go for density estimations for more details

Code

Reading data

Code
import numpy as np
import pandas as pd
from scipy import stats
city = pd.read_csv('man_city_17_18_19.csv')
Code
city.head()
Season Matchday Opponent Minute Player Outcome Distance BodyPart Notes SCA1 SCA1Event SCA2 SCA2Event
0 2017/18 1 Brighton & Hove Albion 4 Gabriel Jesus Blocked 11 Right Foot NaN Fernandinho Pass (Live) Kyle Walker Pass (Live)
1 2017/18 1 Brighton & Hove Albion 9 Danilo Off Target 21 Right Foot NaN David Silva Pass (Live) Danilo Pass (Live)
2 2017/18 1 Brighton & Hove Albion 14 Fernandinho Off Target 10 Head NaN David Silva Pass (Dead) NaN NaN
3 2017/18 1 Brighton & Hove Albion 17 Kevin De Bruyne Saved 27 Right Foot Free kick David Silva Fouled Fernandinho Pass (Live)
4 2017/18 1 Brighton & Hove Albion 32 Kevin De Bruyne Blocked 28 Right Foot Free kick David Silva Fouled Danilo Pass (Live)
Code
print(city.Season.value_counts())
print(city.Matchday.nunique())
2019/20    376
2018/19    346
2017/18    338
Name: Season, dtype: int64
19

Data only for two halves of two seasons. Getting shots per searson.

Code
y17 = city.query("Season == '2017/18'")['Distance']
y18 = city.query("Season == '2018/19'")['Distance']
y19 = city.query("Season == '2019/20'")['Distance']

Comparing two samples

Code
from matplotlib import pyplot as plt
plt.style.use('Solarize_Light2')
%config InlineBackend.figure_format = 'retina'
import seaborn as sns
import warnings
warnings.filterwarnings(action='ignore')
Code
sns.distplot(y17, 20, label='17/18')
sns.distplot(y18, 20, label='18/19')
sns.distplot(y19, 20, label='19/20')
plt.legend()
<matplotlib.legend.Legend at 0x1579b7f70>

Code
y17.mean(), y17.std()
(16.701183431952664, 8.51091453670634)
Code
y18.mean(), y18.std()
(15.95086705202312, 8.779779921952818)
Code
y19.mean(), y19.std()
(14.872340425531915, 6.9627336279989995)
Code
# equal_var=False to perform welch's test instead, because of different variance and amount of sample
print(f'season 17/18 vs 18/19: {stats.ttest_ind(y17,y18, equal_var=False)}')
print(f'season 18/19 vs 19/20: {stats.ttest_ind(y18,y19, equal_var=False)}')
print(f'season 17/18 vs 19/20: {stats.ttest_ind(y17,y19, equal_var=False)}')
season 17/18 vs 18/19: Ttest_indResult(statistic=1.1348979627228202, pvalue=0.2568168035045629)
season 18/19 vs 19/20: Ttest_indResult(statistic=1.8185721011018878, pvalue=0.06943167359879913)
season 17/18 vs 19/20: Ttest_indResult(statistic=3.1215926762079387, pvalue=0.001878247207847508)

Looks like there is a high chance of difference between them.

Density distribution

Code
kde17 = stats.gaussian_kde(y17)
kde18 = stats.gaussian_kde(y18)
kde19 = stats.gaussian_kde(y19)
grid = np.linspace(0,50, 51)
Code
f, axs = plt.subplots(3,1, figsize=(8,12))

axs[0].plot(grid, kde17(grid), label="season 2017/18")
axs[0].plot(grid, kde18(grid), label="season 2018/19")
axs[0].plot(grid, kde18(grid)-kde17(grid), label="difference between 2018 and 2017")

axs[1].plot(grid, kde18(grid), label="season 2018/19")
axs[1].plot(grid, kde19(grid), label="season 2019/20")
axs[1].plot(grid, kde19(grid)-kde18(grid), label="difference between 2019 and 2018")

# axs[2].plot(grid, kde17(grid), label="season 2017/18")
# axs[2].plot(grid, kde19(grid), label="season 2019/20")
axs[2].plot(grid, kde19(grid)-kde17(grid), label="difference between 2019 and 2017")

for ax in axs:
    ax.set(ylabel='Density')
    ax.legend()

axs[2].set(xlabel='Shooting distance')
plt.show()

Code
plot = plt.figure(figsize=(8,3))
plt.plot(grid, kde18(grid), label="season 2018/19", figure=plot)
plt.plot(grid, kde19(grid), label="season 2019/20", figure=plot)
plt.plot(grid, kde19(grid)-kde18(grid), label="difference between 2019 and 2018")
plt.xlabel('Shooting distance')
plt.ylabel('Density')
plt.legend()
plt.show()

Code
plot = plt.figure(figsize=(8,3))
plt.plot(grid, kde17(grid), label="season 2017/18", figure=plot)
plt.plot(grid, kde19(grid), label="season 2019/20", figure=plot)
plt.plot(grid, kde19(grid)-kde17(grid), label="difference between 2019 and 2017")
plt.xlabel('Shooting distance')
plt.ylabel('Density')
plt.legend()
plt.show()

Code
plot = plt.figure(figsize=(8,3))
plt.plot(grid, kde18(grid)-kde17(grid), label="difference between 18 and 17", figure=plot)
plt.plot(grid, kde19(grid)-kde18(grid), label="difference between 19 and 18")
# plt.plot(grid, kde19(grid)-kde17(grid), label="difference between 19 and 17")
plt.xlabel('Shooting distance')
plt.ylabel('Density')
plt.legend()
plt.show()

You can see the difference after plotting estimated densities.

Manchester City has started shooting more from the 5-10 and 15-20 yards range. They shoot less from > 30 yards.

Code
diff=plt.bar(range(5,51,5), 
             height=(np.histogram(y19,bins=10, range=(0,50))[0] - np.histogram(y17,bins=10, range=(0,50))[0])
             ) 
plt.title("a-b")
Text(0.5, 1.0, 'a-b')

Here is the difference between histograms.

Code
df=pd.DataFrame([
        np.histogram(y17,5,(0,50))[0] / y17.count() * 100,
        np.histogram(y18,5,(0,50))[0] / y18.count() * 100,
        np.histogram(y19,5,(0,50))[0] / y19.count() * 100
        ], index=['2017','2018','2019'], columns=['0-10','10-20','20-30','30-40','40-50'])
Code
city['Distance bins'] =  pd.cut(city['Distance'], bins=[0,10,20,30,40,50])
Code
df1 = city[['Season', 'Distance bins','Distance']].groupby(by=['Season','Distance bins'],).agg('count').unstack()
Code
df1.columns = ['(0, 10]','(10, 20]','(20, 30]','(30, 40]','(40, 50]']
Code
ax = df1.plot(kind='bar', stacked=True,width=.85, figsize=(4,4))
ax.set(ylabel='# of shots')
ax.legend(bbox_to_anchor=(1, 1.025))

for i, p in enumerate(ax.patches):
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    pct = height/df1.iloc[i//5].sum()
    if pct > .04:
        ax.annotate('{:.0%}'.format(pct), (x+.27, y + height/2 -5 ), color='white', fontsize=10,)

Code
pd.DataFrame([(df1.loc['2017/18']/df1.loc['2017/18'].sum()).T,
(df1.loc['2018/19']/df1.loc['2018/19'].sum()).T,
(df1.loc['2019/20']/df1.loc['2019/20'].sum()).T])
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
2017/18 0.281065 0.375740 0.269231 0.071006 0.002959
2018/19 0.321739 0.385507 0.240580 0.049275 0.002899
2019/20 0.313830 0.430851 0.252660 0.002660 0.000000
Code
df1
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
Season
2017/18 95 127 91 24 1
2018/19 111 133 83 17 1
2019/20 118 162 95 1 0