Clustering players

In this tutorial we show how to cluster players to positions using their statistics. We will use a dataset provided by Ronan Manning with statistics for everty top 5 leagues player, which is available here. This tutorial is inspired by John Muller’s article on The Athletic Introducing The Athletics 18 player roles.

#importing necessary libraries
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from mplsoccer import Pitch
import os
import pathlib
import json

pd.options.mode.chained_assignment = None
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
np.random.seed(4)

Preparing the dataset

We open the data and change minutes played to integers. Then, we keep ony players who played more than 500 minutes. As the next step, we convert all data to numeric format and replace NaN with 0. Last but not least, we keep only columns with player statistics.

#open data
data = pd.read_csv("data.csv")
#change minutes to numerics
data["Min"] = data["Min"].apply(lambda x: x.replace(",", "")).astype(int)

data = data.loc[data["Min"] > 500]
data = data.reset_index(drop = True)
data = data._convert(numeric=True)
data = data.fillna(0)
X = data.iloc[:, 11:]

Dimensionality reduction

Since we have over 100 features, it is important to reduce the dimensionality. First standard scaling is applied to data. Then, we use UMAP algorithm, as in John Muller’s article, to reduce the dimensions to 2. As the next step we plot the data with formation label available in the dataset.

from umap import UMAP
from sklearn.preprocessing import StandardScaler
#scale data
scaler = StandardScaler()
X = scaler.fit_transform(X)
#dim reduction
reducer = UMAP(random_state = 2213)
comps = reducer.fit_transform(X)

#plotting
fig, ax = plt.subplots(figsize = (16,12))
#map position to color
colors = {"GK": "brown", "DF": "blue", "DF,MF": "aqua", "MF": "green", "MF,FW": "yellow", "FW,MF": "orange", "FW": "red", 'DF,FW': "purple", 'FW,DF': "pink"}
color_list = data.Pos.map(colors).to_list()

#plot it
for i in range(X.shape[0]):
    ax.scatter(comps[i,0], comps[i,1], c = color_list[i], s = 10)
ax.set_xlabel('U1')
ax.set_ylabel('U2')
#make legend
markers = [plt.Line2D([0,0],[0,0],color=color, marker='o', linestyle='') for color in colors.values()]
ax.legend(markers, colors.keys(), numpoints=1)
plt.show()
plot ClusteringPlayers

Clustering

From the plot we claim that there are 6 clusters. We use hierarchical clustering (Ward method) to predict which points belong to which cluster. Then, we plot the result of the clustering.

from sklearn.cluster import AgglomerativeClustering
#declare object
scan = AgglomerativeClustering(n_clusters=6)
#make predictions
labels = scan.fit_predict(comps)
fig, ax = plt.subplots(figsize = (16,12))
ax.scatter(comps[:, 0], comps[:, 1], c=labels, s=10, cmap='gist_rainbow');
plt.show()
plot ClusteringPlayers

Labelling clusters

To label the clusters, we saved predcitins together with player names. Looking for many characteristic players, we label them in 6 formations - goalkeepers, centre-backs, full-backs, defensive midfielders, attacking midfielders and strikers.

#explore the dataframe
see = data.iloc[:, 0:5]
see["label"] = labels
print(see.sample(frac = 1).head(10))

#plot predictions with a legend
fig, ax = plt.subplots(figsize = (16,12))
scatter = ax.scatter(comps[:, 0], comps[:, 1], c=labels, s=10, cmap='gist_rainbow');
handles = scatter.legend_elements()[0]
#order the labels for hierarchy
myorder = [3,0,1,2,4,5]
handles = [handles[i] for i in myorder]
#add legend
ax.legend(handles=handles, labels = ["GK", "CB", "FB", "DM", "AM", "ST"])
plt.show()
plot ClusteringPlayers
                   Player  Nation    Pos       Squad           Comp  label
1215        Kevin Volland  de GER  FW,MF      Monaco     fr Ligue 1      5
1197  Stephan El Shaarawy  it ITA  FW,MF        Roma     it Serie A      1
719                 Chema  es ESP  DF,MF      Getafe     es La Liga      0
1437       Marc Cucurella  es ESP     MF      Getafe     es La Liga      1
404          Pablo Maffeo  es ESP     DF      Huesca     es La Liga      4
1710  Aurélien Tchouaméni  fr FRA     MF      Monaco     fr Ligue 1      2
1878          Rade Krunić  ba BIH  MF,FW       Milan     it Serie A      1
994          Manu Vallejo  es ESP     FW    Valencia     es La Liga      5
767       Ludovic Ajorque  re REU     FW  Strasbourg     fr Ligue 1      5
356          Lukas Kübler  de GER     DF    Freiburg  de Bundesliga      4

A different approach to the same problem - dimensionality reduction

We can also approach this problem using different techniques. First, we reduce the dimensions using a pipeline of PCA and tSNE. We also plot it the same way we did it for UMAP.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
#declare dim reduction objects
pca = PCA()
tsne = TSNE(random_state = 3454)
#declare pipeline
ts = Pipeline([    ('pca', pca),
    ('tsne', tsne)
])
#reduce dimensions
comps = ts.fit_transform(X)
#plot it
fig, ax = plt.subplots(figsize = (16,12))
for i in range(X.shape[0]):
    ax.scatter(comps[i,0], comps[i,1], c = color_list[i], s = 10)
ax.set_xlabel('t1')
ax.set_ylabel('t2')
markers = [plt.Line2D([0,0],[0,0],color=color, marker='o', linestyle='') for color in colors.values()]
ax.legend(markers, colors.keys(), numpoints=1)
plt.show()
plot ClusteringPlayers

A different approach to the same problem - clustering

This time we cluster the data using Gaussian Mixture Model clustering. It is the same model as in Muller’s article. We also plot the predictions on the 2 dimensional space.

from sklearn.mixture import GaussianMixture
#declare object
gmm = GaussianMixture(n_components=6, random_state=5).fit(comps)
#make predictions
labels = gmm.predict(comps)
fig, ax = plt.subplots(figsize = (16,12))
plt.scatter(comps[:, 0], comps[:, 1], c=labels, s=10, cmap='gist_rainbow');
plt.show()
plot ClusteringPlayers

A different approach to the same problem - labelling

Here, we repeat steps we did previously to label the clusters. First, we investigate the dataframe with predictions, then, we plot our clusters with labels.

see = data.iloc[:, 0:5]
see["label"] = labels
print(see.sample(frac = 1).head(10))

#explore the dataframe
fig, ax = plt.subplots(figsize = (16,12))
scatter = ax.scatter(comps[:, 0], comps[:, 1], c=labels, s=10, cmap='gist_rainbow');
handles = scatter.legend_elements()[0]
#order the labels for hierarchy
myorder = [4,2,5,1,0,3]
handles = [handles[i] for i in myorder]
ax.legend(handles=handles, labels = ["GK", "CB", "FB", "DM", "AM", "ST"])
plt.show()
plot ClusteringPlayers
                  Player   Nation  ...                Comp label
674   Kyle Walker-Peters  eng ENG  ...  eng Premier League     5
1522        Jens Jønsson   dk DEN  ...          es La Liga     1
1770    Riccardo Improta   it ITA  ...          it Serie A     0
253      Darnell Furlong  eng ENG  ...  eng Premier League     5
1627   Exequiel Palacios   ar ARG  ...       de Bundesliga     1
1618          Mark Noble  eng ENG  ...  eng Premier League     1
378      Clément Lenglet   fr FRA  ...          es La Liga     2
1241     Andrea Consigli   it ITA  ...          it Serie A     4
1394        Fran Beltrán   es ESP  ...          es La Liga     1
1447      Kerem Demirbay   de GER  ...       de Bundesliga     0

[10 rows x 6 columns]

Total running time of the script: ( 1 minutes 0.232 seconds)

Gallery generated by Sphinx-Gallery