Note
Click here to download the full example code
Using Statsbomb
Getting familiar with Statsbomb data
#importing SBopen class from mplsoccer to open the data
from mplsoccer import Sbopen
# The first thing we have to do is open the data. We use a parser SBopen available in mplsoccer.
parser = Sbopen()
Competition data
Using method competition of the parser we can explore competitions to find the competition we are interested in. The most important information for us is in the competition_id (id of competition) and season_id. The first one is the key in Statsbomb database of a competition, the second one of a season of this competition (for example WC 2018 would have a different season_id than WC 2014, but the same competition_id).
#opening data using competition method
df_competition = parser.competition()
#structure of data
df_competition.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 competition_id 71 non-null int64
1 season_id 71 non-null int64
2 country_name 71 non-null object
3 competition_name 71 non-null object
4 competition_gender 71 non-null object
5 competition_youth 71 non-null bool
6 competition_international 71 non-null bool
7 season_name 71 non-null object
8 match_updated 71 non-null object
9 match_updated_360 54 non-null object
10 match_available_360 8 non-null object
11 match_available 71 non-null object
dtypes: bool(2), int64(2), object(8)
memory usage: 5.8+ KB
Match data
Using method match of the parser we can explore matches of a competition to find the match we are interested in. To open it we need to know the competition_id (id of competition) and season_id. We know that for Women World Cup competition_id is 72 and season_id is 30 From this dataframe for us the most important imformation is provided in match_id, home_team_id and home_team_name and adequately away_team_id and away_team_name.
#opening data using match method
df_match = parser.match(competition_id=72, season_id=30)
#structure of data
df_match.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 52 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 match_id 52 non-null int64
1 match_date 52 non-null datetime64[ns]
2 kick_off 52 non-null datetime64[ns]
3 home_score 52 non-null int64
4 away_score 52 non-null int64
5 match_status 52 non-null object
6 match_status_360 52 non-null object
7 last_updated 52 non-null datetime64[ns]
8 last_updated_360 52 non-null datetime64[ns]
9 match_week 52 non-null int64
10 competition_id 52 non-null int64
11 country_name 52 non-null object
12 competition_name 52 non-null object
13 season_id 52 non-null int64
14 season_name 52 non-null object
15 home_team_id 52 non-null int64
16 home_team_name 52 non-null object
17 home_team_gender 52 non-null object
18 home_team_group 48 non-null object
19 home_team_country_id 52 non-null int64
20 home_team_country_name 52 non-null object
21 home_team_managers_id 52 non-null int64
22 home_team_managers_name 52 non-null object
23 home_team_managers_nickname 52 non-null object
24 home_team_managers_dob 52 non-null datetime64[ns]
25 home_team_managers_country_id 52 non-null int64
26 home_team_managers_country_name 52 non-null object
27 away_team_id 52 non-null int64
28 away_team_name 52 non-null object
29 away_team_gender 52 non-null object
30 away_team_group 48 non-null object
31 away_team_country_id 52 non-null int64
32 away_team_country_name 52 non-null object
33 away_team_managers_id 52 non-null int64
34 away_team_managers_name 52 non-null object
35 away_team_managers_nickname 52 non-null object
36 away_team_managers_dob 52 non-null datetime64[ns]
37 away_team_managers_country_id 52 non-null int64
38 away_team_managers_country_name 52 non-null object
39 metadata_data_version 52 non-null object
40 metadata_shot_fidelity_version 52 non-null object
41 metadata_xy_fidelity_version 52 non-null object
42 competition_stage_id 52 non-null int64
43 competition_stage_name 52 non-null object
44 stadium_id 52 non-null int64
45 stadium_name 52 non-null object
46 stadium_country_id 52 non-null int64
47 stadium_country_name 52 non-null object
48 referee_id 36 non-null float64
49 referee_name 36 non-null object
50 referee_country_id 36 non-null float64
51 referee_country_name 36 non-null object
dtypes: datetime64[ns](6), float64(2), int64(17), object(27)
memory usage: 21.2+ KB
Lineup data
To check the lineups we use the lineup method. We do it for England Sweden WWC 2019 game - game_id is 69301 - you can check that in the df_match. In this dataframe you will find all players who played in this game, their teams and jersey numbers COMMENTED OUT BECAUSE OF CHANGE OF DATA FORMAT.
#opening data using match method
#df_lineup = parser.lineup(69301)
#structure of data
#df_lineup.info()
Event data
The Statsbomb data that we will use the most during the course is event data. Knowing game_id you can open all the events that occured on the pitch In the event dataframe you will find events with additional information, we will mostly use this dataframe. Tactics dataframe provides information about player position on the pitch. ‘Related’ dataframe provides information on events that were related to each other - for example ball pass and pressure applied. df_freeze consists of freezed frames with player position in the moment of shots. We will learn more about tracking data later in the course. Below, an example of event data is presented.
#opening data
df_event, df_related, df_freeze, df_tactics = parser.event(69301)
#if you want only event data you can use
#df_event = parser.event(69301)[0]
#structure of data
df_event.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3289 entries, 0 to 3288
Data columns (total 73 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3289 non-null object
1 index 3289 non-null int64
2 period 3289 non-null int64
3 timestamp 3289 non-null object
4 minute 3289 non-null int64
5 second 3289 non-null int64
6 possession 3289 non-null int64
7 duration 2457 non-null float64
8 match_id 3289 non-null int64
9 type_id 3289 non-null int64
10 type_name 3289 non-null object
11 possession_team_id 3289 non-null int64
12 possession_team_name 3289 non-null object
13 play_pattern_id 3289 non-null int64
14 play_pattern_name 3289 non-null object
15 team_id 3289 non-null int64
16 team_name 3289 non-null object
17 tactics_formation 4 non-null object
18 player_id 3277 non-null float64
19 player_name 3277 non-null object
20 position_id 3277 non-null float64
21 position_name 3277 non-null object
22 pass_recipient_id 834 non-null float64
23 pass_recipient_name 834 non-null object
24 pass_length 921 non-null float64
25 pass_angle 921 non-null float64
26 pass_height_id 921 non-null float64
27 pass_height_name 921 non-null object
28 end_x 1713 non-null float64
29 end_y 1713 non-null float64
30 body_part_id 939 non-null float64
31 body_part_name 939 non-null object
32 sub_type_id 318 non-null float64
33 sub_type_name 318 non-null object
34 x 3264 non-null float64
35 y 3264 non-null float64
36 under_pressure 640 non-null float64
37 outcome_id 503 non-null float64
38 outcome_name 503 non-null object
39 out 31 non-null float64
40 counterpress 86 non-null float64
41 pass_deflected 1 non-null object
42 pass_switch 23 non-null object
43 technique_id 37 non-null float64
44 technique_name 37 non-null object
45 pass_cross 33 non-null object
46 off_camera 25 non-null float64
47 shot_statsbomb_xg 19 non-null float64
48 end_z 15 non-null float64
49 shot_first_time 5 non-null object
50 goalkeeper_position_id 19 non-null float64
51 goalkeeper_position_name 19 non-null object
52 ball_recovery_recovery_failure 10 non-null object
53 pass_assisted_shot_id 10 non-null object
54 pass_shot_assist 8 non-null object
55 shot_key_pass_id 10 non-null object
56 foul_won_defensive 5 non-null object
57 aerial_won 30 non-null object
58 pass_goal_assist 2 non-null object
59 substitution_replacement_id 6 non-null float64
60 substitution_replacement_name 6 non-null object
61 foul_committed_offensive 2 non-null object
62 shot_one_on_one 1 non-null object
63 dribble_overrun 1 non-null object
64 block_deflection 2 non-null object
65 bad_behaviour_card_id 1 non-null float64
66 bad_behaviour_card_name 1 non-null object
67 pass_no_touch 1 non-null object
68 block_save_block 1 non-null object
69 foul_committed_advantage 1 non-null object
70 foul_won_advantage 1 non-null object
71 foul_committed_card_id 1 non-null float64
72 foul_committed_card_name 1 non-null object
dtypes: float64(25), int64(10), object(38)
memory usage: 1.8+ MB
360 data
Statsbomb offers 360 data which track not only location of an event but also players’ location. To open them we need an id of game. Later, we will also need id of the event. In the df_frame we find information on players’ position (but only if teammate, not all information) and in df_visible it is provided which part of the pitch was tracked during an event.
df_frame, df_visible = parser.frame(3788741)
# exploring the data
df_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45737 entries, 0 to 45736
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 teammate 45737 non-null bool
1 actor 45737 non-null bool
2 keeper 45737 non-null bool
3 match_id 45737 non-null int64
4 id 45737 non-null object
5 x 45737 non-null float64
6 y 45737 non-null float64
dtypes: bool(3), float64(2), int64(1), object(1)
memory usage: 1.5+ MB
Before you start
Run these lines in Spyder/Jupyter notebook and explore dataframes to get more familiar before you start working on the course.
Total running time of the script: ( 0 minutes 2.043 seconds)