Using Statsbomb

Getting familiar with Statsbomb data

#importing SBopen class from mplsoccer to open the data
from mplsoccer import Sbopen
# The first thing we have to do is open the data. We use a parser SBopen available in mplsoccer.
parser = Sbopen()

Competition data

Using method competition of the parser we can explore competitions to find the competition we are interested in. The most important information for us is in the competition_id (id of competition) and season_id. The first one is the key in Statsbomb database of a competition, the second one of a season of this competition (for example WC 2018 would have a different season_id than WC 2014, but the same competition_id).

#opening data using competition method
df_competition = parser.competition()
#structure of data
df_competition.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   competition_id             71 non-null     int64
 1   season_id                  71 non-null     int64
 2   country_name               71 non-null     object
 3   competition_name           71 non-null     object
 4   competition_gender         71 non-null     object
 5   competition_youth          71 non-null     bool
 6   competition_international  71 non-null     bool
 7   season_name                71 non-null     object
 8   match_updated              71 non-null     object
 9   match_updated_360          54 non-null     object
 10  match_available_360        8 non-null      object
 11  match_available            71 non-null     object
dtypes: bool(2), int64(2), object(8)
memory usage: 5.8+ KB

Match data

Using method match of the parser we can explore matches of a competition to find the match we are interested in. To open it we need to know the competition_id (id of competition) and season_id. We know that for Women World Cup competition_id is 72 and season_id is 30 From this dataframe for us the most important imformation is provided in match_id, home_team_id and home_team_name and adequately away_team_id and away_team_name.

#opening data using match method
df_match = parser.match(competition_id=72, season_id=30)
#structure of data
df_match.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 52 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   match_id                         52 non-null     int64
 1   match_date                       52 non-null     datetime64[ns]
 2   kick_off                         52 non-null     datetime64[ns]
 3   home_score                       52 non-null     int64
 4   away_score                       52 non-null     int64
 5   match_status                     52 non-null     object
 6   match_status_360                 52 non-null     object
 7   last_updated                     52 non-null     datetime64[ns]
 8   last_updated_360                 52 non-null     datetime64[ns]
 9   match_week                       52 non-null     int64
 10  competition_id                   52 non-null     int64
 11  country_name                     52 non-null     object
 12  competition_name                 52 non-null     object
 13  season_id                        52 non-null     int64
 14  season_name                      52 non-null     object
 15  home_team_id                     52 non-null     int64
 16  home_team_name                   52 non-null     object
 17  home_team_gender                 52 non-null     object
 18  home_team_group                  48 non-null     object
 19  home_team_country_id             52 non-null     int64
 20  home_team_country_name           52 non-null     object
 21  home_team_managers_id            52 non-null     int64
 22  home_team_managers_name          52 non-null     object
 23  home_team_managers_nickname      52 non-null     object
 24  home_team_managers_dob           52 non-null     datetime64[ns]
 25  home_team_managers_country_id    52 non-null     int64
 26  home_team_managers_country_name  52 non-null     object
 27  away_team_id                     52 non-null     int64
 28  away_team_name                   52 non-null     object
 29  away_team_gender                 52 non-null     object
 30  away_team_group                  48 non-null     object
 31  away_team_country_id             52 non-null     int64
 32  away_team_country_name           52 non-null     object
 33  away_team_managers_id            52 non-null     int64
 34  away_team_managers_name          52 non-null     object
 35  away_team_managers_nickname      52 non-null     object
 36  away_team_managers_dob           52 non-null     datetime64[ns]
 37  away_team_managers_country_id    52 non-null     int64
 38  away_team_managers_country_name  52 non-null     object
 39  metadata_data_version            52 non-null     object
 40  metadata_shot_fidelity_version   52 non-null     object
 41  metadata_xy_fidelity_version     52 non-null     object
 42  competition_stage_id             52 non-null     int64
 43  competition_stage_name           52 non-null     object
 44  stadium_id                       52 non-null     int64
 45  stadium_name                     52 non-null     object
 46  stadium_country_id               52 non-null     int64
 47  stadium_country_name             52 non-null     object
 48  referee_id                       36 non-null     float64
 49  referee_name                     36 non-null     object
 50  referee_country_id               36 non-null     float64
 51  referee_country_name             36 non-null     object
dtypes: datetime64[ns](6), float64(2), int64(17), object(27)
memory usage: 21.2+ KB

Lineup data

To check the lineups we use the lineup method. We do it for England Sweden WWC 2019 game - game_id is 69301 - you can check that in the df_match. In this dataframe you will find all players who played in this game, their teams and jersey numbers COMMENTED OUT BECAUSE OF CHANGE OF DATA FORMAT.

#opening data using match method
#df_lineup = parser.lineup(69301)
#structure of data
#df_lineup.info()

Event data

The Statsbomb data that we will use the most during the course is event data. Knowing game_id you can open all the events that occured on the pitch In the event dataframe you will find events with additional information, we will mostly use this dataframe. Tactics dataframe provides information about player position on the pitch. ‘Related’ dataframe provides information on events that were related to each other - for example ball pass and pressure applied. df_freeze consists of freezed frames with player position in the moment of shots. We will learn more about tracking data later in the course. Below, an example of event data is presented.

#opening data
df_event, df_related, df_freeze, df_tactics = parser.event(69301)
#if you want only event data you can use
#df_event = parser.event(69301)[0]
#structure of data
df_event.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3289 entries, 0 to 3288
Data columns (total 73 columns):
 #   Column                          Non-Null Count  Dtype
---  ------                          --------------  -----
 0   id                              3289 non-null   object
 1   index                           3289 non-null   int64
 2   period                          3289 non-null   int64
 3   timestamp                       3289 non-null   object
 4   minute                          3289 non-null   int64
 5   second                          3289 non-null   int64
 6   possession                      3289 non-null   int64
 7   duration                        2457 non-null   float64
 8   match_id                        3289 non-null   int64
 9   type_id                         3289 non-null   int64
 10  type_name                       3289 non-null   object
 11  possession_team_id              3289 non-null   int64
 12  possession_team_name            3289 non-null   object
 13  play_pattern_id                 3289 non-null   int64
 14  play_pattern_name               3289 non-null   object
 15  team_id                         3289 non-null   int64
 16  team_name                       3289 non-null   object
 17  tactics_formation               4 non-null      object
 18  player_id                       3277 non-null   float64
 19  player_name                     3277 non-null   object
 20  position_id                     3277 non-null   float64
 21  position_name                   3277 non-null   object
 22  pass_recipient_id               834 non-null    float64
 23  pass_recipient_name             834 non-null    object
 24  pass_length                     921 non-null    float64
 25  pass_angle                      921 non-null    float64
 26  pass_height_id                  921 non-null    float64
 27  pass_height_name                921 non-null    object
 28  end_x                           1713 non-null   float64
 29  end_y                           1713 non-null   float64
 30  body_part_id                    939 non-null    float64
 31  body_part_name                  939 non-null    object
 32  sub_type_id                     318 non-null    float64
 33  sub_type_name                   318 non-null    object
 34  x                               3264 non-null   float64
 35  y                               3264 non-null   float64
 36  under_pressure                  640 non-null    float64
 37  outcome_id                      503 non-null    float64
 38  outcome_name                    503 non-null    object
 39  out                             31 non-null     float64
 40  counterpress                    86 non-null     float64
 41  pass_deflected                  1 non-null      object
 42  pass_switch                     23 non-null     object
 43  technique_id                    37 non-null     float64
 44  technique_name                  37 non-null     object
 45  pass_cross                      33 non-null     object
 46  off_camera                      25 non-null     float64
 47  shot_statsbomb_xg               19 non-null     float64
 48  end_z                           15 non-null     float64
 49  shot_first_time                 5 non-null      object
 50  goalkeeper_position_id          19 non-null     float64
 51  goalkeeper_position_name        19 non-null     object
 52  ball_recovery_recovery_failure  10 non-null     object
 53  pass_assisted_shot_id           10 non-null     object
 54  pass_shot_assist                8 non-null      object
 55  shot_key_pass_id                10 non-null     object
 56  foul_won_defensive              5 non-null      object
 57  aerial_won                      30 non-null     object
 58  pass_goal_assist                2 non-null      object
 59  substitution_replacement_id     6 non-null      float64
 60  substitution_replacement_name   6 non-null      object
 61  foul_committed_offensive        2 non-null      object
 62  shot_one_on_one                 1 non-null      object
 63  dribble_overrun                 1 non-null      object
 64  block_deflection                2 non-null      object
 65  bad_behaviour_card_id           1 non-null      float64
 66  bad_behaviour_card_name         1 non-null      object
 67  pass_no_touch                   1 non-null      object
 68  block_save_block                1 non-null      object
 69  foul_committed_advantage        1 non-null      object
 70  foul_won_advantage              1 non-null      object
 71  foul_committed_card_id          1 non-null      float64
 72  foul_committed_card_name        1 non-null      object
dtypes: float64(25), int64(10), object(38)
memory usage: 1.8+ MB

360 data

Statsbomb offers 360 data which track not only location of an event but also players’ location. To open them we need an id of game. Later, we will also need id of the event. In the df_frame we find information on players’ position (but only if teammate, not all information) and in df_visible it is provided which part of the pitch was tracked during an event.

df_frame, df_visible = parser.frame(3788741)

# exploring the data
df_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45737 entries, 0 to 45736
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   teammate  45737 non-null  bool
 1   actor     45737 non-null  bool
 2   keeper    45737 non-null  bool
 3   match_id  45737 non-null  int64
 4   id        45737 non-null  object
 5   x         45737 non-null  float64
 6   y         45737 non-null  float64
dtypes: bool(3), float64(2), int64(1), object(1)
memory usage: 1.5+ MB

Before you start

Run these lines in Spyder/Jupyter notebook and explore dataframes to get more familiar before you start working on the course.

Total running time of the script: ( 0 minutes 2.043 seconds)

Gallery generated by Sphinx-Gallery