.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "gallery/plot_t_test.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_gallery_plot_t_test.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_gallery_plot_t_test.py:


Using t-tests
=========================================
In this tutorial we demonstrate how to check if values are significantly different 
from each other using t-tests

.. GENERATED FROM PYTHON SOURCE LINES 7-20

.. code-block:: default


    import pandas as pd
    import numpy as np
    import json
    # plotting
    import matplotlib.pyplot as plt
    #opening data
    import os
    import pathlib
    import warnings  
    pd.options.mode.chained_assignment = None
    warnings.filterwarnings('ignore')


.. GENERATED FROM PYTHON SOURCE LINES 21-27

Opening the dataset
----------------------------

First we open the data. For this example we will use WyScout data from 2017/18 Premier League season.  To meet file size requirements of 
Github, we have to open it from different files,
but you can open the file locally from the directory you saved it in. Also, we open the file containing all teams in WyScout database.

.. GENERATED FROM PYTHON SOURCE LINES 27-45

.. code-block:: default


    #open events
    train = pd.DataFrame()
    for i in range(13):
        file_name = 'events_England_' + str(i+1) + '.json'
        path = os.path.join(str(pathlib.Path().resolve()), 'data', 'Wyscout', file_name)
        with open(path) as f:
            data = json.load(f)
        train = pd.concat([train, pd.DataFrame(data)])
    
    #open team data
    path = os.path.join(str(pathlib.Path().resolve()),"data", 'Wyscout', 'teams.json')
    with open(path) as f:
        teams = json.load(f)

    teams_df = pd.DataFrame(teams)
    teams_df = teams_df.rename(columns={"wyId": "teamId"})
    

.. GENERATED FROM PYTHON SOURCE LINES 46-51

Preparing the dataset
----------------------------

First, we take out corners. Then, we sum them by team. We also merge it together with team dataframe to keep their names.
Then we repeat the same, but calculate corners taken by each team per game. 

.. GENERATED FROM PYTHON SOURCE LINES 51-65

.. code-block:: default

    
    #get corners
    corners = train.loc[train["subEventName"] == "Corner"]
    #count corners by team
    corners_by_team = corners.groupby(['teamId']).size().reset_index(name='counts')
    #merge with team name
    summary = corners_by_team.merge(teams_df[["name", "teamId"]], how = "left", on = ["teamId"])
    #count corners by team by game
    corners_by_game = corners.groupby(['teamId', "matchId"]).size().reset_index(name='counts')
    #merge with team name
    summary2 = corners_by_game.merge(teams_df[["name", "teamId"]], how = "left", on = ["teamId"])


.. GENERATED FROM PYTHON SOURCE LINES 66-73

One-sample one-sided t-test
----------------------------

Imagine that it is established that teams tyoically get 6 corners in a match in
football. City are an attacking team and we might think that they get more corners than
this. Let's start by plotting a distribution of City's corners.


.. GENERATED FROM PYTHON SOURCE LINES 73-101

.. code-block:: default


    team_name= 'Manchester City'

    city_corners = summary2.loc[summary2["name"] == 'Manchester City']["counts"]

    def FormatFigure(ax):
        ax.legend(loc='upper left')
        ax.set_ylim(0,0.25) 
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.set_ylabel('')
        ax.set_xlabel('Corners')
        ax.set_ylabel('Proportion of games')
        ax.set_xticks(np.arange(0,21,step=1)) 

 
    fig,ax1=plt.subplots(1,1)
    ax1.hist(city_corners, np.arange(0.01,20.5,1), color='lightblue', edgecolor = 'white',linestyle='-',alpha=0.5, label=team_name, density=True,align='right')
    FormatFigure(ax1)


    mean = city_corners.mean()
    std = city_corners.std()

    print('City typically had %.2f  plus/minus %.2f corners per match in the 2017/18 season.'%(mean,std))


.. image-sg:: /gallery/images/sphx_glr_plot_t_test_001.png
   :alt: plot t test
   :srcset: /gallery/images/sphx_glr_plot_t_test_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    City typically had 7.50  plus/minus 3.28 corners per match in the 2017/18 season.


.. GENERATED FROM PYTHON SOURCE LINES 102-105

We use can use a one-sided t-test to check if Manchester City took
more corners than we might expect due to normal variation in the number
of corners we tend to see.  We set the significance level at 0.05.

.. GENERATED FROM PYTHON SOURCE LINES 106-117

.. code-block:: default


    from scipy.stats import ttest_1samp
    t, pvalue = ttest_1samp(city_corners,popmean=6)

    print("The t-staistic is %.2f and the P-value is %.2f."%(t,pvalue))
    if pvalue < 0.05:
        print("We reject null hypothesis - " + team_name + " typically take more than 6 corners per match.")
    else:
        print("We cannot reject null hypothesis - " + team_name + " do not typically take more than 6 corners per match.")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    The t-staistic is 2.82 and the P-value is 0.01.
    We reject null hypothesis - Manchester City typically take more than 6 corners per match.


.. GENERATED FROM PYTHON SOURCE LINES 118-120

At this significance level, there's a reason to reject the null hypothesis. 
It is reasonable to say that City take more corners than what is considered normal for a typical team (i.e. 6).

.. GENERATED FROM PYTHON SOURCE LINES 125-130

Two-sample two-sided t-test
----------------------------

Here we compare Liverpool and Everton in terms of corners per match. 


.. GENERATED FROM PYTHON SOURCE LINES 130-148

.. code-block:: default


    liverpool_corners = summary2.loc[summary2["name"] == 'Liverpool']["counts"]
    everton_corners = summary2.loc[summary2["name"] == 'Everton']["counts"]

    mean = liverpool_corners.mean()
    std = liverpool_corners.std()
    print('Liverpool typically had %.2f plus/minus %.2f corners per match in the 2017/18 season.'%(mean,std))
    std_error=std/np.sqrt(len(liverpool_corners))
    print('The standard error in the number of corners per match is %.4f'%std_error)

    mean = everton_corners.mean()
    std = everton_corners.std()
    print('Everton typically had %.2f plus/minus %.2f corners per match in the 2017/18 season.'%(mean,std))
    std_error=std/np.sqrt(len(everton_corners))
    print('The standard error in the number of corners per match is %.4f'%std_error)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Liverpool typically had 6.08 plus/minus 3.06 corners per match in the 2017/18 season.
    The standard error in the number of corners per match is 0.4966
    Everton typically had 4.17 plus/minus 2.66 corners per match in the 2017/18 season.
    The standard error in the number of corners per match is 0.4428


.. GENERATED FROM PYTHON SOURCE LINES 149-150

Now let's plot the corners as a histogram.

.. GENERATED FROM PYTHON SOURCE LINES 150-156

.. code-block:: default


    fig,ax=plt.subplots(1,1)
    ax.hist(liverpool_corners, np.arange(0.01,15.5,1), color='red', edgecolor = 'white',linestyle='-',alpha=1.0, label="Liverpool", density=True,align='right')
    ax.hist(everton_corners, np.arange(0.01,15.5,1), alpha=0.25, color='blue', edgecolor = 'black', label='Everton',  density=True,align='right')
    FormatFigure(ax)


.. image-sg:: /gallery/images/sphx_glr_plot_t_test_002.png
   :alt: plot t test
   :srcset: /gallery/images/sphx_glr_plot_t_test_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 157-171

.. code-block:: default


    # Here we test if Liverpool had a different average corners per game than Everton. 
    # We set the significance level at 0.05.


    from scipy.stats import ttest_ind
    t, pvalue  = ttest_ind(a=liverpool_corners, b=everton_corners, equal_var=True)

    print("The t-staistic is %.2f and the P-value is %.2f."%(t,pvalue))
    if pvalue < 0.05:
        print("We reject null hypothesis - Liverpool took different number of corners per game than Everton")
    else:
        print("We cannot reject the null hypothesis that Liverpool took the same number of corners per game as Everton")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    The t-staistic is 2.86 and the P-value is 0.01.
    We reject null hypothesis - Liverpool took different number of corners per game than Everton


.. GENERATED FROM PYTHON SOURCE LINES 172-174

The t-statistic (roughly) measures how many standard errors 
the two means are from each other. 


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  8.356 seconds)


.. _sphx_glr_download_gallery_plot_t_test.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_t_test.py <plot_t_test.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_t_test.ipynb <plot_t_test.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_