.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gallery/lesson2/plot_LinearRegression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_gallery_lesson2_plot_LinearRegression.py: Linear regression ============== We are going to look at the relationship between age and minutes played. Start by watching the video a .. youtube:: TnOqoeVPnXE :width: 640 :height: 349 Either work through the code at the same time as watching or afterwards. .. GENERATED FROM PYTHON SOURCE LINES 14-21 .. code-block:: default #importing necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.formula.api as smf .. GENERATED FROM PYTHON SOURCE LINES 22-29 Opening data ---------------------------- In this example we use data downloaded from `FBref `_ on players in La Liga. We just use the age and minutes played columns. And we only take the first 20 observations, to help visualise the process. Download `playerstats.csv `_ your working directory. .. GENERATED FROM PYTHON SOURCE LINES 29-40 .. code-block:: default num_obs=20 laliga_df=pd.read_csv("playerstats.csv",delimiter=',') minutes_model = pd.DataFrame() minutes_model = minutes_model.assign(minutes=laliga_df['Min'][0:num_obs]) minutes_model = minutes_model.assign(age=laliga_df['Age'][0:num_obs]) # Make an age squared column so we can fir polynomial model. minutes_model = minutes_model.assign(age_squared=np.power(laliga_df['Age'][0:num_obs],2)) .. GENERATED FROM PYTHON SOURCE LINES 41-44 Plotting the data ---------------------------- Start by plotting the data. .. GENERATED FROM PYTHON SOURCE LINES 44-57 .. code-block:: default fig,ax=plt.subplots(num=1) ax.plot(minutes_model['age'], minutes_model['minutes'], linestyle='none', marker= '.', markersize= 10, color='blue') ax.set_ylabel('Minutes played') ax.set_xlabel('Age') ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.xlim((15,40)) plt.ylim((0,3000)) plt.show() .. image-sg:: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_001.png :alt: plot LinearRegression :srcset: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 58-66 Fitting the model ---------------------------- We are going to begin by doing a straight line linear regression .. math:: y = b_0 + b_1 x A straight line relationship between minutes played and age. .. GENERATED FROM PYTHON SOURCE LINES 66-71 .. code-block:: default model_fit=smf.ols(formula='minutes ~ age ', data=minutes_model).fit() print(model_fit.summary()) b=model_fit.params .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: minutes R-squared: 0.231 Model: OLS Adj. R-squared: 0.189 Method: Least Squares F-statistic: 5.415 Date: Thu, 20 Nov 2025 Prob (F-statistic): 0.0318 Time: 06:56:50 Log-Likelihood: -163.24 No. Observations: 20 AIC: 330.5 Df Residuals: 18 BIC: 332.5 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1293.0147 1152.158 -1.122 0.277 -3713.609 1127.580 age 102.5404 44.065 2.327 0.032 9.963 195.118 ============================================================================== Omnibus: 0.142 Durbin-Watson: 2.001 Prob(Omnibus): 0.931 Jarque-Bera (JB): 0.316 Skew: -0.153 Prob(JB): 0.854 Kurtosis: 2.466 Cond. No. 151. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. .. GENERATED FROM PYTHON SOURCE LINES 72-80 Comparing the fit ---------------------------- We now use the fit to plot a line through the data. .. math:: y = b_0 + b_1 x where the parameters are estimated from the model fit. .. GENERATED FROM PYTHON SOURCE LINES 80-102 .. code-block:: default #First plot the data as previously fig,ax=plt.subplots(num=1) ax.plot(minutes_model['age'], minutes_model['minutes'], linestyle='none', marker= '.', markersize= 10, color='blue') ax.set_ylabel('Minutes played') ax.set_xlabel('Age') ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.xlim((15,40)) plt.ylim((0,3000)) #Now create the line through the data x=np.arange(40,step=1) y= np.mean(minutes_model['minutes'])*np.ones(40) ax.plot(x, y, color='black') #Show distances to line for each point for i,a in enumerate(minutes_model['age']): ax.plot([a,a],[minutes_model['minutes'][i], np.mean(minutes_model['minutes']) ], color='red') plt.show() .. image-sg:: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_002.png :alt: plot LinearRegression :srcset: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 103-111 A model including squared terms ---------------------------- We now fit the quadratic model .. math:: y = b_0 + b_1 x + b_2 x^2 estimating the parameters from the data. .. GENERATED FROM PYTHON SOURCE LINES 111-134 .. code-block:: default # First fit the model model_fit=smf.ols(formula='minutes ~ age + age_squared ', data=minutes_model).fit() print(model_fit.summary()) b=model_fit.params # Compare the fit fig,ax=plt.subplots(num=1) ax.plot(minutes_model['age'], minutes_model['minutes'], linestyle='none', marker= '.', markersize= 10, color='blue') ax.set_ylabel('Minutes played') ax.set_xlabel('Age') ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.xlim((15,40)) plt.ylim((0,3000)) x=np.arange(40,step=1) y= b[0] + b[1]*x + b[2]*x*x ax.plot(x, y, color='black') for i,a in enumerate(minutes_model['age']): ax.plot([a,a],[minutes_model['minutes'][i], b[0] + b[1]*a + b[2]*a*a], color='red') plt.show() .. image-sg:: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_003.png :alt: plot LinearRegression :srcset: /gallery/lesson2/images/sphx_glr_plot_LinearRegression_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: minutes R-squared: 0.295 Model: OLS Adj. R-squared: 0.212 Method: Least Squares F-statistic: 3.559 Date: Thu, 20 Nov 2025 Prob (F-statistic): 0.0512 Time: 06:56:50 Log-Likelihood: -162.38 No. Observations: 20 AIC: 330.8 Df Residuals: 17 BIC: 333.7 Df Model: 2 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept -8063.5823 5573.188 -1.447 0.166 -1.98e+04 3694.817 age 634.7722 431.113 1.472 0.159 -274.796 1544.340 age_squared -10.1432 8.174 -1.241 0.232 -27.389 7.103 ============================================================================== Omnibus: 1.800 Durbin-Watson: 2.164 Prob(Omnibus): 0.407 Jarque-Bera (JB): 1.054 Skew: -0.190 Prob(JB): 0.590 Kurtosis: 1.942 Cond. No. 2.06e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.06e+04. This might indicate that there are strong multicollinearity or other numerical problems. .. GENERATED FROM PYTHON SOURCE LINES 135-142 Now try with all data points ---------------------------- 1) Refit the model with all data points 2) Try adding a cubic term 3) Think about how well the model works. What are the limitations? .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.128 seconds) .. _sphx_glr_download_gallery_lesson2_plot_LinearRegression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_LinearRegression.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_LinearRegression.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_