March Madness 2022
Import Libraries and Data into the Notebook
import pandas as pd
import numpy as np
column_names = ["Team", "Conference", "Games Played","Games Won",
"Adjusted Offensive Efficiency", "Adjusted Defensive Efficiency",
"Power Ranking", "Effective Field Goal %","Effective Field Goal % (D)",
"Turnover %", "Turnover % (D)", "Offensive Rebounds", "Defensive Rebounds",
"Free Throw Rate", "Free Throw Rate (D)", "2-PT%", "2-PT% (D)",
"3-PT%", "3-PT (D)", "Adjusted Tempo", "Wins above Bubble", "Postseason","Seed in Tournament", "Year"]
data = pd.read_csv("testdata.csv", header = None, names = column_names, skiprows = 1)
data
Team | Conference | Games Played | Games Won | Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | ... | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Postseason | Seed in Tournament | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Indiana | B10 | 36 | 29 | 121.0 | 89.7 | 0.9692 | 54.7 | 44.0 | 19.3 | ... | 27.0 | 52.0 | 43.2 | 40.3 | 30.4 | 67.8 | 7.8 | S16 | 1 | NaN |
1 | Gonzaga | WCC | 34 | 31 | 118.9 | 90.2 | 0.9599 | 54.9 | 44.9 | 17.2 | ... | 29.9 | 55.0 | 42.1 | 36.5 | 32.9 | 65.1 | 7.6 | R32 | 1 | NaN |
2 | Kansas | B12 | 37 | 31 | 111.6 | 86.2 | 0.9514 | 53.3 | 41.5 | 20.3 | ... | 32.0 | 52.9 | 39.3 | 36.4 | 30.3 | 67.7 | 7.5 | S16 | 1 | NaN |
3 | Louisville | BE | 40 | 35 | 115.9 | 84.5 | 0.9743 | 50.6 | 44.8 | 18.3 | ... | 34.9 | 50.8 | 43.4 | 33.3 | 31.8 | 67.1 | 9.0 | Champions | 1 | NaN |
4 | Georgetown | BE | 32 | 25 | 107.6 | 85.0 | 0.9381 | 51.1 | 43.0 | 20.1 | ... | 35.3 | 50.2 | 41.4 | 35.3 | 30.7 | 62.5 | 6.6 | R64 | 2 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
403 | Creighton | BE | 28 | 20 | 114.4 | 94.3 | 0.9025 | 55.7 | 46.9 | 15.8 | ... | 25.7 | 56.3 | 46.1 | 36.7 | 32.1 | 69.1 | 3.5 | S16 | 5 | NaN |
404 | Oregon | P12 | 26 | 20 | 113.1 | 98.2 | 0.8350 | 54.4 | 50.1 | 16.8 | ... | 27.1 | 52.9 | 49.9 | 37.9 | 33.6 | 67.2 | 3.0 | S16 | 7 | NaN |
405 | Loyola Chicago | MVC | 26 | 24 | 108.5 | 88.3 | 0.9136 | 56.3 | 46.7 | 18.5 | ... | 22.4 | 58.0 | 45.5 | 35.7 | 32.5 | 63.9 | 1.7 | S16 | 8 | NaN |
406 | Syracuse | ACC | 25 | 16 | 112.8 | 97.6 | 0.8402 | 50.7 | 48.5 | 16.0 | ... | 25.1 | 50.9 | 49.3 | 33.7 | 31.6 | 69.4 | 0.5 | S16 | 11 | NaN |
407 | Oral Roberts | Sum | 23 | 16 | 107.0 | 107.1 | 0.4981 | 53.6 | 50.4 | 15.7 | ... | 33.1 | 49.7 | 49.0 | 38.8 | 35.6 | 71.4 | -5.1 | S16 | 15 | NaN |
408 rows × 24 columns
data = data.drop(columns =["Year"])
data = data[data["Postseason"] != 'R68']
data
Team | Conference | Games Played | Games Won | Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | ... | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Postseason | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Indiana | B10 | 36 | 29 | 121.0 | 89.7 | 0.9692 | 54.7 | 44.0 | 19.3 | ... | 45.8 | 27.0 | 52.0 | 43.2 | 40.3 | 30.4 | 67.8 | 7.8 | S16 | 1 |
1 | Gonzaga | WCC | 34 | 31 | 118.9 | 90.2 | 0.9599 | 54.9 | 44.9 | 17.2 | ... | 40.8 | 29.9 | 55.0 | 42.1 | 36.5 | 32.9 | 65.1 | 7.6 | R32 | 1 |
2 | Kansas | B12 | 37 | 31 | 111.6 | 86.2 | 0.9514 | 53.3 | 41.5 | 20.3 | ... | 39.5 | 32.0 | 52.9 | 39.3 | 36.4 | 30.3 | 67.7 | 7.5 | S16 | 1 |
3 | Louisville | BE | 40 | 35 | 115.9 | 84.5 | 0.9743 | 50.6 | 44.8 | 18.3 | ... | 40.0 | 34.9 | 50.8 | 43.4 | 33.3 | 31.8 | 67.1 | 9.0 | Champions | 1 |
4 | Georgetown | BE | 32 | 25 | 107.6 | 85.0 | 0.9381 | 51.1 | 43.0 | 20.1 | ... | 36.8 | 35.3 | 50.2 | 41.4 | 35.3 | 30.7 | 62.5 | 6.6 | R64 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
403 | Creighton | BE | 28 | 20 | 114.4 | 94.3 | 0.9025 | 55.7 | 46.9 | 15.8 | ... | 26.9 | 25.7 | 56.3 | 46.1 | 36.7 | 32.1 | 69.1 | 3.5 | S16 | 5 |
404 | Oregon | P12 | 26 | 20 | 113.1 | 98.2 | 0.8350 | 54.4 | 50.1 | 16.8 | ... | 26.8 | 27.1 | 52.9 | 49.9 | 37.9 | 33.6 | 67.2 | 3.0 | S16 | 7 |
405 | Loyola Chicago | MVC | 26 | 24 | 108.5 | 88.3 | 0.9136 | 56.3 | 46.7 | 18.5 | ... | 30.7 | 22.4 | 58.0 | 45.5 | 35.7 | 32.5 | 63.9 | 1.7 | S16 | 8 |
406 | Syracuse | ACC | 25 | 16 | 112.8 | 97.6 | 0.8402 | 50.7 | 48.5 | 16.0 | ... | 28.1 | 25.1 | 50.9 | 49.3 | 33.7 | 31.6 | 69.4 | 0.5 | S16 | 11 |
407 | Oral Roberts | Sum | 23 | 16 | 107.0 | 107.1 | 0.4981 | 53.6 | 50.4 | 15.7 | ... | 27.8 | 33.1 | 49.7 | 49.0 | 38.8 | 35.6 | 71.4 | -5.1 | S16 | 15 |
388 rows × 23 columns
data = data.replace({
'Champions': 6,
'2ND': 5,
'F4': 4,
'E8': 3,
'S16': 2,
'R32': 1,
'R64': 0
})
data
Team | Conference | Games Played | Games Won | Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | ... | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Postseason | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Indiana | B10 | 36 | 29 | 121.0 | 89.7 | 0.9692 | 54.7 | 44.0 | 19.3 | ... | 45.8 | 27.0 | 52.0 | 43.2 | 40.3 | 30.4 | 67.8 | 7.8 | 2 | 1 |
1 | Gonzaga | WCC | 34 | 31 | 118.9 | 90.2 | 0.9599 | 54.9 | 44.9 | 17.2 | ... | 40.8 | 29.9 | 55.0 | 42.1 | 36.5 | 32.9 | 65.1 | 7.6 | 1 | 1 |
2 | Kansas | B12 | 37 | 31 | 111.6 | 86.2 | 0.9514 | 53.3 | 41.5 | 20.3 | ... | 39.5 | 32.0 | 52.9 | 39.3 | 36.4 | 30.3 | 67.7 | 7.5 | 2 | 1 |
3 | Louisville | BE | 40 | 35 | 115.9 | 84.5 | 0.9743 | 50.6 | 44.8 | 18.3 | ... | 40.0 | 34.9 | 50.8 | 43.4 | 33.3 | 31.8 | 67.1 | 9.0 | 6 | 1 |
4 | Georgetown | BE | 32 | 25 | 107.6 | 85.0 | 0.9381 | 51.1 | 43.0 | 20.1 | ... | 36.8 | 35.3 | 50.2 | 41.4 | 35.3 | 30.7 | 62.5 | 6.6 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
403 | Creighton | BE | 28 | 20 | 114.4 | 94.3 | 0.9025 | 55.7 | 46.9 | 15.8 | ... | 26.9 | 25.7 | 56.3 | 46.1 | 36.7 | 32.1 | 69.1 | 3.5 | 2 | 5 |
404 | Oregon | P12 | 26 | 20 | 113.1 | 98.2 | 0.8350 | 54.4 | 50.1 | 16.8 | ... | 26.8 | 27.1 | 52.9 | 49.9 | 37.9 | 33.6 | 67.2 | 3.0 | 2 | 7 |
405 | Loyola Chicago | MVC | 26 | 24 | 108.5 | 88.3 | 0.9136 | 56.3 | 46.7 | 18.5 | ... | 30.7 | 22.4 | 58.0 | 45.5 | 35.7 | 32.5 | 63.9 | 1.7 | 2 | 8 |
406 | Syracuse | ACC | 25 | 16 | 112.8 | 97.6 | 0.8402 | 50.7 | 48.5 | 16.0 | ... | 28.1 | 25.1 | 50.9 | 49.3 | 33.7 | 31.6 | 69.4 | 0.5 | 2 | 11 |
407 | Oral Roberts | Sum | 23 | 16 | 107.0 | 107.1 | 0.4981 | 53.6 | 50.4 | 15.7 | ... | 27.8 | 33.1 | 49.7 | 49.0 | 38.8 | 35.6 | 71.4 | -5.1 | 2 | 15 |
388 rows × 23 columns
y = data["Postseason"]
data = data.drop(columns = ["Postseason", "Team","Conference","Games Played","Games Won"]) # Drop Columns we dont need
data.head()
Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | Turnover % (D) | Offensive Rebounds | Defensive Rebounds | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 121.0 | 89.7 | 0.9692 | 54.7 | 44.0 | 19.3 | 20.9 | 39.0 | 31.4 | 45.8 | 27.0 | 52.0 | 43.2 | 40.3 | 30.4 | 67.8 | 7.8 | 1 |
1 | 118.9 | 90.2 | 0.9599 | 54.9 | 44.9 | 17.2 | 20.8 | 37.8 | 29.8 | 40.8 | 29.9 | 55.0 | 42.1 | 36.5 | 32.9 | 65.1 | 7.6 | 1 |
2 | 111.6 | 86.2 | 0.9514 | 53.3 | 41.5 | 20.3 | 18.4 | 33.8 | 29.3 | 39.5 | 32.0 | 52.9 | 39.3 | 36.4 | 30.3 | 67.7 | 7.5 | 1 |
3 | 115.9 | 84.5 | 0.9743 | 50.6 | 44.8 | 18.3 | 27.0 | 38.2 | 33.3 | 40.0 | 34.9 | 50.8 | 43.4 | 33.3 | 31.8 | 67.1 | 9.0 | 1 |
4 | 107.6 | 85.0 | 0.9381 | 51.1 | 43.0 | 20.1 | 22.4 | 30.4 | 31.0 | 36.8 | 35.3 | 50.2 | 41.4 | 35.3 | 30.7 | 62.5 | 6.6 | 2 |
Standardize
data = ( data - data.mean())/data.std()
X = data
X
Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | Turnover % (D) | Offensive Rebounds | Defensive Rebounds | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.524498 | -1.276534 | 1.046617 | 0.872923 | -1.613747 | 0.928357 | 0.745842 | 1.804037 | 0.959349 | 1.725474 | -1.010380 | 0.172274 | -1.263146 | 1.774169 | -1.250704 | -0.010877 | 1.343025 | -1.632740 |
1 | 1.186605 | -1.176837 | 0.985872 | 0.946052 | -1.220337 | -0.181259 | 0.703565 | 1.521991 | 0.426549 | 0.856488 | -0.499170 | 1.170414 | -1.656514 | 0.265353 | -0.033504 | -0.880728 | 1.298588 | -1.632740 |
2 | 0.012026 | -1.974414 | 0.930352 | 0.361024 | -2.706553 | 1.456745 | -0.311085 | 0.581840 | 0.260049 | 0.630552 | -0.128984 | 0.471716 | -2.657813 | 0.225647 | -1.299392 | -0.043094 | 1.276369 | -1.632740 |
3 | 0.703901 | -2.313384 | 1.079929 | -0.626209 | -1.264049 | 0.399968 | 3.324746 | 1.616007 | 1.592048 | 0.717451 | 0.382226 | -0.226983 | -1.191624 | -1.005229 | -0.569072 | -0.236394 | 1.609649 | -1.632740 |
4 | -0.631579 | -2.213687 | 0.843480 | -0.443388 | -2.050869 | 1.351067 | 1.379999 | -0.217289 | 0.826149 | 0.161300 | 0.452738 | -0.426611 | -1.906839 | -0.211116 | -1.104640 | -1.718362 | 1.076401 | -1.416675 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
403 | 0.462550 | -0.359321 | 0.610950 | 1.238565 | -0.346093 | -0.921003 | -0.564748 | -1.533501 | -0.106251 | -1.559292 | -1.239544 | 1.602942 | -0.226085 | 0.344764 | -0.423008 | 0.407940 | 0.387624 | -0.768479 |
404 | 0.253378 | 0.418317 | 0.170058 | 0.763231 | 1.052699 | -0.392614 | 0.492180 | -0.710868 | -0.439251 | -1.576671 | -0.992753 | 0.471716 | 1.132822 | 0.821232 | 0.307312 | -0.204177 | 0.276530 | -0.336349 |
405 | -0.486768 | -1.555686 | 0.683452 | 1.457950 | -0.433517 | 0.505646 | 1.041782 | -1.251456 | -2.503850 | -0.898863 | -1.821265 | 2.168555 | -0.440649 | -0.052293 | -0.228256 | -1.267328 | -0.012312 | -0.120284 |
406 | 0.205108 | 0.298680 | 0.204023 | -0.589645 | 0.353303 | -0.815325 | 0.703565 | -0.287800 | 1.658648 | -1.350735 | -1.345311 | -0.193711 | 0.918258 | -0.846407 | -0.666448 | 0.504590 | -0.278936 | 0.527912 |
407 | -0.728120 | 2.192925 | -2.030485 | 0.470717 | 1.183835 | -0.973842 | -0.395640 | -1.909562 | 1.492148 | -1.402874 | 0.064923 | -0.592968 | 0.810976 | 1.178584 | 1.281071 | 1.148924 | -1.523179 | 1.392172 |
388 rows × 18 columns
Machine Learning Algorithm
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
randTree = RandomForestRegressor(min_samples_split=20, random_state=5);
randTree.fit(X, y)
RandomForestRegressor(min_samples_split=20, random_state=5)
We can use this Trained Model to test
- Lets test data from the 2019 tournment!!!!
column_names = ["Team", "Conference", "Games Played","Games Won",
"Adjusted Offensive Efficiency", "Adjusted Defensive Efficiency",
"Power Ranking", "Effective Field Goal %","Effective Field Goal % (D)",
"Turnover %", "Turnover % (D)", "Offensive Rebounds", "Defensive Rebounds",
"Free Throw Rate", "Free Throw Rate (D)", "2-PT%", "2-PT% (D)",
"3-PT%", "3-PT (D)", "Adjusted Tempo", "Wins above Bubble","Postseason","Seed in Tournament"]
test_data_2019 = pd.read_csv("cbb19.csv", header = None, names = column_names, skiprows = 1)
test_data_2019
Team | Conference | Games Played | Games Won | Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | ... | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Postseason | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Gonzaga | WCC | 37 | 33 | 123.4 | 89.9 | 0.9744 | 59.0 | 44.2 | 14.9 | ... | 35.3 | 25.9 | 61.4 | 43.4 | 36.3 | 30.4 | 72.0 | 7.0 | E8 | 1.0 |
1 | Virginia | ACC | 38 | 35 | 123.0 | 89.9 | 0.9736 | 55.2 | 44.7 | 14.7 | ... | 29.1 | 26.3 | 52.5 | 45.7 | 39.5 | 28.9 | 60.7 | 11.1 | Champions | 1.0 |
2 | Duke | ACC | 38 | 32 | 118.9 | 89.2 | 0.9646 | 53.6 | 45.0 | 17.5 | ... | 33.2 | 24.0 | 58.0 | 45.0 | 30.8 | 29.9 | 73.6 | 11.2 | E8 | 1.0 |
3 | North Carolina | ACC | 36 | 29 | 120.1 | 91.4 | 0.9582 | 52.9 | 48.9 | 17.2 | ... | 30.2 | 28.4 | 52.1 | 47.9 | 36.2 | 33.5 | 76.0 | 10.0 | S16 | 1.0 |
4 | Michigan | B10 | 37 | 30 | 114.6 | 85.6 | 0.9665 | 51.6 | 44.1 | 13.9 | ... | 27.5 | 24.1 | 51.8 | 44.3 | 34.2 | 29.1 | 65.9 | 9.2 | S16 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
348 | Alcorn St. | SWAC | 27 | 10 | 89.0 | 112.6 | 0.0628 | 45.7 | 52.7 | 24.1 | ... | 30.5 | 36.5 | 45.0 | 55.3 | 31.3 | 32.1 | 67.1 | -16.7 | NaN | NaN |
349 | New Hampshire | AE | 27 | 5 | 83.7 | 106.1 | 0.0613 | 44.0 | 51.5 | 18.4 | ... | 21.9 | 38.0 | 39.4 | 52.1 | 32.6 | 33.6 | 67.1 | -20.2 | NaN | NaN |
350 | Chicago St. | WAC | 30 | 3 | 88.5 | 117.3 | 0.0380 | 44.2 | 57.8 | 22.5 | ... | 33.1 | 33.9 | 43.5 | 57.9 | 30.7 | 38.5 | 71.9 | -20.9 | NaN | NaN |
351 | Delaware St. | MEAC | 29 | 6 | 84.3 | 112.2 | 0.0358 | 40.0 | 52.4 | 19.0 | ... | 25.5 | 39.2 | 37.7 | 52.6 | 29.0 | 34.7 | 71.6 | -21.7 | NaN | NaN |
352 | Maryland Eastern Shore | MEAC | 30 | 7 | 85.7 | 114.4 | 0.0346 | 43.5 | 54.4 | 20.7 | ... | 28.3 | 36.6 | 44.5 | 53.2 | 27.9 | 37.3 | 64.5 | -19.9 | NaN | NaN |
353 rows × 23 columns
Adjusting our 2019 Data
# Lets Drop the teams that didn't make the tournment in 2018 and who ever didnt get into the round of 64
test_data_2019 = test_data_2019.dropna()
test_data_2019 = test_data_2019[test_data_2019["Postseason"] != 'R68']
## Lets replace the POSTSEASON with the number of wins
test_data_2019 = test_data_2019.replace({
'Champions': 6,
'2ND': 5,
'F4': 4,
'E8': 3,
'S16': 2,
'R32': 1,
'R64': 0
})
# Lets grab the teams that made the Tournment
team_names = test_data_2019.get("Team")
# y
actual_outcomes = test_data_2019.get("Postseason")
# Lets drop the Columns like we did above to the dataset
test_data_2019 = test_data_2019.drop(columns = ["Postseason", "Team","Conference","Games Played","Games Won"])
# Lets finally standardize the data!
test_data_2019 = (test_data_2019 - test_data_2019.mean())/test_data_2019.std()
test_data_2019
Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | Turnover % (D) | Offensive Rebounds | Defensive Rebounds | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.042572 | -1.132484 | 1.073843 | 2.460166 | -1.566239 | -1.786817 | 0.010948 | 0.338659 | -0.258338 | 0.390565 | -1.264927 | 3.016431 | -1.347685 | 0.262195 | -1.161525 | 1.304614 | 0.951017 | -1.614218 |
1 | 1.972384 | -1.132484 | 1.068618 | 0.941607 | -1.361502 | -1.929228 | -0.645904 | 0.058170 | -0.736583 | -1.103902 | -1.164796 | 0.040748 | -0.543228 | 1.656936 | -1.875394 | -2.729147 | 1.772165 | -1.614218 |
2 | 1.252960 | -1.260798 | 1.009833 | 0.302213 | -1.238660 | 0.064530 | 0.186108 | 1.384121 | 0.663992 | -0.115625 | -1.740545 | 1.879653 | -0.788062 | -2.135016 | -1.399481 | 1.875766 | 1.792193 | -1.614218 |
3 | 1.463523 | -0.857526 | 0.968031 | 0.022479 | 0.358290 | -0.149087 | -0.295583 | 1.307624 | -1.624753 | -0.838755 | -0.639113 | -0.092990 | 0.226254 | 0.218609 | 0.313805 | 2.732493 | 1.551857 | -1.614218 |
4 | 0.498442 | -1.920698 | 1.022243 | -0.497028 | -1.607186 | -2.498873 | -0.426954 | -1.395277 | -0.941545 | -1.489571 | -1.715513 | -0.193294 | -1.032897 | -0.653104 | -1.780211 | -0.872903 | 1.391633 | -1.398989 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
61 | -1.905487 | 0.847215 | -1.982937 | -0.377142 | 0.726817 | -0.077881 | 1.893923 | -0.808798 | 0.322389 | 0.077209 | 1.839111 | -0.962291 | 0.890806 | 0.872394 | 0.218622 | -0.444539 | -0.971671 | 1.398989 |
62 | -1.010593 | 1.928718 | -2.121406 | 0.781758 | 1.013449 | -0.220292 | -0.339373 | -1.981755 | 1.278879 | 1.113695 | -0.839373 | 0.408529 | 1.100664 | 0.872394 | 0.408988 | 0.126613 | -1.452343 | 1.614218 |
63 | -0.905312 | 2.368651 | -2.436881 | 0.262251 | 2.282819 | -1.217172 | -2.003398 | -2.415239 | -0.565781 | -0.356669 | -1.114731 | -0.092990 | 1.870145 | 0.436537 | 1.979500 | -0.658721 | -2.173351 | 1.614218 |
64 | -1.150969 | 2.331990 | -2.649158 | 0.062441 | 1.750502 | -0.077881 | -0.208003 | -1.293281 | 1.073917 | 0.752130 | -0.614080 | 0.274791 | 1.065688 | -0.260833 | 1.789134 | 1.268917 | -2.533854 | 1.614218 |
66 | -1.273797 | 2.606948 | -3.008395 | 0.541986 | 1.627660 | 1.631054 | 0.536429 | -0.171322 | 1.927926 | 0.486983 | -0.388787 | -0.460771 | 1.415452 | 1.918449 | 1.170448 | -0.123266 | -2.313546 | 1.614218 |
64 rows × 18 columns
Now we can use our Model on it!
outcomes = randTree.predict(test_data_2019)
Lets look at the MSE and R Squared Values
R Squared: is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Range from 0 to 1, closer to 1 is the best! ( 1 being Perfect) MSE: Means Squared Error! (lower number the better)
The sklearn library has MSE and R Squared metrics build in, lets first import them!
from sklearn.metrics import mean_squared_error, r2_score
print('Mean squared error:', mean_squared_error(actual_outcomes, outcomes))
print('Coefficient of determination:', r2_score(actual_outcomes, outcomes))
Mean squared error: 0.3543465603139662
Coefficient of determination: 0.8027716386674812
- .80 Coefficent of Determination is solid!!
Running Our Model on 2022 Data
column_names = ["Team","Adjusted Offensive Efficiency", "Adjusted Defensive Efficiency",
"Power Ranking", "Record", "Effective Field Goal %","Effective Field Goal % (D)",
"Turnover %", "Turnover % (D)", "Offensive Rebounds", "Defensive Rebounds",
"Free Throw Rate", "Free Throw Rate (D)", "2-PT%", "2-PT% (D)",
"3-PT%", "3-PT (D)", "Adjusted Tempo", "Wins above Bubble","Seed in Tournament"]
data_2022 = pd.read_csv("2022_adj.csv", header = None, names = column_names, skiprows = 1)
data_2022.head()
Team | Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Record | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | Turnover % (D) | Offensive Rebounds | Defensive Rebounds | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Gonzaga | 121.405 | 88.3685 | 0.974731 | 26-3 | 59.4 | 43.2 | 15.9 | 17.0 | 29.0 | 23.0 | 29.7 | 22.2 | 60.9 | 41.6 | 37.9 | 30.7 | 72.7093 | 6.980363 | 1.0 |
1 | Kansas | 119.798 | 93.1202 | 0.947700 | 28-6 | 54.1 | 46.9 | 17.8 | 18.4 | 33.4 | 28.9 | 32.8 | 27.8 | 54.5 | 47.9 | 35.5 | 30.1 | 69.0450 | 10.116105 | 1.0 |
2 | Arizona | 118.578 | 93.2390 | 0.940735 | 31-3 | 55.9 | 44.4 | 18.0 | 17.7 | 34.5 | 28.3 | 35.1 | 22.8 | 57.5 | 41.9 | 35.4 | 32.7 | 72.2560 | 9.063691 | 1.0 |
3 | Baylor | 117.053 | 91.7082 | 0.943009 | 26-6 | 52.9 | 47.8 | 18.2 | 22.9 | 36.3 | 28.4 | 28.5 | 26.9 | 53.5 | 49.5 | 34.6 | 29.9 | 67.4767 | 8.628828 | 1.0 |
4 | Duke | 120.018 | 95.1092 | 0.935540 | 28-6 | 55.6 | 47.0 | 14.9 | 16.1 | 31.9 | 28.5 | 28.6 | 18.9 | 55.8 | 46.9 | 36.8 | 31.4 | 67.6327 | 6.480068 | 2.0 |
Cleanup Data
# Drop teams that didnt make it
data_2022 = data_2022.dropna()
# GRab TEam names
team_names = data_2022.get("Team")
# Drop columns
data_2022 = data_2022.drop(columns = ["Team", "Record"])
# Standardize he data
data_2022 =(data_2022 - data_2022.mean())/data_2022.std()
data_2022.head()
Adjusted Offensive Efficiency | Adjusted Defensive Efficiency | Power Ranking | Effective Field Goal % | Effective Field Goal % (D) | Turnover % | Turnover % (D) | Offensive Rebounds | Defensive Rebounds | Free Throw Rate | Free Throw Rate (D) | 2-PT% | 2-PT% (D) | 3-PT% | 3-PT (D) | Adjusted Tempo | Wins above Bubble | Seed in Tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.904057 | -1.591607 | 1.222213 | 2.568833 | -2.065698 | -0.808917 | -0.614141 | -0.334286 | -1.307253 | -0.535455 | -1.252683 | 2.780897 | -2.070024 | 1.088533 | -0.715641 | 2.463788 | 1.185066 | -1.648667 |
1 | 1.631395 | -0.555475 | 1.031333 | 0.657779 | -0.309891 | 0.107417 | -0.071630 | 0.702858 | 0.817306 | 0.266551 | -0.074587 | 0.743341 | 0.241243 | 0.181995 | -1.020038 | 0.921680 | 1.933918 | -1.648667 |
2 | 1.424396 | -0.529570 | 0.982149 | 1.306816 | -1.496247 | 0.203873 | -0.342886 | 0.962144 | 0.601249 | 0.861588 | -1.126458 | 1.698445 | -1.959963 | 0.144222 | 0.299016 | 2.273018 | 1.682589 | -1.648667 |
3 | 1.165647 | -0.863368 | 0.998207 | 0.225087 | 0.117198 | 0.300329 | 1.672155 | 1.386430 | 0.637259 | -0.845909 | -0.263924 | 0.424973 | 0.828232 | -0.157958 | -1.121504 | 0.261667 | 1.578739 | -1.648667 |
4 | 1.668723 | -0.121763 | 0.945464 | 1.198643 | -0.262436 | -1.291197 | -0.962898 | 0.349286 | 0.673268 | -0.820037 | -1.946918 | 1.157220 | -0.125624 | 0.673036 | -0.360511 | 0.327319 | 1.065590 | -1.431478 |
Now we can use our trained model to make predicitons
outcomes = randTree.predict(data_2022)
team_names = team_names.to_numpy()
Predicting the Number of wins
for i in range(len(team_names)):
print(team_names[i], outcomes[i])
Gonzaga 4.298450225081188
Kansas 3.3442196063367895
Arizona 2.8776303454815193
Baylor 3.3032326512844143
Duke 3.0235155961572433
Kentucky 3.046358917236067
Villanova 2.0461495232866493
Auburn 2.2014452057013236
Purdue 2.3204202532996803
Tennessee 2.710680094457952
Wisconsin 1.1603186666452192
Texas Tech 3.085000231714144
UCLA 2.860326343327394
Illinois 1.2330048020144462
Providence 0.8866655310039284
Arkansas 1.4588779583526954
Iowa 2.5911571179344737
Houston 4.090685790210517
Connecticut 0.8193585395350703
Saint Mary's 1.4544321557302027
Alabama 1.0773849084622729
Colorado St. 0.9204878962394024
Texas 1.3948944618441117
LSU 0.5968029485256523
Ohio St. 0.5588504011052923
Michigan St. 0.40531079854572005
Murray St. 0.9958768235705858
USC 0.4046188105875423
North Carolina 0.7319186823626208
Boise St. 0.513611232777687
Seton Hall 0.543717539557223
San Diego St. 0.38283954193826675
Memphis 2.019966984733968
Marquette 0.9110003967595597
TCU 0.5333828842756806
Creighton 0.5397218035252996
Davidson 0.6462122840870752
Miami FL 0.6425294259986538
San Francisco 1.277724887272622
Loyola Chicago 0.5786369140050699
Virginia Tech 1.2454582488421115
Michigan 1.1163138835408595
Iowa St. 1.1006914444528788
Notre Dame 0.9083503259174012
Rutgers 0.24443489708614496
UAB 0.3803614474751469
Richmond 0.39421107910334025
New Mexico St. 0.10003135313117716
Wyoming 0.15089250386005165
Indiana 1.0072659826219392
South Dakota St. 0.7654512567742268
Vermont 0.37248574466916684
Chattanooga 0.458839900837815
Akron 0.04589294258373206
Colgate 0.266381915407217
Longwood 0.22944397759103632
Yale 0.014089238638599254
Montana St. 0.07187918514234304
Delaware 0.027849486247588276
Jacksonville St. 0.07756203007518797
Cal St. Fullerton 0.15801090880796761
Saint Peter's 0.11311555360903186
Norfolk St. 0.008501003344481606
Georgia St. 0.31069314716980634
Bryant 0.13281253300475268
Texas Southern 0.026032138875617138
Finding Importance Features in our Model
Importances = randTree.feature_importances_
print('Features | Coefficients:')
print('-------------------------------')
for i in range(len(Importances)):
print(data_2022.columns[i], ":", Importances[i])
Features | Coefficients:
-------------------------------
Adjusted Offensive Efficiency : 0.020837837067937272
Adjusted Defensive Efficiency : 0.019878321896035844
Power Ranking : 0.6352742363889717
Effective Field Goal % : 0.02354902907386864
Effective Field Goal % (D) : 0.00834636668418663
Turnover % : 0.017603147031684505
Turnover % (D) : 0.024662559002028416
Offensive Rebounds : 0.029780766422328686
Defensive Rebounds : 0.030720724599930433
Free Throw Rate : 0.03103466259250237
Free Throw Rate (D) : 0.0112768168324684
2-PT% : 0.01959143305651005
2-PT% (D) : 0.02504113475899021
3-PT% : 0.011447156047409724
3-PT (D) : 0.015981116861988772
Adjusted Tempo : 0.01668413131261128
Wins above Bubble : 0.033070407747852616
Seed in Tournament : 0.025220152622694322