Find our github page here!

Final Project Tutorial¶

Milestone 1: Data and Website¶

Team: Robin Chahal, Davor Gavranic

Sources of data¶

• Kaggle - UFC-Fight historical data from 1993 to 2019: Link

Milestone 2: Extraction, Transform, and Load (ETL) + Exploratory Data Analysis (EDA)¶

Dataset

We decided us for the dataset UFC-Fight historical data from 1993 to 2019 from the website Kaggle. This set provides data of all UFC fights from 1993 to 2019 and all fighters, that were members of UFC during this time period.

Extraction¶

The data was split in two csv files, that we downloaded from the website and imported in our notebook. There were 2 datasets, which where already preprocessed, which we didn’t use. The reason for that is, that in these sets already modifications were done. We preferred to work with the raw data, to do our own modifications.

Transformation¶

Since we used the raw data, in the next step we had to do several modifications, before we can analyze the data. In the first step we imported both datasets in to two dataframes. Then we tidied each dataframe, before we finally merged these two into one main dataframe.

First, we load all the necessary libraries for our project.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
import datetime

Transforming and Tidying: Fighter Dataset

In this step, we load the first dataset of the fighters. This dataset contains all attributes of each fighter. We recognized that the dataframe has several columns with more than one value. Therefore, we needed to split the columns into two (All Name columns into First and Last Name). Because all new columns were added at the right side of the dataframe, we needed to rearrange them back. Furthermore, we had to change the data types, because the whole dataframe was mainly stored as object types. While we wanted to change the data types, we faced the same problem with more than one value in a column, e.g. the weight of a fighter was stored in 200 lb, instead only the number. We extracted the unit description and put it in the column names. For better analyzes afterwards we change the height and the reach of a fighter from imperial units (feet and inches) into metric units (cm).

# load first cvs file, with fighter attributes, to dataframe
fighter_df = pd.read_csv("./data/raw_fighter_details.csv")


# start tidying fighter_df dataframe...

# seperate name column into first and last name and drop old column
new = fighter_df["fighter_name"].str.split(" ", n = 1, expand = True) 
fighter_df["First Name"]= new[0] 
fighter_df["Last Name"]= new[1] 
fighter_df.drop(columns =["fighter_name"], inplace = True) 

# rearange columns
fighter_df = fighter_df[['First Name', 'Last Name', 'Height', 'Weight', 'Reach', 'Stance', 'DOB']]

# change data types and rename it
fighter_df["DOB"] = pd.to_datetime(fighter_df["DOB"])
fighter_df["Weight"]= fighter_df["Weight"].str.split(" ", n = 1, expand = True) 
fighter_df.rename(columns={'Weight':'Weight (lbs)'}, inplace=True)
fighter_df["Weight (lbs)"] = pd.to_numeric(fighter_df["Weight (lbs)"])

# convert height and reach from feet and inches to cm and rename columns
fighter_df['Height'] = fighter_df['Height'].str.replace('"','')
fighter_df['Height'] = fighter_df['Height'].str.replace('\'','')
newH = fighter_df["Height"].str.split(" ", n = 1, expand = True) 
newH[0] = pd.to_numeric(newH[0])
newH[1] = pd.to_numeric(newH[1])
newH["cm"] = newH[0] * 30.48 + newH[1] * 2.54
fighter_df["Height"] = newH["cm"]
fighter_df.rename(columns={'Height':'Height (cm)'}, inplace=True)
# convert reach from inches to cm
fighter_df['Reach'] = fighter_df['Reach'].str.replace('"','')
fighter_df['Reach'] = pd.to_numeric(fighter_df['Reach'])
fighter_df['Reach'] = fighter_df['Reach'] * 2.54
fighter_df.rename(columns={'Reach':'Reach (cm)'}, inplace=True)

# print tidy data -> we leave NaN/NaT, because if we don't it will drop to many fighters who had a fight yet
fighter_df.head()

Transforming and Tidying: Fights Dataset

After loading the dataset, we had the same problem with too many values in the columns. Since there are always two fighters and a referee involved, we had to do it several times for the names. After rearranging the new name columns, we come to the attributes. These were always listed with two values, e.g. number of significant strikes of attempts. It was not possible to understand the number attempts clearly, because there were different numbers in other columns and the description of Kaggle was also not helpful. Therefore, we decided to drop it, because it is not necessary for the further analyzes. We keep the number of successful attacks (strikes, clinch etc.). Only the attempts of takedowns are important to us (because it has a big influence on the points and the whole fitch), which we split into a new column. Finally, we changed the necessary data types to finish the tidying process.

# load second cvs file with all fights from 1993 - 2019 to dataframe
totalfight_df = pd.read_csv("./data/raw_total_fight_data.csv", sep=";")


# start tidying totalfight_df dataframe...

# for R_fighter: seperate name column in first and last name
newR = totalfight_df["R_fighter"].str.split(" ", n = 1, expand = True) 
totalfight_df["R_First Name"]= newR[0] 
totalfight_df["R_Last Name"]= newR[1] 
totalfight_df.drop(columns =["R_fighter"], inplace = True) 

# for B_fighter: seperate name column in first and last name
newB = totalfight_df["B_fighter"].str.split(" ", n = 1, expand = True) 
totalfight_df["B_First Name"]= newB[0] 
totalfight_df["B_Last Name"]= newB[1] 
totalfight_df.drop(columns =["B_fighter"], inplace = True) 

# for winner: seperate name column in first and last name
newW = totalfight_df["Winner"].str.split(" ", n = 1, expand = True) 
totalfight_df["Winner_First Name"]= newW[0] 
totalfight_df["Winner_Last Name"]= newW[1] 
totalfight_df.drop(columns =["Winner"], inplace = True) 

# for referee: seperate name column in first and last name
newR = totalfight_df["Referee"].str.split(" ", n = 1, expand = True) 
totalfight_df["Referee_First Name"]= newR[0] 
totalfight_df["Referee_Last Name"]= newR[1] 
totalfight_df.drop(columns =["Referee"], inplace = True) 

#rearange columns
cols_to_order = ['R_First Name', 'R_Last Name', 'B_First Name', 'B_Last Name', 'Winner_First Name','Winner_Last Name'] 
new_columns = cols_to_order + (totalfight_df.columns.drop(cols_to_order).tolist())
totalfight_df = totalfight_df[new_columns]

# the columns SIG_STR (significant strikes) and TOTAL_STR (total strikes) is always described as a number of attempts
# Problem: the attemps differ in the two columns and there is no description why
# Solution: we drop the attemps and only count the strikes (without of attempts)
totalfight_df["R_SIG_STR."] = totalfight_df["R_SIG_STR."].str.split(" ", n = 1, expand = True) 
totalfight_df["R_SIG_STR."] = pd.to_numeric(totalfight_df["R_SIG_STR."])
totalfight_df["B_SIG_STR."] = totalfight_df["B_SIG_STR."].str.split(" ", n = 1, expand = True) 
totalfight_df["B_SIG_STR."] = pd.to_numeric(totalfight_df["B_SIG_STR."])
totalfight_df["R_TOTAL_STR."] = totalfight_df["R_TOTAL_STR."].str.split(" ", n = 1, expand = True) 
totalfight_df["R_TOTAL_STR."] = pd.to_numeric(totalfight_df["R_TOTAL_STR."])
totalfight_df["B_TOTAL_STR."] = totalfight_df["B_TOTAL_STR."].str.split(" ", n = 1, expand = True) 
totalfight_df["B_TOTAL_STR."] = pd.to_numeric(totalfight_df["B_TOTAL_STR."])

# change data types of SIG_STR_pct and rename it in SIG_STR (%)
totalfight_df["R_SIG_STR_pct"] = totalfight_df["R_SIG_STR_pct"].str.split("%", n = 1, expand = True) 
totalfight_df["R_SIG_STR_pct"] = pd.to_numeric(totalfight_df["R_SIG_STR_pct"])
totalfight_df.rename(columns={'R_SIG_STR_pct':'R_SIG_STR (%)'}, inplace=True)
totalfight_df["B_SIG_STR_pct"] = totalfight_df["B_SIG_STR_pct"].str.split("%", n = 1, expand = True) 
totalfight_df["B_SIG_STR_pct"] = pd.to_numeric(totalfight_df["B_SIG_STR_pct"])
totalfight_df.rename(columns={'B_SIG_STR_pct':'B_SIG_STR (%)'}, inplace=True)

# split column TD (takedown of attempts) and save takedowns in TD and attemps in TD_pct
# we overwrite the TD_pct with attempts and rename it, because we can calculate the % anytime
newTD = totalfight_df["R_TD"].str.split("of", n = 1, expand = True) 
totalfight_df["R_TD"]= newTD[0] 
totalfight_df["R_TD_pct"]= newTD[1] 
totalfight_df.rename(columns={'R_TD_pct':'R_TD_ATT'}, inplace=True)
newTD = totalfight_df["B_TD"].str.split("of", n = 1, expand = True) 
totalfight_df["B_TD"]= newTD[0] 
totalfight_df["B_TD_pct"]= newTD[1] 
totalfight_df.rename(columns={'B_TD_pct':'B_TD_ATT'}, inplace=True)
# change data type to numeric
cols = ["R_TD", "B_TD", "R_TD_ATT", "B_TD_ATT"]
totalfight_df[cols] = totalfight_df[cols].apply(pd.to_numeric, errors='coerce')

# for the columns HEAD, BODY, LEG, DISTANCE, CLINCH, GROUND we drop the second value attemps, because it is less relevant
# e.g. Head hits of attemped head hits, will only show the hits
totalfight_df["R_HEAD"] = totalfight_df["R_HEAD"].str.split(" ", n = 1, expand = True)
totalfight_df["B_HEAD"] = totalfight_df["B_HEAD"].str.split(" ", n = 1, expand = True)
totalfight_df["R_BODY"] = totalfight_df["R_BODY"].str.split(" ", n = 1, expand = True)
totalfight_df["B_BODY"] = totalfight_df["B_BODY"].str.split(" ", n = 1, expand = True)
totalfight_df["R_LEG"] = totalfight_df["R_LEG"].str.split(" ", n = 1, expand = True)
totalfight_df["B_LEG"] = totalfight_df["B_LEG"].str.split(" ", n = 1, expand = True)
totalfight_df["R_DISTANCE"] = totalfight_df["R_DISTANCE"].str.split(" ", n = 1, expand = True)
totalfight_df["B_DISTANCE"] = totalfight_df["B_DISTANCE"].str.split(" ", n = 1, expand = True)
totalfight_df["R_CLINCH"] = totalfight_df["R_CLINCH"].str.split(" ", n = 1, expand = True)
totalfight_df["B_CLINCH"] = totalfight_df["B_CLINCH"].str.split(" ", n = 1, expand = True)
totalfight_df["R_GROUND"] = totalfight_df["R_GROUND"].str.split(" ", n = 1, expand = True)
totalfight_df["B_GROUND"] = totalfight_df["B_GROUND"].str.split(" ", n = 1, expand = True)

# change all columns above to numeric
cols = ["R_HEAD", "B_HEAD", "R_BODY", "B_BODY", "R_LEG", "B_LEG", "R_DISTANCE", "B_DISTANCE", "R_CLINCH", "B_CLINCH", "R_GROUND", "B_GROUND"]
totalfight_df[cols] = totalfight_df[cols].apply(pd.to_numeric, errors='coerce')

# change data type of date/time columns
totalfight_df["last_round_time"] = pd.to_timedelta(totalfight_df["last_round_time"]+':00')
totalfight_df["date"] = pd.to_datetime(totalfight_df["date"])

# print tidy dataframe
totalfight_df.head()

Transforming and Tidying: Merging process

In this step we merged the dataframes. First we merged the fighter_df and totalfight_df to main_df. Then we did some transformation on the new dataframe main_df. The next step was creating the score for every fighter. To do this, we had to count every fight of each fighter, counting their wins and calculating their losses. We did this in the totalfight_df. the last step was to and put these information into the fighter_df dataframe. To do so, we had to create new dataframes, which we merged with the fighter_df, then we copied the copied the columns (totalfights, wins, losses) to the fighter_df.

# merging both tables on name
# Problem: we have two fighters in totalfight_df which have to merge with the table fighter_df
# Solution: we do two merges: the first with the Red fighter (R_...), second with the Blue fighter (B_...) -> new df main_df

R_match = pd.merge(totalfight_df, fighter_df,  how='left', left_on=['R_First Name','R_Last Name'], right_on = ['First Name','Last Name'])
main_df = pd.merge(R_match, fighter_df,  how='left', left_on=['B_First Name', 'B_Last Name'], right_on = ['First Name','Last Name'])

# drop duplicates, name columns that were merged from fighter_df
main_df.drop(columns =["First Name_x", "Last Name_x", "First Name_y", "Last Name_y"], inplace = True) 
main_df.dtypes

# rename new columns the same way like the others, with R_/B_ to be consistent
main_df.rename(columns={'Height (cm)_x':'R_Height (cm)', "Weight (lbs)_x": "R_Weight (lbs)", "Reach (cm)_x":"R_Reach (cm)", "Stance_x":"R_Stance", "DOB_x":"R_DOB"}, inplace=True)
main_df.rename(columns={'Height (cm)_y':'B_Height (cm)', "Weight (lbs)_y": "B_Weight (lbs)", "Reach (cm)_y":"B_Reach (cm)", "Stance_y":"B_Stance", "DOB_y":"B_DOB"}, inplace=True)
main_df.rename(columns={'Totalfights_x':'R_Totalfights', "wins_x": "R_wins", "losses_x":"R_losses"}, inplace=True)
main_df.rename(columns={'Totalfights_y':'B_Totalfights', "wins_y": "B_wins", "losses_y":"B_losses"}, inplace=True)

# the column fight type has still to many values, which we need to extract in other columns and drop the old column
main_df['Sex'] = np.where(main_df['Fight_type'].str.contains("Women"), 'Women', 'Man')
main_df['Title bout'] = np.where(main_df['Fight_type'].str.contains("Title"), 1, 0)
main_df['Title_bout'] = main_df['Title bout'].astype('bool')
main_df.drop(columns =["Title bout"], inplace = True)
main_df.drop(columns =["Fight_type"], inplace = True)

# creating Score for every fighter

# creating new dataframes and adding a column with the counts of fights and wins grouped by all fighters
df23 = totalfight_df.groupby(['R_First Name','R_Last Name']).size().reset_index().rename(columns={0:'count'})
df24 = totalfight_df.groupby(['B_First Name','B_Last Name']).size().reset_index().rename(columns={0:'count'})
df25 = totalfight_df.groupby(['Winner_First Name','Winner_Last Name']).size().reset_index().rename(columns={0:'wwins'})

#merging the 3 dataframes to calculate the totalfights, wins and losses of every fighter
df_2324 = pd.merge(df23, df24,  how='outer', left_on=['R_First Name','R_Last Name'], right_on = ['B_First Name','B_Last Name'])
df_2324.fillna(0, inplace=True)
df_2324['ttotalfights'] = df_2324['count_x'] + df_2324['count_y']
df_232425 = pd.merge(df_2324, df25,  how='outer', left_on=['R_First Name','R_Last Name'], right_on = ['Winner_First Name','Winner_Last Name'])
df_232425.fillna(0, inplace=True)
df_232425['llosses'] = df_232425['ttotalfights'] - df_232425['wwins']

#merging the dataframe with our fighter_df dataframe and creating 3 new columns(totalfights, wins, losses) in the fighter_df 

df_xx = pd.merge(fighter_df, df_232425,  how='left', left_on=['First Name','Last Name'], right_on = ['B_First Name','B_Last Name'])

fighter_df['Totalfights'] = df_xx['ttotalfights']
fighter_df['wins'] = df_xx['wwins']
fighter_df['losses'] = df_xx['llosses']

# get UFC Weight classes from: https://www.ladbrokes.com.au/betting-info/ufc/weight-divisions/
pd.options.mode.chained_assignment = None
main_df['Weight_class'] = np.nan

main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 0) & (main_df['B_Weight (lbs)'] <= 115)] = 'Strawweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 115) & (main_df['B_Weight (lbs)'] <= 125)] = 'Flyweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 125) & (main_df['B_Weight (lbs)'] <= 135)] = 'Bantamweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 135) & (main_df['B_Weight (lbs)'] <= 145)] = 'Featherweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 145) & (main_df['B_Weight (lbs)'] <= 155)] = 'Lightweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 155) & (main_df['B_Weight (lbs)'] <= 170)] = 'Welterweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 170) & (main_df['B_Weight (lbs)'] <= 185)] = 'Middleweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 185) & (main_df['B_Weight (lbs)'] <= 205)] = 'Light Heavyweight'
main_df['Weight_class'][(main_df['B_Weight (lbs)'] > 205)] = 'Heavyweight'
       
# our dataframe is now tidy and ready for meaningful analyzes
main_df.head()

Exploratory Data Analysis¶

Graph 1: Fights and finishes comparison

UFC is gaining more and more popularity. In the early days the sport was considered too brutal. Only very few fighters dared to enter the Octagon. But meanwhile MMA has become a very popular sport. One reason for this is certainly the organization UFC, which was criticized in the early days, but now has become the biggest and most succesful organization in the sport. Every fighter would like to join this organization and become a champion in his weight class.

For this reason we want to create a graph that shows the number of fights per year in this organization, since the beginning in 1993 till 2018. At the same time we want to show the number of finishes. Finishes are fights that are decided by KO/TKO or submission and not by Decision.

#create new column (finish: yes/no)
main_df.loc[main_df['win_by'].str.contains("KO/TKO"),'finish'] = 'True'
main_df.loc[main_df['win_by'].str.contains("TKO - Doctor's Stoppage"),'finish'] = 'True'
main_df.loc[main_df['win_by'].str.contains("Submission"),'finish'] = 'True'
main_df.finish.fillna(value="False", inplace=True)

#creating a new dataframe, to manipulate the values for the plot
new12 = main_df.copy(deep=True)
#setting the date as index, to group by years
new12.set_index("date", inplace=True)
#grouping by year and creating new dataframe
plot_df = new12.groupby(by=[new12.index.year])['finish'].describe()

#calculating the fights, ending with finishes and with decisision an creating a new boolean column  
plot_df.loc[plot_df['top'].str.contains("True"),'finish'] = plot_df.freq
plot_df['finish'] = plot_df['finish'].astype(float)
plot_df['count'] = plot_df['count'].astype(float)
plot_df.loc[plot_df['top'].str.contains("False"),'finish'] = plot_df['count'] - plot_df['freq'] 
plot_df["finishing_ratio"] = plot_df['finish'] / plot_df['count'] 
plot_df.rename(columns={'count':'Total Fights', 'finish':'Total Submissions'}, inplace=True)


#plotting all fights and all fights wit finishes
plot1 = plot_df["Total Fights"].plot.line(figsize=(15, 10), legend=True)
plot1 = plot_df["Total Submissions"].plot.line(figsize=(15, 10), legend=True)
plot1.set_ylabel("Amount of fights each year")
plot1.set_xlabel("")
plt.xticks(np.arange(1993, 2020, 1))

([<matplotlib.axis.XTick at 0x205feffc828>,
  <matplotlib.axis.XTick at 0x205feff8c50>,
  <matplotlib.axis.XTick at 0x205feff1208>,
  <matplotlib.axis.XTick at 0x205feff1470>,
  <matplotlib.axis.XTick at 0x205feff16a0>,
  <matplotlib.axis.XTick at 0x205feff0160>,
  <matplotlib.axis.XTick at 0x205feff1048>,
  <matplotlib.axis.XTick at 0x205feff8e10>,
  <matplotlib.axis.XTick at 0x205fefebef0>,
  <matplotlib.axis.XTick at 0x205fefeb550>,
  <matplotlib.axis.XTick at 0x205fefea0b8>,
  <matplotlib.axis.XTick at 0x205fefeaba8>,
  <matplotlib.axis.XTick at 0x205fefeafd0>,
  <matplotlib.axis.XTick at 0x205fefebc18>,
  <matplotlib.axis.XTick at 0x205fefedf28>,
  <matplotlib.axis.XTick at 0x205fefe9860>,
  <matplotlib.axis.XTick at 0x205fefe96d8>,
  <matplotlib.axis.XTick at 0x205fefe80f0>,
  <matplotlib.axis.XTick at 0x205fefe8cc0>,
  <matplotlib.axis.XTick at 0x205fefe73c8>,
  <matplotlib.axis.XTick at 0x205fefe7a90>,
  <matplotlib.axis.XTick at 0x205fefe7470>,
  <matplotlib.axis.XTick at 0x205fefe7f28>,
  <matplotlib.axis.XTick at 0x205fefe8b00>,
  <matplotlib.axis.XTick at 0x205fefe5470>,
  <matplotlib.axis.XTick at 0x205fefe5a58>,
  <matplotlib.axis.XTick at 0x205fefe4978>],
 <a list of 27 Text xticklabel objects>)

As you can see, the number of fights per year was very small at the beginning. In the year 1993 there were only 8 fights and the number did not rise very high in the further years. But from 2004/2005 the number of fights per year increased very fast until 2014 when there were 500 fights. With this number it remained then up to now. What is interesting is that with the larger number of fights the ratio of finishes to total fights became smaller.

Graph 2: Finishing ratio

To check this, the next graph shows the ratio of finishes to total battles over the years.

#plotting finishing ration
plot2 = plot_df["finishing_ratio"].plot.line(figsize=(15, 10))
plot2.set_ylabel("Finishings in % of total fights")
plot2.set_xlabel("")
plt.xticks(np.arange(1993, 2019, 1))

([<matplotlib.axis.XTick at 0x205fefe4400>,
  <matplotlib.axis.XTick at 0x205fefd7ba8>,
  <matplotlib.axis.XTick at 0x205fefd8400>,
  <matplotlib.axis.XTick at 0x205fefcd048>,
  <matplotlib.axis.XTick at 0x205fefcf7f0>,
  <matplotlib.axis.XTick at 0x205fefcf550>,
  <matplotlib.axis.XTick at 0x205fefcfc18>,
  <matplotlib.axis.XTick at 0x205fefd76a0>,
  <matplotlib.axis.XTick at 0x205fefcb748>,
  <matplotlib.axis.XTick at 0x205fefcba20>,
  <matplotlib.axis.XTick at 0x205fefcaef0>,
  <matplotlib.axis.XTick at 0x205fefcb320>,
  <matplotlib.axis.XTick at 0x205fefcc080>,
  <matplotlib.axis.XTick at 0x205fefcd1d0>,
  <matplotlib.axis.XTick at 0x205fefcad30>,
  <matplotlib.axis.XTick at 0x205fefc8898>,
  <matplotlib.axis.XTick at 0x205fefc8278>,
  <matplotlib.axis.XTick at 0x205fefc7eb8>,
  <matplotlib.axis.XTick at 0x205fefc7b38>,
  <matplotlib.axis.XTick at 0x205fefc7cf8>,
  <matplotlib.axis.XTick at 0x205fefc6828>,
  <matplotlib.axis.XTick at 0x205fefc7be0>,
  <matplotlib.axis.XTick at 0x205fefc80f0>,
  <matplotlib.axis.XTick at 0x205fefc6748>,
  <matplotlib.axis.XTick at 0x205fefc57b8>,
  <matplotlib.axis.XTick at 0x205fefc5c50>],
 <a list of 26 Text xticklabel objects>)

As you can see. The ratio became smaller over the years. There can be several reasons for this. MMA is still a very young sport. But in recent years more and more good fighters have appeared. Also the technique has been improved and today you know that if you want to be successful, you have to master all segments of this sport. This includes stand-up, wrestling and ground fighting. But the most important reason is certainly the fact that UFC only includes the elite of the sport in its organization. In the beginning, UFC still had trouble finding fighters who are willing to fight in the MMA. Due to the increasing popularity of the sport more and more fighters have appeared and UFC has no trouble finding more fighters, but gets the best fighters from other organizations worldwide.

Graph 3: HOW DO FIGHTS GET FINISHED

Next, we want to know how do fights get finished. There are usually 6 different ways in which a fight can be finished:

KO
TKO
tender
disqualification
decision
Other

Unfortunately there were no differences made between KO and TKO. Therefore these two types were combined. The term "other" is used if the fight has to be stopped for other reasons, or if the fight was later considered invalid, e.g. if a fighter did not reach the weight or if he was banned due to doping.

#condense win_by column
new12.loc[new12['win_by'].str.contains("Decision"),'win_by'] = 'Decision'
new12.loc[new12['win_by'].str.contains("TKO"),'win_by'] = 'KO/TKO'
new12.loc[new12['win_by'].str.contains("Overturned"),'win_by'] = 'Other'
new12.loc[new12['win_by'].str.contains("Could"),'win_by'] = 'Other'

#plotting finishes
finishes_counts = new12.win_by.value_counts()
plot3 = finishes_counts.plot.pie(figsize=(15, 10), fontsize=11)
plot3.set_ylabel("")

Text(0, 0.5, '')

As you can see, most of them are decisions. This confirms our two graphs above. The second most fights are ended by KO/TKOs. Only at third place come the submissions. So you can assume that fighters who have better striking skills often finish a fight prematurely than fighters who have better ground fighting skills.

Graph 4: In which round are the most finishes

Next we want to find out in which round the most finishes are in percentage terms

#show only fights, wich endet with submission
aax5 = new12.loc[new12.finish == "True"]

# get ration of last round to total fights with submission
lastRound_ratio = aax5.groupby("last_round")["R_First Name"].count() / aax5["R_First Name"].count()

# plot result
plot4 = lastRound_ratio.plot.bar(figsize=(15, 10), title="Proportion of last round, finished by submission")
plot4.set_xlabel("Round")
plot4.set_ylabel("Proportion in %")

Text(0.5, 0, 'Round')

It can be seen, that of all finishes, the most are in the first round. the chance of beeing finished decreases every round. One reason is ceertainly, that fighter get tired after the first round and the striking power decreases. Since the most fights have just 3 round and only the main or title fights have 5, the percentage of finishes in round 4 & 5 are very small.

Graph 5: KO's for each weight class

Most people say that, the higher the weight class the more KO's there are. But this rumor could never be proofed with data. Therefore, we filtered our main dataframe to get all fights with KO's. But the total KO's in each weight class is not meaningful, because some weight classes had more fights than others. Since we created the weightclasses, we can now find out the ratio between KO's and total fights of every weightclass.

# get all fights with KO's 
KO_counts = main_df.loc[main_df.win_by == 'KO/TKO']

# get ratio between KO's and total fights 
KO_perc = KO_counts.Weight_class.value_counts() / main_df.Weight_class.value_counts()

# sort result ascending
KO_perc.sort_values(inplace=True)

# make plot more nicely 
ax1 = KO_perc.plot.bar(figsize=(20,10), title="KO's in % of Total Fights")

# print bar plot
ax1

<matplotlib.axes._subplots.AxesSubplot at 0x205fef8e048>

As we can see in the bar plot below there is a correlation between weight and KO's. The heavier the weight of a fighter, which is expressed in the weight classes, the higher is the possibility of a KO. The only exceptions are the weight classes Featherweight and Bantamweight which are switched. But as recognizable in the graph below the difference is very small, it looks almost the same. Almost 50% of every fight in the heabyweight division ends with an KO/TKO. On this account we can say this mith is true.

Graph 6: Average strikes per round compared with proportion of KO/TKO's

After proofing that in the heavyweight weight class are the most KO’s, we were looking for some reasons. Therefore, we focused on the strikes in each round. Since we only have the total strikes each fighter (Red and Blue) made in a fight, we needed to divide it through the rounds of a fight, to get the average strikes per round. We plotted the result and named each point with the according weight class.

# get all weight classes
all_wclasses = main_df.Weight_class.unique()

# get all fights
tot_fights = main_df.groupby("Weight_class")["R_First Name"].describe()
# get all rounds
tot_rounds = main_df.groupby("Weight_class")["last_round"].sum()
# get mean hits per round in each weight class, save result in new df
plt6_df = pd.DataFrame((main_df.groupby("Weight_class")["R_TOTAL_STR."].sum() + main_df.groupby("Weight_class")["B_TOTAL_STR."].sum()) / tot_rounds)
plt6_df["mean_strike_round"] = pd.DataFrame((main_df.groupby("Weight_class")["R_TOTAL_STR."].sum() + main_df.groupby("Weight_class")["B_TOTAL_STR."].sum()) / tot_rounds)
# get proportion of KO/TKO of all fights and change data type
plt6_df["KO/TKO"] = (main_df.loc[main_df.win_by == 'KO/TKO'].groupby("Weight_class")["R_First Name"].describe()["count"]) / main_df.groupby("Weight_class")["R_First Name"].describe()["count"]
plt6_df['KO/TKO'] = plt6_df['KO/TKO'].astype(float)

# plot result
aax1 = plt6_df.plot.scatter(x ='mean_strike_round', y ='KO/TKO', alpha=0.5, figsize=(10, 8), s=300)
aax1.set_ylabel("Proportion of KO/TKO of all fights")
aax1.set_xlabel("Average strikes each round")
# labeling dots for weight classes  
for i, txt in enumerate(plt6_df.index):
    aax1.annotate(txt, (plt6_df["mean_strike_round"].iloc[i], plt6_df["KO/TKO"].iloc[i]))

As seen above, there is no correlation between average strikes per round and proportion of KO/TKO. Heavyweight is the weight class with the fewest strikes of all classes, followed by Light Heavyweight etc. So, the high proportion of KO/TKO in heavyweight is not caused by the amount of strikes in each round. Interesting fact, we see on the graph above, the higher the weight class the less strikes there are. Maybe we need to focus on other data.

**Graph 7: Ratio of significant strikes per round compared with proportion of KO/TKO's

We could not find a reason for the high KO/TKO rate in the heavyweight weight class yet, but we found out that the higher the weight class the less strikes there are. We now analyzed the proportion of significant strikes of total strikes to proportion of KO/TKO. Changing the focus on another variable shows a completely different picture as shown below.

# get all weight classes
all_wclasses = main_df.Weight_class.unique()

# get all fights
tot_fights = main_df.groupby("Weight_class")["R_First Name"].describe()
# get all rounds
tot_rounds = main_df.groupby("Weight_class")["last_round"].sum()
# get mean hits per round in each weight class, save result in new df
plt6_df["sig_strike_round"] = pd.DataFrame((main_df.groupby("Weight_class")["R_SIG_STR."].sum() + main_df.groupby("Weight_class")["B_SIG_STR."].sum()) / tot_rounds)
plt6_df["ratio_strike_round"] = plt6_df["sig_strike_round"] / plt6_df["mean_strike_round"]
# get proportion of KO/TKO of all fights and change data type
plt6_df["KO/TKO"] = (main_df.loc[main_df.win_by == 'KO/TKO'].groupby("Weight_class")["R_First Name"].describe()["count"]) / main_df.groupby("Weight_class")["R_First Name"].describe()["count"]
plt6_df['KO/TKO'] = plt6_df['KO/TKO'].astype(float)
plt6_df["ratio"] = (main_df.groupby("Weight_class")["R_SIG_STR (%)"].mean() + main_df.groupby("Weight_class")["B_SIG_STR (%)"].mean())/2

# plot result
aax1 = plt6_df.plot.scatter(x ='ratio', y ='KO/TKO', alpha=0.5, figsize=(10, 8), s=300, c="red")
aax1.set_ylabel("Proportion of KO/TKO of all fights")
aax1.set_xlabel("ratio of significant strikes")
# labeling dots for weight classes  
for i, txt in enumerate(plt6_df.index):
    aax1.annotate(txt, (plt6_df["ratio"].iloc[i], plt6_df["KO/TKO"].iloc[i]))

Here we could found a correlation that proofs our hypothesis and gives an explanation. Thus in the heavyweight, there is the highest proportion of significant strikes (in relation to total strikes) which leads also to the highest KO/TKO rate.

	R_First Name	R_Last Name	B_First Name	B_Last Name	Winner_First Name	Winner_Last Name	R_KD	B_KD	R_SIG_STR.	B_SIG_STR.	...	B_GROUND	win_by	last_round	last_round_time	Format	date	location	Fight_type	Referee_First Name	Referee_Last Name
0	Henry	Cejudo	Marlon	Moraes	Henry	Cejudo	0	0	90	57	...	1	KO/TKO	3	04:51:00	5 Rnd (5-5-5-5-5)	2019-06-08	Chicago, Illinois, USA	UFC Bantamweight Title Bout	Marc	Goddard
1	Valentina	Shevchenko	Jessica	Eye	Valentina	Shevchenko	1	0	8	2	...	0	KO/TKO	2	00:26:00	5 Rnd (5-5-5-5-5)	2019-06-08	Chicago, Illinois, USA	UFC Women's Flyweight Title Bout	Robert	Madrigal
2	Tony	Ferguson	Donald	Cerrone	Tony	Ferguson	0	0	104	68	...	0	TKO - Doctor's Stoppage	2	05:00:00	3 Rnd (5-5-5)	2019-06-08	Chicago, Illinois, USA	Lightweight Bout	Dan	Miragliotta
3	Jimmie	Rivera	Petr	Yan	Petr	Yan	0	2	73	56	...	4	Decision - Unanimous	3	05:00:00	3 Rnd (5-5-5)	2019-06-08	Chicago, Illinois, USA	Bantamweight Bout	Kevin	MacDonald
4	Tai	Tuivasa	Blagoy	Ivanov	Blagoy	Ivanov	0	1	64	73	...	6	Decision - Unanimous	3	05:00:00	3 Rnd (5-5-5)	2019-06-08	Chicago, Illinois, USA	Heavyweight Bout	Dan	Miragliotta

	R_First Name	R_Last Name	B_First Name	B_Last Name	Winner_First Name	Winner_Last Name	R_KD	B_KD	R_SIG_STR.	B_SIG_STR.	...	R_Stance	R_DOB	B_Height (cm)	B_Weight (lbs)	B_Reach (cm)	B_Stance	B_DOB	Sex	Title_bout	Weight_class
0	Henry	Cejudo	Marlon	Moraes	Henry	Cejudo	0	0	90	57	...	Orthodox	1987-02-09	167.64	135.0	170.18	Orthodox	1988-04-26	Man	True	Bantamweight
1	Valentina	Shevchenko	Jessica	Eye	Valentina	Shevchenko	1	0	8	2	...	Southpaw	1988-03-07	167.64	125.0	167.64	Orthodox	1986-07-27	Women	True	Flyweight
2	Tony	Ferguson	Donald	Cerrone	Tony	Ferguson	0	0	104	68	...	Orthodox	1984-02-12	185.42	155.0	185.42	Orthodox	1983-03-29	Man	False	Lightweight
3	Jimmie	Rivera	Petr	Yan	Petr	Yan	0	2	73	56	...	Orthodox	1989-06-29	170.18	135.0	170.18	Switch	1993-02-11	Man	False	Bantamweight
4	Tai	Tuivasa	Blagoy	Ivanov	Blagoy	Ivanov	0	1	64	73	...	Southpaw	1993-03-16	180.34	250.0	185.42	Southpaw	1986-10-09	Man	False	Heavyweight

	First Name	Last Name	Height (cm)	Weight (lbs)	Reach (cm)	Stance	DOB
0	AJ	Fonseca	162.56	145.0	NaN	NaN	NaT
1	AJ	Matthews	180.34	185.0	NaN	NaN	NaT
2	AJ	McKee	177.80	145.0	NaN	NaN	NaT
3	AJ	Siscoe	170.18	135.0	NaN	NaN	NaT
4	Aalon	Cruz	182.88	145.0	NaN	NaN	NaT