Breast cancer is one of the types of cancer that starts in the breast. It occurs in women, but men can get breast cancer too. It is the second leading cause of death in women. As the use of data in healthcare is very common today, we can use machine learning to predict whether a patient will survive a deadly disease like breast cancer or not. So if you want to learn how to predict the survival of a breast cancer patient, this article is for you. In this article, I will take you through the task of breast cancer survival prediction with machine learning using Python.
You have a dataset of over 400 breast cancer patients who underwent surgery for the treatment of breast cancer. Below is the information of all columns in the dataset:
So by using this dataset, our task is to predict whether a breast cancer patient will survive or not after the surgery. I hope you have an overview of the dataset we are using for the task of breast cancer survival prediction. This dataset was collected from Kaggle. You can download this dataset from here. Now, in the section below, I will walk you through the task of predicting breast cancer survival with machine learning using Python.
I will start the task of breast cancer survival prediction by importing the necessary Python libraries and the dataset we need: 1
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.svm import SV
data = pd.read_csv("BRCA.csv")
print(data.head())
Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 \ 0 TCGA-D8-A1XD 36.0 FEMALE 0.080353 0.42638 0.54715 0.273680 1 TCGA-EW-A1OX 43.0 FEMALE -0.420320 0.57807 0.61447 -0.031505 2 TCGA-A8-A079 69.0 FEMALE 0.213980 1.31140 -0.32747 -0.234260 3 TCGA-D8-A1XR 56.0 FEMALE 0.345090 -0.21147 -0.19304 0.124270 4 TCGA-BH-A0BF 56.0 FEMALE 0.221550 1.90680 0.52045 -0.311990 Tumour_Stage Histology ER status PR status HER2 status \ 0 III Infiltrating Ductal Carcinoma Positive Positive Negative 1 II Mucinous Carcinoma Positive Positive Negative 2 III Infiltrating Ductal Carcinoma Positive Positive Negative 3 II Infiltrating Ductal Carcinoma Positive Positive Negative 4 II Infiltrating Ductal Carcinoma Positive Positive Negative Surgery_type Date_of_Surgery Date_of_Last_Visit \ 0 Modified Radical Mastectomy 15-Jan-17 19-Jun-17 1 Lumpectomy 26-Apr-17 09-Nov-18 2 Other 08-Sep-17 09-Jun-18 3 Modified Radical Mastectomy 25-Jan-17 12-Jul-17 4 Other 06-May-17 27-Jun-19 Patient_Status 0 Alive 1 Dead 2 Alive 3 Alive 4 Dead
Let’s have a look at whether the columns of this dataset contains any null values or not: 1
print(data.isnull().sum())
Patient_ID 7 Age 7 Gender 7 Protein1 7 Protein2 7 Protein3 7 Protein4 7 Tumour_Stage 7 Histology 7 ER status 7 PR status 7 HER2 status 7 Surgery_type 7 Date_of_Surgery 7 Date_of_Last_Visit 24 Patient_Status 20 dtype: int64
So this dataset has some null values in each column, I will drop these null values: 1
data = data.dropna()
Now let’s have a look at the insights about the columns of this data: 1
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 317 entries, 0 to 333 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Patient_ID 317 non-null object 1 Age 317 non-null float64 2 Gender 317 non-null object 3 Protein1 317 non-null float64 4 Protein2 317 non-null float64 5 Protein3 317 non-null float64 6 Protein4 317 non-null float64 7 Tumour_Stage 317 non-null object 8 Histology 317 non-null object 9 ER status 317 non-null object 10 PR status 317 non-null object 11 HER2 status 317 non-null object 12 Surgery_type 317 non-null object 13 Date_of_Surgery 317 non-null object 14 Date_of_Last_Visit 317 non-null object 15 Patient_Status 317 non-null object dtypes: float64(5), object(11) memory usage: 42.1+ KB
Breast cancer is mostly found in females, so let’s have a look at the Gender column to see how many females and males are there: 1
print(data.Gender.value_counts())
FEMALE 313 MALE 4 Name: Gender, dtype: int64
As expected, the proportion of females is more than males in the gender column. Now let’s have a look at the stage of tumour of the patients: 1
# Tumour Stage
stage = data["Tumour_Stage"].value_counts()
transactions = stage.index
quantity = stage.values
figure = px.pie(data, values=quantity,names=transactions,hole = 0.5, title="Tumour Stages of Patients")
figure.show()
So most of the patients are in the second stage. Now let’s have a look at the histology of breast cancer patients. (Histology is a description of a tumour based on how abnormal the cancer cells and tissue look under a microscope and how quickly cancer can grow and spread): 1
# Histology
histology = data["Histology"].value_counts()
transactions = histology.index
quantity = histology.values
figure = px.pie(data, values=quantity, names=transactions,hole = 0.5, title="Histology of Patients")
figure.show()
Now let’s have a look at the values of ER status, PR status, and HER2 status of the patients: 1
# ER status
print(data["ER status"].value_counts())
# PR status
print(data["PR status"].value_counts())
# HER2 status
print(data["HER2 status"].value_counts())
Positive 317 Name: ER status, dtype: int64 Positive 317 Name: PR status, dtype: int64 Negative 288 Positive 29 Name: HER2 status, dtype: int64
Now let’s have a look at the type of surgeries done to the patients: 1
# Surgery_type
surgery = data["Surgery_type"].value_counts()
transactions = surgery.index
quantity = surgery.values
figure = px.pie(data, values=quantity,names=transactions,hole = 0.5, title="Type of Surgery of Patients")
figure.show()
So we explored the data, the dataset has a lot of categorical features. To use this data to train a machine learning model, we need to transform the values of all the categorical columns. Here is how we can transform values of the categorical features: 1
data["Tumour_Stage"] = data["Tumour_Stage"].map({"I": 1, "II": 2, "III": 3})
data["Histology"] = data["Histology"].map({"Infiltrating Ductal Carcinoma": 1, "Infiltrating Lobular Carcinoma": 2, "Mucinous Carcinoma": 3})
data["ER status"] = data["ER status"].map({"Positive": 1})
data["PR status"] = data["PR status"].map({"Positive": 1})
data["HER2 status"] = data["HER2 status"].map({"Positive": 1, "Negative": 2})
data["Gender"] = data["Gender"].map({"MALE": 0, "FEMALE": 1})
data["Surgery_type"] = data["Surgery_type"].map({"Other": 1, "Modified Radical Mastectomy": 2, Lumpectomy": 3, "Simple Mastectomy": 4})
print(data.head())
Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 \ 0 TCGA-D8-A1XD 36.0 1 0.080353 0.42638 0.54715 0.273680 1 TCGA-EW-A1OX 43.0 1 -0.420320 0.57807 0.61447 -0.031505 2 TCGA-A8-A079 69.0 1 0.213980 1.31140 -0.32747 -0.234260 3 TCGA-D8-A1XR 56.0 1 0.345090 -0.21147 -0.19304 0.124270 4 TCGA-BH-A0BF 56.0 1 0.221550 1.90680 0.52045 -0.311990 Tumour_Stage Histology ER status PR status HER2 status Surgery_type \ 0 3 1 1 1 2 2 1 2 3 1 1 2 3 2 3 1 1 1 2 1 3 2 1 1 1 2 2 4 2 1 1 1 2 1 Date_of_Surgery Date_of_Last_Visit Patient_Status 0 15-Jan-17 19-Jun-17 Alive 1 26-Apr-17 09-Nov-18 Dead 2 08-Sep-17 09-Jun-18 Alive 3 25-Jan-17 12-Jul-17 Alive 4 06-May-17 27-Jun-19 Dead
We can now move on to training a machine learning model to predict the survival of a breast cancer patient. Before training the model, we need to split the data into training and test set: 1
# Splitting data
x = np.array(data[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']])
y = np.array(data[['Patient_Status']])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)
Now here’s how we can train a machine learning model: 1
model = SVC()
model.fit(xtrain, ytrain)
Now let’s input all the features that we have used to train this machine learning model and predict whether a patient will survive from breast cancer or not: 1
# Prediction
# features = [['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']]
features = np.array([[36.0, 1, 0.080353, 0.42638, 0.54715, 0.273680, 3, 1, 1, 1, 2, 2,]])
print(model.predict(features))
['Alive']
So this is how we can use machine learning for the task of breast cancer survival prediction. As the use of data in healthcare is very common today, we can use machine learning to predict whether a patient will survive a deadly disease like breast cancer or not. I hope you liked this article on Breast cancer survival prediction with machine learning using Python. Feel free to ask valuable questions in the comments section below.