
- 27th Jun 2024
- 17:43 pm
In this assignment, we will revisit the Cervical Cancer dataset from the UCI Machine Learning Repository. The dataset has substantial class imbalance, and this is the reason we are returning to this dataset in this assignment. You will train an XGBoost classifier to predict Biopsy which is binary and the last column. You will apply SMOTE to mitigate class imbalance. In addition, SHAP will be utilized to explain how the trained classifier works.
Your Python script must meet the following specifications:
- Load the dataset by assuming that the data file is in the same folder as your Python script, with the original file name.
- Exclude the following two columns since they have too many missing values:
- STDs: Time since first diagnosis
- STDs: Time since last diagnosis.
- Exclude the following three columns since they are also target variables:
- Hinselmann
- Schiller
- Cytology
- For all remaining features, impute missing values with the mean and mode for continuous and binary variables, respectively.
- Randomly partition the data into 70% training and 30% test sets.
- Within the training data, oversample the minority class so that its size is 70% of that of the majority class. Use imblearn.over_sampling.SMOTE and for all other parameters use the default settings.
- Train an XGBoost classifier on the oversampled training data using the xgboost package. You do not need to tune hyperparameters or do cross-validation. Use the default parameter settings in xgboost and train on the entire oversampled training data.
- Evaluate the trained model on the test data and print the following metrics to the screen:
- ? Accuracy
- ? F1 score
- ? Precision
- ? Recall
- ? AUC
- Use shap.force_plot to produce the force plot for a randomly selected sample from the test data.
- Use shap.force_plot to produce the force plot for all test data.
- Use shap.summary_plot to produce the summary plot for all features based on the test data.
- Your code should be in a file named assignment10_studentid.py/.ipynb
UCI Machine Learning Repository - DATA 622 - Get Assignment Solution
Please note that this is a sample assignment solved by our Machine Learning Programmers. These solutions are intended to be used for research and reference purposes only. If you can learn any concepts by going through the reports and code, then our Python Tutors would be very happy.
- Option 1 - To download the complete solution along with Code, Report and screenshots - Please visit our Python Assignment Sample Solution page
- Option 2 - Reach out to our Python Tutors to get online tutoring related to this assignment and get your doubts cleared
- Option 3 - You can check the partial solution for this assignment in this blog below
Free Assignment Solution - UCI Machine Learning Repository - DATA 622
This is a partial solution. If you need access to the complete work, please contact us via email or live chat.
# -*- coding: utf-8 -*-
>Importing Python Libraries
"""
# Commented out IPython magic to ensure Python compatibility.
import pandas as pd
import numpy as np
import seaborn as sns;sns.set(style="white")
import matplotlib.pyplot as plt
# %matplotlib inline
import warnings
warnings.simplefilter("ignore")
"""> Loading Dataset as Pandas dataframe:"""
df = pd.read_csv("risk_factors_cervical_cancer.csv")
print("Our orignal data-set have {} rows and {} columns. \n" .format(df.shape[0], df.shape[1]))
df.head()
"""> Pre-Processing:"""
df = df.drop(["STDs: Time since first diagnosis", "STDs: Time since last diagnosis",
"Hinselmann", "Schiller", "Citology"], axis=1)
print("After removing irrelevant features our data-set have {} rows and {} columns. \n" .format(df.shape[0], df.shape[1]))
df.head()
df.columns
df =df.replace("?",np.nan)
df =df.astype(float)
df.info()
# fill NAns with mean for continous variables
int_cols = ["Age", "Number of sexual partners", "First sexual intercourse", "Num of pregnancies", "IUD (years)", "STDs (number)",
"STDs: Number of diagnosis", "Hormonal Contraceptives (years)"]
df[int_cols] = df[int_cols].fillna(df[int_cols].mean())
df.head()
# fill Nan with mode for boolean variables
binary_cols = []
for col in df.columns.tolist():
if col not in int_cols:
binary_cols.append(col)
df[binary_cols] = df[binary_cols].fillna(df[binary_cols].mode().iloc[0])
df.info()
"""> Train Test Split:"""
X = df.drop("Biopsy",axis=1)
y =df["Biopsy"]
# Splitting data into train and test sample using 70% data for training and 30% data for testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
"""> SMOTE:"""
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
X_train, y_train = sm.fit_resample(X_train, y_train)
# count of training and validation class
plt.figure(1 , figsize = (25 ,5))
n = 0
for z , j in zip([y_train , y_test] , ['train data', 'test data']):
n += 1
plt.subplot(1 , 3 , n)
sns.countplot(x = z, palette="Set2" )
plt.title(j)
plt.show()
"""> XGBoost Classifier:"""
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier().fit(X_train, y_train)
"""> Evaluation:"""
from sklearn.metrics import accuracy_score, f1_score, precision_score, auc, recall_score,confusion_matrix
predictions= clf.predict(X_test)
test_acc = accuracy_score(predictions, y_test)*100
f1_score = f1_score(predictions, y_test)
precision_score= precision_score(predictions, y_test)
recall_score= recall_score(predictions, y_test)
print("Accuracy on test set: {:.3f}%. \n".format(test_acc))
print("F1 score: {:.3f}. \n".format(f1_score))
print("Precision score: {:.3f}. \n".format(precision_score))
print("Recall score: {:.3f}. \n".format(recall_score))
"""> Use shap.force_plot to produce the force plot for a randomly selected sample from the test data."""
!pip install shap
import shap
clf_explainer = shap.KernelExplainer(clf.predict,X_test)
shap_values = clf_explainer.shap_values(X_test)
""">Use shap.force_plot to produce the force plot for all test data."""
from random import randrange
n = randrange(X_test.shape[0])
shap.initjs()
shap.force_plot(clf_explainer.expected_value, shap_values[n,:], X_test.iloc[n,:])
shap.initjs()
shap.force_plot(clf_explainer.expected_value, shap_values, X_test)
"""> Use shap.summary_plot to produce the summary plot for all features based on the test
data.
"""
shap.summary_plot(shap_values, X_test)
Get the best UCI Machine Learning Repository - DATA 622 - Free Solution assignment help and tutoring services from our experts now!
About The Author - Dr. Alex Johnson
Dr. Alex Johnson, a data scientist and educator with a PhD in Computer Science, specializes in machine learning and data analysis. He authored the "UCI Machine Learning Repository - DATA 622 - Free Assignment Solution" guide, focusing on practical, hands-on learning for students tackling complex data science problems.