- 5th Jul 2024
- 17:33 pm
To accomplish your task, you need to perform the following operations: 1. Download the dataset and prepare a summary of the features available on the dataset including data type (numerical/ categorical), amount of missing data in individual fields. This can be included as an appendix. 2. Undertake any cleansing or pre-processing you think is necessary on the dataset. In your report, explain clearly what you have done and why you have done it. Some cleaning could be to remove any feature/column with 60% missing values or holding NULL values, constants, NaN values, or to remove duplicate and highly correlated information. You can also perform outlier detection at this stage if this seems appropriate. 3. Split the data into a training set and a test set once cleansing is done. Use suitable toolkit and libraries (Python, Orange, Weka, or R whichever platform you are comfortable with) to train models (e.g. Decision Tree, Random Forest or SVM) from the training set to build the diabetes_mellitus status classifier. Note that you should deal with any class imbalance, do feature selection and other adjustments/tuning to improve the quality of the models obtained. You will need to test the performance of your model on your test set. As part of your final report, please describe and justify the decisions you have made, the results, how the models have been validated/evaluated and discuss the best model’s effectiveness in terms of precision and recall performances. 4. In the next stage, use an unsupervised clustering algorithm (K-means, or hierarchical) using the selected features from the previous stage. Use Scatter plots or t-SNE plots on the clusters to see if there are clusters formed for the various patient groups (without diabetes_mellitus-0, with diabetes_mellitus-1). The diabetes_mellitus field should be omitted during clustering.
"__Importing Basic Libraries:__"
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns;sns.set(style=\"white\")\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"import warnings\n",
"df = pd.read_csv(\"/content/DiabetesClassificationDataset2022.csv\")\n",
"print(\"Our orignal data-set have {} rows and {} columns. \\n\" .format(df.shape[0], df.shape[1]))\n",
"Our orignal data-set have 79159 rows and 88 columns. \n",
"# dropping \"encounter_id\",\t\"hospital_id\", \"ethnicity\"\n",
"df = df.drop([\"encounter_id\",\t\"hospital_id\", \"ethnicity\"], axis =1)"
"df_numerics_only = df.select_dtypes(include=np.number)\n",
"cols = df_numerics_only.columns\n",
"for col in cols: \n",
" fig, ax = plt.subplots()\n",
" fig.set_size_inches(15, 5)\n",
" sns.distplot(df[col], color=\"m\")\n",
" "
"# calculate scores\n",
"ns_probs = [0 for _ in range(len(y_test))]\n",
"ns_auc = roc_auc_score(y_test, ns_probs)\n",
"lr_auc = roc_auc_score(y_test, predictions)\n",
"# summarize scores\n",
"print('No Skill: ROC AUC=%.3f' % (ns_auc))\n",
"print('Logistic: ROC AUC=%.3f' % (lr_auc))\n",
"# calculate roc curves\n",
"ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)\n",
"lr_fpr, lr_tpr, _ = roc_curve(y_test, predictions)\n",
"# plot the roc curve for the model\n",
"plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')\n",
"plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')\n",
"# axis labels\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate')\n",
"# show the legend\n",
"# show the plot\n",
"No Skill: ROC AUC=0.500\n",
"Logistic: ROC AUC=0.649\n"
