Data Mining - CMP-7023B - Assignment Solution

5th Jul 2024
17:33 pm

To accomplish your task, you need to perform the following operations: 1. Download the dataset and prepare a summary of the features available on the dataset including data type (numerical/ categorical), amount of missing data in individual fields. This can be included as an appendix. 2. Undertake any cleansing or pre-processing you think is necessary on the dataset. In your report, explain clearly what you have done and why you have done it. Some cleaning could be to remove any feature/column with 60% missing values or holding NULL values, constants, NaN values, or to remove duplicate and highly correlated information. You can also perform outlier detection at this stage if this seems appropriate. 3. Split the data into a training set and a test set once cleansing is done. Use suitable toolkit and libraries (Python, Orange, Weka, or R whichever platform you are comfortable with) to train models (e.g. Decision Tree, Random Forest or SVM) from the training set to build the diabetes_mellitus status classifier. Note that you should deal with any class imbalance, do feature selection and other adjustments/tuning to improve the quality of the models obtained. You will need to test the performance of your model on your test set. As part of your final report, please describe and justify the decisions you have made, the results, how the models have been validated/evaluated and discuss the best model’s effectiveness in terms of precision and recall performances. 4. In the next stage, use an unsupervised clustering algorithm (K-means, or hierarchical) using the selected features from the previous stage. Use Scatter plots or t-SNE plots on the clusters to see if there are clusters formed for the various patient groups (without diabetes_mellitus-0, with diabetes_mellitus-1). The diabetes_mellitus field should be omitted during clustering.

Data Mining - CMP-7023B - Get Assignment Solution

Please note that this is a sample solution created by our Python programmers for the Developing a Python Script for Digital Forensics and Investigation assignment. These solutions are for research and reference only.

Visit our Python Assignment Sample Solution page to download the complete solution, including code, report, and screenshots.
Connect with our Python Tutors for online tutoring to help you understand and complete this assignment.
Check out the partial solution for this assignment in the blog post below.

Free Assignment Solution - Data Mining - CMP-7023B

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name":
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Lv4TQo3iMEFJ"
},
"source": [
"__Importing Basic Libraries:__"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ep0cuHzcALfi"
},
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns;sns.set(style=\"white\")\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "wrIL3PQLLVYw"
},
"source": [
"__Loading Dataset:__"
]
},
{
"cell_type": "code",
"metadata": {
"id": "T_j2ZYpKBoIw",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
},
"outputId": "86d97064-5b3a-40cd-f8d9-30987ac45629"
},
"source": [
"df = pd.read_csv(\"/content/DiabetesClassificationDataset2022.csv\")\n",
"\n",
"print(\"Our orignal data-set have {} rows and {} columns. \\n\" .format(df.shape[0], df.shape[1]))\n",
"\n",
"df.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Our orignal data-set have 79159 rows and 88 columns. \n",
"\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bf33d2d4-2a70-4847-95d3-a9bd79601294')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-bf33d2d4-2a70-4847-95d3-a9bd79601294 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-bf33d2d4-2a70-4847-95d3-a9bd79601294');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 2
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uot7Z-iLU6fy"
},
"source": [
"__Descriptive Statistics:__"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 364
},
"id": "yMbv4ZkIU57O",
"outputId": "9d70d929-4b27-4fb7-cd67-d67d609ff774"
},
"source": [
"df.describe().T"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-cd5fc89e-58b9-440e-945b-71d4eae18ed6 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-cd5fc89e-58b9-440e-945b-71d4eae18ed6');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ph3yJAfoNVG2"
},
"source": [
"## Data Visualization:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "UQalDungPb5X"
},
"source": [
"# dropping \"encounter_id\",\t\"hospital_id\", \"ethnicity\"\n",
"\n",
"df = df.drop([\"encounter_id\",\t\"hospital_id\", \"ethnicity\"], axis =1)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "J15ohzq2QV24"
},
"source": [
"* __Normal Distribution:__"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "6JDx0zHjQZAU",
"outputId": "4967b6df-f144-4dbf-f69d-0f2cf649d3b9"
},
"source": [
"df_numerics_only = df.select_dtypes(include=np.number)\n",
"\n",
"cols = df_numerics_only.columns\n",
"\n",
"for col in cols: \n",
" fig, ax = plt.subplots()\n",
" fig.set_size_inches(15, 5)\n",
" sns.distplot(df[col], color=\"m\")\n",
" "
],
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x360 with 1 Axes>"
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 322
},
"id": "CaM8AGAZMsPn",
"outputId": "e1ce648d-3ae5-4582-c678-c71493f8f1a8"
},
"source": [
"df.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
},
"metadata": {},
"execution_count": 24
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
},
"id": "ySgVaBUYZLrg",
"outputId": "a98e1f56-1216-4712-aa51-813e928fbbbd"
},
"source": [
"# calculate scores\n",
"ns_probs = [0 for _ in range(len(y_test))]\n",
"\n",
"ns_auc = roc_auc_score(y_test, ns_probs)\n",
"lr_auc = roc_auc_score(y_test, predictions)\n",
"# summarize scores\n",
"print('No Skill: ROC AUC=%.3f' % (ns_auc))\n",
"print('Logistic: ROC AUC=%.3f' % (lr_auc))\n",
"# calculate roc curves\n",
"ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)\n",
"lr_fpr, lr_tpr, _ = roc_curve(y_test, predictions)\n",
"# plot the roc curve for the model\n",
"plt.figure(figsize=(15,6))\n",
"plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')\n",
"plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')\n",
"# axis labels\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate')\n",
"# show the legend\n",
"plt.legend()\n",
"# show the plot\n",
"plt.show()"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"No Skill: ROC AUC=0.500\n",
"Logistic: ROC AUC=0.649\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x432 with 1 Axes>"

Get the best Data Mining - CMP-7023B assignment and tutoring services from our experts now!

About The Author - Dr. Jane Smith

Dr. Jane Smith is a data scientist with extensive experience in data preprocessing, model training, and clustering techniques. With a strong background in machine learning and statistics, she excels in transforming raw datasets into actionable insights. Dr. Smith is proficient in Python, R, and various data analysis tools, making her well-equipped to guide complex data science projects.

Data Mining - CMP-7023B - Assignment Solution