- 1st Jul 2024
- 17:38 pm
In this assignment, you have to perform clustering tasks on the given dataset. In this project, our expert will help you categorize candidates using clustering techniques, perform exploratory data analysis, and evaluate clustering results. Our objective is to categorize the candidates into two categories with the hope that it may be able to categorize the two types of candidates and put them into the different clusters. As we have the ground truth of the data, we can also try evaluate whether it was able to categorize them or not. However, note that in general, you may not have the ground truth to evaluate a cluster like this.
Sections:
1. Load Data and perform basic EDA
. import libraries necessary libraries
- import the data to a dataframe and show the count of rows and columns (1 pt)
- Show the top 5 and last 5 rows (1 pt)
- Is there any null values on any column?
- Are all the columns numeric such as float or int? If not, please convert them to int before going to the next step.
- plot the heatmap with correlations to get some more idea about the data.
2. Feature Selection and Pre-processing
- Put all the data from the dataframe into X, except the enrolle_id and the target columns
- Perform feature scaling on the data of X with StandardScaler and show some sample data from X after scaling (Use the technique shown in the second answer from this post: https://stackoverflow.com/questions/44552031/sklearnstandardscaler-can-i-inverse-the-standardscaler-for-the-model-output (Links to an external site.) )
3. KMeans Clustering
- Import related library for Kmeans and perform Kmeans on X (note that it was scaled already). Make sure to put random_state = 47 (it can be any number, but use 47 so that you will produce almost the same result as us). Use k-means++ for the initial centroids. You should know from the problem description how many clusters we are interested in.
- Show the cluster centers as it is and then inverse the scale and show the centers. Please explain in words about the centers relating them to the columns of the data set
- Show the distance matrix
- Show the labels
- Add a new column to your data frame called cluster_label and assign the cluster label for the instances based on the K-means cluster label
4. AgglomerativeClustering
- Plot a dendrogram (make the figure size relatively big, but still you will not be able to see it completely. However, it least this will give you an idea on how many cluster would you like to generate)
- Perform AgglomerativeClustering with 2 clusters first, and use euclidean distance for affinity and linkage = 'ward'
- After creating the clusters, plot training hours against experience like 3.Xiii and discuss if anything interesting
- Then, increase the number of clusters to 4 or 5 and build the clusters again and plot them again to see any difference.
Clustering - CAP 4611 - Get Assignment Solution
Please note that this is a sample assignment solved by our Python Programmers. These solutions are intended to be used for research and reference purposes only. If you can learn any concepts by going through the reports and code, then our Python Tutors would be very happy.
- Option 1 - To download the complete solution along with Code, Report and screenshots - Please visit our Python Assignment Sample Solution page
- Option 2 - Reach out to our Python Tutors to get online tutoring related to this assignment and get your doubts cleared
- Option 3 - You can check the partial solution for this assignment in this blog below
Free Assignment Solution - Clustering - CAP 4611
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "xpdy5--kaSFP"
},
"source": [
"##1. Load Data and perform basic EDA"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r0ff-78uaTg2"
},
"source": [
">I. import libraries necessary libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "bFXReZn6ZV3q"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt \n",
"import seaborn as sns \n",
"from sklearn.preprocessing import StandardScaler\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u84niISMacMp"
},
"source": [
">II. import the data to a dataframe and show the count of rows and\n",
"columns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TKAu5i-eZobG",
"outputId": "5af7cbaa-325d-449f-c295-267aff07d258"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Our dataset have 12977 rows and 7 columns.\n"
]
}
],
"source": [
"df=pd.read_csv(\"hrdata3.csv\")\n",
"df=df.drop(\"Unnamed: 0\", axis=1)\n",
"print(\"Our dataset have {} rows and {} columns.\".format(df.shape[0], df.shape[1]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4nUtrg5vahIt"
},
"source": [
">III. Show the top 5 and last 5 rows"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "uv6dJ-f-Z5-h",
"outputId": "7a05019c-50e1-48b4-a734-10b6cd2f21e1"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
"
"
"cell_type": "markdown",
"metadata": {
"id": "pynbkVgJbTtf"
},
"source": [
"## 2. Feature Selection and Pre-processing"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vyjjw28Ybci5"
},
"source": [
">I. Put all the data from the dataframe into X, except the enrolle_id\n",
"and the target columns"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "rEei9O0jbQa5",
"outputId": "e16b55de-958f-41ec-bcd5-6598facc7e5e"
},
"outputs": [
{
"data": {
"text/html":
},
"source": [
">II. Perform feature scaling on the data of X with StandardScaler and\n",
"show some sample data from X after scaling"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "DmrA-LmDboQ3",
"outputId": "35755c5d-9ada-428f-e6d1-4c64655bd4a9"
},
"outputs": [
{
"data": {
"\n",
" " title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" " width=\"24px\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
" city_development_index experience company_size last_new_job \\\n",
"0 -0.503422 0.633957 -0.574723 1.690762 \n",
"1 -0.578413 1.546009 -0.574723 1.081137 \n",
"2 0.696434 -0.886130 -0.574723 -0.747739 \n",
"3 -0.620075 0.329940 -1.488268 1.690762 \n",
"4 0.696434 -0.582112 -0.574723 -0.747739 \n",
"\n",
" training_hours \n",
"0 -0.308396 \n",
"1 -0.951805 \n",
"2 -0.687842 \n",
"3 -0.786828 \n",
"4 -0.324894 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scaler = StandardScaler().fit(X)\n",
"X = scaler.transform(X)\n",
"X = pd.DataFrame(X)\n",
"X.columns = X_col\n",
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jxuofw7HcCIR"
},
"source": [
"##3. KMeans Clustering"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TS37KM4GcFdL"
},
"source": [
">I. Import related library for Kmeans and perform Kmeans on X\n",
"(note that it was scaled already). Make sure to put\n",
"random_state = 47 (it can be any number, but use 47 so that\n",
"you will produce almost the same result as us). Use k-\n",
"means++ for the initial centroids. You should know from the problem description how many clusters we are interested in."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R2kKhYnEb7jy",
"outputId": "cac28107-de13-413b-ed16-627e8d47ca92"
},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(algorithm='elkan', n_clusters=2, random_state=47)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.cluster import KMeans\n",
"\n",
"algorithm = KMeans(n_clusters = 2 ,init='k-means++', n_init = 10 ,max_iter=300, \n",
" tol=0.0001, random_state= 47 , algorithm='elkan') \n",
"\n",
"algorithm.fit(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nqqlujfGctb0"
},
"source": [
"> II. Show the cluster centers as it is and then inverse the scale\n",
"and show the centers. Please explain in words about the\n",
"centers relating them to the columns of the data set"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "eWd0PlgNcXeP"
},
"outputs": [],
"source": [
"X_inversed = scaler.inverse_transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 373
},
"id": "pGPbr6hbdIKw",
"outputId": "4c60012e-9002-40e3-acad-0404a6a3f9ff"
},
"outputs": [
{
"data": {
"image/png":
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"label = algorithm.fit_predict(X)\n",
"labels = algorithm.labels_\n",
"centroids = algorithm.cluster_centers_\n",
"centroids_x = centroids[:,0]\n",
"centroids_y = centroids[:,1]\n",
"\n",
"#Getting unique labels\n",
"u_labels = np.unique(labels)\n",
" \n",
"#plotting the results:\n",
"\n",
"plt.figure(1 , figsize = (15 ,6))\n",
"for i in u_labels:\n",
" plt.scatter(X_inversed[label == i , 0] , X_inversed[label == i , 1] , label = i)\n",
"\n",
"plt.scatter(centroids_x,centroids_y,marker = \"x\", s=150,linewidths = 5, zorder = 10, c=['black', 'black'])\n",
"\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C6sFnKPeeYC5"
},
"source": [
">III. Show the distance matrix\n",
"\n",
"__Distance Matrix of KMean is Eucladian Distance Matrix.__"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9dvxnW3NfWlc"
},
"source": [
"> IV. Show the labels"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "baGutOjKfmMA",
"outputId": "a685608b-b884-4ae3-96f0-293bfe99efe3"
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 0, ..., 0, 1, 0], dtype=int32)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a40JcQDGfivB"
},
"source": [
">V. Add a new column to your data frame called cluster_label and\n",
"assign the cluster label for the instances based on the K-\n",
"means cluster label"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "zKb0Z3b_dbOZ"
},
"outputs": [],
"source": [
"df[\"cluster_label\"] = labels"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4a203uigfyvC"
},
"source": [
"> VI. The target column of our data frame is floating-point numbers.\n",
"So, this number is not comparable with the cluster label. Add\n",
"a column target_int and write a function or use a strategy to\n",
"store the int version of the target column into the target_int\n",
"column (For example, 1.0 in the target will be 1 in the\n",
"target_int, 0.0 will be 0)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "DjhIUHUIfdIB"
},
"outputs": [],
"source": [
"df[\"target_int\"] = df[\"target\"].astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y2zSBZE7f9m1"
},
"source": [
">VII. Show the top 5 rows of the dataframe now that shows\n",
"you have added those two columns and they have the correct\n",
"values"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "_pVaPkjsf5mZ",
"outputId": "bdb43287-5340-47fa-b8e4-3a003e4b32f5"
"cell_type": "markdown",
"metadata": {
"id": "H0Tk-v8mhhLU"
},
"source": [
">IX. Discuss the numbers from 3 Viii and any thoughts on it.\n",
"\n",
"__From the results of 3 Viii, we can conclude that target classes were not efieciently classified using the KMean algorithm, espesially 1 target class was highly misclassified. Total misclassified variables were 6607.__ "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OzpPuGQvhsin"
},
"source": [
">X. Show the inertia of the cluster"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RXV4UST3huKn",
"outputId": "3f4800dd-2382-4376-bac5-1641d9813431"
},
"outputs": [
{
"data": {
"text/plain": [
"49643.86379769514"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"algorithm.inertia_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a5y6hkePhqmx"
},
"source": [
">XI. What is the elbow method and what is its purpose of it in the\n",
"case of KMeans clustering?\n",
"\n",
"\n",
"__The elbow method runs k-means clustering on the dataset for a range of values for k and then for each value of k computes an average WSS (Within-Cluster-Sum of Squared Errors) score for all clusters, and choose the k for which WSS becomes first starts to diminish__"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "748ketydiADe"
},
"source": [
">XII. Although we just wanted 2 clusters, we still would like to\n",
"see what will happen if you increase the number of clusters.\n",
"Plot the inertia for the different numbers of clusters from 2 to\n",
"20.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 387
},
"id": "owOCfHpQhq-0",
"outputId": "f5c8a69f-3748-4cc1-f72c-6cb1c009c71d"
},
"outputs": [
{
"data": {
},
"output_type": "display_data"
}
],
"source": [
"inertia = []\n",
"for n in range(2 , 21):\n",
" algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, \n",
" tol=0.0001, random_state= 111 , algorithm='elkan') )\n",
" algorithm.fit(X)\n",
" inertia.append(algorithm.inertia_)\n",
"\n",
"plt.figure(1 , figsize = (15 ,6))\n",
"plt.plot(np.arange(2 , 21) , inertia , 'o')\n",
"plt.plot(np.arange(2 , 21) , inertia , '-' , alpha = 0.5)\n",
"plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5YpFJq2cltIu"
},
"source": [
">XIII. Show a scatter plot with training hours against\n",
"experience where the points should be colored based on the\n",
"two cluster labels. Write any thoughts on this plot."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 407
},
"id": "xEw7iK3GiEK-",
"outputId": "d6b510b2-14a7-417f-ef0f-d6db7b939efe"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(1 , figsize = (15 ,6))\n",
"sns.scatterplot(data=df, x=\"experience\", y=\"training_hours\", hue=\"cluster_label\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6cVo0Ow7mHbd"
},
"source": [
">XIV. Show a scatter plot with any other two attributes you are\n",
"interested in like 3 Xiii and add your thoughts on your plot as\n",
"well"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 405
},
"id": "SzZelo3hleEg",
"outputId": "fa27aad7-d141-4e6a-9e5c-6bf16c5dc18e"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(1 , figsize = (15 ,6))\n",
"sns.scatterplot(data=df, x=\"experience\", y=\"city_development_index\", hue=\"cluster_label\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VfnGHHI_nOvd"
},
"source": [
"## 4. AgglomerativeClustering"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5O-AXzOwnSas"
},
"source": [
">1. Plot a dendrogram (make the figure size relatively big, but still\n",
"you will not be able to see it completely. However, it least this\n",
"will give you an idea on how many cluster would you like to\n",
"generate)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 885
},
"id": "2gwnzLbFnC-Z",
"outputId": "63256130-fda2-40da-f2c1-1c641c64f256"
},
"outputs": [
{
"data": {
"image/png":
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from scipy.cluster.hierarchy import dendrogram, linkage\n",
"\n",
"Z = linkage(X, method='average')\n",
"\n",
"plt.figure(figsize=(20, 15)) \n",
"plt.title(\"Dendrograms: method ='average'\") \n",
"dend = dendrogram(Z)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RQGXk2x-nWPw"
},
"source": [
">2. Perform AgglomerativeClustering with 2 clusters first, and use\n",
"euclidean distance for affinity and linkage = 'ward'"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "qaqo9nalnWtK"
},
"outputs": [],
"source": [
"from sklearn.cluster import AgglomerativeClustering\n",
"\n",
"hierarchical_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')\n",
"hierarchical_cluster.fit_predict(X)\n",
"\n",
"algo_labels = hierarchical_cluster.labels_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e3dlYTV7nXMX"
},
"source": [
"3. After creating the clusters, plot training hours against\n",
"experience like 3.Xiii and discuss if anything interesting"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "i2S_3Kt9nY45",
"outputId": "e1140afd-593c-46bd-d053-7e1117c87769"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" 0 0.80 0.75 0.77 10695\n",
" 1 0.09 0.13 0.11 2282\n",
"\n",
" accuracy 0.64 12977\n",
" macro avg 0.45 0.44 0.44 12977\n",
"weighted avg 0.68 0.64 0.65 12977\n",
"\n",
"\n",
"Confusion Matrix\n",
" [[7970 2725]\n",
" [1996 286]]\n",
"\n",
"Total Misclassified variables are: 4721\n"
]
}
],
"source": [
"print(\"Classification Report: \\n\",classification_report(df[\"target_int\"], algo_labels))\n",
"print() \n",
"cm=confusion_matrix(df[\"target_int\"], algo_labels)\n",
"print(\"Confusion Matrix\\n\",cm)\n",
"print()\n",
"print(\"Total Misclassified variables are:\", cm[0,1]+cm[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mc46weYnnkvU"
},
"source": [
">4. Then, increase the number of clusters to 4 or 5 and build the\n",
"clusters again and plot them again to see any difference."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tRUDrVqknlLR",
"outputId": "06dc958d-3248-4547-e347-51b837a76b09"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification Report: \n",
" precision recall f1-score support\n",
"\n",
" 0 0.58 0.16 0.26 10695\n",
" 1 0.09 0.13 0.11 2282\n",
" 2 0.00 0.00 0.00 0\n",
" 3 0.00 0.00 0.00 0\n",
" 4 0.00 0.00 0.00 0\n",
"\n",
" accuracy 0.16 12977\n",
" macro avg 0.13 0.06 0.07 12977\n",
"weighted avg 0.49 0.16 0.23 12977\n",
"\n",
"\n",
"Confusion Matrix\n",
" [[1758 2725 3209 1214 1789]\n",
" [1283 286 299 237 177]\n",
" [ 0 0 0 0 0]\n",
" [ 0 0 0 0 0]\n",
" [ 0 0 0 0 0]]\n",
"\n",
"Total Misclassified variables are: 4008\n"
]
}
],
"source": [
"hierarchical_cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')\n",
"hierarchical_cluster.fit_predict(X)\n",
"\n",
"algo_labels = hierarchical_cluster.labels_\n",
"\n",
"print(\"Classification Report: \\n\",classification_report(df[\"target_int\"], algo_labels))\n",
"print() \n",
"cm=confusion_matrix(df[\"target_int\"], algo_labels)\n",
"print(\"Confusion Matrix\\n\",cm)\n",
"print()\n",
"print(\"Total Misclassified variables are:\", cm[0,1]+cm[1,0])"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "clustering_solution.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Get the best Clustering assignment help and tutoring services from our experts now!
About The Author - Dr. Samuel Lee
Dr. Samuel Lee, a seasoned Data Scientist with a deep understanding of machine learning and data analysis, will guide you through this clustering assignment. With extensive experience in Python programming and data science methodologies, Dr. Lee specializes in transforming raw data into actionable insights. His expertise ensures you gain practical skills in feature selection, KMeans clustering, and AgglomerativeClustering, empowering you to apply these techniques in real-world scenarios.