- 29th Jun 2024
- 21:03 pm
In this assignment, we'll dive into using Python for regression and classification tasks. We'll calculate sums of squares, fit lines, and predict tumor malignancy using real-world data. This hands-on experience will help you grasp fundamental concepts in data analysis and machine learning.
- Take the data in the table below. Using Python, compute the residual sum of squares for the two equations.
- Using the numpy polyfit() function, compute the best fit line (set deg=1) of the data in Table 1. How does the residual sum of squares of this best fit line compare to the residual sum of squares of the two equations from question 1?
- Using sklearn LinearRegression(), compute the best fit line of the data in Table 1. Then, using the predict() method, predict the value of y when x= 4, 7, and 18.
- Use the pandas read_csv() function to read in the CSV file “breast_cancer.csv”, which is a dataset that uses the description of tumors to predict if the tumor is cancerous. Using matplotlib plot “mean perimeter” vs “mean radius”. Plot the line y ?=2π * “mean radius”, which is a circumference of a circle. What does the agreement (or disagreement) of this line with the data tell you about the shape of the tumor?
- Compute the best fit line that goes through “mean perimeter” and “mean radius”. Print out the slope and intercept.
- Compute the best fit curve that goes through “mean radius” and “mean area”. (Hint: the easiest way to do this is to use the numpy polyfit() function with deg=2). How does this best fit curve compare to the equation for the area of a circle (the area of a circle is πr^2)?
- Using “mean perimeter” and “mean area” as independent variables, predict “mean radius” and compute the mean squared error (see the equation below; note that the mean squared error is the residual sum of squares divided by the number of samples). Do you think that this multivariate linear regression performs better than using a univariate linear regression with either “mean parameter” or “mean area” as the independent variable?
- Using as an independent variable “mean radius”, predict “is malignant” with sklearn LogisticRegression(). Produce a plot that shows “mean radius” along the x axis and “is malignant” along the y axis. Then, plot logistic function you predicted versus “mean radius”. (Note: use the predict_proba() method to get the predicted values from the logistic function).
- With the model you build in question 8, compute the confusion matrix. That is, compute the number of true positives, true negatives, false positives, and false negatives. Print out these four values. The million-dollar question: based on these numbers, would you trust your machine learning model to predict breast cancer?
- Build another Logistic Regression model that predicts “is malignant” but choose several independent variables from the breast cancer dataset. Does this new model you built perform better or worse compared to the model we build in question 8?
Exploring Regression and Classification in Python - Free Assignment Solution
Please note that this sample Regression and Classification in Python assignment is solved by our Python programmers for research and reference. If you learn concepts from it, our Python tutors will be happy.
- Option 1 - Download the complete solution with code, report, and screenshots from our Free Assignment Sample Solution - Regression and Classification in Python page.
- Option 2 - Contact our Python tutors for online tutoring related to this assignment.
- Option 3 - View the partial solution for this assignment in the blog below.
Free Assignment Solution - Exploring Regression and Classification in Python
# coding: utf-8
# In[79]:
import warnings
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
# In[96]:
#number 1
x = np.array([1, 3, 5, 6, 10, 15,20])
y = np.array([0, 5, 2, 8, 7,16,14])
eq_1=x+0.5
eq_2=-(.03*x*x)+(1.32*x)+.89
print(' residual sum of squares of equation 1 is : '+ str(np.sum(np.square(eq_1 - y))))
print(' residual sum of squares of equation 2 is : '+ str(np.sum(np.square(eq_2 - y))))
# In[19]:
#number 2
z = np.polyfit(x, y, 1)
print(z)
# In[20]:
#number 3
X = np.array([[1], [3], [5], [6],[10],[15],[20]])
y=np.array([0,5,2,8,7,16,14])
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
print(reg.predict(np.array([[4]])))
print(reg.predict(np.array([[7]])))
print(reg.predict(np.array([[18]])))
# In[ ]:
# In[ ]:
# In[94]:
#number 4
data= pd.read_csv("1649739255556_breast_cancer.csv")
print(data.columns)
print(data.info())
X_train_4=data[['mean perimeter']]
#print(X_train)
y_label_4=data['mean radius']
xpoints_4 = np.array(X_train_4)
ypoints_4 = np.array(y_label_4)
plt.plot(xpoints_4, ypoints_4)
plt.show()
Y_4=2*3.1416*ypoints_4
plt.plot(y_label_4, Y_4)
plt.show()
#from the graph, I can say that these features are positively co-related
# In[89]:
#number 5
feature_5=np.array(data['mean perimeter'])
feature2_5 = np.array(data['mean area'])
slope,intercept = np.polyfit(feature_5, feature2_5, 1)
print(slope,intercept)
# In[90]:
#number 6
feature_6=np.array(data['mean perimeter'])
feature2_6 = np.array(data['mean area'])
result_6 = np.polyfit(feature_6, feature2_6, 2)
print(result_6)
# In[87]:
#number 7
feature_7=data[['mean perimeter','mean area']]
pred_7=data[['mean radius']]
reg_7 = LinearRegression().fit(feature_7, pred_7)
#pred_vals=reg_7.predict_proba(feature_7)
pred_labels_7=reg_7.predict(feature_7)
print(mean_squared_error(pred_7,pred_labels_7))
feature_7=data[['mean area']]
pred_7=data[['mean radius']]
reg_7 = LinearRegression().fit(feature_7, pred_7)
#pred_vals=reg_7.predict_proba(feature_7)
pred_labels_7=reg_7.predict(feature_7)
print(mean_squared_error(pred_7,pred_labels_7))
feature_7=data[['mean perimeter']]
pred_7=data[['mean radius']]
reg_7 = LinearRegression().fit(feature_7, pred_7)
#pred_vals=reg_7.predict_proba(feature_7)
pred_labels_7=reg_7.predict(feature_7)
print(mean_squared_error(pred_7,pred_labels_7))
#error is more in individual models. So taking both of them as feature is better
# In[83]:
#number 8
X_train=data[['mean radius']]
#print(X_train)
y_label=data['is malignant']
#print(y_label)
clf = LogisticRegression().fit(X_train, y_label)
pred_vals=clf.predict_proba(X_train)
pred_labels=clf.predict(X_train)
xpoints = np.array(X_train)
ypoints = np.array(y_label)
plt.plot(xpoints, ypoints)
plt.show()
pred_x=np.array(pred_vals)
plt.plot(pred_x, ypoints)
plt.show()
# In[56]:
#number 9
matrix = confusion_matrix(y_label,pred_labels, labels=[1,0])
print('Confusion matrix : \n',matrix)
# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_label,pred_labels,labels=[1,0]).reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_label,pred_labels,labels=[1,0])
print('Classification report : \n',matrix)
#from the performance metrices report, i will not select this model
# In[69]:
#number 10
all_feature_cols=['mean radius', 'mean texture', 'mean perimeter', 'mean area','mean smoothness', 'mean compactness', 'mean concavity','mean concave points', 'mean symmetry', 'mean fractal dimension']
X_train_improve=data[all_feature_cols]
#print(X_train_improve)
clf_improve = LogisticRegression().fit(X_train_improve, y_label)
pred_labels_improve=clf_improve.predict(X_train_improve)
matrix = confusion_matrix(y_label,pred_labels_improve, labels=[1,0])
print('Confusion matrix : \n',matrix)
# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_label,pred_labels_improve,labels=[1,0]).reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_label,pred_labels_improve,labels=[1,0])
print('Classification report : \n',matrix)
#from the performance metrices i can see that new models perform far better than the previous model
# In[ ]:
Get the best Regression and Classification in Python assignment help and tutoring services from our experts now!
About The Author - Dr. Emily Carter
Dr. Emily Carter, a seasoned data scientist specializing in machine learning and biomedical applications, leads this exploration of regression and classification in Python. With extensive experience in analyzing medical datasets, she navigates through techniques like polynomial fitting, linear regression, and logistic regression to uncover insights in breast cancer prediction.