- 26th Jul 2024
- 17:45 pm
In this assignment develop a logistic regression model for classifying a customer as a purchaser or nonpurchaser. Partition the data randomly into training set 60% validation set 40%. Run logistic regression with L2 penalty, using method LogisticRegressionCV. Please submit Python code. Tell a high-level story of steps taken to get to the end result. Start with the framework i.e., objective, exploration, variable selection (PCA, Correlation etc.). Then provide the final results and comparison analysis of the training vs. validation data vs. test. Present your findings in power point format (no more than 5 slides) in terms of steps taken and results.
The Logistic Regression:
- Run the logistic regression on all variables apart from spending and sequence_number)
- Partition the data on the whole data set randomly into a training set 60% validation set 40%
- Run quick descriptive stats for validation and training dataset
- Fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization): Predict the model on validation dataset
- Confusion matrix for all sets
Show some use of statsmodel if possible
- Develop a model for predicting spend among purchasers. Refer to problem #3 in case study. Create subsets of the training and validation sets for only purchasers’ records by filtering for Purchase = 1. Develop models for predicting spending with the filtered datasets, using: Multiple linear regression (use stepwise regression). Choose one model on the basis of its performance on the validation data. Please submit Python code.
- Tell a high-level story of steps taken to get to the end result. Start with the framework
- i.e., objective, exploration, variable selection (PCA, Correlation, Exhaustive search etc.). Then provide the final results and comparison analysis of the training vs. validation data vs. test.
- Present your findings in power point format (no more than 3 slides) in terms of steps taken and results
21.3 Tayko Software Cataloger - BUS 4023 - Get Assignment Solution
This sample Python assignment solution has been successfully completed by our team of Python programmers. The solutions provided are designed exclusively for research and reference purposes. If you find value in reviewing the reports and code, our Python tutors would be delighted.
-
For a comprehensive solution package including code, reports, and screenshots, please visit our Python Assignment Sample Solution page.
-
Contact our Python experts for personalized online tutoring sessions focused on clarifying any doubts related to this assignment.
-
Explore the partial solution for this assignment available in the blog above for further insights.
Free Assignment Solution - 21.3 Tayko Software Cataloger - BUS 4023
# Commented out IPython magic to ensure Python compatibility.
# %%capture
# import pandas as pd
# pd.set_option('display.max_columns', None)
# import numpy as np
# import seaborn as sns;sns.set(style="white")
# import matplotlib.pyplot as plt
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import classification_report, confusion_matrix
# from sklearn.model_selection import train_test_split
# %matplotlib inline
# import warnings
# warnings.simplefilter("ignore")
"""## Question 1"""
df = pd.read_csv("/content/1656903894426_Tayko(4).csv")
print("Our orignal data-set have {} rows and {} columns. \n" .format(df.shape[0], df.shape[1]))
df.head()
df.tail()
# Data type check
df.info()
# rename all column names - replace space with underscore
df.colums = [str(col).replace(" ", "_") for col in df.columns]
# descriptive statistics
df.describe()
# check for missing values
df.isnull().any()
# Remove irrelevant features
df = df.drop(["Spending", "sequence_number"], axis=1)
# Count number of unique values in each variable
for col in df.columns:
print(f"{col} has {df[col].nunique()} unique values.")
# Data Visualization
# Heatmapshowing correlation between variables
fig, ax =plt.subplots(figsize=(23, 23))
plt.title("Correlation Plot")
sns.heatmap(df.corr(), mask=np.zeros_like(df.corr(), dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
square=True, ax=ax, annot=True,linewidths=5)
plt.show()
# countplot
for col in df.columns:
if df[col].nunique()<=2:
fig, ax = plt.subplots()
fig.set_size_inches(15, 5)
sns.countplot(df[col], palette="Set3")
# histogram
for col in df.columns:
if df[col].nunique()>2:
fig, ax = plt.subplots()
fig.set_size_inches(15, 5)
sns.distplot(df[col], color="m")
# selecting n components to 5
from sklearn.decomposition import PCA # to apply PCA
data = df.drop("Purchase", axis=1)
pca = PCA(n_components = 5)
pca.fit(data)
data_pca = pca.transform(data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3', 'PC4','PC5'])
data_pca.head()
# Logistic Regression
X = data_pca
y = df["Purchase"]
# Splitting data into train and test sample using 60% data for training and 40% data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.60, stratify=y)
clf1 = LogisticRegression(penalty="l2", C=1e42)
clf1.fit(X_train, y_train)
predictions= clf1.predict(X_test)
train_acc1 = clf1.score(X_train, y_train)*100
test_acc1 = clf1.score(X_test, y_test)*100
labels = ["no purchase", "purchase"]
print("Accuracy on training set: {:.3f}%. \n".format(train_acc1))
print("Accuracy on test set: {:.3f}%. \n".format(test_acc1))
print("Classification Report: \n",classification_report(y_test, predictions, target_names=labels))
print()
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix: " )
confusion = pd.DataFrame(cm, columns = labels, index = labels)
confusion
"""## Question 2:"""
df = pd.read_csv("/content/1656903894426_Tayko(4).csv")
df = df[df["Purchase"]==1].reset_index(drop=True)
print(f"We have {df.shape[0]} purchasers")
df.head()
df.tail()
df.info()
df.describe()
df = df.drop(["Purchase", "sequence_number"],axis=1)
# Data Visualization
# Heatmapshowing correlation between variables
fig, ax =plt.subplots(figsize=(23, 23))
plt.title("Correlation Plot")
sns.heatmap(df.corr(), mask=np.zeros_like(df.corr(), dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
square=True, ax=ax, annot=True,linewidths=5)
plt.show()
# countplot
for col in df.columns:
if df[col].nunique()<=2:
fig, ax = plt.subplots()
fig.set_size_inches(15, 5)
sns.countplot(df[col], palette="Set3")
# histogram
for col in df.columns:
if df[col].nunique()>2:
fig, ax = plt.subplots()
fig.set_size_inches(15, 5)
sns.distplot(df[col], color="m")
# selecting n components to 5
from sklearn.decomposition import PCA # to apply PCA
data = df.drop(["Spending"], axis=1)
pca = PCA(n_components = 5)
pca.fit(data)
data_pca = pca.transform(data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3', 'PC4','PC5'])
data_pca.head()
# Linear Regression
import statsmodels.api as sm
X = data_pca
y = df["Spending"]
x = sm.add_constant(X)
lm = sm.OLS(y,x).fit()
lm.summary()
# Removing non significant fetures from Linear Regression
X = data_pca[["PC1", "PC2", "PC3"]]
y = df["Spending"]
x = sm.add_constant(X)
lm = sm.OLS(y,x).fit()
lm.summary()
Get the best 21.3 Tayko Software Cataloger - BUS 4023 assignment help and tutoring services from our experts now!
About The Author - Alex Gen
Alex Gen is a data analyst specializing in predictive modeling and data analysis. In this assignment, Alex focused on developing a logistic regression model to classify customers as purchasers or non-purchasers, utilizing Python’s LogisticRegressionCV with an L2 penalty. The process involved partitioning the data into training (60%) and validation (40%) sets, applying PCA and correlation for variable selection, and running descriptive statistics.