Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

ISYE 6501 Course homework assignment one solution, Assignments of Socialization and the Life Course

ISYE 6501 Course homework assignment one solution

Typology: Assignments

2023/2024

Uploaded on 06/15/2025

daniel-rong
daniel-rong 🇨🇦

5 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
HW1
2024-08-25
Question 2.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a classification
model would be appropriate. List some (up to 5) predictors that you might use.
Answer
Since my friends are fairly into reading sometimes I have to pick a book as gift for their birthdays. To narrow
down whether a friend of mine will like the book or not I will check Goodreads for some of the following
information that will make good predictors
1. Whether the person has that book on their ‘To-read’ class (binary variable). If yes, then this is likely
to predictive of it being a good choice
2. The number of books the person has read in the genre of the book that I have chosen (numerical
variable).
3. The number of books the person has read in a similar genre to the book I have chosen (for example
scifi books are likely similar to young-adult/action adventure books)
4. Time since the last time that person finishes a book (in months). The longer its been, the more likely
the person will like the book.
5. The number of friends of that person who have read that book or have that book in their ‘to-read’
books (numerical variable). The more friends that do that , the more likely that person will like the
book.
Question 2.2
The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain
a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit
card applications with a binary response variable (last column) indicating if the application was positive
or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository
(https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without
data points that have missing values.
1. Using the support vector machine function ksvm contained in the R package kernlab, find a good
classifier for this data. Show the equation of your classifier, and how well it classifies the data points
in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.)
df <- read.table("C:/Users/tungh/OneDrive/Georgia Tech/ISYE6501/Module 2/hw1/data 2.2/credit_card_data-headers.txt",
header = TRUE)
head(df)
## A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 R1
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download ISYE 6501 Course homework assignment one solution and more Assignments Socialization and the Life Course in PDF only on Docsity!

HW

Question 2.

Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.

Answer

Since my friends are fairly into reading sometimes I have to pick a book as gift for their birthdays. To narrow down whether a friend of mine will like the book or not I will check Goodreads for some of the following information that will make good predictors

  1. Whether the person has that book on their ‘To-read’ class (binary variable). If yes, then this is likely to predictive of it being a good choice
  2. The number of books the person has read in the genre of the book that I have chosen (numerical variable).
  3. The number of books the person has read in a similar genre to the book I have chosen (for example scifi books are likely similar to young-adult/action adventure books)
  4. Time since the last time that person finishes a book (in months). The longer its been, the more likely the person will like the book.
  5. The number of friends of that person who have read that book or have that book in their ‘to-read’ books (numerical variable). The more friends that do that , the more likely that person will like the book.

Question 2.

The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without data points that have missing values.

  1. Using the support vector machine function ksvm contained in the R package kernlab, find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.)

df <- read.table ("C:/Users/tungh/OneDrive/Georgia Tech/ISYE6501/Module 2/hw1/data 2.2/credit_card_data-h header = TRUE) head (df)

## A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 R

dim (df)

## [1] 654 11

We first run the starter code in the homework, and with the defaut paremeter C = 100 we are getting close to 86.4% accuracy.

We will also print the coefficients of A1 through Am and Ao

# install.packages('kernlab') data <- as.matrix (df) library ("kernlab")

Warning: package ’kernlab’ was built under R version 4.4.

# call ksvm. Vanilladot is a simple linear kernel. model <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "vanilladot", C = 100, scaled = TRUE)

Setting default kernel parameters

# calculate a1...am a <- colSums (model @ xmatrix[[1]] ***** model @ coef[[1]]) a

## A1 A2 A3 A8 A

## A10 A11 A12 A14 A

# calculate a a0 <- - model @ b a

## [1] 0.

# see what the model predicts pred <- predict (model, data[, 1 : 10]) pred

## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

Above you can see that with poly = 2 we are getting higher accuracy. We can write a function to tune C as well.

library (kernlab)

# Define the function to evaluate models with varying C values evaluate_svm <- function (data, C_values = 10 ˆseq ( - 3, 3, by = 1)) { best_accuracy <- 0 best_C <- NA

for (C_value in C_values) { # Train the SVM model with the current C value model <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "polydot", kpar = list (degree = 2), C = C_value, scaled = TRUE)

# Make predictions pred <- predict (model, data[, 1 : 10])

# Calculate accuracy accuracy <- sum (pred == data[, 11]) /nrow (data)

# Print the accuracy for the current C value cat ("C:", C_value, "Accuracy:", sprintf ("%.2f%%", accuracy ***** 100), " \n ")

# Check if this is the best accuracy so far if (accuracy > best_accuracy) { best_accuracy <- accuracy best_C <- C_value } }

# Return the best C value and corresponding accuracy cat (" \n The best C value:", best_C, " \n Best Accuracy:", sprintf ("%.2f%%", best_accuracy ***** 100), " \n ") return ( list (best_C = best_C, best_accuracy = best_accuracy)) }

# Example usage result <- evaluate_svm (data)

C: 0.001 Accuracy: 86.39%

C: 0.01 Accuracy: 86.70%

C: 0.1 Accuracy: 87.46%

C: 1 Accuracy: 88.07%

C: 10 Accuracy: 88.84%

C: 100 Accuracy: 89.30%

C: 1000 Accuracy: 88.38%

The best C value: 100

Best Accuracy: 89.30%

Next we try the radical basis function which is what the class reading talks about (https://pyml.sourceforge. net/doc/howto.pdf)

model.4 <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "rbfdot", C = 100, scaled = TRUE) pred <- predict (model.4, data[, 1 : 10]) sum (pred == data[, 11]) /nrow (data)

## [1] 0.

It performs exceptionally well, I wonder if it over fits when we have unbalanced data (lots of examples of 1 class but not others)

pred

## [1] 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

## [75] 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0

## [112] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1

## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1

## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [297] 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [556] 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0

## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Looking at the results, it doesnt seeem to predict all 1’s or 0’s. Thus we can conclude that perhaps theres not evident of overfitting on unbalanced data. Without test and validation sets, we can’t determine this completely.

Base on this, I conclude the Gaussian kernel performs the best, followed by 2-polynomial with C = 100 as second best performing model.

Question 2.2.

  1. Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn).

# install.packages('kknn') library ("kknn")

Warning: package ’kknn’ was built under R version 4.4.

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 89.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 88.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 88.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 88.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 88.

  • k:

  • Metric (distance):

  • Accuracy for the test set is 88.

  • k:

Metric (distance): 2

Accuracy for the test set is 88.

k: 30

Metric (distance): 2

Accuracy for the test set is 88.

Using the Eucledian distance, the optimal value for k seems to be around 13 with accuracy for validation set of 89%. Note that we splitting 80% and 20% for train and validation set. Below we print out values for such data, it looks fairly close to the predictions made by the SVM model.

evaluate_knn (df, k = 13, print_pred = TRUE)

## [1] 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1

## [38] 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## [112] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Levels: 0 1

k: 13

Metric (distance): 2

Accuracy for the test set is 89.

# table(df$R1)