❓ Machine Learning with scikit-learn

Contents

# You must make sure to run all cells in sequence using shift + enter or you might encounter errors
from pykubegrader.initialize import initialize_assignment

responses = initialize_assignment("9_scikit_learn_q", "week_9", "readings", assignment_points = 43.0, assignment_tag = 'week9-readings')

# Initialize Otter
import otter
grader = otter.Notebook("9_scikit_learn_q.ipynb")

❓ Machine Learning with scikit-learn#

Note: we expect you use the sklearn documentation to answer these questions. Searching and reading documentation is a crucial skill for coding and machine learning. College is not about the material you learn but your ability to learn how to learn.

# Run this block of code by pressing Shift + Enter to display the question
from questions._9_scikit_learn_q import Question1
Question1().show()

# Run this block of code by pressing Shift + Enter to display the question
from questions._9_scikit_learn_q import Question2
Question2().show()

# Run this block of code by pressing Shift + Enter to display the question
from questions._9_scikit_learn_q import Question3
Question3().show()

Hello World of Machine Learning#

The MNIST dataset is a collection of handwritten digits. Each image is 28x28 pixels, and each pixel is a grayscale value between 0 and 255. The dataset is split into a training set and a test set, with 60,000 images in the training set and 10,000 images in the test set.

The goal of this example is to build a classifier that can correctly identify the digit in an image. We will walk through the steps of loading the data, training a model, and evaluating the model.

# We need to import the necessary libraries
# numpy as np
...
# matplotlib.pyplot as plt
...
# sklearn.datasets import fetch_openml
...
# sklearn.model_selection import train_test_split
...
# sklearn.linear_model import LogisticRegression
...
# sklearn.preprocessing import StandardScaler
...
# Load the MNIST dataset
# sklearn.datasets.fetch_openml is used to load the MNIST dataset, use the key 'mnist_784', and version 1, version is an optional parameter so set is as (version=1)
# assign the result to the variable mnist
...
# mnist.data is the data, mnist.target is the labels, assign these to X and y respectively
...
# Split the data into training and test sets
# sklearn.model_selection.train_test_split is used to split the data into training and test sets, test_size
# make the the proportion of the dataset to include in the test split to be 20% of the data, random_state is the seed used by the random number generator, set to 42 for reproducibility and testing - Note 42 is the answer to the universe
# random_state is the seed used by the random number generator, set to 42 for reproducibility and testing - Note 42 is the answer to the universe
...

# Standardize the data
# sklearn.preprocessing.StandardScaler is used to standardize the data, fit the scaler to the training data and then transform the training and test data
# save the scaler as an object scaler
...
# fit the scaler to the training data and then transform the training and test data, assign the transformed data to X_train_scaled and X_test_scaled respectively
# you can use the fit_transform method to fit the scaler to the training data and then transform the training and test data
# fit the scaler to the training data and then transform the training and test data, assign the transformed data to X_train_scaled and X_test_scaled respectively
...
# Train a Logistic Regression model
# use the LogisticRegression class to train a model. and save the object to the variable clf which stands for classifier
# set the max_iter to 1000, solver to 'lbfgs', and random_state to 42 for reproducibility and testing
...
# fit the model to the training data, use the fit method to fit the model to the training data - since this is an object we use the dot notation to call the fit method, and we do not need to assign the result to a variable
...
# Predict on the test set
# use the predict method to predict the labels of the test data, assign the result to the variable y_pred
...
# Note, we are actually going to train an machine learning model. It might take a minute or two to train - on my computer it takes about 30 seconds.

# We have provided some code to plot and display a random image from the test set, and print the true and predicted labels.
# Did this do a good job?

# Choose a random test sample
random_index = np.random.randint(0, len(X_test))
random_image = X_test.iloc[random_index].values.reshape(28, 28)
true_label = y_test.iloc[random_index]
predicted_label = y_pred[random_index]

# Plot the image with prediction and true label
plt.figure(figsize=(4, 4))
plt.imshow(random_image, cmap="gray")
plt.title(f"True Label: {true_label}\nPredicted: {predicted_label}", fontsize=14)
plt.axis("off")
plt.show()

grader.check("mnist-classifier")

Submitting Assignment#

Please run the following block of code using shift + enter to submit your assignment, you should see your score.

from pykubegrader.submit.submit_assignment import submit_assignment

submit_assignment("week9-readings", "9_scikit_learn_q")