๐ค Welcome to the World of Machine Learning with Scikit-learn! ๐#
Imagine if your computer could learn how to recognize cats in photos, predict tomorrowโs weather, or even recommend your next favorite songโall on its own! Sounds magical, right? โจ Well, thatโs what Machine Learning (ML) is all about!
๐ What is Machine Learning?#
Machine Learning is like teaching a computer how to learn from experience, just like how humans learn! Instead of giving the computer strict instructions, we show it lots of examples, and it figures out the patterns by itself.
๐ In Simple Words:#
Machine Learning = Learning from Experience (Data) + Making Predictions or Decisions
Think of it as training a puppy ๐พ:
You show the puppy a treat ๐ช every time it sits on command.
The puppy learns the pattern: โIf I sit, I get a treat!โ
Next time, it sits without hesitation, hoping for another snack! ๐
Similarly, we train machines by feeding them lots of data, and they learn to recognize patterns to make decisions or predictions.
๐ค Why is Machine Learning Important?#
ML is everywhere! Here are some fun examples:
๐ต Music Recommendations: Spotify or Apple Music suggesting songs youโll love.
๐ธ Image Recognition: Instagram recognizing your friends in photos.
๐ Online Shopping: Amazon recommending products based on your previous purchases.
๐ฆบ Self-driving Cars: Learning how to navigate roads safely.
ChatGPT: Learning how to generate text that sounds like itโs written by a human (or code which you likely have used in this course)
๐ ๏ธ Meet Scikit-learn: Your ML Toolkit!#
Scikit-learn (also written as sklearn
) is like a magic toolbox ๐ง that has all the tools you need to create machine learning models!
Itโs open-source (free for everyone!)
Built on Python (the most-loved programming language for ML) ๐
Easy to use, even if youโre just getting started!
๐ Whatโs Inside the Toolbox?#
Supervised Learning ๐: The machine learns from labeled data (like a student learning from a textbook).
Examples:
Predicting house prices ๐
Classifying emails as spam or not spam ๐ง
Unsupervised Learning ๐: The machine explores the data and finds patterns on its own (like a detective solving a mystery).
Examples:
Grouping customers with similar buying habits ๐
Organizing news articles by topic ๐ฐ
Model Evaluation & Selection - ๐: Helps you pick the best model for your problem by testing and comparing different ones.
๐ Example: Predicting Car Prices#
Weโll use the Automobile Dataset from the UCI Machine Learning Repository, which is also available through a URL. The dataset contains information on car features like horsepower, engine size, weight, and price.
๐ Step 1: Load the Dataset#
Weโll load the dataset directly from a URL using Pandas.
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset from UCI repository
data_url = (
"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
)
column_names = [
"symboling",
"normalized_losses",
"make",
"fuel_type",
"aspiration",
"num_doors",
"body_style",
"drive_wheels",
"engine_location",
"wheel_base",
"length",
"width",
"height",
"curb_weight",
"engine_type",
"num_cylinders",
"engine_size",
"fuel_system",
"bore",
"stroke",
"compression_ratio",
"horsepower",
"peak_rpm",
"city_mpg",
"highway_mpg",
"price",
]
# Load the dataset
df = pd.read_csv(data_url, names=column_names)
# Display the first few rows of the dataset
print(df.head())
symboling normalized_losses make fuel_type aspiration num_doors \
0 3 ? alfa-romero gas std two
1 3 ? alfa-romero gas std two
2 1 ? alfa-romero gas std two
3 2 164 audi gas std four
4 2 164 audi gas std four
body_style drive_wheels engine_location wheel_base ... engine_size \
0 convertible rwd front 88.6 ... 130
1 convertible rwd front 88.6 ... 130
2 hatchback rwd front 94.5 ... 152
3 sedan fwd front 99.8 ... 109
4 sedan 4wd front 99.4 ... 136
fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg \
0 mpfi 3.47 2.68 9.0 111 5000 21
1 mpfi 3.47 2.68 9.0 111 5000 21
2 mpfi 2.68 3.47 9.0 154 5000 19
3 mpfi 3.19 3.40 10.0 102 5500 24
4 mpfi 3.19 3.40 8.0 115 5500 18
highway_mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450
[5 rows x 26 columns]
๐ง Step 3: Data Preprocessing#
Convert missing values (
?
) to NaN and then drop them for simplicity.Convert relevant columns to numeric data types.
# Replace '?' with NaN and drop rows with missing values
df.replace("?", np.nan, inplace=True)
df.dropna(inplace=True)
# Convert relevant columns to numeric data types
df["price"] = pd.to_numeric(df["price"])
df["horsepower"] = pd.to_numeric(df["horsepower"])
df["engine_size"] = pd.to_numeric(df["engine_size"])
df["curb_weight"] = pd.to_numeric(df["curb_weight"])
df["highway_mpg"] = pd.to_numeric(df["highway_mpg"])
# Choose features and target variable
X = df[["horsepower", "engine_size", "curb_weight", "highway_mpg"]]
y = df["price"]
๐ Step 4: Split Data into Training and Testing Sets#
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
โ๏ธ Step 5: Train the Model#
Weโll use Linear Regression from Scikit-learn to train the model.
# Choose a model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print("Predicted Car Prices:", predictions)
Predicted Car Prices: [18899.66601909 17249.56177096 13809.79414382 10948.22842835
13873.56362342 10230.61132624 9175.39186654 6275.61360388
8338.45295561 8934.12708317 3624.59104788 6686.72349821
6041.30535528 6792.17064464 6433.49006725 11048.94263004
13728.52798586 8771.07420939 14489.6550427 6041.30535528
17198.0765038 16482.11715294 5206.48027809 5024.15027934
9682.3993741 9032.86604074 14201.52957358 11243.74173001
8780.87995517 16634.51252543 11301.69305671 20384.28190924]
๐ Step 6: Evaluate the Model#
Letโs evaluate the model using Mean Squared Error (MSE) and Rยฒ Score.
# Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Rยฒ Score: {r2:.2f}")
Mean Squared Error: 5742577.12
Rยฒ Score: 0.68
๐ Why Scikit-learn is Awesome!#
User-friendly: Intuitive and easy-to-understand syntax
Comprehensive: It has all the basic algorithms and tools youโll need
Community Support: Tons of tutorials, forums, and community help available
๐ Ready to Play?#
With Scikit-learn, the possibilities are endless! You can:
Predict movie ratings ๐ฟ
Classify images of cute puppies and kittens ๐๐
Build your own digital assistant ๐ค
So, what are you waiting for? Grab your laptop, open your Python editor, and letโs start your ML adventure with Scikit-learn! ๐๐
๐ Where to Learn More?#
Kaggle for hands-on ML competitions and datasets