# You must make sure to run all cells in sequence using shift + enter or you might encounter errors
from pykubegrader.initialize import initialize_assignment
responses = initialize_assignment("5_seaborn_q", "week_8", "readings", assignment_points = 20.0, assignment_tag = 'week8-readings')
# Initialize Otter
import otter
grader = otter.Notebook("5_seaborn_q.ipynb")
โ Seaborn Graphing Made Simple#
Question 1 (Points: 8.0): Understanding Diamond Prices: A Data-Driven Exploration#
Diamonds are often considered a symbol of luxury, and their prices are influenced by multiple factors such as cut, clarity, carat, and color. But how do these factors correlate with each other, and what insights can we gain from visualizing their relationships?
In this analysis, we will explore the relationship between diamond cut quality and price using a violin plot, which helps us understand the distribution of diamond prices across different cut grades. Additionally, we will generate a correlation heatmap to identify which numerical features of diamonds (such as carat weight, depth, and table size) are most strongly related to price.
By analyzing these visualizations, we can answer key questions such as:
Does a higher-quality cut always result in a higher price?
Which diamond characteristics have the strongest correlation with price?
Are there any surprising patterns or trends in the data?
Letโs dive into the data and uncover the hidden patterns behind diamond pricing! ๐๐.
We will provide you with inline instructions to help you complete the analysis.
# Import necessary libraries
# seaborn as sns
# matplotlib.pyplot as plt
# pandas as pd
...
# Load dataset
# Load the diamonds dataset from seaborn, to do this you can use the `sns.load_dataset` function.
# The diamonds dataset can be accessed by the name `diamonds`. See documentation [here](https://seaborn.pydata.org/generated/seaborn.load_dataset.html) for more information.
...
# Create figure with subplots
# Create a figure with two subplots, one for the violin plot and one for the correlation heatmap. The subplots should be in a 1x2 grid.
# Save the figure to the variable `fig`, and the axes to the variable `axes`.
# The figure should be of size 16x6.
...
# Violin plot with color palette
# Use the `sns.violinplot` function to create the violin plot.
# You will need to access the `cut` and `price` columns of the dataframe. This is done by calling the pandas dataframe like a dictionary, e.g. `df["cut"]` and `df["price"]`.
# The x-axis should be the `cut` column of the dataframe, and the y-axis should be the `price` column.
# The color palette should be `pastel`.
# The plot should be saved to the first subplot (`axes[0]`).
...
# Set the title, x-axis label, and y-axis label of the first subplot.
# Note the axes object is a list of two axes objects, the first one is `axes[0]` and the second one is `axes[1]`.
# The title should be "Distribution of Diamond Prices by Cut Quality".
# The x-axis label should be "Cut Quality".
# The y-axis label should be "Price (USD)".
...
# Correlation heatmap
# Create a correlation matrix of the dataframe.
# The correlation matrix should only include the numerical columns of the dataframe. You need to use the `corr` method of the dataframe. Make sure to set `numeric_only=True`.
# You can access the numerical columns of the dataframe by calling `df.select_dtypes(include=["number"])`.
# Save the correlation matrix to the variable `corr_matrix`.
...
# Use the `sns.heatmap` function to create the heatmap.
# The heatmap should be saved to the second subplot (`axes[1]`).
# The color palette should be `coolwarm`.
# The annotation should be the correlation matrix.
# The annotation should be formatted to 2 decimal places.
# look at the documentation for the `sns.heatmap` function for specific guidance.
...
# Set the title, x-axis label, and y-axis label of the second subplot.
# The title should be "Correlation Matrix of Diamond Features".
# The x-axis label should be "Features".
# The y-axis label should be "Features".
...
# Show plots
# Use the `plt.tight_layout` function to adjust the layout of the plots.
# Use the `plt.show` function to display the plots.
...
grader.check("Alternative Distribution Plot with Diamonds Dataset")
Question 2 (Points: 12.0): Create and Analyze a Dendrogram for Wine Quality Data#
The Wine dataset contains chemical properties of different wines from three cultivars (wine types). Your goal is to:
Load the UCI Wine dataset from
sklearn.datasets
.Use only the numerical features (excluding the wine class label).
Perform hierarchical clustering and generate a dendrogram.
Customize the dendrogram for better readability.
Analyze the results:
How many clusters are formed at a cutoff height of 200?
Based on clustering, which two wine classes appear most similar?
Expected Output:
A dendrogram visualizing hierarchical clustering of the wine dataset.
Answers to the analysis questions.
# Import necessary libraries
# sklearn.datasets as load_wine
# sklearn.preprocessing as MinMaxScaler
# pandas.plotting as parallel_coordinates
...
# Load the wine dataset from sklearn, to do this you can use the `load_wine` function.
...
# Create a DataFrame from the wine dataset
# The DataFrame should contain the feature names as columns and the target as the class column.
...
# Convert the target class to string for visualization
...
# Normalize numerical data using MinMaxScaler
...
# Normalize the numerical columns of the dataframe using the MinMaxScaler.
# The numerical columns are all columns except the `class` column.
...
# Add the target class for coloring the parallel coordinates plot
...
# Plot parallel coordinates plot of the normalized data
...
# Use the `parallel_coordinates` function from pandas to create the parallel coordinates plot.
# The dataframe should be the normalized dataframe.
# The class column should be the target class.
# The colormap should be `coolwarm`.
# The transparency should be 0.7.
...
# Set the title, x-axis label, and y-axis label of the plot.
# The title should be "Parallel Coordinates Plot of Wine Dataset".
# The x-axis label should be "Features".
# The y-axis label should be "Normalized Values".
...
grader.check("Creating and Analyzing a Dendrogram with Wine Dataset")
Submitting Assignment#
Please run the following block of code using shift + enter
to submit your assignment, you should see your score.
from pykubegrader.submit.submit_assignment import submit_assignment
submit_assignment("week8-readings", "5_seaborn_q")