Introduction
One of the most important ideas in machine learning is feature importance—understanding which input variables actually influence a model’s predictions. For students entering Data Science, this concept often feels straightforward at first, but quickly becomes more subtle when applied in real models.
In this walkthrough, we’ll use the Cleveland Heart Disease dataset from the UCI Machine Learning Repository to explore the full pipeline:
Acquiring real-world data (and dealing with installation challenges)
Cleaning and preparing messy medical data
Training a Random Forest model
Interpreting feature importance—and avoiding a very common misunderstanding
1. Acquiring the Data: Real-World Data is Rarely Plug-and-Play
In many introductory tutorials, datasets are clean and immediately available. Real-world data is rarely that convenient.
For this example, we use the Cleveland Heart Disease Dataset, which is commonly accessed through Python libraries such as ucimlrepo.
Before you can even load the dataset, you may need to install an external package:
pip install ucimlrepo
Then, you can retrieve the dataset:
from ucimlrepo import fetch_ucirepo
heart_disease = fetch_ucirepo(id=45)
X = heart_disease.data.features
y = heart_disease.data.targets
Key Lesson
Even at the “data loading” stage, students encounter a real-world issue:
Data science workflows often depend on external packages that are not included in standard Python installations.
This is an important shift in thinking, from “writing code” to “managing environments and dependencies.”
2. Cleaning the Data: The Unseen Majority of Data Work
Once the dataset is loaded, it is rarely ready for modeling.
A common issue in the Cleveland dataset is missing values represented as "?", which are not automatically interpreted as numeric nulls.
This requires explicit cleaning:
import numpy as np
import pandas as pd
# Replace missing value placeholders
X_clean = X.replace("?", np.nan)
# Convert all columns to numeric
X_clean = X_clean.apply(pd.to_numeric)
# Fill missing values (median is a common choice)
X_clean = X_clean.fillna(X_clean.median())
Key Lesson
At this stage, students often discover an important reality:
Most of the effort in data science is not modeling—it is preparing the data so a model can even run correctly.
Without cleaning, even powerful algorithms like Random Forests will either fail or produce misleading results.
3. Modeling: Random Forests and the Feature Importance Trap
Now we train a Random Forest classifier:
from sklearn.ensemble import RandomForestClassifier
y_binary = (y.iloc[:, 0] > 0).astype(int)
model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
model.fit(X_clean, y_binary)
We then extract feature importance:
import pandas as pd
importance = pd.DataFrame({
"Feature": X_clean.columns,
"Importance": model.feature_importances_
}).sort_values(by="Importance", ascending=False)
print(importance)
A typical result might show a feature like thalach (maximum heart rate achieved) as the most important predictor, with a value such as 0.127.
The Common Misinterpretation
This is where many beginners make a mistake:
“0.127 looks small—so this feature must not be very important.”
This interpretation is incorrect.
Random Forest feature importance values:
Always sum to 1.0
Represent relative contribution across all features
Are not probabilities or absolute effect sizes
In a dataset with ~13 features, the “average” importance is:
1 / 13 ≈ 0.077
So a value like 0.127 is actually significantly above average.
Why Importance Values Feel Smaller Than Expected
Feature importance is distributed across:
Correlated variables (e.g., heart rate, chest pain, exercise response)
Many small contributions from multiple splits
The averaging effect across hundreds of trees
This leads to a natural spread where even strong predictors may not dominate visually.
Key Insight
Feature importance is not about how “big” a number looks. Rather, it is about how the model distributes decision-making across variables.
This is one of the most important conceptual shifts in early machine learning education.
Conclusion
Using the Cleveland heart disease dataset, we walked through a complete beginner-friendly machine learning workflow:
Installing and loading real-world datasets
Cleaning imperfect medical data
Training a Random Forest classifier
Interpreting feature importance correctly
The most valuable takeaway is not the model itself, but the interpretation:
In data science, understanding what a model is doing is often more important than achieving high accuracy.
Feature importance is a powerful tool—but only when it is interpreted correctly.
Optional Next Step
To deepen understanding, try:
Removing the top 3 features and retraining the model
Using permutation importance instead of built-in importance
Comparing results across multiple models (Logistic Regression vs Random Forest)
These experiments often reveal that “importance” is not absolute. It depends on the model itself.
About the Author:
Dr. Dax Bradley is a professor of Computer Science and a lifelong connoisseur of all things nerdy. When he’s not teaching data structures or debugging Python code, he’s diving into Dungeons & Dragons campaigns, quoting obscure B-movies, or debating the finer points of Star Wars canon. He believes comic books are literature, bad movies deserve love, and if there’s a bigger nerd in the room, he’d really like to meet them.


No comments:
Post a Comment