Tuesday, June 9, 2026

Understanding Feature Importance in Data Science: A Practical Walkthrough Using Heart Disease Data

 



Introduction

One of the most important ideas in machine learning is feature importance—understanding which input variables actually influence a model’s predictions. For students entering Data Science, this concept often feels straightforward at first, but quickly becomes more subtle when applied in real models.

In this walkthrough, we’ll use the Cleveland Heart Disease dataset from the UCI Machine Learning Repository to explore the full pipeline:

  1. Acquiring real-world data (and dealing with installation challenges)

  2. Cleaning and preparing messy medical data

  3. Training a Random Forest model

  4. Interpreting feature importance—and avoiding a very common misunderstanding


1. Acquiring the Data: Real-World Data is Rarely Plug-and-Play

In many introductory tutorials, datasets are clean and immediately available. Real-world data is rarely that convenient.

For this example, we use the Cleveland Heart Disease Dataset, which is commonly accessed through Python libraries such as ucimlrepo.

Before you can even load the dataset, you may need to install an external package:

pip install ucimlrepo

Then, you can retrieve the dataset:

from ucimlrepo import fetch_ucirepo

heart_disease = fetch_ucirepo(id=45)

X = heart_disease.data.features
y = heart_disease.data.targets

Key Lesson

Even at the “data loading” stage, students encounter a real-world issue:

Data science workflows often depend on external packages that are not included in standard Python installations.

This is an important shift in thinking, from “writing code” to “managing environments and dependencies.”


2. Cleaning the Data: The Unseen Majority of Data Work

Once the dataset is loaded, it is rarely ready for modeling.

A common issue in the Cleveland dataset is missing values represented as "?", which are not automatically interpreted as numeric nulls.

This requires explicit cleaning:

import numpy as np
import pandas as pd

# Replace missing value placeholders
X_clean = X.replace("?", np.nan)

# Convert all columns to numeric
X_clean = X_clean.apply(pd.to_numeric)

# Fill missing values (median is a common choice)
X_clean = X_clean.fillna(X_clean.median())

Key Lesson

At this stage, students often discover an important reality:

Most of the effort in data science is not modeling—it is preparing the data so a model can even run correctly.

Without cleaning, even powerful algorithms like Random Forests will either fail or produce misleading results.


3. Modeling: Random Forests and the Feature Importance Trap

Now we train a Random Forest classifier:

from sklearn.ensemble import RandomForestClassifier

y_binary = (y.iloc[:, 0] > 0).astype(int)

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

model.fit(X_clean, y_binary)

We then extract feature importance:

import pandas as pd

importance = pd.DataFrame({
    "Feature": X_clean.columns,
    "Importance": model.feature_importances_
}).sort_values(by="Importance", ascending=False)

print(importance)

A typical result might show a feature like thalach (maximum heart rate achieved) as the most important predictor, with a value such as 0.127.




The Common Misinterpretation

This is where many beginners make a mistake:

“0.127 looks small—so this feature must not be very important.”

This interpretation is incorrect.

Random Forest feature importance values:

  • Always sum to 1.0

  • Represent relative contribution across all features

  • Are not probabilities or absolute effect sizes

In a dataset with ~13 features, the “average” importance is:

1 / 13 ≈ 0.077

So a value like 0.127 is actually significantly above average.


Why Importance Values Feel Smaller Than Expected

Feature importance is distributed across:

  • Correlated variables (e.g., heart rate, chest pain, exercise response)

  • Many small contributions from multiple splits

  • The averaging effect across hundreds of trees

This leads to a natural spread where even strong predictors may not dominate visually.


Key Insight

Feature importance is not about how “big” a number looks. Rather, it is about how the model distributes decision-making across variables.

This is one of the most important conceptual shifts in early machine learning education.


Conclusion

Using the Cleveland heart disease dataset, we walked through a complete beginner-friendly machine learning workflow:

  • Installing and loading real-world datasets

  • Cleaning imperfect medical data

  • Training a Random Forest classifier

  • Interpreting feature importance correctly

The most valuable takeaway is not the model itself, but the interpretation:

In data science, understanding what a model is doing is often more important than achieving high accuracy.

Feature importance is a powerful tool—but only when it is interpreted correctly.


Optional Next Step

To deepen understanding, try:

  • Removing the top 3 features and retraining the model

  • Using permutation importance instead of built-in importance

  • Comparing results across multiple models (Logistic Regression vs Random Forest)

These experiments often reveal that “importance” is not absolute. It depends on the model itself.


About the Author:

Dr. Dax Bradley is a professor of Computer Science and a lifelong connoisseur of all things nerdy. When he’s not teaching data structures or debugging Python code, he’s diving into Dungeons & Dragons campaigns, quoting obscure B-movies, or debating the finer points of Star Wars canon. He believes comic books are literature, bad movies deserve love, and if there’s a bigger nerd in the room, he’d really like to meet them.

No comments:

Post a Comment