How to Train Your Model When Data Lies

Imagine you're learning to tell dog breeds apart, but your teacher occasionally tells you the wrong information. They sometimes mistakenly call a Labrador a Golden Retriever. They also call a Husky a Malamute at times. When this keeps happening, you'll start doubting yourself, or worse....learn the wrong things altogether. This is exactly what happens when you train machine learning models on noisy labels. Labels that are erroneous in the data. The model gets confused, learns the incorrect patterns, and does not predict well. So, how do you make a model smart enough to handle these errors? That's what we will explore in this article. You can find the code snippets I've used here in my colab notebook: Colab Notebook What Are Noisy Labels? A label is the correct answer for a data point. So, if you have a data set of pictures of cats and dogs, each picture will have a label of "cat" or "dog." But sometimes, labels are wrong. This can happen because: Humans make errors: Someone manually labeled a picture of a Husky as a Wolf. Data can be unclear: Some flowers are nearly identical to each other. Automatic labeling goes wrong: A weak system can incorrectly classify objects. These types of errors in labels are what are called noisy labels. And if you train a model with too much noise, it may end up memorizing the mistakes instead of learning from correct patterns. Let’s Create a Noisy Dataset in Python First, let’s generate a clean dataset, then introduce some noise. Step 1: Generate a Clean Dataset We’ll create a simple dataset with two classes (0 and 1) using sklearn.datasets.make_classification. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Create a classification dataset with 1000 samples (data points) and 2 features (columns) # n_informative=2 means the two features are useful for the classification task # n_redundant=0 means no extra, redundant features are added X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", alpha=0.7) plt.title("Clean Dataset") plt.show() Step 2: Add Noisy Labels Now, we introduce 20% label noise by flipping some labels randomly. def add_label_noise(y, noise_rate=0.2): np.random.seed(42) num_noisy = int(len(y) * noise_rate) noisy_indices = np.random.choice(len(y), num_noisy, replace=False) y_noisy = y.copy() y_noisy[noisy_indices] = 1 - y_noisy[noisy_indices] # flip the labels :) return y_noisy # Introduce noise into labels y_train_noisy = add_label_noise(y_train, noise_rate=0.2) plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train_noisy, cmap="coolwarm", alpha=0.7) plt.title("Dataset with Noisy Labels (20% incorrect)") plt.show()

Feb 28, 2025 - 17:04

Imagine you're learning to tell dog breeds apart, but your teacher occasionally tells you the wrong information. They sometimes mistakenly call a Labrador a Golden Retriever. They also call a Husky a Malamute at times. When this keeps happening, you'll start doubting yourself, or worse....learn the wrong things altogether.

This is exactly what happens when you train machine learning models on noisy labels. Labels that are erroneous in the data. The model gets confused, learns the incorrect patterns, and does not predict well.

So, how do you make a model smart enough to handle these errors? That's what we will explore in this article.

You can find the code snippets I've used here in my colab notebook: Colab Notebook

What Are Noisy Labels?

A label is the correct answer for a data point. So, if you have a data set of pictures of cats and dogs, each picture will have a label of "cat" or "dog."

But sometimes, labels are wrong. This can happen because:

Humans make errors: Someone manually labeled a picture of a Husky as a Wolf.
Data can be unclear: Some flowers are nearly identical to each other.
Automatic labeling goes wrong: A weak system can incorrectly classify objects.

These types of errors in labels are what are called noisy labels. And if you train a model with too much noise, it may end up memorizing the mistakes instead of learning from correct patterns.

Let’s Create a Noisy Dataset in Python

First, let’s generate a clean dataset, then introduce some noise.

Step 1: Generate a Clean Dataset

We’ll create a simple dataset with two classes (0 and 1) using sklearn.datasets.make_classification.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a classification dataset with 1000 samples (data points) and 2 features (columns)
# n_informative=2 means the two features are useful for the classification task
# n_redundant=0 means no extra, redundant features are added
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", alpha=0.7)
plt.title("Clean Dataset")
plt.show()

Step 2: Add Noisy Labels

Now, we introduce 20% label noise by flipping some labels randomly.

def add_label_noise(y, noise_rate=0.2):
    np.random.seed(42)
    num_noisy = int(len(y) * noise_rate)
    noisy_indices = np.random.choice(len(y), num_noisy, replace=False)
    y_noisy = y.copy()

    y_noisy[noisy_indices] = 1 - y_noisy[noisy_indices] # flip the labels :)
    return y_noisy

# Introduce noise into labels
y_train_noisy = add_label_noise(y_train, noise_rate=0.2)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train_noisy, cmap="coolwarm", alpha=0.7)
plt.title("Dataset with Noisy Labels (20% incorrect)")
plt.show()