Backpropagate
Posts
Teaching Series #3 - Cross-entropy

Teaching Series #3 - Cross-entropy

"...Choose to be curious around everything that goes around you, and adopt it to your own satisfaction..."

Dr Milos Ojdanic
June 04, 2025 • Estimated Reading Time: 10 minutes

Imagine you have a bag of different colored candies: 5 red, 3 blue, and 2 green. If you reach in and pick one, what's the chance of picking a red candy? Or a blue one? A probability distribution is just a way to describe all the possible outcomes of an event and how likely each outcome is. It's like a complete list that shows you where all the "probability" is spread out.

For the candy example, the probability distribution might look like this:

Red: 5 out of 10 candies, so a probability of 0.5 (or 50%)
Blue: 3 out of 10 candies, so a probability of 0.3 (or 30%)
Green: 2 out of 10 candies, so a probability of 0.2 (or 20%)

If we were to draw this, it might look like a bar chart where each bar represents a colour, and its height shows how likely you are to pick that colour.

Now that we have touched on probability distributions, let's briefly touch upon some concepts from Information Theory, that help us understand concepts better. Think about "information." What does it mean in the context of probabilities? Imagine I tell you something highly unlikely to happen – like "it's snowing in the desert." That statement carries a lot of "information" because it's so unexpected. On the other hand, if I tell you "the sun will rise tomorrow," that carries very little information because it's almost certain. In information theory, entropy (not cross-entropy yet, stay tuned!) is a concept that measures the uncertainty or randomness of a single probability distribution (think of it as one outcome). If all outcomes are equally likely, the entropy is high (lots of uncertainty). If one outcome is almost certain, the entropy is low (very little uncertainty). For example, if you have a coin that's perfectly balanced (50% heads, 50% tails), the entropy is high – you're very uncertain about the outcome. But if you have a trick coin that always lands on heads, the entropy is low – you're very certain about the outcome.

Now, you are ready. What is Cross-Entropy?

Cross-entropy is a way to measure how different two probability distributions are. This usually means comparing your predicted probability distribution (something you guessed based on your smart model called brain, or you created your mathematical model) to the true probability distribution.

Imagine you are training a mathematical model to identify if an animal in a picture is a cat, a dog, or a bird.

True Distribution (P): This is the actual reality. If the picture is definitely a cat, the true probability for "cat" is 1, and for "dog" and "bird" it's 0.
- Example: Cat: 1.0, Dog: 0.0, Bird: 0.0
Predicted Distribution (Q): This is what your model thinks the picture is. It might output probabilities like:
- Example: Cat: 0.7, Dog: 0.2, Bird: 0.1

Cross-entropy then gives us a single number that tells us how "far off" the predicted distribution (Q) is from the true distribution (P). The lower the cross-entropy value, the more similar your predicted distribution is to the true one, meaning your model is making better predictions. This is why cross-entropy is often used as a loss function in prediction (notice that I’m not using complex terminology like machine learning or artificial intelligence, yet 🙂 ). During the creation of your mathematical model, the goal of the model is to minimise this loss, which means it's trying to make its predicted probabilities as close as possible to the true probabilities. In other words, we want to build a model that makes fewer mistakes.

The formula for binary cross-entropy (used when there are only two possible outcomes, like cat or no cat) is:

Let's break this down:

H(P, Q): This is the cross-entropy value itself.
P(x_i ): This represents the true probability of outcome x_i. In our cat example, for the true category (cat), this would be 1, and for all other categories (dog, bird), it would be 0.
Q(x_i): This represents the predicted probability of outcome x_i by your model.
log: This is the logarithm (often natural logarithm, ln, or base 2 logarithm, log₂ , depending on the context).
∑: This means we sum up the values for all possible outcomes.
The minus sign at the beginning: This is there to make the result positive, as logarithms of probabilities (which are between 0 and 1) are negative.

Think of it like this: Imagine you're playing a game where you get points based on how good your predictions are.

If the true answer is "cat" (P(cat) = 1), and your model predicted "cat" with high probability (Q(cat) is close to 1), then P(cat) * log(Q(cat)) would be close to 1 * log(1) = 0. This means you get a small "penalty" (loss) because your prediction was good.
If the true answer is "cat" (P(cat) = 1), but your model predicted "cat" with a very low probability (Q(cat) is close to 0), then P(cat) * log(Q(cat)) would be close to 1 * log(0), which approaches negative infinity. This results in a very large "penalty" (loss), because your prediction was very wrong. So, the formula basically penalises your model more when it's confident about the wrong answer, and less when it's confident about the right answer. Leading a model to make fewer mistakes when a “similar” or the same example occurs in future

Now, let's explore its practical applications and why it matters so much in the real world, especially in machine learning (here it is).

Cross-entropy is often used when it comes to classification problems. These are tasks where you want to categorise something into one of several predefined classes. Think about:

Spam Detection: Is this email spam or not spam?
Sentiment Analysis: Is this customer review positive, negative, or neutral?
Medical Diagnosis: Does a patient have a specific disease or not, based on their symptoms? In all these scenarios, a machine learning model outputs probabilities for each possible category. Cross-entropy then steps in as the loss function during the model's training phase.

Why is it a good choice for those tasks?

It severely penalises confident wrong answers: If your model is 99% sure an image is a dog, but it's actually a cat, cross-entropy will give a very high loss value. This strong penalty encourages the model to learn from its mistakes and adjust its internal parameters to make better predictions next time. It essentially tells the model, "Being very wrong when you're very confident is a big no-no!"
It's smooth and differentiable: This is a bit technical, but it means that we can easily use calculus (specifically, gradient descent - will be covered later) to find the best adjustments for our model to minimise the cross-entropy. It's like having a smooth hill to walk down to reach the lowest point (since in the meadows bloom minimal errors), rather than sharp cliffs (remember this analogy - read it again!). So, every time you see a machine learning model successfully classify an image, translate text, or detect fraud, there's a good chance that cross-entropy played a crucial role in its training!

A real-world example - nothing without it!

Let's take email spam detection. Imagine you have a model whose job is to look at an incoming email and decide if it's "spam" or "not spam." Here's how cross-entropy comes into play:

The Actual Truth (True Distribution P): For a particular email, it is either spam or it's not spam. Let's say:
- If it's truly spam: P = [Spam: 1.0, Not Spam: 0.0]
- If it's truly not spam: P = [Spam: 0.0, Not Spam: 1.0]
Your Model's Prediction (Predicted Distribution Q): Your model analyses the email (looking at keywords, sender, etc.) and outputs a probability for each category. For example:
- Email 1 (True: Spam): Model predicts Q = [Spam: 0.8, Not Spam: 0.2]
- Email 2 (True: Not Spam): Model predicts Q = [Spam: 0.1, Not Spam: 0.9]
- Email 3 (True: Spam - Model makes a mistake): Model predicts Q = [Spam: 0.3, Not Spam: 0.7]
Calculating Cross-Entropy: For each email, we calculate the cross-entropy using the formula we discussed. Remember the formula, look up ☝🏻
- For Email 1 (True: Spam, Predicted: [0.8, 0.2]):
  - The P for "Spam" is 1.0, and P for "Not Spam" is 0.0.
  - So, we only care about the term where P is non-zero:
  - H = −(1.0 * log(0.8) + 0.0 * log(0.2))
  - H = −(log(0.8)), which is a small positive number (around 0.22). This is a low loss, indicating a good prediction.
- For Email 3 (True: Spam, Predicted: [0.3, 0.7]):
  - Again, P for "Spam" is 1.0, P for "Not Spam" is 0.0.
  - H = −(1.0 * log(0.3) + 0.0 * (log(0.7))
  - H = −(log(0.3)) which is a larger positive number (around 1.20). This is a higher loss, indicating a worse prediction.
Minimising Cross-Entropy during Training:
- The model needs to process thousands, even millions, of emails. For each email, it calculates the cross-entropy.
- During training, the learning algorithm (like gradient descent - will be covered later) then adjusts the model's internal "knobs and dials" (parameters - known as weights) in a way that reduces this average cross-entropy score across all emails.
- By continuously trying to minimise cross-entropy, the model learns to output higher probabilities for the correct category and lower probabilities for the incorrect categories. So, in essence, cross-entropy acts like a "teacher" for the spam filter. It tells the filter exactly how much it was wrong and in what direction, allowing it to learn and get better at distinguishing spam from legitimate emails over time.

At the end, it is important to outline that there exist many different kinds of loss functions. Each has its role to play (stay tuned).

Don’t go without a quick quiz!

What does a lower cross-entropy value typically indicate about a model's prediction?

a) The model is more confident in its incorrect predictions.

b) The model's predictions are less similar to the true outcomes.

c) The model's predictions are closer to the true outcomes.

d) The model is experiencing a higher degree of uncertainty.

Keep learning!

Sincerely, MO