Logistic Regression: From Zero to One

[ Last Updated: 2024/06/16 ]

Introduction: I realized that my previous understanding of Logistic Regression was quite superficial, especially regarding the correct form and derivation of the Sigmoid function. This is the classic pitfall of the "Black-box user (ones who only know how to import others library without understanding how it works)". Today, let's carefully review Logistic Regression from the ground up.

Why Logistic Regression if we already have Linear Regression?

Linear Regression is used to predict continuous values. However, many of us (myself included) often wonder: why can't we use it for 0/1 classification? A natural thought is to classify values > 0.5 as 1 and < 0.5 as 0.

However, this approach is flawed. The transition from 0 to 1 in classification is often not linear; there is a "qualitative change" at a certain threshold. Linear functions struggle to capture this sharp transition.

Example: Predicting Tumor Risk based on Size

In a Linear Regression model, the predicted $y$ value can be less than 0 or greater than 1. If we interpret $y$ as a probability, this is nonsensical.

Figure_group1

Furthermore, consider an outlier: if the dataset includes a very large malignant tumor, a Linear Regression line will be pulled toward that extreme value. This shifts the decision boundary and leads to an underestimation of risk for other data points, potentially misclassifying malignant tumors as benign (the red circle in the image).

Figure_group2

The Sigmoid Function and Odds

To fix the shortcomings of linear functions, we need a curve that changes rapidly near a threshold and flattens out at the extremes. This is the S-curve (Sigmoid), commonly used to describe population growth and, crucially, in Logistic Regression.

The formula is: $\Phi(z) = \frac {1}{1 + e^{-z}}$

To understand where this formula comes from, we first need to understand the concept of Odds.

What are Odds?

After research bunch of definition of odds online, many of them are quite ambiguous by saying it is something related to probability. Indeed, it is really hard to explain the concept of Odds without math.

To clearly understand what "odds" are, looking at the formula is the fastest way:

$Odd=\frac{P(x)}{1-P(x)}$

In this formula, $P(X)$ is the probability in the general sense we are familiar with. For example, if $P(X)$ is $0.8$ (an 80% chance of occurring), then the Odd is $4$ . If $P(X)$ is $0.9$ , the Odd becomes $9$ . If $P(X)$ is $0.95$ , the Odd increases to $19$ . Conversely, if $P(X)=0.2$ , the Odd becomes $0.25$ . From this perspective, the increase and decrease of Odds can be seen as exponential; no wonder this concept is said to have originated from horse racing, and is essentially the same as the concept of "payout odds" in modern sports lotteries (even though higher odds represent a smaller chance of winning).

Speaking of this, it is easy to think back to the tumor size example. The results of linear regression can be positive or negative and are not restricted between zero and one. While viewing those results as probabilities wouldn't make sense, viewing them as the logarithm of the odds makes it much easier to explain.

That is to say, we assume here:

$Log(Odd)=wx+b$

(Where $Log$ represents the natural logarithm with base $e$ , same as $Ln$ )

$Log(\frac {P(x)}{1-P(x)})=wx+b$

Thus:

$\frac {P(x)}{1-P(x)}=e^{wx+b}$

$P(X)=e^{wx+b}-P(x)e^{wx+b}$

$P(X)(1+e^{wx+b})=e^{wx+b}$

Resulting in:

$P(X)=\frac{e^{wx+b}}{1+e^{wx+b}}$

With a bit of simplification, we obtain the familiar S-curve (Sigmoid) form:

$P(X)=\frac {1}{1+e^{-z}}$

$z=wx+b$

Note that $z$ here does not necessarily have to be $wx+b$ ; it could be a quadratic curve or something else, as long as you believe $e^z$ can be used to fit the odds value.

At this point, we have the standard form of the logistic function:

$y=\frac {1}{1+e^{-z}}$

Where $y$ always stays between 0 and 1, representing probability. Generally, categories are classified based on a threshold of 0.5, but this can be adjusted according to actual circumstances. For instance, to mitigate risk, one might classify all emails with over a 30% probability of being a scam as "scam emails."

Cost Function and Gradient Descent

Once the regression function is determined, following the steps we took in linear regression, we still want to use the gradient descent method to find the best-fitting curve. To do this, we must first define the cost function.

So, what is the cost function for Logistic Regression?

Maximum Likelihood Estimation

In linear regression, we determine the loss for each point using the very intuitive squared distance. In logistic regression, however, we want to classify data into the correct categories as much as possible. Therefore, we look at this problem from the perspective of probability rather than distance:

We want to find the "best" curve—the one that maximizes the probability that every point is classified into the correct category. In other words, we want to maximize the following function:

$Likelihood=\Pi_{{x_i}|y_i=0}(1-\hat{y}_{x_i})\Pi_{x_j|y_j=1}\hat{y}_{x_j}$

My understanding is that this function can be viewed as the overall accuracy of the prediction. For example, if a point is predicted to have a 20% probability of being positive (1), but its true value is actually 0, then the "accuracy" of that prediction is 80%. Conversely, if a point is predicted to be positive at 20% but its true value is 1, the accuracy is 20%. By multiplying the accuracy of all points, we get the overall accuracy of the model's predictions based on all samples. The curve we need to find is the one that maximizes this value.

Wait a second, we aren't done. This function is clearly what we need to maximize, but the corresponding loss function requires a bit of extra processing:

First, take the logarithm of the product formula above (likely because it's painful and unnecessary for computers to perform massive multiplications). Note that we use two coefficients to distinguish between cases where $y_i$ equals 1 or 0 (one of them will always be zero):

$log(likelihood)=\Sigma[{(1-y_i)log(1-\hat{y}_{x_i})}+(y_i)log(\hat{y}_{x_i})]$

Then, take the negative of this value and find the average to get the familiar form of the cost function:

$J=-\frac{1}{m}\Sigma[{(1-y_i)log(1-\hat{y}_{x_i})}+y_ilog(\hat{y}_{x_i})]$

Gradient Descent

Now that we've defined the cost function, we move into the familiar moment of finding the gradient through partial derivatives.

Although the conclusion is very concise and even takes the same form as linear regression:

$\frac{\partial J}{\partial \mathbf{w}} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)}$

$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)$

Reaching this intuitive conclusion still requires some calculation. Let's take the partial derivative with respect to $w$ as an example:

First, let $g(z)=\frac{1}{1+e^{-z}}$ and $z(w)=wx+b$ :

$\frac{\partial g}{\partial w} =\frac{\partial g}{\partial z} \frac{\partial z}{\partial w}=\frac{e^{-z}}{(1+e^{-z})^2}x$

(Note that: $\hat{y}^{(i)}=\frac {1}{1+e^{-z}}$ )

$=(1-\hat{y}^{(i)})\hat{y}^{(i)}x$

Therefore:

$\frac{\partial J}{\partial \mathbf{w}} =-\frac{1}{m}\Sigma[\frac{-(1-\hat{y}^{(i)})\hat{y}^{(i)}(1-y^{(i)})x}{1-\hat{y}^{(i)}}+\frac{(1-\hat{y}^{(i)})\hat{y}^{(i)}y^{(i)}x}{\hat{y}^{(i)}}]=-\frac{1}{m}\Sigma(-\hat{y}^{(i)}x+y^{(i)}x)=\frac{1}{m}\Sigma(\hat{y}^{(i)}-y^{(i)})x$

Note: If there are multiple $w$ values (notated as a vector $(w_1, w_2, .. w_j)$ ), then the $x$ in the above equation must be the $x_j$ corresponding to that specific $w_j$ .

Thus, we have obtained the core formula for each update in gradient descent.

By now, we have basically established all the theoretical support needed to complete the Logistic Regression code. Remarkably, this gradient descent process is strikingly similar to linear regression (I still wonder if it's just a coincidence), so this part of the process doesn't even need to be changed.

Application: A Step-by-Step Implementation

The following is a hands-on exercise in Logistic Regression, covering everything from data exploration to manual gradient descent implementation.

We are using the Graduate Admission dataset from Kaggle: Binary Admission Data (opens in a new tab).

1. Data Import and Exploration

First, let's load our libraries and get a feel for the data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
 
# Load dataset
raw_df = pd.read_csv("Graduate Admission.csv")
 
# Quick look at the first 10 rows and statistical summary
print(raw_df.head(10))
print(raw_df.describe())

To understand the relationships between variables, I used Seaborn to create boxplots comparing Admission Status against GPA, GRE, and Rank.

Python

fig, axes = plt.subplots(1, 3, figsize=(12, 6))
colors = ["#ed45b2", "#87e8c1"] 

for i in range(3):
    sns.boxplot(x='admit', y=raw_df.columns[i+1], data=raw_df, ax=axes[i], palette=colors)
    f_heading = raw_df.columns[i+1].upper()
    axes[i].set_title('Admission Status by ' + f_heading)
    axes[i].set_xlabel('Admission Status (0: No, 1: Yes)')
    axes[i].set_ylabel(raw_df.columns[i+1])

plt.tight_layout()
plt.show()

plotbox

Observations: Both GRE and GPA show a positive correlation with admission. Interestingly, while the specific meaning of "Rank" (1-3) isn't detailed, those in Rank 1 or 2 have a significantly higher likelihood of being admitted. These three features will serve as our parameters.

2. Data Pre-Processing

We need to move the data from Pandas to NumPy and perform Standardization (Z-score normalization) to help our gradient descent converge faster.

Python

X_input = raw_df[['gre', 'gpa', 'rank']].to_numpy()
Y_input = raw_df['admit'].to_numpy()    

def z_score(data):
    # axis=0 calculates the mean/std for each column
    mean = np.mean(data, axis=0)
    std = np.std(data, axis=0)
    return (data - mean) / std

X_train = z_score(X_input)
Y_train = Y_input

3. Implementing Logistic Regression via Gradient Descent

To build our model from scratch, we need three core functions.

A. The Sigmoid Function

The heart of Logistic Regression, used to map any real-valued number into a probability between 0 and 1.

Python

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

B. The Cost Function

We use the Binary Cross-Entropy Loss to measure how well our current $w$ and $b$ are performing.

Python

def compute_cost(X, y, w, b):
    m, n = X.shape
    cost = 0
    for i in range(m):
        z = np.dot(w, X[i]) + b
        f = sigmoid(z)
        loss = -y[i] * np.log(f) - (1 - y[i]) * np.log(1 - f)
        cost += loss
    return cost / m

C. The Gradient Function

This calculates the "direction" and "steepness" we need to adjust our parameters.

Python

def compute_gradient(X, y, w, b):
    m, n = X.shape
    dj_dw = np.zeros(w.shape)
    dj_db = 0
    for i in range(m):
        z_wb = np.dot(w, X[i]) + b
        f_wb = sigmoid(z_wb)
        err = f_wb - y[i]
        dj_db += err
        for j in range(n):
            dj_dw[j] += err * X[i][j]
    return dj_dw / m, dj_db / m

4. Execution and Results

With our functions ready, we initialize our parameters and run the loop.

Python

import math

def gradient_descent(X, y, w_in, b_in, cost_func, grad_func, alpha, itrs):
    J_history = []
    for i in range(itrs):
        cost = cost_func(X, y, w_in, b_in)
        J_history.append(cost)
        
        if i % math.ceil(itrs / 10) == 0:
            print(f"Iteration {i:4}: Cost {cost:8.2f}")
            
        dj_dw, dj_db = grad_func(X, y, w_in, b_in)
        w_in -= alpha * dj_dw
        b_in -= alpha * dj_db
    return w_in, b_in, J_history

# Settings
np.random.seed(1)
initial_w = 0.01 * (np.random.rand(3) - 0.5)
initial_b = 0
iterations = 10000
alpha = 0.01

w, b, J_hist = gradient_descent(X_train, Y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)

Final Cost: Stabilized at approximately 0.57.

To evaluate the model, we calculate the Training Accuracy:

Python

def predict(X, w, b):
    m, n = X.shape
    p = np.zeros(m)
    for i in range(m):
        f_wb = sigmoid(np.dot(w, X[i]) + b)
        p[i] = 1 if f_wb >= 0.5 else 0
    return p

predictions = predict(X_train, w, b)
print(f'Train Accuracy: {np.mean(predictions == Y_train) * 100:.2f}%')

Train Accuracy: 70.50%

Honks:

Writing this post was much slower than usual. I tried my best to straighten out the logic. I asked ChatGPT why we specifically use the Sigmoid curve; it told me it’s due to its interpretability, differentiability, and historical preference in statistics. Lets just trust professor GPT on this point for now then... ( •̀ ω •́ )✧

Also, this was a great exercise in manual implementation. While it’s tempting to remain a "black-box practitioner" using pre-built libraries, writing every line of numpy logic forces you to pay attention to the details—like vector shapes and normalization.

Another good news is that I've also recently finished the basics of Neural Networks. It’s wild how "brute force" they can be—if you have the computing power, you can do almost anything. Maybe I'll write about the theory behind NNs this weekend!

— Untitled Penguin

2024/06/16 21:21

Application part added in 2024-06-26

Manifecto and Honks