Building a Deep Learning Model from Scratch in Python (Without any libraries)

By Md. Babul Hasan (NoYoN) 29 Oct, 2025 Post a Comment

Building a Deep Learning Model from Scratch in Python

This blog post explains how to build a multi-layer neural network (MLP) from scratch in Python. The goal of this model is to classify handwritten digits from the famous MNIST dataset. This exercise will guide you through building a neural network without relying on machine learning libraries like TensorFlow or PyTorch, but by relying solely on basic libraries such as NumPy. This approach will give you a deeper understanding of how neural networks operate under the hood.

1. Understanding the Problem

We aim to build a model that can classify handwritten digits (0 to 9) into one of 10 categories. The dataset consists of images of size 28x28 pixels, which will be flattened into vectors of length 784. The neural network's task is to map these input vectors to the correct digit labels.

The dataset consists of:

Input features: 784 features (one for each pixel in the 28x28 image).
Output: 10 classes (digits 0 to 9).

2. Importing Required Libraries

Before we begin building the model, we'll need some libraries for data handling and computations:

NumPy: For matrix computations and operations.
Pandas: To load and handle the dataset.
Matplotlib: For visualizing data, especially displaying images.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive

3. Loading and Preprocessing the Data

The dataset is stored in CSV format, where each row corresponds to an image, and the first column contains the label (the digit). We load the data, shuffle it, and split it into training and testing sets. Here is how it's done:


# Load the data from Google Drive
data = pd.read_csv('/content/gdrive/My Drive/DigitDataSet/train.csv')
data = np.array(data)

# Shuffle the data for randomness
np.random.shuffle(data)

# Get the number of examples (m) and features (n)
m, n = data.shape

We then split the data into training and test sets:


# Split the data into train and test sets
m_test = 1000
m_train = m - m_test

# Test data
data_test = data[0 : m_test].transpose()  # Transpose for easier manipulation
Y_test = data_test[0]
X_test = data_test / np.max(data_test)  # Normalize the pixel values
X_test[0] = np.ones(m_test)  # Add bias unit

# Train data
data_train = data[m_test : m].transpose()
Y_train = data_train[0]
X_train = data_train / np.max(data_train)  # Normalize the pixel values
X_train[0] = np.ones(m_train)  # Add bias unit

4. Defining the Neural Network Architecture

We will define the architecture of our neural network. The network consists of:

Input layer: 784 neurons (one for each pixel in the image).
Hidden layer: 10 neurons (chosen arbitrarily for simplicity).
Output layer: 10 neurons (one for each digit class).


# Neural network architecture
n_hidden = 10
n_input = 784
n_output = 10

5. Defining Activation Functions

Activation functions are crucial for introducing non-linearity into the network. We'll use the following activation functions:

ReLU (Rectified Linear Unit): Applied to the hidden layers to introduce non-linearity. Defined as:

ReLU function:

\( \text{ReLU}(z) = \max(0, z) \)


def ReLu(Z):
    return np.maximum(Z, 0)

Sigmoid: Applied to the output layer. It maps the input to a value between 0 and 1, making it ideal for classification. Defined as:

Sigmoid function:

\( \sigma(z) = \frac{1}{1 + e^{-z}} \)


def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

6. Forward Propagation

In forward propagation, we compute the activations for each layer. The calculations for each layer \( l \) are as follows:

The linear transformation for a layer is:

\( Z^{(l)} = \theta^{(l)} \cdot A^{(l-1)} \)

\( A^{(l)} = g(Z^{(l)}) \)

Where:

\( \theta^{(l)} \) are the weights of layer \( l \),
\( A^{(l-1)} \) is the activation from the previous layer,
\( g(Z^{(l)}) \) is the activation function applied to the linear transformation.

Here's the Python code to implement forward propagation:


def forward_prop(X, theta1, theta2):
    Z2 = theta1.dot(X)  # Linear transformation for hidden layer
    A21 = ReLu(Z2)      # Apply ReLU activation
    A2 = np.ones((n_hidden + 1, m_train))  # Add bias unit
    A2[1:n_hidden+1, :] = A21

    Z3 = theta2.dot(A2)  # Linear transformation for output layer
    A3 = sigmoid(Z3)     # Apply sigmoid activation

    return A2, Z2, A3, Z3

7. Backward Propagation

Backward propagation calculates the gradients of the cost function with respect to the weights. We use the chain rule to calculate the error at each layer and propagate it backward:

\( \delta^{(output)} = (A^{(output)} - Y) \cdot g'(Z^{(output)}) \)

For the hidden layers:

\( \delta^{(l)} = \theta^{(l)} \cdot \delta^{(l+1)} \cdot g'(Z^{(l)}) \)

The gradients are used to update the weights during training. Here's the Python code for backward propagation:


def backward_prop(Z2, A2, Z3, A3, X, Y, theta2):
    Y_convert_Y = Y_convert(Y)

    # Output layer error
    s_del3 = A3 - Y_convert_Y
    temp4 = deriv_sigmoid(Z3)
    s_del3 = np.multiply(s_del3, temp4)

    # Hidden layer error
    b_del2 = s_del3.dot(A2.transpose())
    temp1 = (theta2.transpose()).dot(s_del3)
    temp2 = temp1[1:n_hidden+1, :]
    temp3 = deriv_ReLu(Z2)
    s_del2 = np.multiply(temp2, temp3)

    b_del1 = s_del2.dot(X.transpose())

    return b_del1, b_del2

8. Gradient Descent

Gradient descent is used to minimize the cost function by updating the weights. The weight update rule is:

\( \theta^{(l)} = \theta^{(l)} - \alpha \cdot \frac{\partial J}{\partial \theta^{(l)}} \)

Where \( \alpha \) is the learning rate, and \( \frac{\partial J}{\partial \theta^{(l)}} \) is the gradient for layer \( l \). Here's the Python code for gradient descent:


def update_theta(theta1, theta2, b_del1, b_del2, alpha):
    theta1 = theta1 - (1/m_train) * alpha * b_del1
    theta2 = theta2 - (1/m_train) * alpha * b_del2
    return theta1, theta2

def gradient_descent(X, Y, alpha, iterations):
    theta1, theta2 = init_param()

    for i in range(iterations):
        A2, Z2, A3, Z3 = forward_prop(X, theta1, theta2)
        b_del1, b_del2 = backward_prop(Z2, A2, Z3, A3, X, Y, theta2)
        theta1, theta2 = update_theta(theta1, theta2, b_del1, b_del2, alpha)

        if i % 10 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A3)
            print(get_accuracy(predictions, Y))

    return theta1, theta2

9. Model Training

We initialize the parameters and train the model using gradient descent:


# Optimizing the parameters
theta1, theta2 = gradient_descent(X_train, Y_train, 0.9, 500)

10. Evaluating the Model

Finally, we evaluate the trained model by calculating its accuracy on the test data:


def get_predictions(A3):
    return np.argmax(A3, 0)

def get_accuracy(predictions, Y):
    return np.sum(predictions == Y) / Y.size

Programming & Tech Support

Widget HTML Atas