Zero to LLM —Linear Algebra for Machine Learning

You use linear algebra every time you open a photo on your phone.

That image is stored as a 3D array of numbers — height × width × colour channels. When a filter sharpens it, that's a matrix operation. When your face is detected, that's matrix multiplication, hundreds of times per second.

Neural networks are built from exactly the same machinery. Every layer, every weight, every forward pass — it's all linear algebra. Before you write a single line of neural network code, this is the language you need to speak.

By the end of this article, you'll understand: scalars, vectors, matrices, and tensors; the operations that connect them; and — most importantly — why any of this matters for ML. Real-world scenarios throughout. No hand-waving.

What You'll Need

Finish Article 01 first. You need NumPy fluency and the idea of array-thinking before this lands properly.

1. Scalars — One Number, One Meaning

A scalar is a single number. That's it.

Temperature: 38.5°C. Price: ₹499. Loss after training step 1: 2.31.

In code:

import numpy as np

loss = np.array(2.31)   # scalar, shape: ()
lr   = np.array(0.01)   # learning rate — also a scalar

new_loss = loss * lr    # scalar * scalar = scalar

Why ML cares: Your training loop's goal is to drive a scalar — the loss — down toward zero. Everything else in deep learning is in service of that one number.

2. Vectors — A List of Numbers That Means Something

A vector is an ordered list of scalars. The order matters. The length is fixed.

Real-world scenario: A user profile

Imagine you're building a recommendation system. Each user is described by five numbers:

[age, hours_per_week, avg_session_length, num_purchases, days_since_last_visit]

That's a vector of length 5. Every user becomes a point in 5-dimensional space. Users who are similar to each other will sit close together in that space. That's the geometric intuition behind recommendation engines, search, and word embeddings.

user_A = np.array([25, 10, 30, 3, 2])
user_B = np.array([26,  9, 28, 4, 1])
user_C = np.array([60,  1,  5, 0, 90])

# How similar are A and B vs A and C?
# We'll answer this precisely when we get to dot products.

Accessing elements

x = np.arange(4)      # [0, 1, 2, 3]
x[0]                  # first element: 0
x[-1]                 # last element: 3
x[1:3]                # slice: [1, 2]

In math, vectors are written as bold lowercase: x. The ith element is written xᵢ (not bold, because it's a scalar).

Length vs. dimensionality

People use these words interchangeably and it causes confusion. Be precise:

Length of a vector = number of elements. len(x) gives you this.
Dimensionality of a tensor = number of axes. A vector has 1 axis.

x = np.array([10, 20, 30, 40])
len(x)      # 4 — the length
x.shape     # (4,) — one axis, length 4
x.ndim      # 1 — one axis (dimensionality of the tensor)

3. Matrices — A Table of Numbers

A matrix is a 2D grid of scalars: rows × columns.

Real-world scenario: A dataset

Every tabular dataset is a matrix. Each row is one data example. Each column is one feature.

age	income	credit_score	default?
25	45000	720	0
42	82000	650	1
33	60000	700	0

In code, this is a matrix with shape (3, 4) — 3 rows, 4 columns.

dataset = np.array([
    [25, 45000, 720, 0],
    [42, 82000, 650, 1],
    [33, 60000, 700, 0]
])

dataset.shape   # (3, 4)
dataset[1, 2]   # row 1, column 2: 650 (credit score of second person)
dataset[:, 0]   # all rows, column 0: ages → [25, 42, 33]

Notation

We write A ∈ ℝᵐˣⁿ for a matrix with m rows and n columns. The element at row i, column j is written aᵢⱼ.

A = np.arange(20).reshape(5, 4)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]
#  [12 13 14 15]
#  [16 17 18 19]]

A.shape    # (5, 4)
A[2, 1]    # row 2, col 1: 9

The Transpose

Flipping a matrix across its diagonal — rows become columns, columns become rows.

If A has shape (5, 4), then Aᵀ has shape (4, 5).

A.T
# [[ 0  4  8 12 16]
#  [ 1  5  9 13 17]
#  [ 2  6 10 14 18]
#  [ 3  7 11 15 19]]

When you'll use it: Constantly. Matrix multiplication requires compatible shapes, and transposing is how you fix mismatches. You'll write .T hundreds of times before you finish Stage 1.

Symmetric Matrices

A matrix where A = Aᵀ. Element at (i, j) equals element at (j, i). In plain terms: the matrix looks the same when you flip it across its diagonal.

B = np.array([[1, 2, 3],
              [2, 0, 4],
              [3, 4, 5]])

np.all(B == B.T)   # True — it's symmetric

Covariance matrices — which measure how much pairs of features in your data move together — are always symmetric. For example, if tall people also tend to be heavier, that relationship shows up as a non-zero value in the covariance matrix. You'll encounter these properly when we reach PCA and statistics. For now, just know they're always symmetric, and that's why symmetric matrices matter in ML.

4. Tensors — Arrays with Any Number of Axes

Scalars, vectors, and matrices are all special cases of a tensor. A tensor is simply an n-dimensional array.

Object	Axes	Example shape	Real example
Scalar	0	`()`	A single loss value
Vector	1	`(512,)`	A word embedding
Matrix	2	`(1000, 20)`	Dataset: 1000 users, 20 features
Tensor	3	`(32, 28, 28)`	Batch of 32 grayscale images
Tensor	4	`(32, 3, 224, 224)`	Batch of 32 RGB images

Real-world scenario: Images

A single colour photo is height × width × 3 (red, green, blue). A batch of 32 such photos is 32 × height × width × 3. This is a 4D tensor.

X = np.arange(24).reshape(2, 3, 4)
# Two "slabs", each 3 rows × 4 columns

X.shape   # (2, 3, 4)
X.ndim    # 3
len(X)    # 2 — length along the FIRST axis

# Accessing: X[slab, row, col]
X[0, 1, 2]   # 6

Important: len() on a tensor always gives the size of the first axis, not the total number of elements. For total elements, use .size.

X.size    # 24 — total elements

5. Elementwise Operations — Do the Same Thing to Every Cell

Any operation applied to matching positions across two same-shaped tensors.

A = np.arange(20).reshape(5, 4)
B = A.copy()

A + B     # add each pair: [[0, 2, 4, 6], [8, 10, ...], ...]
A * B     # multiply each pair — this is the Hadamard product

The Hadamard Product (⊙)

Elementwise multiplication of two matrices. Not matrix multiplication. They look different and mean completely different things.

A ⊙ B:  a₁₁·b₁₁  a₁₂·b₁₂  ...
         a₂₁·b₂₁  a₂₂·b₂₂  ...

A * B
# [[ 0,   1,   4,   9],
#  [16,  25,  36,  49],
#  ...]

It has a formal name — the Hadamard product — because mathematicians like naming things. You'll see the symbol ⊙ in research papers. When you do, it just means "multiply the matching positions together." Nothing fancier than that.

When you'll see it: Gating mechanisms in LSTMs, attention masking, certain activation computations.

Scalar Operations

Adding or multiplying every element by a single number:

a = 2
X = np.arange(24).reshape(2, 3, 4)

a + X     # adds 2 to every element
a * X     # multiplies every element by 2
(a * X).shape   # (2, 3, 4) — shape unchanged

6. Reduction — Collapsing Dimensions Down

Sometimes you want one number from many. Reduction is the operation of collapsing along one or more axes.

Sum

x = np.arange(4)    # [0, 1, 2, 3]
x.sum()             # 6 — scalar

For a matrix, you choose which axis to collapse:

A = np.arange(20).reshape(5, 4)
# 5 rows, 4 columns

A.sum(axis=0)   # collapse rows → shape (4,)  — sum down each column
A.sum(axis=1)   # collapse columns → shape (5,) — sum across each row
A.sum()         # collapse everything → scalar: 190

Real-world scenario: Class probabilities

After the final layer of a classifier, you have a (batch_size, num_classes) matrix — one row per example, one column per class. The raw numbers the network outputs before converting to probabilities are called logits. Think of them as unnormalised scores — higher means more confident, but they're not percentages yet. The network hasn't decided "I'm 80% sure" — it's just said "class 1 scores 3.4 and class 2 scores 0.5." We turn those into probabilities later using softmax. To find the predicted class for each example:

logits = np.array([[2.1, 0.3, -1.2],   # example 1
                   [0.5, 3.4,  0.1]])   # example 2

predicted_class = logits.argmax(axis=1)   # [0, 1]
# Example 1 → class 0 (highest logit: 2.1)
# Example 2 → class 1 (highest logit: 3.4)

Mean

A.mean()           # 9.5 — same as A.sum() / A.size
A.mean(axis=0)     # column-wise mean: [ 8.,  9., 10., 11.]

When you'll see it: Loss functions reduce a batch of per-example losses to a single scalar using .mean(). Every training step ends with this.

Keeping Dimensions (keepdims)

Sometimes you want to reduce but keep the shape intact for the next operation.

sum_A = A.sum(axis=1, keepdims=True)
sum_A.shape   # (5, 1) — not (5,)

# Now you can divide each row by its own sum (normalise rows)
A / sum_A     # shape: (5, 4) ÷ (5, 1)

NumPy automatically stretches the smaller shape (5, 1) to match the larger one (5, 4) before doing the division — this is called broadcasting. In practice, the (5, 1) column gets duplicated across all 4 columns so the shapes align. Without keepdims=True, you'd get shape (5,) and the division would fail or give wrong results. We'll cover broadcasting fully in Article X, but for now: if shapes look mismatched but the operation works, broadcasting is why.

Cumulative Sum

A.cumsum(axis=0)
# Each row is the running total of all rows above it plus itself.
# Useful in sequence models and certain sampling algorithms.

7. The Dot Product — What Every Neuron Computes

Given two vectors x and y of the same length, their dot product is:

xᵀy = x₁y₁ + x₂y₂ + ... + xₙyₙ

Multiply matching elements, sum everything up. One scalar out.

w = np.array([0.5, 0.3, -0.2])   # weights
x = np.array([1.0, 4.0,  2.0])   # input features

np.dot(w, x)   # 0.5*1.0 + 0.3*4.0 + (-0.2)*2.0 = 1.3

Real-world scenario: A single neuron

That's literally all a neuron does. It takes its weights and an input, computes a dot product, then passes the result through an activation function. Every neuron in every neural network.

Dot product as similarity

Before comparing directions, we first normalise each vector — we scale it so its length becomes exactly 1. This is called a unit vector. Think of it like this: you don't care how loud someone's voice is, only which direction it's pointing. Normalising strips away the volume and keeps the direction. Dividing by np.linalg.norm(v) does this in one line.

After normalising two vectors to unit length, their dot product equals the cosine of the angle between them:

Dot product = 1: vectors point in the same direction → very similar
Dot product = 0: vectors are perpendicular → unrelated
Dot product = −1: vectors point opposite directions → opposite

# Are user_A and user_B similar?
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

user_A = np.array([25, 10, 30, 3, 2], dtype=float)
user_B = np.array([26,  9, 28, 4, 1], dtype=float)
user_C = np.array([60,  1,  5, 0, 90], dtype=float)

cosine_similarity(user_A, user_B)   # ≈ 0.999 — very similar
cosine_similarity(user_A, user_C)   # ≈ 0.30  — quite different

When you'll see it: Word embeddings, attention scores, recommendation systems, nearest-neighbour search — they all reduce to cosine similarity.

8. Matrix-Vector Products — Transforming Space

A matrix A ∈ ℝᵐˣⁿ applied to a vector x ∈ ℝⁿ produces a new vector in ℝᵐ.

Think of it as a transformation — you're projecting x from n-dimensional space into m-dimensional space.

Ax = [a₁ᵀx, a₂ᵀx, ..., aₘᵀx]

Each element of the result is a dot product of one row of A with x.

A = np.arange(20).reshape(5, 4)   # 5×4 matrix
x = np.array([1., 2., 3., 4.])    # length-4 vector

np.dot(A, x)
# [ 20.  60. 100. 140. 180.]  ← length-5 vector

Real-world scenario: One neural network layer

A fully-connected layer is exactly this:

output = activation(W @ input + b)

Where W is the weight matrix, input is the activation from the previous layer, and b is the bias vector. The @ operator is matrix multiplication — it's equivalent to np.dot for 2D arrays, just cleaner to write. Every forward pass through a dense layer is a matrix-vector product.

W = np.random.randn(128, 64)   # 128 neurons, 64 inputs each
x = np.random.randn(64)         # one input vector

pre_activation = np.dot(W, x)  # shape: (128,)

# ReLU activation: if the number is negative, replace it with zero.
# If it's positive, keep it as-is. That's the entire operation.
# Neural networks use it because real-world relationships aren't linear
# — you need something that can "switch off."
output = np.maximum(0, pre_activation)  # shape: (128,)

Shape rule: For A (m × n) and x (n,) → result is (m,). The inner dimensions must match.

9. Matrix-Matrix Multiplication — The Core of Deep Learning

If you can do matrix-vector products, you already understand this. Matrix-matrix multiplication is just doing many matrix-vector products at once.

C = AB where A ∈ ℝⁿˣᵏ and B ∈ ℝᵏˣᵐ → C ∈ ℝⁿˣᵐ

Element cᵢⱼ is the dot product of row i of A with column j of B.

A = np.arange(20).reshape(5, 4)   # 5×4
B = np.ones((4, 3))               # 4×3

C = np.dot(A, B)    # 5×3
# [[ 6.  6.  6.]
#  [22. 22. 22.]
#  [38. 38. 38.]
#  [54. 54. 54.]
#  [70. 70. 70.]]

Real-world scenario: Processing a whole batch at once

In practice, you never process one example at a time. You stack a whole batch of inputs into a matrix and multiply once.

W = np.random.randn(128, 64)    # weight matrix: 128 outputs, 64 inputs
X = np.random.randn(32, 64)     # batch of 32 examples, each length 64

# Process all 32 examples in one shot
# Note: A @ B and np.dot(A, B) are equivalent for 2D arrays
output = X @ W.T    # shape: (32, 128)

One matrix multiplication. 32 examples. This is why GPUs are fast at deep learning — they're designed for exactly this.

Shape rule: For A (n × k) and B (k × m) → result is (n × m). Inner dimensions must match.

(n × k) @ (k × m) = (n × m)
         ↑ ↑
     must match

10. Norms — How Big Is a Vector?

A norm measures the magnitude (size) of a vector. Not its dimensionality — how large its values are.

Formally, a norm f must satisfy:

f(αx) = |α| f(x) — scaling a vector scales its norm
f(x + y) ≤ f(x) + f(y) — triangle inequality
f(x) ≥ 0 — always non-negative
f(x) = 0 only if x = 0

L2 Norm (Euclidean distance)

The straight-line distance from the origin to the point x:

‖x‖₂ = √(x₁² + x₂² + ... + xₙ²)

u = np.array([3., -4.])
np.linalg.norm(u)   # √(9 + 16) = √25 = 5.0

This is Pythagoras's theorem in any number of dimensions.

When you'll see it: L2 regularisation (weight decay) adds ‖w‖₂² to the loss, penalising large weights to prevent overfitting.

L1 Norm (Manhattan distance)

Sum of absolute values:

‖x‖₁ = |x₁| + |x₂| + ... + |xₙ|

np.abs(u).sum()   # 3 + 4 = 7

When you'll see it: L1 regularisation (Lasso) encourages sparse weights — many weights pushed to exactly zero. Useful for feature selection.

The Difference That Matters

Both L1 and L2 regularise, but differently:

L2 penalises large values heavily (squares them). Spreads weight across all features.
L1 penalises all values equally. Drives some weights to zero. Creates sparsity.

# Visualising the difference
w = np.array([0.1, 0.1, 0.1, 5.0])

l2 = np.linalg.norm(w)            # dominated by the 5.0
l1 = np.abs(w).sum()              # 5.0 just contributes its value

print(f"L2: {l2:.2f}")   # 5.01 — largely driven by 5.0
print(f"L1: {l1:.2f}")   # 5.30 — 5.0 contributes proportionally

Frobenius Norm — L2 for Matrices

The same idea as the L2 norm, but applied to every element of a matrix instead of a vector. You square every element, sum them all up, and take the square root:

‖X‖_F = √(Σᵢⱼ xᵢⱼ²)

np.linalg.norm(np.ones((4, 9)))   # √36 = 6.0

When you'll see it: Gradient clipping — during training, if the Frobenius norm of the gradient matrix exceeds a threshold, you scale it down proportionally. This prevents exploding gradients: a situation where gradient values grow uncontrollably large and destabilise training. Common in RNNs.

Putting It All Together: A Minimal Neural Layer from Scratch

Here is everything from this article working in 20 lines:

import numpy as np

# A single fully-connected layer
# Input: batch of 4 examples, each with 3 features
# Output: each example mapped to 2 values

X = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0],
              [7.0, 8.0, 9.0],
              [1.0, 0.0, 2.0]])   # shape: (4, 3)

W = np.random.randn(2, 3)          # weight matrix: 2 outputs, 3 inputs
b = np.zeros(2)                    # bias vector: shape (2,)

# Forward pass: matrix multiplication + bias
# (4, 3) @ (3, 2) = (4, 2), then bias (2,) is broadcast across each row
pre_activation = X @ W.T + b       # shape: (4, 2)

# ReLU: replace negatives with zero, keep positives as-is
output = np.maximum(0, pre_activation)   # shape: (4, 2)

# Compute mean L2 norm of outputs (as a rough "size" measure)
norms = np.linalg.norm(output, axis=1)   # one norm per example
print("Output norms:", norms)
print("Mean output norm:", norms.mean())

Walk through the shapes at each step. That habit will save you hours of debugging.

Summary

Concept	What it is	ML role
Scalar	Single number	Loss, learning rate
Vector	1D array	Embedding, one example, bias
Matrix	2D array	Dataset, weight matrix
Tensor	nD array	Batches of images, sequences
Transpose	Flip rows/cols	Fix shape mismatches
Elementwise ops	Same-shaped tensors	Hadamard product, activation functions
Hadamard product	Elementwise matrix multiply (⊙)	Gating, attention masking
Reduction	Collapse axes	Loss, batch norm, softmax
Dot product	Weighted sum → one scalar	What one neuron computes
Matrix-vector product	Transform a vector	One forward pass, one example
Matrix multiplication (`@`)	Transform a batch	One forward pass, whole batch
Broadcasting	Auto-stretch smaller shape to match larger	Division, bias addition
ReLU	Replace negatives with zero	Non-linearity in layers
Logits	Raw unnormalised scores from final layer	Input to softmax
Unit vector	Vector normalised to length 1	Cosine similarity
L2 norm	Euclidean length	Weight decay, gradient clipping
L1 norm	Sum of absolutes	Sparsity, Lasso
Frobenius norm	L2 for matrices	Gradient clipping

Exercises

Shapes and indexing

Create a (3, 4) matrix using np.arange(12).reshape(3, 4). Extract: the second row; the third column; the element at row 1, column 2.
Given A with shape (5, 4), what is the shape of A.sum(axis=0)? Of A.sum(axis=1)? Work it out by hand first, then verify.
Create a 3D tensor of shape (2, 3, 4). What does len(X) return? Why? What does X.size return?

Transpose and symmetry

Prove to yourself that (Aᵀ)ᵀ = A using NumPy. Create any (3, 5) matrix, transpose it twice, and confirm equality.
Create any (4, 4) matrix A. Is A + Aᵀ always symmetric? Test it, then explain why this must be true mathematically.

Dot products and norms

Create two vectors: a = np.array([1., 2., 3.]) and b = np.array([4., 5., 6.]). Compute their dot product two ways: with np.dot and with (a * b).sum(). Confirm they match.

Compute the cosine similarity between these three word vectors. Which two words are most semantically similar?

king   = np.array([0.3, 0.9, 0.1, 0.7])
queen  = np.array([0.3, 0.8, 0.2, 0.6])
apple  = np.array([0.9, 0.1, 0.8, 0.1])

Matrix multiplication

Given A (4 × 3) and B (3 × 5), what is the shape of AB? Given A (4 × 3) and B (4 × 3), can you multiply them? Why or why not?
Write a forward pass for a two-layer neural network with no activation function. Input: (32, 10). Layer 1: (10 → 64). Layer 2: (64 → 5). What is the shape of the output?

Reduction

Create a (5, 4) matrix. Normalise each row so its elements sum to 1 (each row becomes a probability distribution). Use keepdims=True in your sum.

Before You Move On

Make sure you can do these without looking anything up:

State the shape of a matrix-matrix product given the input shapes
Explain the difference between the Hadamard product (⊙) and matrix multiplication (@)
Write a matrix-vector product in NumPy and state the output shape
Explain what axis=0 vs axis=1 does in .sum()
Compute the L1 and L2 norm of a vector by hand and in NumPy
Describe what a neural network layer is doing in terms of linear algebra operations
Explain in one sentence what ReLU does and why networks need it
Explain the difference between logits and probabilities

Resources

3Blue1Brown — "Essence of Linear Algebra" on YouTube. Watch episodes 1–4 before or alongside this article. The visual intuition is invaluable.

D2L.ai — Chapter 2.3 covers this material with runnable MXNet code. Read it after this article for a second pass with different notation.

Gilbert Strang — Introduction to Linear Algebra — The standard university textbook. Chapters 1–3 if you want depth.

NumPy docs — np.dot, np.linalg.norm, broadcasting rules. Keep this tab open.

Next: Article 03 — Probability & Statistics for Deep Learning. Models are probability machines. You'll learn distributions, expectations, Bayes' theorem, and the statistical view of loss functions. By the end, you'll understand why cross-entropy loss is the natural choice for classification.

Series: Zero to LLM | Article 02 of the series

Command Palette