Zero to LLM — Article 02: Linear Algebra for Machine Learning: Vectors, Matrices & Neural Networks Explained
Series: Zero to LLM | Stage: 1 — Foundations | Topic: Linear Algebra

You use linear algebra every time you open a photo on your phone.
That image is stored as a 3D array of numbers — height × width × colour channels. When a filter sharpens it, that's a matrix operation. When your face is detected, that's matrix multiplication, hundreds of times per second.
Neural networks are built from exactly the same machinery. Every layer, every weight, every forward pass — it's all linear algebra. Before you write a single line of neural network code, this is the language you need to speak.
By the end of this article, you'll understand: scalars, vectors, matrices, and tensors; the operations that connect them; and — most importantly — why any of this matters for ML. Real-world scenarios throughout. No hand-waving.
What You'll Need
Finish Article 01 first. You need NumPy fluency and the idea of array-thinking before this lands properly.
1. Scalars — One Number, One Meaning
A scalar is a single number. That's it.
Temperature: 38.5°C. Price: ₹499. Loss after training step 1: 2.31.
In code:
import numpy as np
loss = np.array(2.31) # scalar, shape: ()
lr = np.array(0.01) # learning rate — also a scalar
new_loss = loss * lr # scalar * scalar = scalar
Why ML cares: Your training loop's goal is to drive a scalar — the loss — down toward zero. Everything else in deep learning is in service of that one number.
2. Vectors — A List of Numbers That Means Something
A vector is an ordered list of scalars. The order matters. The length is fixed.
Real-world scenario: A user profile
Imagine you're building a recommendation system. Each user is described by five numbers:
[age, hours_per_week, avg_session_length, num_purchases, days_since_last_visit]
That's a vector of length 5. Every user becomes a point in 5-dimensional space. Users who are similar to each other will sit close together in that space. That's the geometric intuition behind recommendation engines, search, and word embeddings.
user_A = np.array([25, 10, 30, 3, 2])
user_B = np.array([26, 9, 28, 4, 1])
user_C = np.array([60, 1, 5, 0, 90])
# How similar are A and B vs A and C?
# We'll answer this precisely when we get to dot products.
Accessing elements
x = np.arange(4) # [0, 1, 2, 3]
x[0] # first element: 0
x[-1] # last element: 3
x[1:3] # slice: [1, 2]
In math, vectors are written as bold lowercase: x. The ith element is written xᵢ (not bold, because it's a scalar).
Length vs. dimensionality
People use these words interchangeably and it causes confusion. Be precise:
Length of a vector = number of elements.
len(x)gives you this.Dimensionality of a tensor = number of axes. A vector has 1 axis.
x = np.array([10, 20, 30, 40])
len(x) # 4 — the length
x.shape # (4,) — one axis, length 4
x.ndim # 1 — one axis (dimensionality of the tensor)
3. Matrices — A Table of Numbers
A matrix is a 2D grid of scalars: rows × columns.
Real-world scenario: A dataset
Every tabular dataset is a matrix. Each row is one data example. Each column is one feature.
| age | income | credit_score | default? |
|---|---|---|---|
| 25 | 45000 | 720 | 0 |
| 42 | 82000 | 650 | 1 |
| 33 | 60000 | 700 | 0 |
In code, this is a matrix with shape (3, 4) — 3 rows, 4 columns.
dataset = np.array([
[25, 45000, 720, 0],
[42, 82000, 650, 1],
[33, 60000, 700, 0]
])
dataset.shape # (3, 4)
dataset[1, 2] # row 1, column 2: 650 (credit score of second person)
dataset[:, 0] # all rows, column 0: ages → [25, 42, 33]
Notation
We write A ∈ ℝᵐˣⁿ for a matrix with m rows and n columns. The element at row i, column j is written aᵢⱼ.
A = np.arange(20).reshape(5, 4)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]
# [16 17 18 19]]
A.shape # (5, 4)
A[2, 1] # row 2, col 1: 9
The Transpose
Flipping a matrix across its diagonal — rows become columns, columns become rows.
If A has shape (5, 4), then Aᵀ has shape (4, 5).
A.T
# [[ 0 4 8 12 16]
# [ 1 5 9 13 17]
# [ 2 6 10 14 18]
# [ 3 7 11 15 19]]
When you'll use it: Constantly. Matrix multiplication requires compatible shapes, and transposing is how you fix mismatches. You'll write .T hundreds of times before you finish Stage 1.
Symmetric Matrices
A matrix where A = Aᵀ. Element at (i, j) equals element at (j, i). In plain terms: the matrix looks the same when you flip it across its diagonal.
B = np.array([[1, 2, 3],
[2, 0, 4],
[3, 4, 5]])
np.all(B == B.T) # True — it's symmetric
Covariance matrices — which measure how much pairs of features in your data move together — are always symmetric. For example, if tall people also tend to be heavier, that relationship shows up as a non-zero value in the covariance matrix. You'll encounter these properly when we reach PCA and statistics. For now, just know they're always symmetric, and that's why symmetric matrices matter in ML.
4. Tensors — Arrays with Any Number of Axes
Scalars, vectors, and matrices are all special cases of a tensor. A tensor is simply an n-dimensional array.
| Object | Axes | Example shape | Real example |
|---|---|---|---|
| Scalar | 0 | () |
A single loss value |
| Vector | 1 | (512,) |
A word embedding |
| Matrix | 2 | (1000, 20) |
Dataset: 1000 users, 20 features |
| Tensor | 3 | (32, 28, 28) |
Batch of 32 grayscale images |
| Tensor | 4 | (32, 3, 224, 224) |
Batch of 32 RGB images |
Real-world scenario: Images
A single colour photo is height × width × 3 (red, green, blue). A batch of 32 such photos is 32 × height × width × 3. This is a 4D tensor.
X = np.arange(24).reshape(2, 3, 4)
# Two "slabs", each 3 rows × 4 columns
X.shape # (2, 3, 4)
X.ndim # 3
len(X) # 2 — length along the FIRST axis
# Accessing: X[slab, row, col]
X[0, 1, 2] # 6
Important: len() on a tensor always gives the size of the first axis, not the total number of elements. For total elements, use .size.
X.size # 24 — total elements
5. Elementwise Operations — Do the Same Thing to Every Cell
Any operation applied to matching positions across two same-shaped tensors.
A = np.arange(20).reshape(5, 4)
B = A.copy()
A + B # add each pair: [[0, 2, 4, 6], [8, 10, ...], ...]
A * B # multiply each pair — this is the Hadamard product
The Hadamard Product (⊙)
Elementwise multiplication of two matrices. Not matrix multiplication. They look different and mean completely different things.
A ⊙ B: a₁₁·b₁₁ a₁₂·b₁₂ ...
a₂₁·b₂₁ a₂₂·b₂₂ ...
A * B
# [[ 0, 1, 4, 9],
# [16, 25, 36, 49],
# ...]
It has a formal name — the Hadamard product — because mathematicians like naming things. You'll see the symbol ⊙ in research papers. When you do, it just means "multiply the matching positions together." Nothing fancier than that.
When you'll see it: Gating mechanisms in LSTMs, attention masking, certain activation computations.
Scalar Operations
Adding or multiplying every element by a single number:
a = 2
X = np.arange(24).reshape(2, 3, 4)
a + X # adds 2 to every element
a * X # multiplies every element by 2
(a * X).shape # (2, 3, 4) — shape unchanged
6. Reduction — Collapsing Dimensions Down
Sometimes you want one number from many. Reduction is the operation of collapsing along one or more axes.
Sum
x = np.arange(4) # [0, 1, 2, 3]
x.sum() # 6 — scalar
For a matrix, you choose which axis to collapse:
A = np.arange(20).reshape(5, 4)
# 5 rows, 4 columns
A.sum(axis=0) # collapse rows → shape (4,) — sum down each column
A.sum(axis=1) # collapse columns → shape (5,) — sum across each row
A.sum() # collapse everything → scalar: 190
Real-world scenario: Class probabilities
After the final layer of a classifier, you have a (batch_size, num_classes) matrix — one row per example, one column per class. The raw numbers the network outputs before converting to probabilities are called logits. Think of them as unnormalised scores — higher means more confident, but they're not percentages yet. The network hasn't decided "I'm 80% sure" — it's just said "class 1 scores 3.4 and class 2 scores 0.5." We turn those into probabilities later using softmax. To find the predicted class for each example:
logits = np.array([[2.1, 0.3, -1.2], # example 1
[0.5, 3.4, 0.1]]) # example 2
predicted_class = logits.argmax(axis=1) # [0, 1]
# Example 1 → class 0 (highest logit: 2.1)
# Example 2 → class 1 (highest logit: 3.4)
Mean
A.mean() # 9.5 — same as A.sum() / A.size
A.mean(axis=0) # column-wise mean: [ 8., 9., 10., 11.]
When you'll see it: Loss functions reduce a batch of per-example losses to a single scalar using .mean(). Every training step ends with this.
Keeping Dimensions (keepdims)
Sometimes you want to reduce but keep the shape intact for the next operation.
sum_A = A.sum(axis=1, keepdims=True)
sum_A.shape # (5, 1) — not (5,)
# Now you can divide each row by its own sum (normalise rows)
A / sum_A # shape: (5, 4) ÷ (5, 1)
NumPy automatically stretches the smaller shape (5, 1) to match the larger one (5, 4) before doing the division — this is called broadcasting. In practice, the (5, 1) column gets duplicated across all 4 columns so the shapes align. Without keepdims=True, you'd get shape (5,) and the division would fail or give wrong results. We'll cover broadcasting fully in Article X, but for now: if shapes look mismatched but the operation works, broadcasting is why.
Cumulative Sum
A.cumsum(axis=0)
# Each row is the running total of all rows above it plus itself.
# Useful in sequence models and certain sampling algorithms.
7. The Dot Product — What Every Neuron Computes
Given two vectors x and y of the same length, their dot product is:
xᵀy = x₁y₁ + x₂y₂ + ... + xₙyₙ
Multiply matching elements, sum everything up. One scalar out.
w = np.array([0.5, 0.3, -0.2]) # weights
x = np.array([1.0, 4.0, 2.0]) # input features
np.dot(w, x) # 0.5*1.0 + 0.3*4.0 + (-0.2)*2.0 = 1.3
Real-world scenario: A single neuron
That's literally all a neuron does. It takes its weights and an input, computes a dot product, then passes the result through an activation function. Every neuron in every neural network.
Dot product as similarity
Before comparing directions, we first normalise each vector — we scale it so its length becomes exactly 1. This is called a unit vector. Think of it like this: you don't care how loud someone's voice is, only which direction it's pointing. Normalising strips away the volume and keeps the direction. Dividing by np.linalg.norm(v) does this in one line.
After normalising two vectors to unit length, their dot product equals the cosine of the angle between them:
Dot product = 1: vectors point in the same direction → very similar
Dot product = 0: vectors are perpendicular → unrelated
Dot product = −1: vectors point opposite directions → opposite
# Are user_A and user_B similar?
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
user_A = np.array([25, 10, 30, 3, 2], dtype=float)
user_B = np.array([26, 9, 28, 4, 1], dtype=float)
user_C = np.array([60, 1, 5, 0, 90], dtype=float)
cosine_similarity(user_A, user_B) # ≈ 0.999 — very similar
cosine_similarity(user_A, user_C) # ≈ 0.30 — quite different
When you'll see it: Word embeddings, attention scores, recommendation systems, nearest-neighbour search — they all reduce to cosine similarity.
8. Matrix-Vector Products — Transforming Space
A matrix A ∈ ℝᵐˣⁿ applied to a vector x ∈ ℝⁿ produces a new vector in ℝᵐ.
Think of it as a transformation — you're projecting x from n-dimensional space into m-dimensional space.
Ax = [a₁ᵀx, a₂ᵀx, ..., aₘᵀx]
Each element of the result is a dot product of one row of A with x.
A = np.arange(20).reshape(5, 4) # 5×4 matrix
x = np.array([1., 2., 3., 4.]) # length-4 vector
np.dot(A, x)
# [ 20. 60. 100. 140. 180.] ← length-5 vector
Real-world scenario: One neural network layer
A fully-connected layer is exactly this:
output = activation(W @ input + b)
Where W is the weight matrix, input is the activation from the previous layer, and b is the bias vector. The @ operator is matrix multiplication — it's equivalent to np.dot for 2D arrays, just cleaner to write. Every forward pass through a dense layer is a matrix-vector product.
W = np.random.randn(128, 64) # 128 neurons, 64 inputs each
x = np.random.randn(64) # one input vector
pre_activation = np.dot(W, x) # shape: (128,)
# ReLU activation: if the number is negative, replace it with zero.
# If it's positive, keep it as-is. That's the entire operation.
# Neural networks use it because real-world relationships aren't linear
# — you need something that can "switch off."
output = np.maximum(0, pre_activation) # shape: (128,)
Shape rule: For A (m × n) and x (n,) → result is (m,). The inner dimensions must match.
9. Matrix-Matrix Multiplication — The Core of Deep Learning
If you can do matrix-vector products, you already understand this. Matrix-matrix multiplication is just doing many matrix-vector products at once.
C = AB where A ∈ ℝⁿˣᵏ and B ∈ ℝᵏˣᵐ → C ∈ ℝⁿˣᵐ
Element cᵢⱼ is the dot product of row i of A with column j of B.
A = np.arange(20).reshape(5, 4) # 5×4
B = np.ones((4, 3)) # 4×3
C = np.dot(A, B) # 5×3
# [[ 6. 6. 6.]
# [22. 22. 22.]
# [38. 38. 38.]
# [54. 54. 54.]
# [70. 70. 70.]]
Real-world scenario: Processing a whole batch at once
In practice, you never process one example at a time. You stack a whole batch of inputs into a matrix and multiply once.
W = np.random.randn(128, 64) # weight matrix: 128 outputs, 64 inputs
X = np.random.randn(32, 64) # batch of 32 examples, each length 64
# Process all 32 examples in one shot
# Note: A @ B and np.dot(A, B) are equivalent for 2D arrays
output = X @ W.T # shape: (32, 128)
One matrix multiplication. 32 examples. This is why GPUs are fast at deep learning — they're designed for exactly this.
Shape rule: For A (n × k) and B (k × m) → result is (n × m). Inner dimensions must match.
(n × k) @ (k × m) = (n × m)
↑ ↑
must match
10. Norms — How Big Is a Vector?
A norm measures the magnitude (size) of a vector. Not its dimensionality — how large its values are.
Formally, a norm f must satisfy:
f(αx) = |α| f(x) — scaling a vector scales its norm
f(x + y) ≤ f(x) + f(y) — triangle inequality
f(x) ≥ 0 — always non-negative
f(x) = 0 only if x = 0
L2 Norm (Euclidean distance)
The straight-line distance from the origin to the point x:
‖x‖₂ = √(x₁² + x₂² + ... + xₙ²)
u = np.array([3., -4.])
np.linalg.norm(u) # √(9 + 16) = √25 = 5.0
This is Pythagoras's theorem in any number of dimensions.
When you'll see it: L2 regularisation (weight decay) adds ‖w‖₂² to the loss, penalising large weights to prevent overfitting.
L1 Norm (Manhattan distance)
Sum of absolute values:
‖x‖₁ = |x₁| + |x₂| + ... + |xₙ|
np.abs(u).sum() # 3 + 4 = 7
When you'll see it: L1 regularisation (Lasso) encourages sparse weights — many weights pushed to exactly zero. Useful for feature selection.
The Difference That Matters
Both L1 and L2 regularise, but differently:
L2 penalises large values heavily (squares them). Spreads weight across all features.
L1 penalises all values equally. Drives some weights to zero. Creates sparsity.
# Visualising the difference
w = np.array([0.1, 0.1, 0.1, 5.0])
l2 = np.linalg.norm(w) # dominated by the 5.0
l1 = np.abs(w).sum() # 5.0 just contributes its value
print(f"L2: {l2:.2f}") # 5.01 — largely driven by 5.0
print(f"L1: {l1:.2f}") # 5.30 — 5.0 contributes proportionally
Frobenius Norm — L2 for Matrices
The same idea as the L2 norm, but applied to every element of a matrix instead of a vector. You square every element, sum them all up, and take the square root:
‖X‖_F = √(Σᵢⱼ xᵢⱼ²)
np.linalg.norm(np.ones((4, 9))) # √36 = 6.0
When you'll see it: Gradient clipping — during training, if the Frobenius norm of the gradient matrix exceeds a threshold, you scale it down proportionally. This prevents exploding gradients: a situation where gradient values grow uncontrollably large and destabilise training. Common in RNNs.
Putting It All Together: A Minimal Neural Layer from Scratch
Here is everything from this article working in 20 lines:
import numpy as np
# A single fully-connected layer
# Input: batch of 4 examples, each with 3 features
# Output: each example mapped to 2 values
X = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[1.0, 0.0, 2.0]]) # shape: (4, 3)
W = np.random.randn(2, 3) # weight matrix: 2 outputs, 3 inputs
b = np.zeros(2) # bias vector: shape (2,)
# Forward pass: matrix multiplication + bias
# (4, 3) @ (3, 2) = (4, 2), then bias (2,) is broadcast across each row
pre_activation = X @ W.T + b # shape: (4, 2)
# ReLU: replace negatives with zero, keep positives as-is
output = np.maximum(0, pre_activation) # shape: (4, 2)
# Compute mean L2 norm of outputs (as a rough "size" measure)
norms = np.linalg.norm(output, axis=1) # one norm per example
print("Output norms:", norms)
print("Mean output norm:", norms.mean())
Walk through the shapes at each step. That habit will save you hours of debugging.
Summary
| Concept | What it is | ML role |
|---|---|---|
| Scalar | Single number | Loss, learning rate |
| Vector | 1D array | Embedding, one example, bias |
| Matrix | 2D array | Dataset, weight matrix |
| Tensor | nD array | Batches of images, sequences |
| Transpose | Flip rows/cols | Fix shape mismatches |
| Elementwise ops | Same-shaped tensors | Hadamard product, activation functions |
| Hadamard product | Elementwise matrix multiply (⊙) | Gating, attention masking |
| Reduction | Collapse axes | Loss, batch norm, softmax |
| Dot product | Weighted sum → one scalar | What one neuron computes |
| Matrix-vector product | Transform a vector | One forward pass, one example |
Matrix multiplication (@) |
Transform a batch | One forward pass, whole batch |
| Broadcasting | Auto-stretch smaller shape to match larger | Division, bias addition |
| ReLU | Replace negatives with zero | Non-linearity in layers |
| Logits | Raw unnormalised scores from final layer | Input to softmax |
| Unit vector | Vector normalised to length 1 | Cosine similarity |
| L2 norm | Euclidean length | Weight decay, gradient clipping |
| L1 norm | Sum of absolutes | Sparsity, Lasso |
| Frobenius norm | L2 for matrices | Gradient clipping |
Exercises
Shapes and indexing
Create a (3, 4) matrix using
np.arange(12).reshape(3, 4). Extract: the second row; the third column; the element at row 1, column 2.Given A with shape (5, 4), what is the shape of
A.sum(axis=0)? OfA.sum(axis=1)? Work it out by hand first, then verify.Create a 3D tensor of shape (2, 3, 4). What does
len(X)return? Why? What doesX.sizereturn?
Transpose and symmetry
Prove to yourself that (Aᵀ)ᵀ = A using NumPy. Create any (3, 5) matrix, transpose it twice, and confirm equality.
Create any (4, 4) matrix A. Is A + Aᵀ always symmetric? Test it, then explain why this must be true mathematically.
Dot products and norms
Create two vectors:
a = np.array([1., 2., 3.])andb = np.array([4., 5., 6.]). Compute their dot product two ways: withnp.dotand with(a * b).sum(). Confirm they match.Compute the cosine similarity between these three word vectors. Which two words are most semantically similar?
king = np.array([0.3, 0.9, 0.1, 0.7]) queen = np.array([0.3, 0.8, 0.2, 0.6]) apple = np.array([0.9, 0.1, 0.8, 0.1])
Matrix multiplication
Given A (4 × 3) and B (3 × 5), what is the shape of AB? Given A (4 × 3) and B (4 × 3), can you multiply them? Why or why not?
Write a forward pass for a two-layer neural network with no activation function. Input: (32, 10). Layer 1: (10 → 64). Layer 2: (64 → 5). What is the shape of the output?
Reduction
- Create a (5, 4) matrix. Normalise each row so its elements sum to 1 (each row becomes a probability distribution). Use
keepdims=Truein your sum.
Before You Move On
Make sure you can do these without looking anything up:
State the shape of a matrix-matrix product given the input shapes
Explain the difference between the Hadamard product (⊙) and matrix multiplication (@)
Write a matrix-vector product in NumPy and state the output shape
Explain what
axis=0vsaxis=1does in.sum()Compute the L1 and L2 norm of a vector by hand and in NumPy
Describe what a neural network layer is doing in terms of linear algebra operations
Explain in one sentence what ReLU does and why networks need it
Explain the difference between logits and probabilities
Resources
3Blue1Brown — "Essence of Linear Algebra" on YouTube. Watch episodes 1–4 before or alongside this article. The visual intuition is invaluable.
D2L.ai — Chapter 2.3 covers this material with runnable MXNet code. Read it after this article for a second pass with different notation.
Gilbert Strang — Introduction to Linear Algebra — The standard university textbook. Chapters 1–3 if you want depth.
NumPy docs — np.dot, np.linalg.norm, broadcasting rules. Keep this tab open.
Next: Article 03 — Probability & Statistics for Deep Learning. Models are probability machines. You'll learn distributions, expectations, Bayes' theorem, and the statistical view of loss functions. By the end, you'll understand why cross-entropy loss is the natural choice for classification.
Series: Zero to LLM | Article 02 of the series


