Ask ten people what a neural network is and you will get ten metaphors about brains and neurons firing. The metaphor is charming, but it hides the machinery. A neural network is a function — a very large, adjustable function — that maps inputs to outputs, and "learning" is just the disciplined process of adjusting it until its outputs are useful.
In this explainer we build the idea from the bottom up: one neuron, then a layer, then a full network, and finally the learning loop that powers everything from spam filters to large language models. No prior maths beyond multiplication and a little intuition about slopes is assumed.
01Inside a single neuron
The atom of every network is the artificial neuron. It does three things: it multiplies each input by a weight, adds everything up together with a bias, and passes the result through an activation function. That is the whole story of a neuron.
Formally, for inputs x₁…xₙ with weights w₁…wₙ and bias b, the neuron computes z = w₁x₁ + w₂x₂ + … + wₙxₙ + b, then outputs a = f(z). The weights decide how much each input matters; the bias shifts the threshold at which the neuron "fires".
Weighted sum, then squash. Each input is scaled by its weight, summed with a bias, and pushed through an activation that decides the neuron's output.
02Why activation functions matter
If a neuron only ever computed a weighted sum, stacking thousands of them would still produce nothing more than a single straight-line relationship — a thousand linear steps collapse into one. The activation function introduces a deliberate kink, a non-linearity, and that kink is what lets networks model curves, edges, language and everything else that is not a straight line.
Three functions dominate practice. ReLU (max(0, z)) is the modern default: cheap, and it keeps gradients alive for positive inputs. Sigmoid squashes any number into the range 0–1, handy for probabilities. Tanh does the same but centred on zero.
Each curve bends the straight weighted sum into something expressive. ReLU dominates deep networks because it is fast and resists vanishing gradients.
03The forward pass
Neurons are organised into layers. Data enters at the input layer, flows through one or more hidden layers, and leaves at the output layer. Each neuron in a layer is connected to every neuron in the next, and each connection carries its own weight. Pushing an input all the way through to a prediction is called the forward pass.
Concretely, the whole layer's computation is one matrix multiplication: take the vector of inputs, multiply by the weight matrix, add the bias vector, apply the activation. Repeat for each layer. This is why GPUs — which are built for fast matrix maths — turned out to be the perfect hardware for deep learning.
Information moves left to right. Every layer is one matrix multiply plus an activation — small operations, repeated billions of times.
04Measuring how wrong we are
A fresh network is initialised with random weights, so its first predictions are nonsense. To improve, it needs a number that says how nonsense. That number is the loss (or cost). For regression we often use mean squared error; for classification, cross-entropy. The loss compares the prediction ŷ against the true answer y and returns a single value — high when wrong, near zero when right.
Learning, then, has a precise goal: find the weights that make the loss as small as possible. Everything else is bookkeeping.
05Backpropagation & gradient descent
Here is the genuinely clever part. The loss is a function of every weight in the network. Calculus gives us the gradient — the direction in which each weight should move to reduce the loss fastest. Backpropagation is an efficient algorithm for computing that gradient for all weights at once, working backwards from the output using the chain rule.
Once we know the gradient, gradient descent takes a small step downhill: w ← w − η · ∂L/∂w, where η is the learning rate. Repeat this loop — forward pass, compute loss, backpropagate, update — millions of times, and the random network gradually becomes a competent one.
Each update is a step downhill. The learning rate sets the step size — too big and you overshoot, too small and training crawls.
The training loop in four steps
- Forward pass — run an input through the network to get a prediction.
- Compute loss — measure the gap between prediction and truth.
- Backpropagate — use the chain rule to find each weight's gradient.
- Update — nudge every weight a little against its gradient. Repeat.
06Why "deep" works
Stacking many layers — going deep — lets a network build a hierarchy of features. In an image model, early layers learn edges and colours, middle layers assemble those into textures and shapes, and later layers recognise whole objects. Nobody programs this hierarchy; it emerges because each layer learns to transform the representation handed to it by the layer below.
The same principle scales to language. A large language model is, at heart, the same neuron stacked into very deep, very wide layers (with an attention mechanism layered on top), trained with the same forward-pass / backprop loop on enormous text corpora. The arithmetic is humble; the scale is what feels like magic.
Key takeaways
What to remember
- A neuron is just a weighted sum plus a non-linear activation — nothing more.
- Activation functions are what give networks the power to model non-linear reality.
- A forward pass is a chain of matrix multiplications, which is why GPUs excel.
- The loss turns "is it good?" into a single number to minimise.
- Backpropagation + gradient descent is the learning engine behind all of it.
- Depth lets simple features compose into complex understanding.
