To a computer, a photograph has no subject. It is a spreadsheet of brightness values. Computer vision is the discipline of teaching a machine to read that spreadsheet and recover the meaning a human sees instantly.
01Pictures as numbers
Every digital image is a grid of pixels. A greyscale image is one number per pixel — 0 for black, 255 for white. A colour image stacks three such grids, one each for red, green and blue. A modest 1000×1000 colour photo is therefore three million numbers.
That framing matters: once an image is just a matrix, the same mathematics that powers any neural network can operate on it. The challenge is that raw pixels are a terrible description of content — shift a cat ten pixels left and every number changes, yet it is still a cat.
02A grid of brightness
Early systems tried to hand-write rules: "an edge is where brightness changes sharply." This worked for simple cases and collapsed in the real world, where lighting, angle and clutter make hand-made rules brittle. The breakthrough was to stop writing rules and start learning them from examples.
A single colour pixel is three intensity values; a full image is that, repeated across a grid. Position and intensity together are all the model ever receives.
03Convolutions find features
The workhorse of vision is the convolution. A small window — say 3×3 — slides across the image, and at each position multiplies the pixels under it by a set of learned weights called a filter. One filter might respond strongly to vertical edges, another to a patch of green, another to a curve.
Crucially the same filter is reused at every position. This weight sharing means a feature learned in one corner is recognised everywhere — solving the "shifted cat" problem and making the model far smaller than a fully-connected one.
04From edges to objects
Stack convolutional layers and something elegant happens. The first layer learns simple edges and colours. The next combines edges into textures and corners. Deeper layers assemble those into parts — an eye, a wheel — and the final layers into whole objects. Meaning is built bottom-up.
Each layer composes the one below it. No human labels "edge" or "wheel" — the network discovers this hierarchy because it is the most efficient way to reduce error.
05The three core tasks
Most applications reduce to three jobs. Classification answers "what is in this image?" with one label. Detection goes further — "what is here, and where?" — drawing a box around each object. Segmentation is the most precise, labelling every single pixel, which is how medical and self-driving systems trace exact outlines.
The same learned features feed all three; the difference is only what the final layer is asked to output. Precision rises — and so does the cost of labelling training data.
06Where it fails
Vision models are pattern matchers, not understanders. They can be fooled by tiny, deliberate pixel changes that are invisible to us (adversarial examples). They inherit the biases of their training images — a dataset of mostly daytime, sunny streets makes a poor night-time driver. And they cannot explain why they decided, which matters enormously in medicine and policing.
The practical lesson for any organisation deploying vision is the same as for any AI: the model is only as good, and as fair, as the images it learned from.
What to remember
- An image is just a grid of numbers; vision turns those numbers into meaning.
- Convolutions slide small learned filters across the image to detect features.
- Weight sharing lets a feature learned once be recognised anywhere.
- Layers build a hierarchy: edges → textures → parts → objects.
- Classification, detection and segmentation differ only in the output asked for.
- Performance and fairness are bounded by the training images.
