How Computers Learn to See

To a computer, a photograph has no subject. It is a spreadsheet of brightness values. Computer vision is the discipline of teaching a machine to read that spreadsheet and recover the meaning a human sees instantly.

01Pictures as numbers

Every digital image is a grid of pixels. A greyscale image is one number per pixel — 0 for black, 255 for white. A colour image stacks three such grids, one each for red, green and blue. A modest 1000×1000 colour photo is therefore three million numbers.

That framing matters: once an image is just a matrix, the same mathematics that powers any neural network can operate on it. The challenge is that raw pixels are a terrible description of content — shift a cat ten pixels left and every number changes, yet it is still a cat.

02A grid of brightness

Early systems tried to hand-write rules: "an edge is where brightness changes sharply." This worked for simple cases and collapsed in the real world, where lighting, angle and clutter make hand-made rules brittle. The breakthrough was to stop writing rules and start learning them from examples.

Figure 1 — An image is a matrix of numbers

A single colour pixel is three intensity values; a full image is that, repeated across a grid. Position and intensity together are all the model ever receives.

03Convolutions find features

The workhorse of vision is the convolution. A small window — say 3×3 — slides across the image, and at each position multiplies the pixels under it by a set of learned weights called a filter. One filter might respond strongly to vertical edges, another to a patch of green, another to a curve.

Crucially the same filter is reused at every position. This weight sharing means a feature learned in one corner is recognised everywhere — solving the "shifted cat" problem and making the model far smaller than a fully-connected one.

04From edges to objects

Stack convolutional layers and something elegant happens. The first layer learns simple edges and colours. The next combines edges into textures and corners. Deeper layers assemble those into parts — an eye, a wheel — and the final layers into whole objects. Meaning is built bottom-up.

Figure 2 — Features build up layer by layer

Each layer composes the one below it. No human labels "edge" or "wheel" — the network discovers this hierarchy because it is the most efficient way to reduce error.

A vision model is not told what a cat looks like. It is shown many cats and left to invent, layer by layer, the features that distinguish one.

05The three core tasks

Most applications reduce to three jobs. Classification answers "what is in this image?" with one label. Detection goes further — "what is here, and where?" — drawing a box around each object. Segmentation is the most precise, labelling every single pixel, which is how medical and self-driving systems trace exact outlines.

Figure 3 — Classification to segmentation

The same learned features feed all three; the difference is only what the final layer is asked to output. Precision rises — and so does the cost of labelling training data.

06Where it fails

Vision models are pattern matchers, not understanders. They can be fooled by tiny, deliberate pixel changes that are invisible to us (adversarial examples). They inherit the biases of their training images — a dataset of mostly daytime, sunny streets makes a poor night-time driver. And they cannot explain why they decided, which matters enormously in medicine and policing.

The practical lesson for any organisation deploying vision is the same as for any AI: the model is only as good, and as fair, as the images it learned from.

What to remember

An image is just a grid of numbers; vision turns those numbers into meaning.
Convolutions slide small learned filters across the image to detect features.
Weight sharing lets a feature learned once be recognised anywhere.
Layers build a hierarchy: edges → textures → parts → objects.
Classification, detection and segmentation differ only in the output asked for.
Performance and fairness are bounded by the training images.

RRINOVA Research Team

We translate advanced technology and EU policy into practical training. This explainer is part of our open Insights series for educators, youth workers and SMEs.

01Pictures as numbers

02A grid of brightness

03Convolutions find features

04From edges to objects

05The three core tasks

06Where it fails

What to remember

Keep exploring

How neural networks actually work

Transformers, the engine of modern AI

AI ethics & bias in practice

Bring practical AI to your classroom