Reinforcement learning, or RL, is how a machine learned to beat the world champion at Go, how robots learn to walk, and a key ingredient in how chatbots are tuned to be helpful. Its logic is older than computers: behaviour that is rewarded is repeated.
01Learning by doing
In supervised learning, every example comes with the correct answer. RL has none. Instead an agent acts in an environment, and occasionally receives a reward — a single number saying that went well or badly. From this thin signal it must work out an entire strategy.
02The agent–environment loop
Everything in RL is one repeating loop. The agent observes the current state, chooses an action, and the environment returns a new state and a reward. Over millions of cycles the agent learns a policy: a mapping from "what I see" to "what I should do".
The loop never stops during training. Each cycle nudges the policy toward actions that earned reward and away from those that did not.
03Reward is everything
The reward signal defines what the agent will become. Reward a game agent only for the final score and it learns to win; reward it for collecting points and it may learn to farm points forever, ignoring the goal. This is reward hacking, and designing rewards that capture what you actually want is the central craft — and central danger — of RL.
04Explore vs exploit
Every agent faces a dilemma. Should it exploit the best action it knows, or explore a new one that might be better? Too much exploiting and it gets stuck in a mediocre rut; too much exploring and it never settles. Balancing the two — often by acting greedily but occasionally trying something random — is essential to learning well.
A typical learning curve: early random exploration earns little, but the policy steadily improves as the agent discovers and then exploits what works.
05Where it shines
RL excels wherever decisions unfold over time and feedback is sparse: game-playing, robotics and control, logistics and routing, energy management in data centres, and aligning language models to human preferences. Its sweet spot is problems too complex to write rules for but cheap to simulate.
06Why it is hard
RL is famously sample-hungry — it may need millions of trials, which is fine in simulation but costly or dangerous in the real world. Rewards are hard to specify safely, training can be unstable, and a policy tuned in simulation often stumbles in reality (the "sim-to-real" gap). Powerful, but not plug-and-play.
What to remember
- RL learns from rewards through trial and error, with no answer key.
- The core is a loop: observe state, act, receive reward, update policy.
- The reward signal defines behaviour — bad rewards cause reward hacking.
- Agents must balance exploring new actions against exploiting known ones.
- It shines on sequential decisions: games, robotics, control, alignment.
- It is sample-hungry and suffers a gap between simulation and reality.
