Reinforcement learning is a machine learning technique that allows an agent to learn how to make a sequence of decisions, by interacting with its environment through trial and error and receiving feedback in form of rewards or penalties for its actions. Main goal of the agent is to maximize the total reward of its actions.

Agent acting in an environment as part of reinforcement learning can be a bot player in computer game or a robot moving inventory in a physical warehouse. The goal of the reinforcement learning discipline is to learn an optimal strategy for the agent in each environment.

Environments are often complex and large in terms of potential states and transitions. The space of possible moves in a game of Go e.g. rapidly increases as the game progresses. In chess, there are around 400 possible moves after the first two moves, whereas in Go, there are around 130,000 moves. Environment can also be difficult because of imperfect information, e.g. in card games, the cards of opponents are not fully known to the player.

Reinforcement learning problems are often modeled and formalized as a Markov Decision Process or MDP. MDP contains a group of environment states, actions that an agent can perform in each state, a reward function and a transition model for mappings between states.

One important distinction between reinforcement learning and the more widely used supervised learning is that in the latter the machine learning model is instantly “informed” whether its prediction on input is correct or not. Whereas the former operates in an environment, where the total reward of actions taken by agents is delayed and occurs over a sequence of decisions.

Reinforcement algorithms have achieved great success in beating human experts at different games, like Go, Dota-2, Starcraft II and Atari video games. In the second part of our article, we will more closely look at the program AlphaGo that achieved great success in the game of Go against human players.

Main concepts of reinforcement learning

For better understanding of reinforcement learning, we are introducing its key concepts:

  • agent is the entity that takes actions in its environment, this can be a Pac-Man in the maze,
  • action can be any valid interaction of the agent with its environment. Pac-Man can e.g. take left, right, up or down moves,
  • environment is where the agent resides and with which it interacts. After each action of the agent, the environment takes the state of the agent and action and returns the reward for the agent and agent’s next state,
  • state is the current situation that the agents finds itself in and is returned by the environment after each action. In Pac-Man game, it is the position of the Pac-Man in the maze, position of the four coloured ghosts and accumulated rewards of the agent,
  • reward is received by the agent after taking a given action. Rewards received by the agent can be immediate or delayed. In Pac-Man game, a reward of 100 points is e.g. received for each cherry,
  • discount factor controls the value of reward with respect to time – future rewards are discounted by a factor and worth less than immediate rewards. If discount factor is and the agent receives 100 points after 5 time steps, then its value is ,
  • policy is the method which maps the agent’s states to actions, that offer the highest total reward. Each state has an expected value of future rewards that the agent receives in its current state if it takes the action according to the policy. One of the main goals of reinforcement learning is to learn the optimal policy. Policies can be both deterministic as well as have elements of chance by being stochastic,
  • value is the expected reward over the long-term from the current state and policy,
  • model simulates the functioning of the environment and can return the next state and reward for given state and the action taken.

Reinforcement learning problem can be solved either by using the model (model-based approach) or without a model (model-free approach). Model-based approaches try to correctly model its environment and then learn the optimal policy based on this model. In model-free approach, the optimal policy is learned through trial and error by the agent.

Supervised learning, unsupervised learning and reinforcement learning

In supervised learning, the model is trained on labelled data with each data instance having a known outcome. Supervised learning algorithms are further divided in classification (prediction of classes, e.g. whether an email is spam or not) and regression models (prediction of continuous outputs, e.g. prediction of sales in the next quarter).

Unsupervised learning describes algorithms for finding patterns in data, without prior knowledge of outcomes for data instances. Unsupervised learning includes clustering or finding groups of data instances that are similar to each other but distinct from data instances in other groups.

Unsupervised learning includes dimensionality reduction methods which play an important role in feature selection and feature engineering.

Reinforcement learning, as we have seen, describes decision and reward systems, which learn in an environment where they are rewarded for “good” actions and penalized for “bad” actions they take. Like unsupervised learning, reinforcement learning does not require prior knowledge of outputs for given input data.



There are several important differences between reinforcement learning and supervised learning:

Reinforcement learning Supervised learning
Reinforcement learning trains the agent by letting it interact with is environment Supervised learning trains the model by applying it on train data and letting it learn from (potential) deviations of model’s predictions from labels
There is no labelled data set available in reinforcement learning Supervised learning models learn on the labelled data set
Reinforcement learning trains the agent to make a sequence of decisions and not only single decision. Supervised learning model gives a single decision or prediction on given input data instance
Example: playing a Mario game. Example: translating a sentence.

Types of reinforcement learning

We distinguish between two types of reinforcement learning:

  • positive reinforcement reinforcement is considered positive, when an event has a positive effect on the behaviour, by e.g. increasing the frequency or strength of the behaviour, example would be giving your pet a favourite food for favourable behaviour. Positive reinforcement helps to maximize the performance for given task and can cause models to make sustainable changes that last for longer periods of time. Positive reinforcement is the most common type of reinforcement used in reinforcement learning problems,
  • negative reinforcement – reinforcement is considered negative, when an event has a negative effect on the behaviour, by e.g. decreasing the frequency or strength of the behaviour.

Reinforcement Learning Algorithms

As noted previously, reinforcement learning algorithms can be divided into two main groups:

  • model-based
  • model-free

Model-free methods include policy optimization and Q-learning.

Policy optimization involves learning the policy that maps states to actions. Policies are further subdivided into deterministic, where the mapping of state to action is done in a deterministic way. Stochastic policies, on the other hand, involve element of chance in the mapping of state to action.

Another important model-free reinforcement learning algorithm is Q-learning. The latter aims to find the optimal action for given current state. This is done by employing the concept of Q-table which maps the actions to values. The values in Q-table are calculated during exploration phase, when the agent selects random actions, receives rewards and based on that the Q-table is updated.

For problems with huge space of possible states, the calculation of Q-table can become computationally difficult. To improve performance on these problems, Deep-Q-Learning has been introduced, in which the Q-table is replaced by the deep neural net. The neural net receives as input the current state and produces values of each possible action.

Application of reinforcement learning – AlphaGo

One of the most well-known applications of reinforcement learning is the development of a computer program AlphaGo. It is not only important for the RL field but also perception of artificial intelligence by the wider audiences as the AlphaGo was able to beat the best human players at playing a game of Go, which was long considered to be too difficult for a computer to achieve a feat of surpassing human-level ability.

The major event in this respect occurred in March 2016, when AlphaGo was able to beat Lee Sedol in a five-game match, which was also the first time a computer program was better than the 9-dan professional player without handicap. As of today, AlphaGo is not the most powerful computer Go program, as the DeepMind (developer of AlphaGo and part of Google) developed three more powerful successors – AlphaGo Master, AlphaGO Zero and AlphaZero.

Go is an ancient game, invented in China almost 2500 years ago. It is played by two players on a board with black and white stones. One of the reasons why the Go game was seen as too difficult for computers is the huge number of possible placements of stones – higher than the number of atoms in the Universe.

Go is a game of perfect information, which means that each player has information on all the previous moves. In this type of games, one can determine the outcome of a game from each current state, if one assumes that each player takes the most optimal move on every turn. To find the optimal game one needs to calculate the value of each move with help of simulations. This is done by going through the search or game tree with all possible moves.

Each node of a search tree represents a state in the game. When a player performs a move, the transition occurs from the node to one of the children nodes. The aim is to find the optimum path through the search tree. Due to complexity of Go, the calculation of optimum action in given current state, using present day computer, would take many orders of magnitude more time than practical.

AlphaGo aim is thus to reduce search space to a dimension where the number of possible games (to the end) is still small enough so that one can evaluate it in a time that is in the order of seconds. It uses Monte Carlo tree search algorithm (MCTS) for this purpose, by randomly sampling for potential moves.

MCTS is only one of the key components of AlphaGo, the other is a Supervised Learning (SL) policy, which was trained on millions of positions from the KGS Go Server. Although the SL policy helps with prediction of most likely next moves, the Reinforcement learning is the component that predicts the best winning moves. According to original article (, when played head-to-head, the Reinforcement Learning policy network won more than 80% of games against the SL policy network.

Other applications of Reinforcement Learning

Reinforcement learning is being used in many industries for a wide range of purposes. Some of applications and fields of use include:

  • personalized recommendations,
  • trading strategies in financial organizations,
  • manufacturing (robots learning to put devices in boxes, )
  • management (strategic planning)
  • inventory management,
  • robotics,
  • industrial automation,
  • personalized recommendations,
  • delivery management, e.g. computing optimal delivery routes,
  • autonomous vehicles,
  • advertising,
  • chemistry (optimizing chemical reactions),
  • personalization in games,
  • power systems,
  • real-time bidding,
  • recommendation of news.


In this article, we introduced an important field of machine learning – reinforcement learning, which is becoming increasingly used in many different fields, from advertising, finance, autonomous vehicles to industrial automation and other sectors.

Reinforcement learning is an approach that allows an agent to learn how to make a sequence of decisions. It achieves this by interacting with its environment and receiving feedback on its actions in form of rewards and penalties. The main goal of the agent is to maximize the total reward of its actions.

Reinforcement learning has become more widely known as part of the frameworks that enabled a program called AlphaGo to beat the best human players at the ancient game of Go. This occurred at a time, when many considered that such a feat for a computer program is still at least a decade away.

Reinforcement learning has found considerable success in many other fields, such as autonomous driving and we expect that it will play an important role in the future of AI.