Came alive on March 31, 2020
Most courses on reinforcement learning include some mention of "N-Armed Bandit problems" since they are a relatively easy exercise to start off with. As simple as they may be, they do help in developing "mental constructs" that help you reason about the bigger, more complete reinforcement learning problem down the line. They even have cool applications like choosing what advertisement to display to a user, so that they might click on it.
A slot machine in a casino could be described as a "one armed bandit", since it has one arm ("the lever"), and usually ends up taking your money. Now, an "N-armed bandit" could be thought of as N different slot machines each with a different "average payout" value. "Machine i" might give you 10 dollars on average, and "Machine j" might give you 3 dollars on average in the long run.
Simply put, your task is to find the slot machine that'll give you the most payout / "reward" in the long run. During any "game" you get to choose a machine whose lever you'd like to pull. The problem at hand is "non-stationary", which is just a fancy way of saying each slot machine keeps giving the same "average reward" that it was programmed to. The reward distribution across the N slot machines remains the same, irrespective of the number of games you play. There is no notion of a state, just actions, since the agent is perpetually in the same state. Our "agent" who plays this game does not know the average payouts for the machines, so it must find the optimal machine through its actions across all the games.
So let's make our own N armed bandit, each of which has an "average reward" ranging from 0 to 10. Here I've decided to go with 10 machines. Simply click on the green "Generate Bandit" button to create your very own 10 armed bandit! The bar highlighted in green, is the slot machine with the highest payout. This is the ideal slot machine that we would like our agent to keep using every game.
In the following interactive tool, you'll get to use 2 different exploration strategies for your agent. One way would be to choose the lever with the maximum reward you have seen so far. However, this almost always ends up in a suboptimal lever, since the agent is biased towards the first lever that it pulls in the very first game. It doesn't care to explore other options, because it sticks to the best lever it has seen so far, that is, the first lever it ever chose. That is why the agent's reward estimate for the other "arms" / "actions" do not develop over time.
If however, we take a risk and randomly explore other levers once in a while (say 20% of the games), then mathematically speaking, our agent's reward estimate will get really close to the actual reward distribution if it plays enough games. Using such a strategy, you might discover that another action that was previously undiscovered actually gives you a better reward. Or it could be the case that your previous estimate of the value of that action was low, because it gave you a low payout, because the agent hasn't sampled that action enough times or simply by random chance (remember that a $5 slot machine gives you an "average" of 5 dollars, it probably wouldn't give you exactly $5 every time you choose that machine).
This latter strategy is known as "epsilon-greedy exploration", since you explore with a probability of "epsilon" once in a while.
If you chose the greedy strategy, then each time you click on "Train Agent", you see that the agent locks into one option (in the second plot above). The very first action that the agent chooses would be the action that the agent locks into. It is therefore, biased by its initial choice.
If however, you choose the epsilon greedy strategy, the agent consistently converges to the optimal action for reasons we've talked about before. Moreover, it succeeds at consistently finding the lever with the maximum payout across different training sessions.
Hopefully, you've now gained a bit more intuition about the simple, yet powerful concept of bandits in RL!