In this post: high-level description of bandit algorithms and how they could play a role in recent news regarding an emergency response experiment being held in Philadelphia.

On the way home from work today, I heard on the radio about a controversial trial being performed in Philadelphia. They are studying the appropriate response for a gunshot or stab victim: whether the victim receives advanced care at the scene or if they receive only basic care before being transported to a nearby hospital.

Most of the controversy from this study, at least from what the article describes, is due to how the study is conducted. Depending on the dispatch number of the city paramedics…

  • odd numbers will carry out advanced care
  • even numbers will provide basic care

Just focusing on appearances, the words “advanced” and “basic” imply that half of the future stab and gunshot victims in Philadelphia will just randomly get sub-bar care (because “advanced” is better than “basic”, right?). It’s reasonable to see how the population may be worried.

I won’t be exploring the medical aspects of this story. I am no where near qualified to discuss those details. But I want to explore how this research is conducted and how a family of algorithms might be better suited for a study like this. These algorithms are known as Bandit Algorithms.

The Hypothesis

Dr. Zoe Maher, trauma surgeon, mother, and a researcher conducting this study, asserts that it is better for victims to receive basic care until they reach the hospital. Her colleagues and she believe that advanced care, such as IV fluids and breathing tubes, will disrupt the body’s natural measures of preventing blood like the constriction of blood vessels and the forming of clots.

The 2 decisions, advanced care or basic care at the scene, lie on the spectrum of an immediate response in a less than ideal environment (the scene) or a delayed/partial response followed by a much more ideal setting (the hospital).

I want to do a basic intro to bandit algorithms and why they’re better for this type of experiment. Then I want to motivate contextual bandits as the further improved approach. Finally, I want to discuss another way to approach this study that involves already-collected data while still using bandit algorithms.

Bandit algorithms

Bandit algorithms, also known by the longer name “Multi-armed bandit” algorithms, represent a family of reinforcement learning algorithms. Reinforcement learning poses machine learning problems as an agent with a set of actions. Actions interact with an environment and the environment will respond with a reward. A good policy describes what actions an agent should take when faced with a decision from the environment such that the agent maximizes its rewards. The phrase “multi-armed bandit” are named after the scenario in which these algorithms were first studied. Imagine a person at a casino with an entire row of slot machines to themselves.

  • The Gambler (the agent)
  • The row of slot machines (the environment)
  • The gambler chooses which slot machine they will play on (the actions, each machine is one action)
  • The payout of the played machine (the reward from the environment)
  • which machine to play (the policy)

The gambler, I hope, wants to maximize their payouts. Suppose these slot machines have differing but fixed ratios by which they will payout i.e. slot machine 1 will payout 10% of the time (I don’t gamble much but I know slot machines are much more complicated than this. We’re simplifying it for this discussion). To find the machine with the highest payout, the gambler can explore and try each machine numerous times to confidently identify the payout likelihood. And once identified, they will exploit the best machine. But the gambler has a finite purse and can’t spend it all just to determine the payouts. This is where bandit algorithms come into play.

Avoiding a very mathematical treatment of the algorithm, my best summary for it is this:

a simple, multi-armed bandit algorithm will create a policy such that the gambler will choose to play each slot machines in proportion to the rewards she or he has received from that machine.

And it sounds so simple. Because it is. The gambler should play the machines that pay them more. It’s so obvious. But the elegance of bandit algorithms is how it dictates exploration vs. exploitation in order to maximize rewards and actually, minimize the difference between potential rewards. We can’t explore how these different variations of bandit algorithms handle the reinforcement learning problem (I haven’t even gotten to its application to this story yet!) because they are many. But I do recommend @johnmyleswhite’s book Bandit Algorithms for Website Optimization. (I’m not affiliated with O’Reilly, I just figured because it was an animal book, O’Reilly is the primary source. You can also get it in other places.). It’s hands-down the best book for applying bandit algorithms.

How does this apply to health-care?!

Consider a different scenario AND consider a different approach.

You’re a doctor and you would like to identify, between the choice of 3 treatments, which treatment is best suited to cure patients suffering from Nhatinitus (trust me, you don’t want it). Little is known about the disease and all the treatments are new. So you decide to divide a population of 90 patients diagnosed with the disease into 3 groups, each group receiving one of the treatments. Results are collected when the trial is over and the group with the best recovery ratio determines what treatment will be delivered to the masses. Well, wouldn’t it suck to have been in the group that doesn’t receive the working treatment. It sucks even more when you learn that Nhatinitus gives you a nauseating reaction to desserts and inclinations to bad sweaters.

Look at that damn sweater poor guy

What if instead, as the experiment progresses, you notice that one group is performing better than the others? And what if by noticing, you progress to moving individuals from other groups to the better one, thereby improving your final success rate? That’s what bandit algorithms do for you. It allows subsequent decision to be changed dynamically based on prior rewards.

Consider quickly another scenario. People love saying they optimize their user interaction/ click-through rate/[some metric] on their website because they do A/B testing. (Side note: A/B testing. Think of the above example except with only 2 groups). Bandit algorithms would be better because they will progress towards the better solution as we learn!

SO FINALLY how does it apply to our story? It seems the Philadelphia story is testing similar to how A/B testing is done: half of the population will receive “basic” treatment and the other half will receive “advanced” treatment. Why not use bandit algorithms to dynamically decide which treatment to administer as we learn the more successful of the two?

Contextual bandits

Up until now, bandits only developed their policy based solely on the rewards they receive from the environment. But a bandit should be taking advantage of additional information (the context) and include them in their decision-making. Enter the contextual bandit.

The words “contextual bandit” should be synonymous with “John Langford” who’s one of, if not, the most prominent researchers in this area. (Side note 2: what’s up with the dominance of John’s within bandit algorithms?). Refer back to the agent and their environment. This time, the agent doesn’t choose decisions based on historical rewards. They decide given a context and the historical reward for that context. There is no one, true decision that is best. So in the example above, perhaps their isn’t one treatment that is best (the action) but the best treatment depends on the patient (the context).

Of course, this increases the complexity of the algorithm (this paper is amazing but makes my head numb). Most often, we want an algorithm that can handle an infinite space of context whereas the set of decisions must be constrained.

How does this apply to the Philadelphia trials? Well maybe there isn’t one solution that is better. Perhaps stabbing victims (a context) show more success when given “advanced” treatment whereas gunshot victims (another context) respond better to “basic” treatment due to the differing average size of the trauma between the two types of incidents. And we don’t have to limit our context to that. We can include time to respond, age/health of victim, and other factors in our context to create richer information for the bandit.

Offline Analysis

The last thing I want to touch upon is how this study can be conducted on existing data. We should look at two things: 1) the sensitivity of an experiment like this and 2) a pure and true reinforcement learning setting. These two things will seem to conflict but I’ll propose a way to compromise.

Reinforcement learning is powerful because it allows one to build a model from initially knowing nothing about the environment. The gambler has no idea which slot machine will be most fruitful and a doctor doesn’t know which new drug will work best. So a typical scenario for a scientist working with reinforcement learning is to just set the agent loose and let it train anonymously.

An experiment such as the one being conducted in Philadelphia is considered very sensitive because it is dealing with people’s lives. The article contain quotes regarding citizens’ concerns in regards to “gambling” with people’s lives. What maybe one decision to explore an action is actually a risk on somebody’s life. We would like run an experiment that reduces as much risk as possible while still gathering enough evidence to identify the best policy. Historical data may be the best way to do this. You can prevent further risk by essentially not running this experiment at all! But how do we do run a bandit on historical data?

In this tutorial by John Langford at NIPS, John runs his bandit framework, against an existing dataset and in particular, a dataset that looks like a supervised learning dataset. So there are features (the context) and each label can be interpreted as an action. As John explains, if you have a confident guess at what the bandit’s policy is for this dataset, you can set those parameters a.k.a. taking decision A was always 25% of the time. If not, you can just assume it was even amongst the possible actions (5 actions? Assume a policy that would choose a certain arm 20% of the time.) Then you can train a bandit on the dataset! The bandit algorithm will learn from that and be able to decide what to do on future events. It can also decide whether it wants to explore the actions more if the dataset didn’t provide enough training.


The experimental trials in Philadelphia immediately jumped to me as a prime example of where bandit algorithms are appropriate. With bandit algorithms, one can design a robust policy for decision-making and reduce the expense of the experiment.

For those of you wanting or hoping for something more detailed or technical, I sincerely apologize. I hope to revisit this again with some examples and actual code.

Thanks for reading!