Blog

A Markov Decision Process (MDP) is a powerful way to allow AI to make decisions in uncertain environments. In this project, I used the algorithm to provide the optimal strategy for Blackjack. I then compared the strategy to a widely available “Blackjack Cheat Sheet”, and found the strategy provided by MDP and the cheat sheet to have comparable performances of expecting to lose only 5¢ on a $10 bet, which is around the “house advantage” of 0.5% to the casino, indicating that it is indeed an optimal policy.

MDPs and Blackjack Rules

Traditional MDPs are most applicable to situations where the decision space is finite and discrete, the outcomes are uncertain, and the terminal state and its rewards are well defined. Furthermore, the situation must be made to abide by the Markov assumption, that the next state is a stochastic function of only the current state and the action applied.

Blackjack is played against a dealer, where the player is dealt two cards and the dealer exposes one card. The player aims to get the total face value of his or her set of cards as close to 21 without going over. Using the knowledge of their own cards and the single exposed dealer card, they must choose to either Hit (where they are dealt an additional card), Stay (where they attempt to beat the dealer with the cards they currently have), Double Down (where they hit and double their current bet), or Split (where if they currently hold two cards that are a pair, they split the pair into two hands, add the same bet to the additional hand, and hit on both). If the player hits in any form and busts (by going over 21), they lose. Else, if they stay, the dealer exposes another card, and must hit as long as their total value is less than 17, and stay otherwise. If the dealer busts or stays with a value lower than the player’s, the dealer loses. Else if their value is the same, they push (no reward), and otherwise, the player loses. The last relevant rule is that if the player or dealer hold an A, the value held by the A is a ‘soft’ 11, meaning it holds a value of 11 so long as doing so doesn’t make the hand’s total value exceed 21. Otherwise, the A takes a value of 1.

Blackjack meets the characteristics for applying an MDP because the players state can be defined by the total value of the hand they hold, whether or not it is a pair, and whether or not the hand is ‘soft’ (with an A acting as an 11). The dealer’s state is defined entirely by the value of the single card displayed, and thus the entire game’s finite state space is defined by the player’s state and the dealer’s state. Furthermore, the next state of the game is stochastically defined completely by the current state and the player’s action (Hit (H), Stay (S), Double Down (DS), or Split (SP)).

Solving an MDP gives us an optimal (reward maximizing) action (H, S, DD, SP) for a every possible state. For an MDP that has possible states and possible actions, the problem is defined by a State Transition Matrix () of size , which specifies the probability of transitioning from state to ‘ via action . The problem is fully defined by providing a Reward Matrix () also of size , which specifies the reward gained by transitioning from state to ‘ by action . Once the matrices and are provided, an MDP solver can be used to provide a decision policy of size n, where the decision policies () are members of the action space and specify the action to be taken in a given state .

Framing Blackjack as an MDP

Thus in order to get the optimal strategy for Blackjack, we must first enumerate and encode the possible states of the game, and then compute and . A few assumptions must be made about the game to form an efficient encoding:

Infinite decks: if we assume that cards are being drawn from an infinite set of decks (typically at least four), then no advantage is gained from keeping track of the specific cards open, but mostly just their face value, dramatically reducing the number of required states.

Constant Marginal Utility of Money: If we assume that the player is indifferent between doubling from $10 to $20 and doubling from $20 to $40, then we only need to keep track of whether the player has ever doubled down, and not the number of times they did. The reason for this is that with the assumption, we only care about whether we should double down from a game state that we haven’t doubled down yet. If we want to know whether we should do so from the new state, we can use the decision policy of the equivalent state assuming we hadn’t doubled down yet. Thus we can use a simplifying rule that is a player can only double down one time.

Using the first two assumptions, we can also see that if we split, the strategy used given either hand is not affected by the state of the other, and so we only need to keep track of one of the hands in our state.

Using these assumptions, we observe that the available states for the players hand are a Hard (no A acting as an 11) 5-21, Soft 13-20 (21 ignored because we know the strategy will be the same as if it were hard), Pair of 2-10 and A, giving us 35 possibilities. Note that all face cards are assumed to be 10. Additionally, the dealer’s hand could have an exposed 2-10 or an A, a total of 10 possibilities. Additionally, we keep track of whether or not we have doubled down as a binary state. The total number of combined states is 35*10*2 or 700 unique states. Lastly, we append on four terminal states (Bust, Lose, Push, or Win), giving us a total of 704 states.

With the states enumerated, the next step is to compute the transition probabilities from one state to the others given an action. Let us examine the Hit action for a Hard 5. This state can transition to a Hard 7-14 if any of the cards 2-9 are dealt, and thus transitions to these with a 1/13 probability. It can additionally transition to a Hard 15 with any of 10-K, so has a 4/13 probability. Finally, if an A is dealt with a 1/13 probability, it will transition to a Soft 16. If the action were a double down, the transition probabilities would be the same, but the transitions would be to the set of states assigned for having doubled down. As another example, if we hit on a pair of A’s, and a 7 is dealt with 1/13 probability, we transition to a Soft 18.

Carefully considering all possibilities, we can fill out the transition probabilities for the Hit, Double Down and Split actions. A quick bug-check is to ensure that the rows of sum to 1, ensuring that a state must always transition to at least one of the other states.

Lastly, we must fill out for the Stay action. If we stay with a value of 17, and the dealer has an exposed 9, we know that there is a 4/13 probability they will show 10-K and have to stay at 19. However, if they show any card from 2-7 with 1/13, then they must hit once more, leading to a lot of individudal cases to consider. A simpler way to find the transition probability from, say, my having a 18 and the dealer showing a 4, is to perform a Monte Carlo simulation and approximate the probability. Simulating the game an excessive 10,000 times for each of the 350 unique gamestates (double down states act identically), we estimate the transition probabilities from a given state to possible terminal states under the Stay action. We also handle illegal actions (like splitting without a pair) by sending them straight to Bust.

The next step is to fill out the reward matrix . As a reminder, we need to state the explicit reward for transitioning from a given state s to another state s’ under some action a (even if the transition would be impossible). There are only a few cases to be handled:

If we bust by hitting from any normal state, we lose a nominal amount like $10.

If we bust by hitting having previously doubled down, or doubling down, we lose double the nominal, so $20.

If we lose by staying on a normal or doubled down state, we lose $10 or $20 respectively.

If we win by staying on a normal or doubled down state, we win $10 or $20 respectively.

If we make any other legal move, we get no explicit reward.

Having specified and , we now simply use an MDP solver to get the decision policy. For this project, the open source MDP toolbox for MATLAB was used (link). The mdp_LP function asks for , , and a discount rate (set at 0.999 because we don’t care if we make extra moves to win), and returns the optimal policy, which is a vector of actions for each state.

Evaluating the Policy

The optimal policy returned by using an MDP is shown below, and compared to a “Blackjack Cheat Sheet”.

We can see a surprising amount of similarity for the Hard player hands, but some discrepancy for the Pairs and Soft hands. Though this discrepancy exists, the results still seem to line up reasonably with intuition. In fact, the cheat sheet seems to have some inconsistencies between Pair and Hard hand strategies – it recommends hitting regardless of the dealer on a Pair 6, but recommends staying for moderate dealer hands on a Hard 12, which is an identical hand under a Hit. These inconsistencies don’t seem as frequent in the MDP output.

To test the performance of these decision policies more rigorously, I performed a Monte Carlo simulation of 1,000,000 games for each policy, using this Blackjack simulator on MATLAB. The policies were also compared to a very naive policy of Staying no matter what your state. Using a $10 nominal bet per game, it was found that the MDP policy and the cheat sheet have expected losses of 6¢ and 4¢ respectively, but with standard deviations in these estimates (of expected loss) of 1.2¢. Using MATLAB’s ttest2 (two-tailed t-test) to test for a statistically significant difference in performance, we can assert that the cheat sheet is better with about 90% confidence.

On the other hand, the naive strategy has an expected loss of a much larger $1.6, with a standard deviation of 1¢ on that estimate. Using ttest2, we can assert that the MDP strategy is better with near perfect confidence.

While the difference between the MDP performance and that of the cheat sheet is very marginal, the likeliest for why is performs worse is that I made a small error in painstakingly filling out the almost two million state transitions probabilities in . However, what is clearly evident is that the exercise does generate a near optimal strategy for playing Blackjack, one that performs much better than a naive decision policy.

Conclusion

The purpose of this project was to get familiar with using Markov Decision Processes to provide optimal strategies in discrete, finite state stochastic environments. Blackjack seemed to be a perfect candidate to try the algorithm on. To formulate the game in a manner acceptable to an MDP solver, I first specified the probabilities for transitioning from each gamestate to every other under all actions using either explicit analysis or Monte Carlo simulation. I then specified the rewards for transitioning from game states to terminal states under the different actions. With these matrices specified, I used the MDP solver to provide the optimal strategy. I compared this strategy to a widely available “Blackjack Cheat Sheet”, which claims to be the optimal blackjack strategy. Repeatedly simulating the strategies in the game of Blackjack, I found that both the MDP’s strategy and that of the cheat sheet lose only about 5¢ per $10 bet on average, while a more naive strategy loses $1.6 on the same bet. According to wizardofodds.com, the “house advantage” for a game of Blackjack with the assumed rules is around 0.5%. The house advantage indicates the expected value of the dealer assuming the player is playing optimally, and thus provides evidence that the strategy given by the MDP is indeed close to optimal.

I did this project with the help of Dilip Ravindran, a close friend and graduate student at Columbia Economics.

As part of the Harvey Mudd College Clinic Program, I worked on a project for the Kinohi Institute. Over the course of a year, we designed and built the first spatial heterodyne Raman spectrometer capable of operating at near-infrared excitation light. The team also tested and validated the performance of this device.

Raman spectroscopy is a common technique used to determine the material composition of a substance. Currently, the most common Raman spectrometers are slit spectrometers (which have low light throughput), and Fourier Transform spectrometers (which require sensitive moving parts). Spatial Heterodyne Spectrometry (SHS) has no moving parts and high light throughput, which means it is robust and non-destructive to the sample. The Spatial Heterodyne Infrared Spectroscopy (SHIRS) technique developed in this project can be employed to detect cancer, observe protein folding in real time, and could even be used for the non-destructive detection of extraterrestrial life.

In this post, I will describe how the device excites Raman signals, isolates the signal from noise, and identifies the Raman signal through spectroscopy. I will then discuss the results of the effort.

Figure 1: The complete SHIRS system. The Raman signals are excited in section A, filtered from noise in section B, and the resulting signal is transformed into a spectrogram by the Spatial Heterodyne Spectrometer in section C.

Exciting Raman Signals (A)

Raman signals are a form of light that reflects of molecules with a change in energy that acts as a signature for the molecule. Hence, if we shine light of a single, pure wavelength onto any molecule and identify the spectrum of the resulting Raman signals, we can correctly identify the molecule. However, this isn’t quite so easy. For one, Raman shifts are a very low probability event, which results in their signals being extremely weak. In addition, the unchanged reflection of the input laser (Rayleigh scattering) and fluorescent reflections are sources of noise that flood out the Raman signals. The method for dealing with this noise is discussed in the next section.

Isolating Raman Signals (B)

Aside from the advantages of causing little damage to organic samples, using a near infrared excitation laser has the additional advantage of causing much less fluorescence than the more conventionally used visible or UV excitation lasers in Raman or SHS systems. Additionally, since Raman shifts are known to be longer wavelength than the excitation laser, a Long-Pass Filter (LP on Figure 1) and Long-Pass Dichroic Mirror (D) are used to ensure that all light of lower than the excitation wavelength are heavily attenuated. A Notch Filter (N) is used to heavily attenuate the Rayleigh scattering from the excitation laser. The remaining light is mostly composed of just the Raman signal at a sufficiently high Signal-to-Noise ratio. The next section will discuss how a Spatial Heterodyne Spectrometer recovers the spectra of this light.

Identifying Raman Signals (C)

With reference to Figure 1, the SHS works by first using a 50:50 Beam Splitter (BS) to direct the light towards two Diffraction Gratings (DG). The light then reflects of these gratings at a wavelength-dependent angle. Since the angle of reflection depends on wavelength, there is a wavelength-dependent separation between the the two reflected wavefronts that recombine on the Camera’s Sensor (C). From the theory of interference, we know that the frequency of oscillations in an interference pattern depends on the angle separating the wavefronts. Thus, the wavelength of light is encoded in the frequency of oscillations produced on the camera’s sensor. Performing a Fourier Transform on the resulting image recovers the spectrum of the input light.

The figure above shows examples of fringe patterns created by single-frequency sources (top), and the spectra that are generated from the fringe patterns (bottom). The generated spectra are compared to spectra gathered from an off-the-shelf spectrometer. We can see that for two test cases, the device functions as an accurate and precise spectrometer compared to an off-the-shelf alternative.

Results

The complete SHIRS system was validated as a standalone spectrometer on various test sources such as Hydrogen, Neon, and Sodium Vapor Lamps. The spectrometer demonstrated a bandwidth of 131 nm and a resolution of 0.1-0.6 nm, which satisfy the constraints for the desired applications of the Kinohi Institute.

The figure above shows the spectrum of the sodium vapor lamp captured by the team’s SHS and the ScienceSurplus slit-based spectrometer. Notice that only the team’s SHS is able to resolve sodium’s characteristic double peak.

The complete system was validated by testing elemental Sulfur, which is known to have a strong Raman signal and is a typical first-test case.

This figure shows the Raman spectrum gathered using the team’s prototype from a sulfur sample. The peak seen matches the peak expected from published Raman spectra to within 1 cm-1. This spectrum was reproducible across multiple trials.

I completed this project on a team with Alex Alves, Achintaya Bansal, Erica Martelly, and Apoorva Sharma. The team was advised by Professors Philip Cha and Gregory Lyzenga of Harvey Mudd College. If you would like to apply this technology to any of your work please contact me or the liaison of this project from the Kinohi Institute, Dr. Michael Storrie-Lombardi (mike@kinohi.org).

In this project, a small disc magnet is levitated by a solenoid that is PID controlled by an FPGA at 156 kHz sample rate. A slower open loop is driven by a Raspberry Pi to stably move the magnet between set points. The Pi communicates the set point, steady state voltage, and gains, thus slowly making big transitions by commanding smaller jumps with accurate linearization. A video of the system in action can be found here.

Figure 2: Block diagram of system components

Figure 2 shows a block diagram of the system components. As shown in Figure 1, the physical system is composed of a solenoid above a disc magnet. A PID controller is implemented on an FPGA to control the position of the disc magnet. The 10-bit position of the magnet is measured by the Hall-effect sensor, which is located about an inch below the solenoid so it avoids saturation from the solenoid’s magnetic field but still detects the disc-magnet’s field. The voltage from the sensor is passed through an ADC and then through a pre-calibrated lookup-table to estimate the position of the magnet. The position is then compared to the desired setpoint and used to compute a voltage across the solenoid. The voltage is computed by a PID controller that implements discrete-time derivative filtering for improved performance. This voltage is then passed through a PWM to generate a binary control signal. This control signal leaves the FPGA and is applied to a transistor which controls the current flow to the solenoid.

The linearized gains and feed-forward voltages most applicable to a given setpoint were found in calibration and stored on a Raspberry Pi. The Raspberry Pi takes a desired setpoint and passes the associated gains and feed-forward voltage to the FPGA’s PID module over SPI. In order to create the up-and-down motion seen in the video, the Pi takes small, discrete steps to march the magnet to the endpoints, holding each command for a time long enough for the magnet to reach the desired intermediate location and dissipate its transient response.

The project was done as a joint final project for the Harvey Mudd College courses Microprocessor Design (E155) and Systems Simulation (E205). The project was done in collaboration with Vaibhav Viswanathan.