#include <adlib.h>Max Shinn's blog.
/
Sat, 18 Jun 2022 14:30:47 +0000Sat, 18 Jun 2022 14:30:47 +0000Jekyll v3.9.2Does “Flight of the Bumblebee” resemble bumblebee flight?
<p><a href="https://www.youtube.com/watch?v=X14kC-sEH0I">Flight of the Bumblebee</a> is one of
the rare pieces of classical music which, through its association with bees, has
cemented its place in pop culture. However, it is unclear whether its composer,
Nikolai Rimsky-Korsakov, actually took inspiration from bumblebee flight
patterns. I address this question using new tools from ethology, mathematics,
and music theory. Surprisingly, the melody line of ``Flight of the Bumblebee’’
mimics a distinctive property of bumblebee flight, a property which was not
formally discovered until decades after Rimsky-Korsakov’s death. Therefore,
<em>yes, it is very likely</em><sup id="fnref:a" role="doc-noteref"><a href="#fn:a" class="footnote" rel="footnote">1</a></sup> that Rimsky-Korsakov observed and incorporated
actual bumblebee flight patterns into his music.</p>
<p>In what follows, I assume the reader has a knowledge of high school level
mathematics and basic music theory (chord changes, intervals, scales, etc.).
<a href="https://osf.io/preprints/socarxiv/4v6nu/">This post is based on my recent preprint, available on
SocArXiv.</a></p>
<h2 id="historical-background">Historical background</h2>
<p>The piece we now call “Flight of the Bumblebee” actually comes from one
of Rimsky-Korsakov’s operas, “The Tale of Tsar Saltan”. The opera is based on a
story by Alexander Pushkin, which is in turn based on several folk tales. The
hero, Prince Gvidon, is cast on a remote island by his jealous aunts. While
there, he unknowingly saves a Swan Princess from death, so in return, she looks
after his well-being. To help him to see his father again, she temporarily
transforms him into a bee<sup id="fnref:b" role="doc-noteref"><a href="#fn:b" class="footnote" rel="footnote">2</a></sup> so he may fly back to the kingdom (and sting his
aunts). Eventually, when he is a human again, Prince Gvidon is reunited with
his father and marries the swan princess.</p>
<p>Flight of the Bumblebee begins towards <a href="https://youtu.be/iKWGvke7bq8?t=3192">the end of the first scene of Act
III</a> after Prince Gvidon is transformed
into a bee. The first half of the piece serves as background music for the Swan
Princess’ singing as Prince Gvidon flies away. The second half serves as a
transition from the first to second scene of Act III. In most renditions of
this piece today, the Swan Princess’ vocal line is removed. The piece
introduces the “bumblebee” theme, which continues to be a central theme of Act
III as Prince Gvidon flies around the court causing mischief.</p>
<p>Despite the fact that Flight of the Bumblebee is by far the most recognisable
piece of music Rimsky-Korsakov wrote in his lifetime, he didn’t consider it to
be one of his major works. In fact, Rimsky-Korsakov didn’t even include it in
his own <a href="https://imslp.org/wiki/The_Tale_of_Tsar_Saltan_(suite),_Op.57_(Rimsky-Korsakov,_Nikolay)">suite of highlights from the
opera</a><sup id="fnref:c" role="doc-noteref"><a href="#fn:c" class="footnote" rel="footnote">3</a></sup>.
This piece was relatively unknown until 1936, when it first<sup id="fnref:d" role="doc-noteref"><a href="#fn:d" class="footnote" rel="footnote">4</a></sup> entered pop
culture as the theme song for the radio show “The Green Hornet”. It only took a
few years to become widely recognisable, in large part due to a <a href="https://www.youtube.com/watch?v=jxS7llr8x_4">1941 big band
jazz cover by Harry James</a>.</p>
<h2 id="bumblebee-flight">Bumblebee flight</h2>
<p>In accordance with the cliché “busy as a bee”, bumblebees spend most of their
day looking for food. So, we can understand bumblebee flight patterns by
understanding how they forage for food. New miniature radar technology allows
us to <a href="https://www.youtube.com/watch?v=P575vyxOc2Q">track the flight of foraging
bumblebees</a>. One important
characteristic aspect we have discovered about <a href="https://doi.org/10.1016/0003-3472(95)80047-6">insect foraging flight
patterns</a>, and <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0078681">bumblebee flight
in
particular</a>,
is the <a href="https://journals.biologists.com/jeb/article/210/21/3763/17205/Honeybees-perform-optimal-scale-free-searching">presence of lots of short movements and a few very large
movements</a>.
At any given point in time, a bee may choose to move in any direction it wants.
Most of the time, it will move a small distance. However, sometimes, it will
make very large movements. For example, it may <a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1046/j.1365-2664.1999.00428.x">travel long distances and then
carefully explore a small
patch</a>.
Relatively speaking, bees will make fewer moderate-sized movements: most
movements will be very small, but those which aren’t small will be very large.
<a href="https://www.nature.com/articles/44831">This has been shown to be optimal
behaviour</a> for bumblebees in many
situations. In fact, algorithms inspired by this bumblebee behaviour have been
used to tackle <a href="https://www.tandfonline.com/doi/abs/10.1080/00207721.2015.1010748">complex engineering
challenges</a>.
This type of movement is in contrast to, e.g., the movement of spores released
from trees, which tend to drift gently with the wind and not make large, sudden
movements across a pasture. We will refer to these large movements as “jumps”.</p>
<p>If you think about it, it makes sense why bees might prefer to sometimes make
very large movements when they are searching for food. If they always take
small steps, they may never get to the large food source that is on the other
side of the meadow. Indeed, this strategy is used by a <a href="https://www.cambridge.org/core/books/physics-of-foraging/B009DE42189D3A39718C2E37EBE256B0">wide range of
animals</a>
beyond bumblebees<sup id="fnref:e" role="doc-noteref"><a href="#fn:e" class="footnote" rel="footnote">5</a></sup>.</p>
<h2 id="mathematical-analysis-of-bumblebee-flight">Mathematical analysis of bumblebee flight</h2>
<p>To represent these flight patterns, we need to build an extremely simple model
which is easy to work with and can be applied to a melody line. First, we need
a way to relate the notes in a melody line to the location of a bee. One
natural way is to assume that the ups and downs of the melody line correspond to
movements of the bee. We can do this by assigning a numerical value to each
note in the melody line, where 0 is the lowest note on the piano keyboard and
each half step interval is 1 higher than the previous. Then, tracking the note
in the melody line is the same as tracking the location of the bee. With this
representation, a small movement for a bee is equivalent to a small interval in
the melody line. Likewise, a “jump” for the bee is analogous to a “jump” in the
melody line, or a large interval.</p>
<p>Now we can think about how a simple bumblebee flight model might operate. The
“small steps and large jumps” behaviour we described in the previous section
depends only on the sizes of the steps the bee takes at subsequent points in
time, not on the direction. So without making assumptions about the direction
of each step, we can specify the probability of having steps of different sizes.
Then, we can assume that the bee goes in a random direction at each step. This
model, known as a random walk, is used to model a <a href="https://en.wikipedia.org/wiki/Random_walk#Applications">huge number of
phenomena</a> in the
natural world.</p>
<p>We will compare two models for choosing our step sizes in the random walk<sup id="fnref:f" role="doc-noteref"><a href="#fn:f" class="footnote" rel="footnote">6</a></sup>.
The first, the “geometric” model, is a good representation for most types of
data. It posits that steps become proportionally less frequent as they get
larger. So, if 50% of steps are of size 1, 25% will be of size 2, 12.5% will be
of size 3, and so on. This means that large step sizes are extremely
infrequent: if we continue the pattern forward, a step size of one octave (12)
will only occur once every 4000 steps, and a step of two octaves (24) will occur
once every 16 million steps!</p>
<p>By contrast, we can also use a “powerlaw” model for step sizes. Here, the
probability of having a step of a given size is proportional to a power of the
size of the step. This is harder to do in our head, so I worked out these
numbers for us: if 50% of steps are of size 1, then only 15% of steps are of
size 2, and 7% are of size 3. But, if we go out to larger steps, a one-octave
jump will occur once every 146 steps, and a two octave jump will occur once
every 485 steps! So, compared to the geometric model, the powerlaw model allows
large jumps to happen much more frequently, with relatively fewer medium-sized
jumps. Both of these models have only one parameter, describing the scale on
which they operate.</p>
<p>Here is a simulation of what a bumblebee’s flight path might be under each
model.</p>
<div>
<figure>
<center><img src="/res/bumblebee/random-processes-2d.png" /></center>
<figcaption class="imagecaption"><p>Example 2-dimensional flight trajectories from a geometric or powerlaw random walk model.</p>
</figcaption>
</figure>
</div>
<p>As you can see, the powerlaw model involves several large jumps, whereas the
geometric model doesn’t. You can compare this to some <a href="https://doi.org/10.1371/journal.pone.0078681.s003">example bumblebee
flight trajectories measured by Juliet Osborne and
colleagues</a>, which show
patterns which more closely resemble the powerlaw model than the geometric
model.</p>
<p>While bumblebees fly in three dimensions, our melody line is only measured in
one dimension. So, we need to convert these models into one dimension in order
to compare them to Flight of the Bumblebee. First, let’s simulate these models
and compare them to the actual melody line. These simulations try to match the
qualitative character of the jumps of the melody line, rather than the exact
“flight path” of the melody line. Here are example flight paths from the
one-dimensional versions of these models, as well as the one derived from the
melody of Flight of the Bumblebee:</p>
<div>
<figure>
<center><img src="/res/bumblebee/random-processes-1d.png" /></center>
<figcaption class="imagecaption"><p>Example
1-dimensional “flight” trajectories from a geometric or powerlaw random walk
model, compared to the melody line from Flight of the Bumblebee.</p>
</figcaption>
</figure>
</div>
<p>The melody line appears to have a more similar “jumpiness” to the powerlaw model
than the geometric model. We can formalise this by fitting the models directly
to the jump sizes in melody line (through maximum likelihood, see the appendix
on methods for details), and then evaluating the fit to see which model is more
likely given the data. When we do so, we find that the powerlaw model is \(1.5
\times 10^{72}\) times more likely than the geometric model!</p>
<div>
<figure>
<center><img src="/res/bumblebee/model-comparison.png" /></center>
<figcaption class="imagecaption"><p>Comparison of geometric and powerlaw model.</p>
</figcaption>
</figure>
</div>
<p>This gives us very, very strong evidence that the Flight of the Bumblebee
follows a pattern involving small steps and large jumps, over a pattern with a
more balanced distribution of small and medium sized jumps.</p>
<h3 id="control-analysis">Control analysis</h3>
<p>Flight of the Bumblebee is based on the chromatic scale, and the chromatic scale
contains lots of small intervals. Is it possible that this correspondence to
the powerlaw model is just due to the extensive use of the chromatic scale? To
test this, we can perform the same analysis on the other<sup id="fnref:g" role="doc-noteref"><a href="#fn:g" class="footnote" rel="footnote">7</a></sup> highly-chromatic
piece of classical music widely known in pop culture: Entry of the Gladiators by
Fučík (i.e., <a href="https://www.youtube.com/watch?v=9ZM-HZDZTc0">the circus song</a>).
Entry of the Gladiators also contains lots of chromatic passages and several
large jumps. However, in stark contrast to Flight of the Bumblebee, given the
melody line in Entry of the Gladiators, the geometric model was 1.8 times more
likely than the powerlaw model.</p>
<div>
<figure>
<center><img src="/res/bumblebee/model-comparison-gladiators.png" /></center>
<figcaption class="imagecaption"><p>Comparison of geometric and powerlaw model on Entry of the Gladiators.</p>
</figcaption>
</figure>
</div>
<p>This means that not all music based on the chromatic scale follows powerlaw step
sizes.</p>
<p>At first, it may come as a surprise that Entry of the Gladiators isn’t better
fit by powerlaw model. Like Flight of the Bumblebee, it also has lots of small
intervals and lots of large jumps. In fact, it has far more large jumps than
Flight of the Bumblebee. The reason it is not better fit is because Entry of
the Gladiators also contains several intermediate-sized jumps. These
intermediate-sized jumps aren’t predicted by the powerlaw model or by models of
bumblebee flight. Flight of the Bumblebee contains almost exclusively chromatic
steps and large jumps, which makes the powerlaw model a better fit. This means
that Flight of the Bumblebee bears more of a mathematical resemblance to the
behavioural patterns of bumblebee flight than this other highly-chromatic piece.</p>
<h2 id="what-did-rimsky-korsakov-intend-to-write">What did Rimsky-Korsakov intend to write?</h2>
<p>These analyses raise an important question: is there a music theory explanation
for including large jumps beyond the imagery of bumblebee flight? Likewise,
which aspects of the piece are intended to invoke the imagery and which are
incorporated for other musical or artistic purposes?</p>
<p>We know that Rimsky-Korsakov didn’t explicitly implement these mathematical
models in his music. Not only was there scarce knowledge about insect behaviour
when the Tale of Tsar Saltan premiered in November 1900, but the mathematics on
which the models are based hadn’t even been invented yet! In order to make a
judgement about step sizes in the melody line, we first must understand what
musical and artistic aspects influenced the melody line. Let’s explore two
here: the hero’s main theme, and the use of the whole-tone scale.</p>
<h3 id="heros-main-theme">Hero’s main theme</h3>
<p>One important component of the melody line is the hero Prince Gvidon’s theme<sup id="fnref:h" role="doc-noteref"><a href="#fn:h" class="footnote" rel="footnote">8</a></sup>
(<a href="https://en.wikipedia.org/wiki/Leitmotif">leitmotif</a>) within the opera. This
theme comes from the folksong “Заинька Попляши” (“Zainka Poplyashi”, which
roughly translates to “Dance, bunny, dance!”), a song with which Rimsky-Korsakov was
<a href="https://imslp.org/wiki/Collection_of_100_Russian_Folksongs%2C_Op.24_(Rimsky-Korsakov%2C_Nikolay)">intimately
familiar</a>.
The theme is:</p>
<div>
<figure>
<center><img src="/res/bumblebee/gvidon.png" /></center>
<figcaption class="imagecaption"><p>Leitmotif of Prince Gvidon’s.</p>
</figcaption>
</figure>
</div>
<p>As you can see below, the character’s theme is clearly represented in main
bumblebee melody line, as indicated by red note heads:</p>
<div>
<figure>
<center><img src="/res/bumblebee/leitmotif_melody.png" /></center>
<figcaption class="imagecaption"><p>Melody line of Flight of the Bumblebee with highlighted leitmotif.</p>
</figcaption>
</figure>
</div>
<p>What this means is that one of the most common jumps, the jump of a perfect 4th
in the main theme, can be “explained” by this theme. Since the frequency and
size of jumps is critical in our mathematical analysis, we can repeat the
analysis while ignoring these specific jumps in the melody line. As we see, doing so
doesn’t ruin the mathematical effect described in the previous section: the
powerlaw model is \(1.8 \times 10^{65}\) times more likely, given the melody
line.</p>
<div>
<figure>
<center><img src="/res/bumblebee/model-comparison-nofourth.png" /></center>
<figcaption class="imagecaption"><p>Comparison of geometric and powerlaw model.</p>
</figcaption>
</figure>
</div>
<p>So, the use of perfect 4th jumps to capture Gvidon’s leitmotif isn’t a major
factor in what makes this melody resemble bumblebee flight.</p>
<h3 id="whole-tone-scale">Whole-tone scale</h3>
<p>The piece also has a close connection with the whole-tone scale. While the
whole-tone scale is most closely associated today with the French impressionists
like Debussy and Ravel, it was actually used much earlier by Rimsky-Korsakov’s
contemporaries as a building block of Russian nationalistic music. In this
school of music, the whole-tone scale is used to represent the magical, the
regal, the ominous, and the surreal.</p>
<p>Given the magical nature of a human transforming into a bumblebee, it may come
as no surprise that the whole-tone scale plays a prominent role in Flight of the
Bumblebee. Recall that there are only two whole-tone scales, which contain no
notes in common.</p>
<div>
<figure>
<center><img src="/res/bumblebee/whole_tone.png" /></center>
<figcaption class="imagecaption"><p>The two whole-tone scales.</p>
</figcaption>
</figure>
</div>
<p>Since the scales contain no notes in common, we can classify any given note as
belonging to one of the two whole-tone scales.</p>
<p>In Flight of the Bumblebee, almost all of the pitches on the eighth note beats
fall into the C♮ whole-tone scale, and all of the notes on the off-beats fall
into the D♭ whole-tone scale. This trend is only violated five times. In three
cases, the piece modulates from A minor to D minor (the subdominant), when the
whole-tone scales switch. One violation is to make the melody line align
properly at the repeat<sup id="fnref:i" role="doc-noteref"><a href="#fn:i" class="footnote" rel="footnote">9</a></sup>, and the final violation is to make sure the final
note of the piece falls on A, the tonic.</p>
<p>We can visualise this by plotting each on-beat in the melody line of the piece.
The x axis indicates the time at which each note is played, and the y axis
indicates which whole-tone scale the note comes from. If we show the entire
piece in one plot, the on-beats look like this:</p>
<div>
<figure>
<center><img src="/res/bumblebee/whole-tone-melody.png" /></center>
<figcaption class="imagecaption"><p>The whole-tone scale associated with each on-beat in Flight of the Bumblebee. Individual points appear as lines because they are very close together in time.</p>
</figcaption>
</figure>
</div>
<p>And the off-beats look like this:</p>
<div>
<figure>
<center><img src="/res/bumblebee/whole-tone-melody-off.png" /></center>
<figcaption class="imagecaption"><p>The whole-tone scale associated with each off-beat in Flight of the Bumblebee. Individual points appear as lines because they are very close together in time.</p>
</figcaption>
</figure>
</div>
<p>There aren’t very many switches between whole-tone scales, and those that do
occur have a clear musical purpose. This appears to be a deliberate use of the
whole-tone scale in the piece.</p>
<p>An objection one might make to this is that the whole-tone scale is by necessity
connected with the chromatic scale. Since Flight of the Bumblebee uses the
chromatic scale, this property of the whole-tone scale might arise naturally.</p>
<p>To show this objection doesn’t apply, we can perform the same analysis on Entry
of the Gladiators. Here, unlike in Flight of the Bumblebee, we see no
deliberate use of whole-tone scale for the on-beats:</p>
<div>
<figure>
<center><img src="/res/bumblebee/whole-tone-melody-gladiators.png" /></center>
<figcaption class="imagecaption"><p>The whole-tone scale associated with each on-beat in Entry of the Gladiators.</p>
</figcaption>
</figure>
</div>
<p>Or for off-beats:</p>
<div>
<figure>
<center><img src="/res/bumblebee/whole-tone-melody-gladiators-offbeats.png" /></center>
<figcaption class="imagecaption"><p>The whole-tone scale associated with each off-beat in Entry of the Gladiators.</p>
</figcaption>
</figure>
</div>
<p>This means that the whole-tone scale seems to have been deliberately used by
Rimsky-Korsakov, but not by Fučík, in constructing the melody line.</p>
<p>Interestingly, this pattern makes it more “difficult” to write a melody line
which includes large jumps. This is because large jumps can only be included if
they sound nice. It is musically “easy” to maintain the whole-tone scale
pattern while making medium-sized jumps, because most medium-sized jumps sound
pleasant in many different contexts. All intervals from minor 2nds to major
6ths are extremely common across a wide range of musical genres. By contrast,
it is more difficult to make large jumps sound nice. In most pieces, large
jumps often occur at octave, 9th, 10th, or flat 7th intervals, all of which
would violate the alternating whole-tone scale pattern. This means that the use
of the whole-tone scale in the melody line actually makes it <em>more difficult</em> to
have large jumps. So, because of the whole-tone scale pattern, the presence of
large jumps in Flight of the Bumblebee is even more surprising than the
mathematical analysis suggests.</p>
<p>Rimsky-Korsakov also left out some medium-sized jumps which fit the whole-tone
scale pattern. Another “valid” jump under the whole-tone pattern, and one which
is extremely common in other pieces, is the perfect 5th. However, the perfect
5th, a medium-sized jump, only occurs once in the melody line of Flight of the
Bumblebee. Two similarly “easy-to-use” intervals which satisfy the whole-tone
pattern, the minor 3rd and the major 6th, don’t occur at all. The only
medium-sized interval that Rimsky-Korsakov uses is the perfect 4th, which we
already showed was a result of Prince Gvidon’s theme. So, the use of large
jumps and not medium sized jumps appears to be a deliberate choice by the
composer.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Flight of the Bumblebee is a much more musically complex piece than it initially
seems. Rimsky-Korsakov appears to have deliberately mimicked an important
property of bumblebee flight within his music. Mathematical models to describe
bumblebee flight, invented long after Rimsky-Korsakov’s time, end up providing
an excellent fit to his melody line. On top of this, he incorporated
interesting features from a music theory perspective, including the main
character’s theme and a whole-tone scale pattern. These musical features don’t
explain the melody line’s resemblance to bumblebee flight. While we will never
know if Rimsky-Korsakov actually observed bumblebee flight while composing
Flight of the Bumblebee, it sure is fun to speculate<sup id="fnref:j" role="doc-noteref"><a href="#fn:j" class="footnote" rel="footnote">10</a></sup>, isn’t it?</p>
<h3 id="appendix-methods">Appendix: Methods</h3>
<p>To find the melodic pattern, I downloaded <a href="https://musescore.com/nicolas/scores/437">Nicolas Froment’s engraving of the
Rachmaninoff piano arrangement</a> and
cut out everything except the melody line. Then, I cross-referenced this to the
<a href="https://imslp.org/wiki/File:PMLP3170-Rimsky_Saltan_Score.pdf">score from the original
opera</a>, page
262-267, to tweak the melody line to ensure it is faithful to that of the opera
rather than the piano arrangement, correcting for octaves, breaks, repeats, etc.
I exported this as MIDI (attached below) and analysed it using a Python script
(attached below). Likewise, I used <a href="https://musescore.com/james_brigham/scores/1243801">James Birgham’s engraving of the piano
reduction of Entry of the
Gladiators</a> and exported to
MIDI (attached).</p>
<p>I only used sections of Flight of the Bumblebee which were part of the rapid,
recognisable melody, concatenating across breaks. Likewise, I excluded the trio
section of Entry of the Gladiators since it isn’t based on chromatics. Repeated
notes were excluded since a jump of zero in a discrete random walk doesn’t
correspond to any kind of step in a continuous Wiener or Levy process. I
adjusted the octave down in two cases (penalising the powerlaw model): once for
a two-octave jump in the final run of Flight of the Bumblebee, since it was an
artistic flourish for the finale; and once for a two-octave jump when the melody
line switches from treble to bass in Entry of the Gladiators.</p>
<p>Model fitting was performed by finding the pairwise distances between
neighbouring points in each timeseries, and then fitting the counts of each to a
geometric or Zipf distribution through maximum likelihood. A numerical
minimisation routine was used to find parameters which maximised the likelihood
function.</p>
<p>The music theory analysis was partially my own (whole-tone scale analysis) and
partially based on work from <a href="https://www.gutenberg.org/files/46587/46587-h/46587-h.htm">Rosa
Newmarch</a>, <a href="https://blog.mymusictheory.com/2009/flight-of-the-bumble-bee/">Victoria
Williams</a> and
<a href="https://helda.helsinki.fi/handle/10138/41117">John Nelson</a> (historical and
motivic analysis). Thank you to Sophie Westacott for helpful comments, and to
the giant bumblebee who regularly hovers outside my window and then darts away
for inspiration.</p>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/bumblebee/fit_models.py">Analysis script</a></li>
<li><a href="/res/bumblebee/make_plots.py">Script to generate plots</a></li>
<li><a href="/res/bumblebee/flight_melody.mid">Flight of the Bumblebee MIDI melody</a></li>
<li><a href="/res/bumblebee/gladiators_melody.mid">Entry of the Gladiators MIDI melody</a></li>
</ul>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:a" role="doc-endnote">
<p>contrary to Betteridge’s law <a href="#fnref:a" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:b" role="doc-endnote">
<p>In the Pushkin story, he was also transformed into a mosquito and a fly to
see his father and sting his aunts, for a total of three trips back. <a href="#fnref:b" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:c" role="doc-endnote">
<p>Due to the piece’s broad recognisability today, many orchestras insert
Flight of the Bumblebee into the suite anyway. <a href="#fnref:c" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:d" role="doc-endnote">
<p>Contrary to claims from some sources, it didn’t appear in Charlie Chaplin’s
1925 film “Gold Rush” - instead, it appeared in the 1942 version, which used
different music. <a href="#fnref:d" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:e" role="doc-endnote">
<p>Similarly, when humans seek food, it is occasionally wise to take a large
step and go to the supermarket instead of always looking inside the
refrigerator. <a href="#fnref:e" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:f" role="doc-endnote">
<p>As a technical note: ideally, we would want to consider the difference
between a Wiener process (Brownian motion) and a Levy process, which is
similar to Brownian motion except the steps are powerlaw-distributed. Since
our melody line is discrete in pitch (i.e. there are only 12 notes per
octave), we must use a model with discrete steps in order to get a
likelihood which makes sense. So, we use the geometric distribution and the
discrete powerlaw distribution (zeta or Zipf distribution). We use the
geometric distribution as a stand-in for the normal distribution, since
there is no well-known discrete half-normal–like distribution with support
on all the natural numbers. <a href="#fnref:f" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:g" role="doc-endnote">
<p>The use of “the” was deliberate. I don’t think there are any other pieces
of classical music which are widely known in pop culture and rely so heavily
on chromatics. The internet doesn’t seem to think so either, but please
correct me if I am wrong! (Habanera from Carmen doesn’t count, since it
only embeds a descending chromatic scale within a surrounding
not-so-chromatic melody.) <a href="#fnref:g" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:h" role="doc-endnote">
<p>Prince Gvidon has two main leitmotifs in the opera, but only the one shown
here participates in the melody line. The other one also appears in Flight
of the Bumblebee but not within the melody line: it is the descending and
ascending staccato line which appears several times in the accompaniment. <a href="#fnref:h" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:i" role="doc-endnote">
<p>The repeat is only found in the original opera score, not in the popular
piano reduction. <a href="#fnref:i" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:j" role="doc-endnote">
<p>Speaking of speculation, here is a pretty big leap: Rimsky-Korsakov is known
to have had synesthesia, associating colours with different keys. While
there is no documented evidence of Rimsky-Korsakov’s associations with the
minor keys, there are descriptions of his associations with the major keys.
While I was unable to find the primary source describing Rimsky-Korsakov’s
colour-key synesthesia, I do believe there to be a primary source in
Russian, because several of the secondary sources use different translations
of the colour names.</p>
<p>The parallel and relative major keys actually line up well with bumblebees.
Flight of the Bumblebee is written in A minor, and modulates to D and G minor.
In Rimsky-Korsakov’s classification, the parallel major keys, A, D, and G major,
correspond to rose, yellow and gold, respectively. These colours evoke the
imagery of a yellow-gold bumblebee flying to bright flowers. Additionally, the
relative major keys of these minor keys—C major, F major, and Bb
major—correspond to green and white, with no documented correspondance for Bb
major. These associations, while interesting, are probably a coincidence,
especially without evidence about his colour associations with the minor keys. <a href="#fnref:j" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sat, 04 Jun 2022 00:00:00 +0000
/2022/06/04/flight-of-the-bumblebee.html
/2022/06/04/flight-of-the-bumblebee.htmlmathoperastochastic-processesmodelingstatsMusicAre buffets efficient?
<p>I recently attended an academic conference and was struck by the
inefficiency of the buffet-style dinner. The conference had
approximately 500 attendees, and dinner was scheduled for 6:30 PM
following a 15 minute break in the conference program. A 15 minute
break is not enough time to go back to the hotel room and take a nap,
and barely enough time to find a quiet corner and get some work done,
but it is a perfect amount of time to crowd around the buffet waiting
for dinner to open.</p>
<p>When dinner finally opened, all of the food was arranged on a single
table, inviting the attendees to form a single line to serve
themselves. As you can imagine, this line was quite long. I was
lucky enough to be one of the first people through the line, but by
the time I finished eating a half hour later, the line was still very
long. I had been waiting for someone who was towards the middle to
end of the line, and this person still had not made it through yet.</p>
<p>I found it odd that feeding people should be so slow. From my
experiences in undergraduate dining halls, it is possible to feed more
people in a shorter amount of time. A key difference between these
situations is that in undergraduate dining halls, food is often served
at individual stations, meaning you only need to wait in line for food
you want to eat. By contrast, in the catering style, it is often
served all at one table, with diners waiting in a single line and
accessing the dishes one by one. I wanted to examine the efficiency of
these two systems. This is important not only for minimizing the mean
wait time so that everyone gets their food faster, but also for
minimizing inequality between people at the front and end of the line.
This ensures that everyone has the opportunity to dine together.</p>
<h2 id="model">Model</h2>
<p>I modeled this situation as a single line in a buffet versus
individual lines for each different dish. I made the following
assumptions:</p>
<ul>
<li>Some people may be faster or slower at serving themselves.</li>
<li>Some foods can be served faster than other foods.</li>
<li>People may not want all of the food which is being served, and each
person wants a different random selection of foods.</li>
<li>Only one person can serve themself a particular dish at one time.</li>
<li>When dishes have their own individual lines, people look at the
lines for the foods they want to eat and stand in the shortest line
next.</li>
<li>People already know what the options are and where they are located.</li>
</ul>
<p>I calculated the wait time for each person in the simulation as the
total amount of time it took a person to pass through the buffet. I
then looked at the mean wait time for the group as well as the
inequality in wait time for people in the group, defined as the third
quartile minus the first quartile. Each simulation was run many times
to ensure accurate statistics.</p>
<h2 id="number-of-dishes-wanted">Number of dishes wanted</h2>
<p>Let us suppose that not everyone wants the same number of dishes, but
that all dishes are equally attractive. First we look at the case
when there are a limited number of dishes.</p>
<div>
<figure>
<center><img src="/res/buffet/wait-ineq-few-dishes.png" /></center>
<figcaption class="imagecaption"><p>100 people serving themselves in a buffet with 6 dishes</p>
</figcaption>
</figure>
</div>
<p>We see that when there aren’t very many dishes and most people want
all of them, it is faster to have a single line. This may be
counter-intuitive, but it is due to the fact that people do not
optimally distribute themselves, but instead choose the shortest line.
Suppose for example that one dish is much slower to serve than all of
the others. People who choose this food last will have to wait
approximately the same amount of time as they would have if there was
a single line and they ended up at the end, because this dish serves
as the bottleneck. However, the people who are at the front of this
line will still need to wait in more lines for the other dishes,
because other people tried to serve themselves these dishes first. As
a result, having multiple lines can sometimes increase the amount of
time for the fastest people and not decrease the amount of time for
the slowest people.</p>
<p>Additionally, there is a large inequality in wait times, i.e. some
people will get through the line quickly, while others will be stuck
in line for a long time. This is the case for both serving styles,
but is especially pronounced for the case with multiple lines.</p>
<p>Let us also examine the case when there are many dishes to choose
from.</p>
<div>
<figure>
<center><img src="/res/buffet/wait-ineq-many-dishes.png" /></center>
<figcaption class="imagecaption"><p>100 people serving themselves in a buffet with 20 dishes</p>
</figcaption>
</figure>
</div>
<p>When there are many dishes to choose from (here 20), no matter how
many dishes people may want (within reason), individual lines reduce
both the mean wait time and the inequality in wait times compared to a
single line. Intuitively, this is because people can distribute
themselves and they only have to wait for the dishes they want to eat.</p>
<p>Additionally, let’s look at the case when there are many people.</p>
<div>
<figure>
<center><img src="/res/buffet/wait-ineq-many-people.png" /></center>
<figcaption class="imagecaption"><p>500 people serving themselves in a buffet with 6 dishes</p>
</figcaption>
</figure>
</div>
<p>In this case, we see the counter-intuitive result again: the mean wait
time is quite a bit higher for individual lines when most people want
most of the foods, however inequality is still lower.</p>
<p>Finally, we can examine the case when there are very few people.</p>
<div>
<figure>
<center><img src="/res/buffet/wait-ineq-veryfew-people.png" /></center>
<figcaption class="imagecaption"><p>30 people serving themselves in a buffet with 6 dishes</p>
</figcaption>
</figure>
</div>
<p>In this case, separate lines are better for both mean wait time and
equality.</p>
<h2 id="fairness">Fairness</h2>
<p>A fair system is one in which the amount of time someone waits is
proportional to the number of dishes they want. In an unfair
scenario, someone who only wants one dish must wait for the same
amount of time as someone who wants all of the dishes.</p>
<p>Let’s look at whether this form of fairness holds. First, we look at
how long someone must wait depending on how many dishes they want.</p>
<div>
<figure>
<center><img src="/res/buffet/fairness.png" /></center>
<figcaption class="imagecaption"><p>Average wait time differs depending on how many dishes a person would like</p>
</figcaption>
</figure>
</div>
<p>As expected, when there is only one line, everyone must wait for
approximately the same amount of time, no matter how much food they
want to eat. People who want all of the dishes in a line must wait
for less time on average, but someone who only wants one dish must
wait for a very long time. By contrast, when there are multiple
lines, the amount of time people wait is proportional to the number of
dishes they want to try.</p>
<p>Similarly, it might be fairer that someone who can serve themself
quickly has a shorter waiting time than someone who is slower.</p>
<div>
<figure>
<center><img src="/res/buffet/speed-vs-time.png" /></center>
<figcaption class="imagecaption"><p>Points
represent people. There is no significant correlation (\(p>.2\))
between time spent waiting and serving speed</p>
</figcaption>
</figure>
</div>
<p>Unfortunately this does not seem to be the case in either system.
Rather, people who are slow to serve themselves take approximately the
same amount of time in line as those who are fast.</p>
<h2 id="summary-and-conclusions">Summary and conclusions</h2>
<p>In summary, when there are a lot of people present, if everyone wants
most of the food at the buffet, a single line counter-intuitively
reduces the mean wait time. However, this single line substantially
increases inequality in wait times, meaning that some people will have
to wait for a long time while others can go through immediately.
Additionally, people who only want a small amount of food must wait a
long time to serve themselves. A more fair but slightly less
efficient system is one where there is a separate station for each
dish, but this can be inefficient when most people want most of the
dishes available.</p>
<p>This analysis leaves out a few factors which are difficult to account
for. For example, it assumes the amount of time taken to walk from
one food to another is negligible, and that people know <em>a priori</em>
what food they would like to eat and where it is located. Both of
these have the potential to slow down serving times in the case with
separate lines. This analysis also doesn’t account for several other
factors which are important in real life. For example, it assumes
that space is not an issue. It also assumes there is enough seating
to accommodate everybody; if only a limited amount of seating is
available, a high inequality is desirable as it prevents everyone from
going to the dining area at one time.</p>
<p>One method which is often employed to speed up single lines is having
more than one identical line, or two sides on the same line, likewise,
in the case of separate lines, there are sometimes “stations” which
have identical dishes. In both cases, because we assume people balance
themselves by going to the shortest line, doubling the number of
copies of all dishes would be expected to approximately cut the mean
wait time in half.</p>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/buffet/buffet_model.py">Data analysis script</a></li>
</ul>
Sat, 02 Mar 2019 00:00:00 +0000
/2019/03/02/are-buffets-efficient.html
/2019/03/02/are-buffets-efficient.htmlmodelingfoodbuffetqueuing-theoryModelOptimality in card shuffling
<p>Many powerful minds have devoted countless decades to the academic
study of card games. However, relatively little attention has been
given to the best way to shuffle a deck of cards. There are several
different methods that people have developed to shuffle cards:</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=O8W9ipOSmWA">The riffle shuffle</a>:
The classic shuffling method. Cut the deck approximately in half.
Take one half in each hand, and let the cards from both decks fall
on top of one another.</li>
<li><a href="https://www.youtube.com/watch?v=x5tLNHuvf6s">The overhand shuffle</a>:
Another very popular method. Hold the cards in one hand, and take
the cards with your other hand, and let them fall on top.</li>
<li><a href="https://www.youtube.com/watch?v=0ZXhPWkro9A">The Hindu shuffle</a>: A
method popular in India. It is a different style of performing an
overhand shuffle.</li>
<li><a href="https://youtu.be/pnULsmjX4VA?t=26">The pile shuffle</a>: Create
several sub-decks of cards, by placing cards one-by-one into a
(potentially random) sub-deck. Then, combine the sub-decks together.</li>
</ul>
<p><a href="http://www.ams.org/samplings/feature-column/fcarc-shuffle">It has been claimed</a>
that seven, eight, or more riffle shuffles are necessary to obtain
complete randomization. However, these studies assume that any
difference in probability between a shuffled deck and a fully random
permutation can be exploited by the players. This is an important
model for casinos where large amounts of money are at stake, but for
games between friends where perfect randomness is not needed, seven or
eight shuffles in between hands causes a substantial delay in the
game.</p>
<p>Thus, below I describe how many shuffles you need <em>in practice</em>
instead of <em>in theory</em>. My evaluation looks for patterns and
irregularities in hands that would be dealt. It accounts for three
types of patterns—suits, ranks, and clusters/straights—by looking
at the joint distribution of of these frequencies compared to a null
distribution. (For instance, a six card hand containing four of one
suit and two of another is highly unlikely.) I simulate a number of
different shuffling methods and find the probability of obtaining the
arrangements generated by these shuffles in a truly random deck.</p>
<p>Each of the shuffles starts with either fully ordered decks (i.e. a
new deck of cards), or else with cards in an order representative of a
specific card game. I compare these to a deck which started out fully
randomized before the shuffle as a control, which is therefore
guaranteed to be fully randomized after the shuffle. All shuffles
should be compared to the randomized deck, i.e. the red line in the
plots.</p>
<div>
<figure>
<center><img src="/res/shuffle/legend.png" /></center>
<figcaption class="imagecaption"><p>Card game deck patterns used in all figures which follow.</p>
</figcaption>
</figure>
</div>
<h2 id="riffle-shuffle">Riffle shuffle</h2>
<p>The riffle shuffle is arguably the most popular shuffling method. The
standard accepted mathematical model for the Riffle shuffle is called
the
<a href="https://en.wikipedia.org/wiki/Gilbert%E2%80%93Shannon%E2%80%93Reeds_model">Gilbert-Shannon-Reeds model</a>
(GSR model). (Yes, that’s
<a href="https://en.wikipedia.org/wiki/Claude_Shannon">Claude Shannon</a>.) This
model assumes that the shuffler is an expert and can dynamically
adjust the rate at which cards are laid down from each hand during the
shuffle. It states that the probability of putting down a card from
either hand is proportional to the relative size of the deck in that
hand. For example, if your left hand had 30 cards and your right hand
had 10, you would have a 3/4 probability of the next card coming from
your left hand.</p>
<p>Hesitant to accept this definition, I developed another model which
includes a skill parameter \(\lambda\). High \(\lambda\) means that
there is a low probability of two consecutive cards being dealt from
the same hand, whereas low \(\lambda\) means there will be large
bunches of cards from the same hand in the deck. This is motivated by
the fact that, given the speed at which cards are shuffled, most
people do not have the reaction time necessary to adjust the rate at
which they let cards fall from either hand, and also by the fact that
casino dealers come close to alternating cards from each hand.</p>
<p>In order to test these models, I collected data from my own shuffles
to determine the most likely model. I found that my model was
slightly more likely given the data, but if you account for the fact
that my model has a parameter whereas the GSR model has none, the GSR
model is a slightly better fit to the data.</p>
<p>Because they were close, I decided to test both cases, and also to
vary the skill level \(\lambda\). I tested four cases: a novice
shuffler (\(\lambda=0.3\)), an average shuffler (\(\lambda=0.45\), the
best fit to my shuffling data), an expert shuffler (\(\lambda=0.8\)),
and the ideal GSR case (which has no parameter).</p>
<div>
<figure>
<center><img src="/res/shuffle/riffle.png" /></center>
<figcaption class="imagecaption"><p>Effectiveness of riffle shuffles as a function of the number of consecutive shuffles.</p>
</figcaption>
</figure>
</div>
<p>As we can see, people comfortable with a riffle shuffle need
approximately <strong>4 shuffles</strong> in order to randomize the deck, which is
approximately
<a href="https://projecteuclid.org/download/pdf_1/euclid.aoap/1177005705">half of the theoretical recommendation</a>.
An average card player does not have any advantage over a professional
casino dealer in this regard. Of course, if you’re not riffle
shuffling very well, it will take far more shuffles to achieve the
same degree of randomness.</p>
<h2 id="overhandhindu-shuffle">Overhand/Hindu shuffle</h2>
<p>The overhand shuffle is equivalent to dividing the deck into a number
of sub-decks and then recombining those decks in reverse order. By
watching several YouTube videos of people performing the overhand
shuffle, there is a wide variety in the number of times people will
divide the deck when performing the shuffle. Most people do it around
5 times, but some people consistently do more or fewer than this.</p>
<p>Here, we consider the cases of 5 cuts, 3 cuts, and 8 cuts, as these
values covers the typical range of cuts.</p>
<div>
<figure>
<center><img src="/res/shuffle/overhand.png" /></center>
<figcaption class="imagecaption"><p>Effectiveness of overhand shuffles as a function of the number of consecutive shuffles.</p>
</figcaption>
</figure>
</div>
<p>It takes many more overhand shuffles to randomize the deck. Assuming
an average number of cuts, it takes approximately <strong>25 shuffles</strong>, which
is 20-60 times less than
<a href="https://www.math.upenn.edu/~pemantle/papers/overhand2.pdf">the theoretical result</a>.
When you only cut the deck 3 times during an overhand shuffle, this
number jumps to almost 40. Nevertheless, this goes against the
theoretical finding, and suggests that the overhand shuffle is a valid
and useful method, even if it is a bit more time consuming.</p>
<h2 id="pile-shuffle">Pile shuffle</h2>
<p>The pile shuffle has many variants. In the strict form, the shuffler
deals all of the cards into some number of piles, and then stacks the
cards on top of each other.</p>
<p>Clearly this strict form is both deterministic and highly patterned,
and thus it is rather ineffective. Sometimes, people will add a
slight bit of randomization by picking up the piles in a different
order than they laid them down. More frequently, people will perform
this deal by randomizing the order in which they lay down cards into
the decks. Sometimes they will do so while keeping the decks
approximately the same size, and sometimes they will disregard deck
size. (However, note that
<a href="http://blog.maxshinnpotential.com/2017/07/05/can-humans-generate-random-numbers.html">people are notoriously bad at randomizing</a>,
so these should be considered the maximum limits of randomization
rather than the method’s true amount of randomization.)</p>
<div>
<figure>
<center><img src="/res/shuffle/piles.png" /></center>
<figcaption class="imagecaption"><p>Effectiveness of pile shuffles as a function of the number of piles.</p>
</figcaption>
</figure>
</div>
<p>We see that pile shuffling is not very efficient when only performed
once. The reason appears to be that this method does not randomize
the ranks in the deck. If you started with an ordered deck, you can
be nearly certain that if you pick up a 3, the next card will not be
a 2. Most versions of pile shuffling are ineffective for this reason.
The only version which works is the version which keeps the decks at
an approximately similar size throughout the duration of the shuffle,
while using at least eight decks. However, this also assumes that the
shuffler is able to generate near-random numbers, which is impossible
without either a random number generator or knowledge of strategies
for generating random numbers without one.</p>
<h2 id="miscellaneous-results">Miscellaneous results</h2>
<p>I simulated two styles of deals from the shuffled deck: one where the
top 6 cards were taken from the deck, and one where there were 4
players, and 6 cards were dealt to each in a clockwise manner. Results
were nearly identical for both cases, so only results for the former
are included here.</p>
<p>I also simulated the mixed case which combines riffle shuffles and
that overhand shuffles, with the hypothesis that adding a few overhand
shuffles could reduce the number of riffle shuffles needed to
randomize the deck. Unfortunately, this turned out to not be the
case. Adding one or two overhand shuffles to different places in the
riffle shuffle sequence was not able to reduce the number of riffle
shuffles needed to randomize the deck.</p>
<p>If we assume that the riffle shuffle takes approximately 5 seconds to
perform and the overhand shuffle takes 2 seconds to perform, it takes
40 seconds to randomize the deck using the overhand shuffle but 20
seconds to randomize it using the riffle shuffle. If we assume that
four cards can be dealt per second and decks can be straightened and
stacked at a rate of one per subpile, a suitable pile shuffle would
take 21 seconds. However, this is also assuming suitable
randomization, and thus, cards may not be as randomized as in the
other methods.</p>
<p>There are other considerations in choosing the shuffling method as
well; for instance, the overhand shuffle is considered to be less
damaging to the cards than a riffle shuffle, which may bend the cards.</p>
<h2 id="summary">Summary</h2>
<p>From this analysis, we have learned the following:</p>
<ul>
<li>The riffle shuffle is highly efficient, requiring only 4 shuffles in
order to make a deck random for most practical purposes.</li>
<li>When using 5 deck cuts in the overhand shuffle, about 25 shuffles
are necessary to randomize the deck. While it takes longer to
perform, it is equally as effective as 4 riffle shuffles after 25
iterations.</li>
<li>10 or more riffle shuffles may be required if the shuffler lacks
experience.</li>
<li>The pile shuffle should generally be avoided, unless using eight or
more piles, distributing cards randomly, ensuring all piles are
approximately the same size, and ideally finding a way to circumvent
the limitations humans have in generating random numbers.</li>
<li>Combining overhand shuffles with riffle shuffles does not increase
randomization compared to just using riffle shuffles.</li>
</ul>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/shuffle/cards.py">Python library for performing these analyses</a></li>
<li><a href="/res/shuffle/make_figures.py">Python script to make the figures in this post</a></li>
<li><a href="/res/shuffle/analyze_riffle_data.py">Python script for riffle shuffle experiment</a></li>
<li><a href="/res/shuffle/riffle_data.txt">Data from riffle shuffle experiment</a></li>
</ul>
Sun, 05 Nov 2017 00:00:00 +0000
/2017/11/05/optimality-in-card-shuffling.html
/2017/11/05/optimality-in-card-shuffling.htmldatamodelingcardsgamerandomnessModelCan humans generate random numbers?
<p>It is widely known that humans cannot generate sequences of random
binary numbers
<a href="http://dx.doi.org/10.1037/h0032060">(e.g. see Wagenaar (1972))</a>. The
main problem is that we see true randomness as being “less random”
than it truly is.</p>
<p>A fun party trick (if you attend the right parties) is to have one
person generate a 10-15 digit random binary number by herself, and
another generate a random binary sequence using coin flips. You,
using your magical abilities, can identify which one was generated
with the coin.</p>
<p>The trick to distinguishing a human-generated sequence from a random
sequence is by <em>finding the number of times the sequence switches
between runs of <code class="language-plaintext highlighter-rouge">0</code>s and <code class="language-plaintext highlighter-rouge">1</code>s</em>. For instance, <code class="language-plaintext highlighter-rouge">0011</code> switches once, but
<code class="language-plaintext highlighter-rouge">0101</code> switches three times. A truly random sequence will have a switch
probability of 50%. A human generated sequence will be greater
than 50%. To demonstrate this, I have generated two sequences
below, one by myself and one with a random number generator:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(A) 11101001011010100010
(B) 11111011001001011000
</code></pre></div></div>
<p>For the examples above, sequence (A) switches 13 times, and sequence
(B) switches 9 times. As you can guess, sequence (A) was mine, and
sequence (B) was a random number generator.</p>
<p>However, when there is a need to generate random numbers, is it
possible for humans to use some type of procedure to quickly generate
random numbers? For simplicity, let us assume that the switch
probability is the only bias that people have when generating random
numbers.</p>
<h2 id="can-we-compensate-manually">Can we compensate manually?</h2>
<p>If we know that the switch probability is abnormal, it is possible
(but much more difficult than you might expect) to generate a sequence
which takes this into account. If you have time to sit and consider
the binary sequence you have generated, this can work. It is
especially effective for short sequences, but is not efficient for
longer sequences. Consider the following longer sequence, which I
generated myself.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1101100011011010100
1101010111101011000
0101011110101001011
0100110101011110100
1010010101010111111
</code></pre></div></div>
<p>This sequence is 100 digits long, and switches 62 times. A simple
algorithm which will equalize the number of switches is to replace
digits at the beginning of the string with zero until the desired
number of switches has been obtained. For example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000000
0001010111101011000
0101011110101001011
0100110101011110100
1010010101010111111
</code></pre></div></div>
<p>But intuitively, this “feels” much less random. Why might that be?
In a truly random string, not only will the switch probability be
approximately 50%, but the <em>switch probability of switching</em> will be
approximately 50%, which we will call “2nd order switching”. What
exactly is 2nd order switching? Consider the truly random string:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>01000011011100010000
</code></pre></div></div>
<p>Now, let’s generate a new binary sequence, where a <code class="language-plaintext highlighter-rouge">0</code> means that a
switch did not happen at a particular location, and a <code class="language-plaintext highlighter-rouge">1</code> means a
switch did happen at that location. This resulting sequence will be
one digit shorter than the initial sequence. Doing this to the above
sequence, we obtain</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1100010110010011000
</code></pre></div></div>
<p>For reference, notice that the sum of these digits is equal to the
total number of switches. We define the 2nd order switches as the
number of switches in this sequence of switches.</p>
<p>We can generalize this to \(n\)‘th order switches by taking the sum of
the sequence once we have recursively found the sequence of switches
\(n\) times. So the number of 1st order switches is equal to the
number of switches in the sequence, the 2nd order is the number of
switches in the switch sequence, the number of 3rd order switches is
equal to the number of switches in the switch sequence of the switch
sequence, and so on.</p>
<p>Incredibly, in an infinitely-long truly random sequence, <em>the
percentage of \(n\)‘th order switches will always be 50%, for all
\(n\)</em>. In other words, no matter how many times we find the
sequence of switches, we will always have about 50% of the digits be
<code class="language-plaintext highlighter-rouge">1</code>.</p>
<p>Returning to our naive algorithm, we can see why this does not mimic a
random sequence. The number of 2nd order switches is only 24%. In a
truly random sequence, it would be close to 50%.</p>
<h2 id="is-there-a-better-algorithm">Is there a better algorithm?</h2>
<p>So what if we make a smarter algorithm? In particular, what if our
algorithm is based on the very concept of switch probability? If we
find the sequence of \(n\)‘th order switches, can we get a random
sequence?</p>
<p>It would take a very long time to do this by hand, so I have written
some code to do it for me. In particular, this code specifies a
switch probability, and generates a sequence of arbitrary length based
on this (1st order) switch probability. Then, it will also find the
\(n\)‘th order switch probability.</p>
<p>As a first measure, we can check if high order switch probabilities
eventually become approximately 50%. Below, we plot across the switch
probability the average of the \(n\)‘th order difference, which is
very easy to calculate for powers of 2 (see Technical Notes).</p>
<div>
<figure>
<center><img src="/res/randombinary/nthdiffmeans-2powers.png" /></center>
<figcaption class="imagecaption"><p>As we
increase the precision by powers of two, we get sequences that have
\(n\)‘th switch probabilities increasingly close to 50%, no matter what
the 1st order switch probability was.</p>
</figcaption>
</figure>
</div>
<p>From this, it would be easy to conclude that the \(n\)‘th switch
probability of a sequence approximates a random sequence as \(n → ∞\).
But is this true? What if we do the powers of 2 plus one?</p>
<div>
<figure>
<center><img src="/res/randombinary/nthdiffmeans-2powers-plus1.png" /></center>
<figcaption class="imagecaption"><p>As we increase the precision by powers of two plus one, we
get no closer to a random sequence than the 2nd order switch
probability.</p>
</figcaption>
</figure>
</div>
<p>As we see, even though the switch probability approaches 50%, there is
“hidden information” in the second order switch probability which
makes this sequence non-random.</p>
<h2 id="is-it-possible">Is it possible?</h2>
<p>Mathematicians have already figured out how we can turn biased coins
(i.e. coins that have a \(p≠0.5\) probability of landing heads) into
fair coin flips. This was
<a href="https://mcnp.lanl.gov/pdf_files/nbs_vonneumann.pdf">famously described by Von Neumann</a>.
(While the Von Neumann procedure is not optimal,
<a href="http://dl.acm.org/citation.cfm?id=1070587">it is close</a>, and its
simplicity makes it appropriate for our purposes as a heuristic
method.) To summarize this method, suppose you have a coin which
comes up heads with a probability \(p≠0.5\). Then in order to obtain
a random sequence, flip the coin twice. If the coins come up with the
same value, discard both values. If they come up with different
values, discard the second value. This takes on average
\(n/(p-p^2)\) coin flips to generate a sequence of \(n\) values.</p>
<p>In our case, we want to correct for a biased switch probability.
Thus, we must generate a sequence of random numbers, find the
switches, apply this technique, and then map the switches back to the
initial choices. So for example,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1011101000101011010100101000010
</code></pre></div></div>
<p>has switches at</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>110011100111110111110111100011
</code></pre></div></div>
<p>So converting this into pairs, we have</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 00 11 10 01 11 11 01 11 11 01 11 10 00 11
</code></pre></div></div>
<p>Applying the procedure, we get</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 0 0 0 1
</code></pre></div></div>
<p>and mapping it back, we get a random sequence of</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>100001
</code></pre></div></div>
<p>(With a bias of \(p=0.7\) in this example, the predicted length of our
initial sequence of 31 was 6.5. This is a sequence of length 6.)</p>
<p>It is possible to recreate this precisely with a sliding window.
However, there is an easier way. In humans, the speed limiting step
is performing these calculations, not generating biased binary digits.
If we are willing to sacrifice theoretical efficiency (i.e. using as
few digits as possible) for simplicity, we can also look at our initial
sequence in chunks of 3. We then discard the sequences <code class="language-plaintext highlighter-rouge">101</code>,
<code class="language-plaintext highlighter-rouge">010</code>, <code class="language-plaintext highlighter-rouge">111</code>, and <code class="language-plaintext highlighter-rouge">000</code>, but keep the most frequent digit
in the triple for the other observations, namely keeping a <code class="language-plaintext highlighter-rouge">0</code> for
<code class="language-plaintext highlighter-rouge">001</code> and <code class="language-plaintext highlighter-rouge">100</code>, or a <code class="language-plaintext highlighter-rouge">1</code> for <code class="language-plaintext highlighter-rouge">110</code>, and <code class="language-plaintext highlighter-rouge">011</code>.
(Note that this is only true because we assume an equal probability of
<code class="language-plaintext highlighter-rouge">0</code> or <code class="language-plaintext highlighter-rouge">1</code> in the initial sequence. A more general choice
procedure would be a <code class="language-plaintext highlighter-rouge">0</code> if we observe <code class="language-plaintext highlighter-rouge">110</code> or <code class="language-plaintext highlighter-rouge">001</code>, and
a <code class="language-plaintext highlighter-rouge">1</code> if we observe <code class="language-plaintext highlighter-rouge">100</code> or <code class="language-plaintext highlighter-rouge">011</code>. However, this is more
difficult for humans to compute.) A proof that this method generates
truly random sequences is trivial.</p>
<p>When we apply this method to the sequences in Figures 1 and 2, we get
the following, which shows both powers of two and powers of two plus
one.</p>
<div>
<figure>
<center><img src="/res/randombinary/nthdiffmeans-2powers-plus1-triplet.png" /></center>
<figcaption class="imagecaption"><p>Using the triplet method, no differences of the sequence show
structure, i.e. the probability of a <code class="language-plaintext highlighter-rouge">0</code> or <code class="language-plaintext highlighter-rouge">1</code> is approximately 50%
in all differences. These values mimic those of the two previous
figures. The variance increases at the edges because the probability
of finding the triplets <code class="language-plaintext highlighter-rouge">010</code> and <code class="language-plaintext highlighter-rouge">101</code> is high when switch
probability is high, and the triplets <code class="language-plaintext highlighter-rouge">000</code> and <code class="language-plaintext highlighter-rouge">111</code> is high when
switch probability is low, so the resulting sequence is shorter.</p>
</figcaption>
</figure>
</div>
<h2 id="testing-for-randomness-of-these-methods">Testing for randomness of these methods</h2>
<p>While there are many definitions of random sequences, the normalized
<a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a>
is especially useful for our purposes. In short, normalized entropy
divides the sequence into blocks of size \(k\) and looks at the
probability of finding any given sequence of length \(k\). If it is a
uniform probability, i.e. no block is any more likely than another
block, the function gives a value of 1, but if some occur more
frequently than others, it gives a value less than 1. In the extreme
case where only one block appears, it gives a value of 0.</p>
<p>Intuitively, if we have a high switch probability, we would expect to
see the blocks <code class="language-plaintext highlighter-rouge">01</code> and <code class="language-plaintext highlighter-rouge">10</code> more frequently than <code class="language-plaintext highlighter-rouge">00</code> or <code class="language-plaintext highlighter-rouge">11</code>.
Likewise, if we have a low switch probability, we would see more <code class="language-plaintext highlighter-rouge">00</code>
and <code class="language-plaintext highlighter-rouge">11</code> blocks than <code class="language-plaintext highlighter-rouge">01</code> or <code class="language-plaintext highlighter-rouge">10</code>. Similar relationships extend
beyond blocks of size 2. Entropy is useful for determining randomness
because it makes no assumptions about the form of the non-random
component.</p>
<p>As we can see below, the triplet method is identical to random, but
all other methods show non-random patterns.</p>
<div>
<figure>
<center><img src="/res/randombinary/binary-sequence-entropy.png" /></center>
<figcaption class="imagecaption"><p>Using entropy, we can see which methods are able to mimic a
random sequence. Of those we have considered here, only the triplet
method is able to generate random values, as the normalized entropy is
approximately equal to 1 for all sequence lengths.</p>
</figcaption>
</figure>
</div>
<h2 id="conclusions">Conclusions</h2>
<p>Humans are not very effective at generating random numbers. With
random binary numbers, human-generated random numbers tend to switch
back and forth between sequences of <code class="language-plaintext highlighter-rouge">0</code>s and <code class="language-plaintext highlighter-rouge">1</code>s too quickly. Even
when made aware of this bias, it is difficult to compensate for it.</p>
<p>However, there may be ways that humans can generate sequences which
are closer to random sequences. One way is to split the sequence into
three-digit triplets, discarding the entire triplet for <code class="language-plaintext highlighter-rouge">000</code>, <code class="language-plaintext highlighter-rouge">111</code>,
<code class="language-plaintext highlighter-rouge">010</code>, and <code class="language-plaintext highlighter-rouge">101</code>, and taking the most frequent number in all other
triplets. When the key non-random element is switch probability, this
creates an easy-to-compute method for generating a random binary
sequence.</p>
<p>Nevertheless, this triplet method relies on the assumption that the
only bias humans have when generating random binary numbers is the
switch probability bias. Since no data were analyzed here, it is
still necessary to look into whether human-generated sequences have
more non-random elements than just a change in switch probability.</p>
<h2 id="more-information">More information</h2>
<h3 id="technical-notes">Technical notes</h3>
<p>The 1st order switch sequence is also called differencing. Similarly,
the sequence of \(n\)‘th order switches is equal to the \(n\)‘th
difference. The 1st difference can also be thought of as a sliding
XOR window, i.e. a parity sliding window of size 2.</p>
<p>Using the concept of a sliding XOR window raises the idea that the
\(n\)‘th difference can be represented as as sliding parity window of
size greater than 2. By construction, for a sequence of length \(N\),
a sliding window of size \(k\) would end up producing a sequence of
length \(N-k+1\). Since the \(n\)‘th difference produces a sequence
of length \(N-n\), this means that a sliding window of length \(k\)
would be limited to only the \((k-1)\)‘th difference.</p>
<p>It turns out that this can only be shown to hold for window sizes of
powers of two. The proof by induction that this holds for window
sizes of powers of two is a simple proof by induction. The 0th
difference corresponds to a window of size 1. The parity of a single
binary digit is just the digit itself. So it holds trivially in this
case.</p>
<p>For the induction step, suppose the \((k-1)\)‘th difference is equal
to the parity for a window of size \(k\) where \(k\) is a power of 2.
Let \(x\) be any bit in at least the \((2k-1)\)‘th difference. Let
the window of \(x\) be the appropriate window of \(2k\) digits of the
original sequence which determines the value of \(x\). We need to
show that the parity of the window of \(x\) is equal to the value of
\(x\).</p>
<p>Without loss of generality, suppose the sequence of bits is of length
\(2k\) (the window of \(x\)), and \(x\) is the single bit that is the
\((2k-1)\)‘th difference.</p>
<p>Let us split the sequence in half, and apply the assumption to the
first half and then to the second half separately. We notice that the
parity of the parities of these halves is equal to the parity of the
entire sequence. Splitting the sequence in half tells us that the
first bit in the \((k-1)\)‘th difference is equal to the parity of the
first half of the bits, and the second bit in the \((k-1)\)‘th
difference is equal to the parity of the second half of the bits.</p>
<p>We can reason that these two bits would be equal iff the parity of the
\(k\)‘th difference is 0, and different iff the parity of the \(k\)‘th
difference is 1, because by the definition of a difference, we switch
every time there is a 1 in the sequence, so an even number of switches
means that the two bits would have the same parity. By applying our
original assumption again, we know that if the \(k\)‘th difference is
0, then \(x\) will be 0, and if the \(k\)‘th difference is 1, \(x\)
will be 1.</p>
<p>Therefore, \(x\) is equal to the parity of a window of size \(2k\),
and hence the \((2k-1)\)‘th difference is equal to the parity of a
sliding window of size \(2k\). QED.</p>
<div>
<figure>
<center><img src="/res/randombinary/proof.png" /></center>
<figcaption class="imagecaption"><p>A visual version of
the above proof.</p>
</figcaption>
</figure>
</div>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/randombinary/randombinary.py">Analysis script</a></li>
</ul>
Wed, 05 Jul 2017 00:00:00 +0000
/2017/07/05/can-humans-generate-random-numbers.html
/2017/07/05/can-humans-generate-random-numbers.htmlrandomnessstatsalgorithmPsychWhen to stop and when to keep going
<p>I was recently posed the following puzzle:</p>
<blockquote>
<p>Imagine you are offered a choice between two different bets. In
(A), you must make 2/3 soccer shots, and in (B), you must make
5/8. In either case, you receive a \($100\) prize for winning the bet.
Which bet should you choose?</p>
</blockquote>
<p>Intuitively, a professional soccer player would want to take the
second bet, whereas a hopeless case like me would want to take the
first. However, suppose you have no idea whether your skill level is
closer to Lionel Messi or to Max Shinn. The puzzle continues:</p>
<blockquote>
<p>You are offered the option to take practice shots to determine your
skill level at a cost of \($0.01\) for each shot. Assuming you and
the goalie never fatigue, how do you decide when to stop taking
practice shots and choose a bet?</p>
</blockquote>
<p>Clearly it is never advisable to take more than \(100/.01=10000\)
practice shots, but how many <em>should</em> we take? A key to this question
is that you do not have to determine the number of shots to take
beforehand. Therefore, rather than determining a fixed number of
shots to take, we will instead need to determine a decision procedure
for when to stop shooting and choose a bet.</p>
<p>There is no single “correct” answer to this puzzle, so I have
documented my approach below.</p>
<h2 id="approach">Approach</h2>
<p>To understand my approach, first realize that there are a finite
number of potential states that the game can be in, and that you can
fully define each state based on how many shots you have made and how
many you have missed. The sum of these is the total number of shots
you have taken, and the order does not matter. Additionally, we
assume that all states exist, even if you will never arrive at that
state by the decision procedure.</p>
<p>An example of a state is taking 31 shots, making 9 of them, and
missing 22 of them. Another example is taking 98 shots, making 1 of
them and missing 97 of them. Even though we may have already made a
decision before taking 98 shots, the concept of a state does not
depend on the procedure used to “get there”.</p>
<p>Using this framework, it is sufficient to show which decision we
should take given what state we are in. My approach is as follows:</p>
<ol>
<li>Find a tight upper bound \(B \ll 10000\) on the number of practice
shots to take. This limits the number of states to work with.</li>
<li>Determine the optimal choice based on each potential state after
taking \(B\) total shots. Once \(B\) shots have been taken, it is
always best to have chosen either bet (A) or bet (B), so choose the
best bet without the option of shooting again.</li>
<li>Working backwards, starting with states with \(B-1\) shots and
moving down to \(B-2,...,0\), determine the expected value of each
of the three choices: select bet (A), select bet (B), or shoot
again. Use this to determine the optimal choice to make at that
position.</li>
</ol>
<p>The advantage of this approach is that the primary criterion we will
work with is the expected value for each decision. This means that if
we play the game many times we will maximize the amount of money we
earn. As a convenient consequence of this, we know much money we can
expect to earn given our current state.</p>
<p>The only reason this procedure is necessary is because we don’t know
our skill level. If we could determine with 100% accuracy what are
skill level was, we would never need to take any shots at all. Thus,
a key part of this procedure is estimating our skill level.</p>
<h2 id="what-if-you-know-your-skill-level">What if you know your skill level?</h2>
<p>We define skill level as the probability \(p_0\) that you will make a
shot. So if you knew your probability of making each shot, we could
find your expected payoff from each bet. On the plot below, we show
the payoff (in dollars) of each bet on the y-axis, and how it changes
with skill on the x-axis.</p>
<div>
<figure>
<center><img src="/res/soccer/winning_prob_binom.png" /></center>
<figcaption class="imagecaption"><p>Assuming
we have a precise knowledge of your skill level, we can find how much
money you can expect to make from each bet.</p>
</figcaption>
</figure>
</div>
<p>The first thing to notice is the obvious: as our skill improves, the
amount of money we can expect to win increases. Second, we see that
there is some point (the “equivalence point”) at which the bets are
equal; we compute this numerically to be \(p_0 = 0.6658\). We should
choose bet (A) if our skill level is worse than \(0.6658\), and bet (B) if
it is greater than \(0.6658\).</p>
<p>But suppose our guess is poor. We notice that <em>the consequence for
guessing too high is less than the consequence for guessing too low</em>.
It is better to bias your choice towards (A) unless you obtain
substantial evidence that you have a high skill level and (B) would be
a better choice. In other words, the potential gains from choosing
(A) over (B) are larger than the potential gains for choosing (B) over
(A).</p>
<h2 id="finding-a-tight-upper-bound">Finding a tight upper bound</h2>
<p>Quantifying this intuition, we compute the maximal possible gain of
choosing (A) over (B) and (B) over (A) as the maximum distance between
the curves on each side of the equivalence point. In other words, we
find the skill level at which the incentive is strongest to choose one
bet over the other, and then find what the incentive is at these
points.</p>
<div>
<figure>
<center><img src="/res/soccer/winning_prob_binom_lines.png" /></center>
<figcaption class="imagecaption"><p>We
see here the locations where the distance between the curves is
greatest, showing the skill levels where it is most advantageous to
choose (A) or (B).</p>
</figcaption>
</figure>
</div>
<p>This turns out to be \($4.79\) for choosing (B) over (A), and
\($17.92\) for choosing (A) over (B). Since each shot costs
\($0.01\), we conclude that it is never a good idea to take more than
479 practice shots. Thus, our upper bound \(B=479\).</p>
<h2 id="determining-the-optimal-choice-at-the-upper-bound">Determining the optimal choice at the upper bound</h2>
<p>Because we will never take more than 479 shots, we use this as a
cutoff point, and force a decision once 479 shots have been taken. So
for each possible combinations of successes and failures, we must
find whether bet (A) or bet (B) is better.</p>
<p>In order to determine this, we need two pieces of information: first,
we need the expected value of bets (A) and (B) given \(p_0\) (i.e. the
curve shown above); second, we need the distribution representing our
best estimate of \(p_0\). Remember, it is not enough to simply choose
(A) when our predicted skill is less than \(0.6658\) and (B) when it
is greater than \(0.6658\); since we are biased towards choosing (A),
we need a probability distribution representing potential values of
\(p_0\). Then, we can find the expected value of each bet given the
distribution of \(p_0\) (see appendix for more details). This can be
computed with a simple integral, and is easy to approximate
numerically.</p>
<p>Once we have performed these computations, in addition to having
information about whether (A) or (B) was chosen, we also know the
expected value of the chosen bet. This will be critical for
determining whether it is beneficial to take more shots before we have
reached the upper bound.</p>
<h2 id="determining-the-optimal-choice-below-the-upper-bound">Determining the optimal choice below the upper bound</h2>
<p>We then go down one level: if 478 shots have been taken, with \(k\)
successes and \((478-k)\) failures, should we choose (A), should we
choose (B), or should we take another shot? Remember, we would like
to select the choice which will give us the highest expected outcome.</p>
<p>Based on this principle, it is only advisable to take another shot if
it would influence the outcome; in other words, if you would choose
the same bet no matter what the outcome of your next shot, it does not
make sense to take another shot, because you lose \($0.01\) without
gaining any information. It only makes sense to take the shot if the
information gained from taking the shot increases the expected value
by more than \($0.01\).</p>
<p>Thus, we would only like to take another shot if the information
gained is worth more than \($0.01\). We can compute this by finding the
expected value of each of the three options (choose (A), choose (B),
or shoot again). Using our previous experiments to judge the
probability of a successful shot (see appendix), we can find the
expected payoff of taking another shot. If it is greater than
choosing (A) or (B), we take the shot.</p>
<p>Working backwards, we continue until we are on our first shot, where
we assume we have a \(50\)% chance of succeeding. Once we reach this
point, we have a full decision tree, indicating which action we should
take based on the outcome of each shot, and the entire decision
process can be considered solved.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Here is the decision tree, plotted in raster form.</p>
<div>
<figure>
<center><img src="/res/soccer/decision-tree.png" /></center>
<figcaption class="imagecaption"><p>Starting at
the point (0,0), go one to the right for every shot that you take, and
one up for every shot that you make. Red indicates you should shoot
again, blue indicates you should choose (A), and green indicates you
should choose (B).</p>
</figcaption>
</figure>
</div>
<p>Looking more closely at the beginning, we see that unless you are
really good, you should choose (A) rather quickly.</p>
<div>
<figure>
<center><img src="/res/soccer/decision-tree-zoomed.png" /></center>
<figcaption class="imagecaption"><p>An
identical plot to that above, but zoomed in near the beginning.</p>
</figcaption>
</figure>
</div>
<p>We can also look at the amount of money you will win on average if you
play by this strategy. As expected, when you make more shots, you
will have a higher chance of winning more money.</p>
<div>
<figure>
<center><img src="/res/soccer/value-tree.png" /></center>
<figcaption class="imagecaption"><p>For each point in
the previous figures, these values correspond to the choices.</p>
</figcaption>
</figure>
</div>
<p>We can also look at the zoomed in version.</p>
<div>
<figure>
<center><img src="/res/soccer/value-tree-zoomed.png" /></center>
<figcaption class="imagecaption"><p>An
identical plot to the one above, but zoomed in near the beginning.</p>
</figcaption>
</figure>
</div>
<p>This algorithm grows in memory and computation time like \(O(B^2)\),
meaning that if we double the size of the upper bound, we quadruple
the amount of memory and CPU time we require.</p>
<p>This may not be the best strategy, but it seems to be a principled
strategy which works reasonably well with a relatively small runtime.</p>
<h2 id="appendix-determining-the-distribution-of-p_0">Appendix: Determining the distribution of \(p_0\)</h2>
<p>In order to find the distribution for \(p_0\), we consider the
distribution of \(p_0\) for a single shot. The chance that we make a
shot is \(100\)% if \(p_0=1\), \(0\)% if \(p_0=0\), \(50\)% if
\(p_0=0.5\), and so on. Thus, the distribution of \(p_0\) from a
single successful trial is \(f(p)=p\) for \(0 ≤ p ≤ 1\). Similarly,
if we miss the shot, then the distribution is \(f(p)=(1-p)\) for
\(0≤p≤1\). Since these probabilities are independent, we can multiply
them together and find that, for \(n\) shots, \(k\) successes, and
\((n-k)\) failures, we have \(f(p)=p^k (1-p)^{n-k}/c\) for some
normalizing constant \(c\). It turns out, this is identical to the
beta distribution, with parameters \(α=k+1\) and \(β=n-k+1\).</p>
<p>However, we need a point estimate of \(p_0\) to compute the expected
value of taking another shot. We cannot simply use the ratio \(n/k\)
for two practical reasons: first, it is undefined when no shots have
been taken, and second, when the first shot has been taken, we have a
\(100\)% probability of one outcome and a \(0\)% probability of the
other. If we want to assume a \(50\)% probability of making the shot
initially, an easy way to solve this problem is to use the ratio
\((k+1)/(n+2)\) instead of \(k/n\) to estimate the probability.
Interestingly, this quick and dirty solution is equivalent to finding
the mean of the beta distribution. When no shots have been taken,
\(k=0\) and \(n=0\), so \(α=1\) and \(β=1\), which is equivalent to the
uniform distribution, hence our non-informative prior.</p>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/soccer/find-best-strategy.py">Analysis script</a></li>
</ul>
Wed, 08 Mar 2017 00:00:00 +0000
/2017/03/08/when-to-stop-and-when-to-keep-going.html
/2017/03/08/when-to-stop-and-when-to-keep-going.htmlalgorithmstatsmodelingpuzzlegamePuzzleWhich hints are best in Towers?
<p>There is a wonderful collection of puzzles by Simon Tatham called the
<a href="http://www.chiark.greenend.org.uk/~sgtatham/puzzles/">Portable Puzzle Collection</a>
which serves as a fun distraction. The game “Towers” is a simple puzzle
where you must fill in a
<a href="https://en.wikipedia.org/wiki/Latin_square">Latin square</a> with
numbers \(1 \ldots N\), only one of each per row/column, as if the
squares contained towers of this height. The number of towers visible
from the edges of rows and columns are given as clues. For example,</p>
<div>
<figure>
<center><img src="/res/towers/example-board.png" /></center>
<figcaption class="imagecaption"><p>An example starting board from the Towers game.</p>
</figcaption>
</figure>
</div>
<p>Solved, the board would appear as,</p>
<div>
<figure>
<center><img src="/res/towers/solved-example.png" /></center>
<figcaption class="imagecaption"><p>The previous example solved.</p>
</figcaption>
</figure>
</div>
<p>In more advanced levels, not all of the hints are given.
Additionally, in these levels, hints can also be given in the form of
the value of particular cells. For example, the initial conditions of
the puzzle may be,</p>
<div>
<figure>
<center><img src="/res/towers/hard-level.png" /></center>
<figcaption class="imagecaption"><p>A more difficult example board.</p>
</figcaption>
</figure>
</div>
<p>With such different types of hints, it raises the question of whether
some hints are better than others.</p>
<h2 id="how-will-we-approach-the-problem">How will we approach the problem?</h2>
<p>We will use an
<a href="https://en.wikipedia.org/wiki/Shannon_information">information-theoretic</a>
framework to understand how useful different hints are. This allows
us to measure the amount of information that a particular hint gives
about the solution to a puzzle in bits, a nearly-identical unit to
that used by computers to measure file and memory size.</p>
<p>Information theory is based on the idea that random variables
(quantities which can take on one of many values probabilistically)
are not always independent, so sometimes knowledge of the value of one
random variable can change the probabilities for a different random
variable. For instance, one random variable may be a number 1-10, and
a second random variable may be whether that number is even or odd. A
bit is an amount of information equal to the best possible yes or no
question, or (roughly speaking) information that can cut the number of
possible outcomes in half. Knowing whether a number is even or odd
gives us one bit of information, since it specifies that the first
random variable can only be one of five numbers instead of one of ten.</p>
<p>Here, we will define a few random variables. Most importantly, we
will have the random variable describing the correct solution of the
board, which could be any possible board. We will also have random
variables which represent hints. There are two types of hints:
initial cell value hints (where one of the cells is already filled in)
and tower visibility hints (which define how many towers are visible
down a row or column).</p>
<p>The number of potential Latin squares of size \(N\) grows very fast.
For a \(5×5\) square, there are 161,280 possibilities, and for a
\(10×10\), there are over \(10^{47}\). Thus, for computational
simplicity, we analyze a \(4×4\) puzzle with a mere 576 possibilities.</p>
<h2 id="how-useful-are-initial-cell-value-hints">How useful are “initial cell value” hints?</h2>
<p>First, we measure the entropy, or the maximal information content that
a single cell will give. For the first cell chosen, there is an equal
probability that any of the values (<code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, <code class="language-plaintext highlighter-rouge">3</code>, or <code class="language-plaintext highlighter-rouge">4</code>) will be
found in that cell. Since there are two options, this give us 2 bits
of information.</p>
<p>What about the second initial cell value? Interestingly, it depends
both on the location and on the value. If the second clue is in the
same row or column as the first, it will give less information. If it
is the same number as the first, it will also give less information.</p>
<p>Counter-intuitively, in the 4×4 board, this means we gain <em>more</em> than
2 bits of information from the second hint. This is because, once we
reveal the first cell’s value, the probabilities of each of the other
cell’s possible values are not equal as they were before. Since we
are not choosing from the same row or column of our first choice, is
more likely that this cell will be equal to the first cell’s value
than to any other value. So therefore if we reveal a value which is
different, it will provide more information.</p>
<p>Intuitively, for the 4×4 board, suppose we reveal the value of a cell
and it is <code class="language-plaintext highlighter-rouge">4</code>. There cannot be another <code class="language-plaintext highlighter-rouge">4</code> in the same column or row,
so if we are to choose a hint from a different column or row, we are
effectively choosing from a leaving a 3×3 grid. There must be 3 <code class="language-plaintext highlighter-rouge">4</code>
values in the 3×3 grid, so the probability of selecting it is 1/3. We
have an even probability of selecting a <code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, or <code class="language-plaintext highlighter-rouge">3</code>, so each
other symbol has a probability of 2/9. Being more surprising finds,
we gain 2.17 bits of information from each of these three.</p>
<p>Consequently, selecting a cell in the same row or column, or one which
has the same value as the first, will give an additional 1.58 bits of
information.</p>
<h2 id="how-about-tower-visibility-hints">How about “tower visibility” hints?</h2>
<p>In a 4×4 puzzle, it is very easy to compute the information gained if
the hint is a <code class="language-plaintext highlighter-rouge">1</code> or a <code class="language-plaintext highlighter-rouge">4</code>. A hint of <code class="language-plaintext highlighter-rouge">1</code> always gives the same
amount of information as a single square: it tells us that the cell on
the edge of the hint must be a <code class="language-plaintext highlighter-rouge">4</code>, and gives no information about the
rest of the squares. If only one tower can be seen, the tallest tower
must come first. Thus, it must give 2 bits of information.</p>
<p>Additionally, we know that if the hint is equal to <code class="language-plaintext highlighter-rouge">4</code>, the only
possible combination for the row is <code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, <code class="language-plaintext highlighter-rouge">3</code>, <code class="language-plaintext highlighter-rouge">4</code>. Thus, this
gives an amount of information equal to the entropy of a single row,
which turns out to be 4.58 bits.</p>
<p>For a hint of <code class="language-plaintext highlighter-rouge">2</code> or <code class="language-plaintext highlighter-rouge">3</code>, the information content is not as
immediately clear, but we can calculate them numerically. For a hint
of <code class="language-plaintext highlighter-rouge">2</code>, we have 1.13 bits, and for a hint of <code class="language-plaintext highlighter-rouge">3</code>, we have 2 bits.</p>
<p>Conveniently, due to the fact that the reduction of entropy in a row
must be equal to the reduction of entropy in the entire puzzle, we can
compute values for larger boards. Below, we show the information
gained about the solution from each possible hint (indicated by the
color). In general, it seems higher hints are usually better, but a
hint of <code class="language-plaintext highlighter-rouge">1</code> is generally better than one of <code class="language-plaintext highlighter-rouge">2</code> or <code class="language-plaintext highlighter-rouge">3</code>.</p>
<div>
<figure>
<center><img src="/res/towers/information-by-board-size.png" /></center>
<figcaption class="imagecaption"><p>For each board size, the information content of each
potential hint is plotted.</p>
</figcaption>
</figure>
</div>
<h2 id="conclusion">Conclusion</h2>
<p>In summary:</p>
<ul>
<li>The more information given by a hint for a puzzle, the easier that
hint makes it to solve the puzzle.</li>
<li>Of the two types of hints, usually the hints about the tower
visibility are best.</li>
<li>On small boards (of size less than 5), hints about individual cells
are very useful.</li>
<li>The more towers visible from a row or column, the more information is
gained about the puzzle from that hint.</li>
</ul>
<p>Of course, remember that all of the hints combined of any given puzzle
must be sufficient to completely solve the puzzle (assuming the puzzle
is solvable), so the information content provided by the hints must be
equal to the entropy of the puzzle of the given size. When combined,
we saw in the “initial cell value” that hints may become more or less
effective, so these entropy values cannot be directly added to
determine which hints provide the most information. Nevertheless,
this serves as a good starting point in determining which hints are
the most useful.</p>
<h2 id="more-information">More information</h2>
<h3 id="theoretical-note">Theoretical note</h3>
<p>For initial cell hints, it is possible to compute the information
content analytically for any size board. For a board of size \(N×N\)
with \(N\) symbols, we know that the information contained in the
first hint is \(-\log(1/N)\) bits. Suppose this play uncovers token
<code class="language-plaintext highlighter-rouge">X</code>. Using this first play, we construct a sub-board where the row
and column of the first hint are removed, leaving us with an
\((N-1)×(N-1)\) board. If we choose a cell from this board, it has a
\(1/(N-1)\) probability of being <code class="language-plaintext highlighter-rouge">X</code> and an equal chance of being
anything else, giving a \(\frac{N-2}{(N-1)^2}\) probability of each of
the other tokens. Thus, information gained is
\(-\frac{N-2}{(N-1)^2}×\log\left(\frac{N-2}{(N-1)^2}\right)\) if the
value is different from the first, and
\(-1/(N-1)×\log\left(1/(N-1)\right)\) if they are the same; these
expressions are approximately equal for large \(N\). Note how no
information is gained when the second square is revealed if \(N=2\).</p>
<p>Similarly, when a single row is revealed (for example by knowing that
\(N\) towers are visible from the end of a row or column) we know that
the entropy must be reduced by \(-\sum_{i=1}^N \log(1/N)\). This is
because the first element revealed in the row gives \(-\log(1/N)\)
bits, the second gives \(-\log(1/(N-1))\) bits, and so on.</p>
<h3 id="solving-a-puzzle-algorithmically">Solving a puzzle algorithmically</h3>
<p>Most of these puzzles are solvable without backtracking, i.e. the next
move can always be logically deduced from the state of the board
without the need for trial and error. By incorporating the
information from each hint into the column and row states and then
integrating this information across rows and columns, it turned out to
be surprisingly simple to write a quick and dirty algorithm to solve
the puzzles. This algorithm, while probably not of optimal
computational complexity, worked reasonably well. Briefly,</p>
<ol>
<li>Represent the initial state of the board by a length-\(N\) list of
lists, where each of the \(N\) lists represents a row of the board,
and each sub-list contains all of the possible combinations of this
row (there are \(N!\) of them to start). Similarly, define an
equivalent (yet redundant) data structure for the columns.</li>
<li>Enforce each condition on the start of the board by eliminating the
impossible combinations using the number of towers visible from
each row and column, and using the cells given at initialization.
Update the row and column lists accordingly.</li>
<li>Now, the possible moves for certain squares will be restricted by
the row and column limitations; for instance, if only 1 tower is
visible in a row or column, the tallest tower in the row or column
must be on the edge of the board. Iterate through the cells,
restricting the potential rows by the limitations on the column and
vice versa. For example, if we know the position of the tallest
tower in a particular <em>column</em>, eliminate the corresponding <em>rows</em>
which do not have the tallest tower in this position in the row.</li>
<li>After sufficient iterations of (3), there should only be one
possible ordering for each row (assuming it is solvable without
backtracking). The puzzle is now solved.</li>
</ol>
<p>This is not a very efficient algorithm, but it is fast enough and
memory-efficient enough for all puzzles which might be fun for a human
to solve. This algorithm also does not work with puzzles which
require backtracking, but could be easily modified to do
so.</p>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/towers/compute_information.py">Analysis script</a></li>
<li><a href="/res/towers/solve.py">Script to solve a Towers puzzle</a></li>
</ul>
Sat, 28 Jan 2017 00:00:00 +0000
/2017/01/28/which-hints-are-best-in-towers.html
/2017/01/28/which-hints-are-best-in-towers.htmlpuzzlegametowersalgorithminformation-theoryMathWhen should you leave for the bus?
<p>Anyone who has taken the bus has at one time or another wondered,
“When should I plan to be at the bus stop?” or more importantly, “When
should I leave if I want to catch the bus?” Many bus companies
suggest
<a href="http://www.matatransit.com/ridersguide/how-to-ride/">arriving</a>
<a href="http://www.riderta.com/howtoride">a</a>
<a href="http://routes.valleymetro.org/">few</a>
<a href="http://www.metrotransit.org/ride-the-bus">minutes</a>
<a href="http://atltransit.org/guide/tips/">early</a>, but there seem to be no
good analyses on when to leave for the bus. I decided to find out.</p>
<h2 id="finding-a-cost-function">Finding a cost function</h2>
<p>Suppose we have a bus route where a bus runs every \(I\) minutes, so if
you don’t catch your bus, you can always wait for the next bus.
However, since more than just your time is at stake for missing the
bus (e.g. missed meetings, stress, etc.), we assume there is a penalty
\(\delta\) for missing the bus in addition to the extra wait time.
\(\delta\) here is measured in minutes, i.e. how many minutes of your
time would you exchange to be guaranteed to avoid missing the bus.
\(\delta=0\) therefore means that you have no reason to prefer one bus
over another, and that you only care about minimizing your lifetime
bus wait time.</p>
<p>Assuming we will not be late enough to need to catch the third bus, we
can model this with two terms, representing the cost to you (in
minutes) of catching each of the next two buses, weighted by the
probability that you will catch that bus:</p>
\[C(t) = \left(E(T_B) - t\right) P\left(T_B > t + L_W\right) + \left(I + E(T_B) - t + \delta\right) P(T_B < t + L_W)\]
<p>where \(T_B\) is the random variable representing the time at which
the bus arrives, \(L_W\) is the random variable respresenting the
amount of time it takes to walk to the bus stop, and \(t\) is the time
you leave. (\(E\) is expected value and \(P\) is the probability.) We
wish to choose a time to leave the office \(t\) which minimizes the cost
function \(C\).</p>
<p>If we assume that \(T_B\) and \(L_W\) are Gaussian, then it can shown that
the optimal time to leave (which minimizes the above function) is</p>
\[t = -\mu_W - \sqrt{\left(\sigma_B^2 + \sigma_W^2\right)\left(2\ln\left(\frac{I+\delta}{\sqrt{\sigma_B^2+\sigma_W^2}}\right)-2\ln\left(\sqrt{2\pi}\right)\right)}\]
<p>where \(\sigma_B^2\) is the variance of the bus arrival time,
\(\sigma_W^2\) is the variance of your walk, and \(\mu_W\) is expected
duration of your walk. In other words, you should plan to arrive at
the bus stop on average \(\sqrt{\left(\sigma_B^2 + \sigma_W^2\right)\left(2\ln\left(\left(I+\delta\right)/\sqrt{\sigma_B^2+\sigma_W^2}\right)-2\ln\left(\sqrt{2\pi}\right)\right)}\) minutes before your bus arrives.</p>
<p>Note that one deliberate oddity of the model is that the cost function
does not just measure wait time, but also walking time. I optimized
on this because, in the end, what matters is the total time you spend
getting on the bus.</p>
<h2 id="what-does-this-mean">What does this mean?</h2>
<p>The most important factor you should consider when choosing which bus
to take is the variability in the bus’ arrival time and the
variability in the time it takes you to walk to the bus. The arrival
time scales approximately linearly with the standard deviation of the
variability.</p>
<p>Additionally, it scales at approximately the square root of the log
the your value of time and of the frequency of the buses. So even if
very high values of time and very infrequent buses do not
substantially change the time at which you should arrive. For
approximation purposes, you might consider adding a constant in place
of this term, anywhere from 2-5 minutes depending on the frequency of
the bus.</p>
<h2 id="checking-the-assumption">Checking the assumption</h2>
<p>First, we need to collect some data to assess whether the bus time
arrival (\(T_B\)) is normally distributed. I wrote scripts to scrape
data from Yale University’s Blue Line campus shuttle route. Many bus
systems (including Yale’s) now have real-time predictions, so I used
many individual predictions by Yale’s real-time arrival prediction
system as the expected arrival time, simulating somebody checking this
to see when the next bus comes.</p>
<p>For our purposes, the expected arrival time looks close enough to a
Gaussian distribution:</p>
<div>
<figure>
<center><img src="/res/bus/isnormal.png" /></center>
<figcaption class="imagecaption"><p>It actually looks like a Gaussian!</p>
</figcaption>
</figure>
</div>
<h2 id="so-what-time-should-i-leave">So what time should I leave?</h2>
<p>When estimating the \(\sigma_B^2\) parameter, we only examine bus
times which are 10 minutes away or later. This is because you can’t
use a real-time bus system to plan ahead of time to catch something if
it is too near in the future, which defeats the purpose of the present
analysis. The variance in arrival time for the Yale buses is
\(\sigma_B^2=5.7\).</p>
<p>We use an inter-bus interval of \(I=15\) minutes.</p>
<p>While the variability of the walk to the bus station \(\sigma_W^2\) is
unique for each person, I consider two cases: one case, where we
assume that arrival time variability is small (\(\sigma_W^2=0\))
compared to the bus’ variability, representing the case where the bus
stop is (for intance) located right outside one’s office building. I
also consider the case where the time variability is comperable to the
variability for the bus (\(\sigma_W=5\)), representing the case where
one must walk a long distance to the bus stop.</p>
<p>Finally, I consider the case where we strongly prioritize catching the
desired bus (\(\delta=60\) corresponding to, e.g., an important meeting)
and also the case where we seek to directly minimize the expected wait
time (\(\delta=0\) corresponding to, e.g., the commute home).</p>
<div>
<figure>
<center><img src="/res/bus/variants.png" /></center>
<figcaption class="imagecaption"><p>Even though the
shape of the optimization function changes greatly, the optimal
arrival time changes very little.</p>
</figcaption>
</figure>
</div>
<p>We can also look at a spectrum of different cost tradeoffs for missing
the bus (values of \(\delta\)) and variance in the walk time (values
of \(\sigma_W^2 = var(W)\)). Because they appear similarly in the
equations, we can also consider these values to be changes in the
interval of the bus arrival \(I\) and the variance of the bus’ arrival
time \(\sigma_B^2=var(B)\) respectively.</p>
<div>
<figure>
<center><img src="/res/bus/howearly.png" /></center>
<figcaption class="imagecaption"><p>Across all
reasonable values, the optimal time to plan to arrive is between 3.5
and 8 minutes early.</p>
</figcaption>
</figure>
</div>
<h2 id="conclusion">Conclusion</h2>
<p>So to summarize:</p>
<ul>
<li>If it always takes you approximately the same amount of time to walk
to the bus stop, plan to be there 3-4 minutes early on your commute
home, or 5-6 minutes early if it’s the last bus before an important
meeting.</li>
<li>If you have a long walk to the bus stop which can vary in duration,
plan to arrive at the bus stop 4-5 minutes early if you take the bus
every day, or around 7-8 minutes early if you need to be somewhere
for a special event.</li>
<li>These estimations assume that you know how long it takes you on
average to walk to the bus stop. As we saw previously, if you need
to be somewhere at a certain time, arriving a minute early is much
better than arriving a minute late. If you don’t need to be
somewhere, just make your best guess.</li>
<li>The best way to reduce waiting time is to decrease variability.</li>
<li>These estimates also assume that the interval between buses is
sufficiently large. If it is small, as in the case of a subway,
there are
<a href="http://erikbern.com/2016/07/09/waiting-time-math.html">different factors</a>
that govern the time you spend waiting.</li>
<li>This analysis focuses on buses with an expected arrival time, not
with a scheduled arrival time. When buses have schedules, they will
usually wait at the stop if they arrive early. This situation would
require a different analysis than what was performed here.</li>
</ul>
<h3>Code/Data:</h3>
<ul>
<li><a href="/res/bus/2016-07-28-data.csv">Data</a></li>
<li><a href="/res/bus/getyalebus.py">Data collection script</a></li>
<li><a href="/res/bus/analyzebus.py">Data analysis script</a></li>
</ul>
Mon, 01 Aug 2016 00:00:00 +0000
/2016/08/01/when-should-you-leave-for-the-bus.html
/2016/08/01/when-should-you-leave-for-the-bus.htmldatabusstatsmodelingData