In the video he argues that based on the historical data it is a bad idea to invest into exciting new technologies, and the companies that create them. Historically investors have overestimated the future growth of new innovative firms and underestimated how long it would take for dying industries to become irrelevant.
From 1900 through 2019, rail companies declined from a 63% share of the US stock market to a less than 1% share. It is the ultimate example of a declining industry. Over that time period, rail stocks beat the US market, road transportation stocks, and air transportation stocks.
To illustrate the point, Ben uses the example of the declining rail industry. Despite going from a 63% to 1% share of the stock market capitalization between 1900 and 2019, it still managed to outperform innovative new transportation technologies like cars and airplanes during the same time period.
Investors had overestimated how quickly the railway companies would become obsolete leading them to value those stocks too low. Similarly, they overestimated how well car and airplane companies would do causing those stocks to become overvalued and have lower returns.
The moral of the story is that great companies are not necessarily great investments if you pay too much for them, and when new technologies come out investors get excited and do just that. Additionally, bad companies could be good investments if you can get them cheap enough.
Since the approximate start of the age of information in 1971, the software industry has grown more than any other, from basically non-existent in 1971, to the largest industry by market capitalization at the end of 2019 at nearly 15% of the US stock market. The oil industry on the other hand has seen a massive decline in market capitalization, from nearly 15% of the US market in 1971, to about 3% at the end of 2019. Over this period, a dollar invested in the oil index grew to $134, while a dollar invested in the software index grew to $76.
A second example is that you would have made more money holding oil stocks instead of technology stocks over the time period between 1971 and 2019. Most likely investors are overestimating how quickly renewable energy will make oil obsolete leading to oil stocks being undervalued and having higher returns.
So counterintuitively, it seems like you’re better off investing in cheap dying stocks over expensive growth stocks. In other words, values stocks (those with low multiples) outperform growth stocks (those with high multiples). This is a well known phenomenon called the value premium and is the basis for value investing.
I really love podcasts. Not only do they provide great entertainment value as an alternative to audiobooks, but they are also one of the last open ecosystems on the web. Anyone can start a podcast by publishing an RSS feed on their website without having to rely on a central platform (thus nobody can “ban” your podcast). Once published listeners can consume their favorite podcasts from any RSS reader, including many specially made for podcasts like PocketCasts and Overcast.
This arrangement is beneficial to creators because it gives them full freedom of expression without having to worry about the censors on platforms like YouTube, and it gives them complete freedom of choice on how to monetize their work. It is equally beneficial to consumers who get to choose among hundreds of independently developed podcast apps to find the one with the best features for them. If a consumer wants to switch podcast players they can also do so while taking their subscriptions with them.
However, over the last few years Spotify has been making moves that could threaten this open ecosystem.
Later in 2019, Spotify acquired the podcast networks Gimlet Media, Anchor FM, and Parcast. However, they did not limit access to podcasts produced on those networks so users could still listen using their client of choice.
Extend
In May 2020, Spotify announced that it acquired an exclusive license to The Joe Rogan Experience (a popular comedy podcast) for $100 million dollars. Starting in September 2020, Joe’s podcast will be removed from all 3rd party podcasting apps and made available only in Spotify’s own podcasts section.
If the Joe Rogan license is a commercial success then it seems likely that the shows from the other podcast networks that Spotify owns will also be made exclusive to their own apps.
Extinguish
If Spotify chooses to continue on their current path of exclusive content it will break interoperability with other podcast apps and force listeners of those shows to use the Spotify podcast client. I suspect that many listeners will also transfer their existing subscriptions into Spotify to avoid needing two separate podcast clients.
If Spotify gains enough market share then it will effectively become the de facto gatekeeper of podcasts (similar to how Google Play is the de facto gatekeeper of Android apps despite side loading and alternative app stores). Once that happens many of the benefits of podcasts will be destroyed. Creators will no longer have full creative freedom as they risk annoying the Spotify censors and having a large portion of their audience taken away from them. Consumers will no longer have choice in podcast clients if they want to listen to shows that are exclusive to Spotify.
I really hope that Spotify’s attempt to centralize the podcasting ecosystem around their apps is a colossal failure, however, the Embrace, Extend, and Extinguish strategy is quite effective and thus I fear they may succeed.
As a small and feeble attempt to protest this direction that Spotify is moving I have decided to cancel my Spotify Premium subscription.
I was recently wondering which of the popular web search engines provided the best results and decided to try to design an objective benchmark for evaluating them. My hypothesis was that Google would score the best followed by StartPage (Google aggregator) and then Bing and it’s aggregators.
Usually when evaluating search engine performance there are two methods I’ve seen used:
Have humans search for things and rate the results
Create a dataset of mappings between queries and “ideal” result URLs
The problem with having humans rate search results is that it is expensive and hard to replicate results. Creating a dataset of “correct” webpages to return for each query solves the repeatability of the experiment problem but is also expensive upfront and depends on the human creating the dataset’s subjective biases.
Instead of using either of those methods I decided to evaluate the search engines on the specific task of answering factual questions from humans asked in natural language. Each engine is scored by how many of its top 10 results contain the correct answer.
Although this approach is not very effective at evaluating the quality of a single query, I believe in aggregate over thousands of queries it should provide a reasonable estimation of how well each engine can answer the users questions.
To source the factoid questions, I use the Stanford Question Answering Dataset (SQuAD) which is a popular natural language dataset containing 100k factual questions and answers from Wikipedia collected by Mechanical Turk workers.
Here are some sample questions from the dataset:
Q: How did the black death make it to the Mediterranean and Europe?
A: merchant ships
Q: What is the largest city of Poland?
A: Warsaw
Q: In 1755 what fort did British capture?
A: Fort Beauséjour
Some of the questions in the dataset are also rather ambiguous such as the one below:
Q: What order did British make of French?
A: expulsion of the Acadian
This is because the dataset is designed to train question answering models that have access to the context that contains the answer. In the case of SQaUD each Q/A pair comes with the paragraph from Wikipedia that contains the answer.
However, I don’t believe this is a huge problem since most likely all search engines will perform poorly on those types of questions and no individual one will be put at a disadvantage.
Collecting data
To get the results from each search engine, I wrote a Python script that connects to Firefox via Selenium and performs searches just like regular users via the browser.
The first 10 results are extracted using CSS rules specific to each search engine and then those links are downloaded using the requests library. To check if a particular result is a “match” or not we simply perform an exact match search of the page source code for the correct answer (both normalized to lowercase).
Again this is not a perfect way of determining whether any single page really answers a query, but in aggregate it should provide a good estimate.
Some search engines are harder to scrape due to rate limiting. The most aggressive rate limiters were: Qwant, Yandex, and Gigablast. They often blocked me after just two queries (on a new IP) and thus there are fewer results available for those engines. Also, Cliqz, Lycos, Yahoo!, and YaCy were all added mid experiment, so they have fewer results too.
I scraped results for about 2 weeks and collected about 3k queries for most engines. Below is a graph of the number of queries that were scraped from each search engine.
Crunching the numbers
Now that the data is collected there are lots of ways to analyze it. For each query we have the number of matching documents, and for the latter half of queries also the list of result links saved.
The first thing I decided to do was see which search engine had the highest average number of matching documents.
Much to my surprise, Google actually came in second to Ecosia. I was rather shocked with this since Ecosia’s gimmick is that they plant trees with the money from ads, not having Google beating search results.
Also surprising is the number of Bing aggregators (Ecosia, DuckDuckGo, Yahoo!) that all came in ahead of Bing itself. One reason may be that those engines each apply their own ranking on top of the results returned by Bing and some claim to also search other sources.
Below is a chart with the exact scores of each search engine.
To further understand why the Bing aggregators performed so well, I wanted to check how much of their own ranking was being used. I computed the average Levenshtein distance between each two search engines, which is the minimum number of single result edits (insertions, deletions or substitutions) required to change one results page into the other.
Of the three, Ecosia was the most different from pure Bing with an average edit distance of 8. DuckDuckGo was the second most different with edit distance of 7, followed by Yahoo! with a distance of 5.
Interestingly the edit distances of Ecosia, DuckDuckGo, and Yahoo! seem to correlate well with their overall rankings where Ecosia came in 1st, DuckDuckGo 3rd, and Yahoo! 5th. This would indicate that whatever modifications these engines have made to the default Bing ranking do indeed improve search result quality.
Closing thoughts
This was a pretty fun little experiment to do, and I am happy to see some different results from what I expected. I am making all the collected data and scripts available for anyone who wants to do their own analysis.
This study does not account for features besides search result quality such as instant answers, bangs, privacy, etc. and thus it doesn’t really show which search engine is “best” just which one provides the best results for factoid questions.
I plan to continue using DuckDuckGo as my primary search engine despite it coming in 3rd place. The results of the top 6 search engines are all pretty close, so I would expect the experience across them to be similar.
So I was recently asked why I prefer to use free and open source software over more conventional and popular proprietary software and services.
A few years ago I was an avid Google user. I was deeply embedded in the Google ecosystem and used their products everywhere. I used Gmail for email, Google Calendar and Contacts for PIM, YouTube for entertainment, Google Newsstand for news, Android for mobile, and Chrome as my web browser.
I would upload all of my family photos to Google Photos and all of my personal documents to Google Drive (which were all in Google Docs format). I used Google Domains to register my domain names for websites where I would keep track of my users using Google Analytics and monetize them using Google AdSense.
I used Google Hangouts (one of Google’s previous messaging plays) to communicate with friends and family and Google Wallet (with debit card) to buy things online and in-store.
My home is covered with Google Homes (1 in my office, 1 in my bedroom, 1 in the main living area) which I would use to play music on my Google Play Music subscription and podcasts from Google Podcasts.
I have easily invested thousands of dollars into my Google account to buy movies, TV shows, apps, and Google hardware devices. This was truly the Google life.
Then one day, I received an email from Google that changed everything.
“Your account has been suspended”
Just the thing you want to wake up to in the morning. An email from Google saying that your account has been suspended due to a perceived Terms of Use violation. No prior warning. No appeals process. No number to call. Trying to sign in to your Google account yields an error and all of your connected devices are signed out. All of your Google data, your photos, emails, contacts, calendars, purchased movies and TV shows. All gone.
I nearly had a heart attack, until I saw that the Google account that had been suspended was in fact not my main personal Google account, but a throwaway Gmail account that I created years prior for a project. I hadn’t touched the other account since creation and forgot it existed. Apparently my personal Gmail was listed as the recovery address for the throwaway account and that’s why I received the termination email.
Although I was able to breathe a sigh of relief this time, the email was wake up call. I was forced to critically reevaluate my dependence on a single company for all the tech products and services in my life.
I found myself to be a frog in a heating pot of water and I made the decision that I was going to jump out.
Leaving Google
Today there are plenty of lists on the internet providing alternatives to Google services such as this and this. Although the “DeGoogle” movement was still in its infancy when I was making the move.
The first Google service I decided to drop was Gmail, the heart of my online identity. I migrated to Fastmail with my own domain in case I needed to move again (hint: glad I did, now I self host my email). Fastmail also provided calendar and contacts solutions so that took care of leaving Google Calendar and Contacts.
Migrating away from Google was not a fast or easy process. It took years to get where I am now and there are still several Google services that I depend on: YouTube and Google Home.
Eventually, my Google Home’s will grow old and become unsupported at which point hopefully the Mycroft devices have matured and become available for purchase. YouTube may never be replaced (although I do hope for projects like PeerTube to succeed) but I find the compromise of using only one or two Google services to be acceptable.
At this point losing my Google account due to a mistake in their machine learning would largely be inconsequential and my focus has shifted to leaving Amazon which I use for most of my shopping and cloud services.
The reason that I moved to mostly FOSS applications is that it seems to be the only software ecosystem where everything works seamlessly together and I don’t have to cede control to any single company. Alternatively I could have simply split my service usage up evenly across Google, Microsoft, Amazon, and Apple but I don’t feel that they would have worked as nicely together.
Overall I’m very happy with the open source ecosystem. I use Ubuntu with KDE on all of my computers and Android (no GApps) on my mobile phone. I’ve ordered the PinePhone “Brave Heart” and hope to one day be able to use it or one of its successors as a daily driver with Ubuntu Touch or Plasma Mobile.
I don’t want to give the impression that I exclusively use open source software either, I do use a number of proprietary apps including: Sublime Text, Typora, and Cloudron.
If you’ve decided to move to another email provider it’s possible to take all of your old emails and folders with you. The easiest way I’ve found to do this is using the mail client Mozilla Thunderbird.
Thunderbird new account dialog. File > New > Existing mail account.
With Thunderbird installed, sign in to both your old and new emails accounts. This is provider dependent but in general if you are using a popular email service like Gmail, Yahoo, Outlook, etc. then Thunderbird can auto discover the SMTP endpoints. If you have two-factor authentication setup on your email account you may need to create an app password.
If you are unsure, here are the instructions for a few popular services:
When you set up your old account make sure you set Thunderbird to download the entire email history not just the last few months.
Account settings for you can set how many emails Thunderbird will download. Edit > Account Settings.
Once you are signed in to both accounts you should see all of your emails and folders in the old account. You may want to wait for Thunderbird to finish downloading emails if necessary.
To move emails, simply select the inbox of your old mail account, use Ctrl + A to select all the emails, then drag them to the new inbox. You will also need to drag each of the folders from the old email account to the new one.
If you’d like to just move a couple of emails you can select them individually and drag them to the new email account.
I recently took the CICS Make course available at the University of Massachusetts Amherst with my colleague Hannah Dacayanan. At the end of the course we were required to build a final project that involved interacting with the physical world using computers.
We decided to build a robot cat. More specifically, a small motorized car that would follow a laser emitted from a laser pointer around a flat surface. This was inspired by popular viral online videos of real cats trying to pounce on red dots from laser pointers.
Hardware
The first thing that needed to be done was to get all the hardware and put it together. The main four components used were a car kit provided to us by the class, a Raspberry Pi 4, the Raspberry Pi camera, and the L298N motor driver.
The car kit with motors and L298N motor driver attached. Raspberry Pi 4 and camera sitting on top.
Hannah assembled the car kit over our Thanksgiving break and I attached the L298N driver and Raspberry Pi via the Pi’s onboard GPIO pins.
The L298N supports two motors simultaneously by connecting the motors positive and negative to the outputs shown on the diagram. The Raspberry Pi then controls those motors by supplying power to the pins labeled “A Enable” and “B Enable”. Then it can control the direction and speed of the motor by sending signals to the four pins between the enable pins. The top two control motor A and the bottom two motor B.
The direction of the motor is controlled by which pins are active and the speed of the motor is controlled by PWM on the active pin.
The Raspberry connected to the L298N motor driver via GPIO.
We used six GPIO pins from the Raspberry Pi to control the motors, the first three (11, 13, 15) for the left, and the last three (19, 21, 23) for the right.
At this point the hardware is completely done.
Software
For the cat to follow the laser, we needed some software on the Raspberry Pi to take a frame from the camera and tell us where in that frame the laser is, if at all.
There are two possible approaches to take: deep learning or hand-crafted algorithm. We opted to try the deep learning approach first.
Lasernet
To build a neural network that can both recognize and localize lasers in an image we decided to use TensorFlow. We didn’t want to have to label tons of training data and generating synthetic data yielded poor results, so instead we went to with a semi-supervised network. Lasernet takes as input a frame of video and outputs the likelihood of a laser existing in the image. In the network there is an attention mechanism used which is where we will get our localization properties form.
First let’s import everything
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
Input,
Conv2D,
Activation,
Reshape,
Flatten,
Lambda,
Dense,
)
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow.keras.backend as K
Then we can define some global settings that are useful for us to use in the network
Now, we get to actually building the network. The first layer is the input for our image at the resolution specified in the settings (currently 128×128), then the second is a 2d convolutional layer using the number of filters and kernel specified in the settings.
‘same’ padding is used to keep the output of the convolutional layer the same shape as its input. This is important for when the network outputs a probability distribution for each pixel in the attention mechanism.
You could optionally add more convolutional layers to the network with the following code
for i in range(DEPTH):
encoder = Conv2D(FILTERS, KERNEL, activation='relu', padding='same', name=f'encoder_conv_{i + 1}')(encoder)
Although, in our production model settings the ‘DEPTH’ is set to zero, which just uses the first convolutional layer.
Next, we write the attention mechanism. Attention shows us where the neural network “looks” to determine whether there exists a laser in the image or not. In theory, the pixel with the highest attention weight should be where the laser is.
There is a lot going on in that code, but basically here’s what each layer does:
Attention Conv: takes an image or output of another convolutional layer and transforms it in some way that is learned by the neural network. In this case it will output a 128×128 matrix.
Attention Flatten: flattens the 128×128 matrix into a 16,384 item vector.
Attention Softmax: applies softmax activation to the vector and outputs another 16,384 item long vector with values between 0 and 1 that sum to 1. The i’th item of this vector is the weight of the i’th pixel in the input.
Attention Reshape: reshapes the softmax vector be the input resolution.
Attention Output: multiply the pixel weights by the pixels element wise. Pixels with higher weights will be preserved more while those with lower weights will not.
Now, we just need to get all of the outputs setup.
Classifier 1 Flatten: converts the attention weight matrix to a vector again (this is equivalent to the Attention Softmax layer)
Classifier 2 Flatten: converts the output of the attention mechanism to a vector
Classifier 1: Outputs the maximum probability of the attentions weights
Classifier 2: Uses a general feed-forward dense layer to learn how to “see” a laser.
Both classifiers will be trained to predict whether there is a laser in the input image and output either a 1 or 0. Classifier 1 forces the attention mechanism to produce a weight greater than 0.5 when there does exist a laser and product all weights less than 0.5 when there does not exist a laser. Classifier 2 is used for laser detection in production.
Finally, there is one last part of the network. We have it try to reconstruct the original image from just the attention weights. The idea here is that the easiest thing for the network to reconstruct should be the laser (since that’s the only thing in common between all images) which should encourage the attention mechanism to highlight that in the weights.
The last thing needed is to compile the model. We use binary cross entropy for the two classifiers and mean squared error for the reconstruction loss. The optimizer is Adam.
I won’t go over the code for loading and reading the training data, but you can find the complete training script here. That said, there are a few interesting things we did in preprocessing.
Recall that, in production Lasernet is fed a continuous stream of frames from the Raspberry Pi camera. So what we do is take the average of the previous 10 frames and diff that with the current frame. Then send the diff the Lasernet instead of the raw frame.
This produces images where anything that is not really moving tends to be blacked out.
Train it
It’s training time! I trained Lasernet on my Nvidia GeForce GTX 1060 6GB for a day or so and here is the result.
The white dot is the network’s prediction for each frame, and the red circle is the moving average of predictions.
Catcarlib
Now, that the neural network is done, we needed some library for actually driving the car using the GPIO pins on the Raspberry Pi. For this purpose, we created catcarlib.
First we use the Raspberry Pi GPIO Python library to help us in controlling the pins on the board. Let’s import that.
import RPi.GPIO as GPIO
Next, we need to initialize GPIO with the correct pins that are connected to the car. In our case those were:
Left Motor Power: 11
Left Motor Forward: 13
Left Motor Backward: 15
Right Motor Power: 23
Right Motor Forward: 21
Right Motor Backward: 19
channels = [
{'label': 'LEFT_MOTOR_POWER', 'pin': 11},
{'label': 'LEFT_MOTOR_FORWARD', 'pin': 13},
{'label': 'LEFT_MOTOR_BACKWARD', 'pin': 15},
{'label': 'RIGHT_MOTOR_POWER', 'pin': 23},
{'label': 'RIGHT_MOTOR_FORWARD', 'pin': 21},
{'label': 'RIGHT_MOTOR_BACKWARD', 'pin': 19},
]
GPIO.setmode(GPIO.BOARD)
GPIO.setup([i['pin'] for i in channels], GPIO.OUT, initial=GPIO.LOW)
state = [False for i in channels]
To review, we set the GPIO mode to use the pin numbers on the board. Then we setup each pin in channels and initialize it to 0. The state list is used to keep track of what the current state of each pin is.
Now it would be good to write some helper functions for things like resetting all the pins, get the index for each action in the state, and enabling pins.
def reset(state):
for i in range(len(state)):
state[i] = False
def getIndexFromLabel(label):
for i, channel in enumerate(channels):
if channel['label'] == label:
return i
return None
def commit(state):
GPIO.output([i['pin'] for i in channels], state)
def enableByLabel(state, labels):
for label in labels:
state[getIndexFromLabel(label)] = True
reset: resets the states to all zero
getIndexFromLabel: gets the index of a particular action in the state list
commit: sends the current state to the pins
enableByLabel: enables a list of actions
At last, we can write the functions to actually move the car. Below is the function for moving the car forward. First it resets the state to a blank slate of all zeros. Then it enables the power for both motors and the forward pins. Finally, it commits the changes to the GPIO pins.
The functions for left, right, and backwards can be written in much the same way. For left we want to left wheel to go forward, and the right wheel to go backward. Going right is the opposite of left, and going backward is the opposite of going forward.
Again you can see the full catcarlib.py on GitHub.
Putting everything together
So, now we have all the hardware ready, Lasernet trained, and catcarlib to control the car. Let’s see how it does.
To be honest with you, I was a little disappointed with the performance at this point. I was just hoping for more.
I tried a number of different things to improve the performance. Sanity checks to reduce false positives like checking the color of the pixel where the network predicted the laser to be, or the euclidean distance between the prediction and the images brightest pixel.
Ultimately nothing worked well enough to bring the performance to where I wanted it to be.
Catseelib
Since the neural network approach didn’t seem to be working out I decided try to come up with some handcrafted algorithm that can do it. I’ll call it catseelib.
The approach of taking the diff between the current frame and the average of the last 10 was pretty useful, so let’s start there. Then we can just take the pixel with the highest magnitude in the diff. That should be the laser if there is minimal background noise.
To make sure that the only thing in the diff was the laser, I pointed the camera straight down under the head of the cat. Let’s see how well that works.
A few weeks ago my OnePlus 3 camera stopped focusing properly meaning
I could no longer take photos of text or scan QR codes. After doing a
bit of research on the OnePlus community forums I found that this was likely an issue with the hardware stablization of the camera getting stuck.
If your stabilizer is stuck you may be able to fix it by just giving
your phone a good shake. However, the solution for me was to take a
refrigerator magnet and move it over the external camera module several
times (you should hear the camera moving back and forth inside).
If the magnet solution does not work it seems like some people have had success opening their phone and placing a metal object like paperclip or staple next to the camera module. Below is a video I found on YouTube of someone doing so. That said, I believe that opening your phone will void your warranty so if you are near a OnePlus service center I would recommend you give them a visit to see if they can fix your problem first.
After purchasing audiobooks on Audible you may want to store the files on your computer in case Amazon decides to pull the books later on. Audible allows you to download encrypted copies of your books from your account library.
Clicking on the “Download” link for any audiobook will download a
.aax file to your computer. This file contains audio data that has been
encrypted using a 4-byte key unique to your Audible account. Because the
key is so short it is trivial to break it using brute force and there
is plenty of software available specifically for that purpose. In this
blog post, I’ll be covering two ways to decrypt the file.
OpenAudible
OpenAudible a free open-source graphical program available for Linux, Windows, and macOS. It’s specifically designed to remove DRM from your Audible files and hides a lot of the complexity.
EDIT: OpenAudible appears to have become closed source and paid software. You can buy it if you want or try to find an old version, but see below for a free method.
Once you install OpenAudible from its website you can drag and drop the .aax files you downloaded from Audible into it. They will show up in a list at the bottom of the window.
With your audiobooks loaded select them (Ctrl + A) and right-click to select “Convert to MP3”.
OpenAudible will convert each of your audiobooks to a DRM-free mp3
file and save them in the ~/OpenAudible folder on your computer. If you
can’t find the mp3 files then right-click one of the books and select
“Show MP3”.
One nice thing about OpenAudible over the FFMPEG method is that the
book’s metadata (author, reader, publisher, etc.) will be preserved in
the resulting mp3 file.
FFMPEG
ffmpeg is a popular free and open-source command line utility for processing video and audio. It can decrypt the Audible DRM but requires you to input the specific 4-byte encryption key unique to your Audible account. You can brute force your downloaded .aax files (you only need to get the key from one, and it will work for the others) using this website.
Once you’ve gotten your key you can use it to convert your .aax files to mp3s using ffmpeg like so (replace XXXX with your key):
I was recently tasked with solving a fun epidemiology puzzle for one of my university classes. Below is an excerpt from the assignment describing the scenario.
11 people get sick enough to go to a local hospital with severe
diarrhea and vomiting that lasts four days or so in each patient. All
the patients turn out to all have the same strain of norovirus.
It turned out that they all knew each other and over the summer had
been sharing produce from their gardens. The nurse’s hypothesis was that
one person had norovirus, and had transmitted the virus to others on
the food. She made a list, numbered the patients, starting with the
patient that had first shared, and who they had shared with. It turned
out a total of 16 people had shared produce, so she contacted the
additional people who had not gotten sick, and asked them who they had
shared produce with and when. In the end, she came up with the list
below. So, patient 1 first shared vegetables with patient 12, then with
patient 14. Patient 2 first shared vegetables with patient 5, then with
patient 15, and so on. And patient 1 never got ill, while patient 2 did.
Any time that two people come in contact with each other, the virus can
move either way. For example, it would be possible for patient 2 to
have infected patient 5, or patient 5 to infect patient 2.
After studying the list, she said, “I know who started this!” She
asked that patient where they had been recently and it turned out they’d
been on a cruise ship that had had a severe outbreak of norovirus!
Based on her data, which patient was the one who went on the cruise and
started the epidemic?
Data
Below is the dataset of patients and who they met with.
Patient
Meetings
Sick
1
12,14
FALSE
2
5,15
TRUE
3
6,16
TRUE
4
1,7,11
TRUE
5
10,3,16
FALSE
6
13,2
FALSE
7
2,8
TRUE
8
3,10
TRUE
9
15,5
TRUE
10
9
TRUE
11
14
TRUE
12
13,15
FALSE
13
16,3
TRUE
14
9
TRUE
15
16,5
FALSE
16
9
TRUE
Rules
So based on the above passage we can derive some simplified rules to use in solving the puzzle
Meetings happen sequentially going left to right
Rows are in chronological order going top to bottom
We don’t move onto the next row until all the meetings of the current row are complete
Each meeting has only two people
When two people meet the disease can go either direction
Solution
Theory
To solve this we need to find an algorithm that can identify patient
zero based on their interactions with others. At first, I thought about
using a graph-based approach to model each meeting, but the temporal
nature of the data makes that untenable. Instead, I opted for a much
simpler and intuitive approach that takes advantage of the ordering.
Since we know the precise sequence in which meetings occurred and we
know that each meeting contains only two people we can generate a list
of interaction tuples from the dataset. For example, 1 meets with 12,
then 14, and then 2 meets with 5. So we could have a list like so:
[(1, 12), (1, 14), (2, 5)]
Once we have our sequential list of interactions we can iterate
through them to simulate the effect of any given individual being
patient zero. Then it’s just a game of trial and error trying out
different possible patients. If we find a contradiction in our
simulation based on the data we were given (ie. someone gets sick in the
simulation but was healthy in the table) then we know that our guess
for patient zero was wrong and can move on to the next one. But if we
get all the way to the end of the simulation and everyone who was
supposed to get sick is sick and everyone who was supposed to be healthy
is healthy then we found our culprit.
Code
So now that we have a game-plan, we just need to code it up and find out who got everyone sick.
We’ll be using the Pandas python library for working with our table.
import pandas as pd
I’ve placed the table into a CSV file called data.csv which we’ll open as a Pandas DataFrame.
# Load the dataset into a DataFrame
df = pd.read_csv('data.csv')
We need to get the list of interactions in chronological order from
the table. To do this I’ll use a Python generator to iterate over the
rows and for each row I’ll split the meetings up and yield them.
def get_interactions():
# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
# Get each meeting in order
for meeting in row['Meetings'].split(','):
# Yield the interaction
yield {row['Patient'], int(meeting)}
Great, so far so good. All we have left is the actual simulation to
write. For this, what we’ll do is keep a list of patients who are sick
in the simulation. It will start with just our guess for patient zero
and grow as they interact with others.
When we iterate through the interactions there are three possible situations that can happen:
The interaction has no sick people in it
The interaction has one sick person and one healthy person
The interaction has two sick people
If the interaction has no sick people or two sick people then we just
move along to the next one. But if it has one sick person and one
healthy person then we need to make the healthy person “sick” by adding
them to the sick_people set. However, before doing that we check with
our real data to see if the healthy person was recorded as being sick.
If they were then they get added and we keep going, but if they are
supposed to be healthy then we know that our hypothesis was wrong and
can return False.
Finally, if we make it through all of the iterations without
invalidating our hypothesis then it must be true and we will return.
# Test if our patient zero hypothesis is correct
def test_hypothesis(id, interactions):
# A set of sick people in the simulation
# starts with just patient zero
sick_people = {id}
# Iterate over the interactions in
# chronological order
for interaction in interactions:
# Check if the interaction has at least
# one sick person in it
if sick_people.intersection(interaction):
# If there is a sick person then
# check if everyone in this interaction
# was supposed to get sick.
for person in interaction:
# If they were then add them to the set
if df[df['Patient'] == person]['Sick'].bool():
sick_people.add(person)
# If they weren't then we are done and
# can return False
else:
return False
return True
Alright, well that’s it! Just add a few more lines of code to run our functions and let’s see who it was.
# Get list of interactions
interactions = list(get_interactions())
# Iterate through the 16 candidates
for candidate in range(1, 17):
# Check if our guess is correct
if test_hypothesis(candidate, interactions):
# Yay! We found them.
print('It was {}!'.format(candidate))
break
Run it!
(env) kyle@IntelNuc:~/Code/Python/Virus Spread$ python who_did_it.py
It was 7!
Well, it looks like it was patient #7 who got everyone sick. Mystery solved.
I decided to write another script using similar code to produce a
tree diagram of the infection which is pictured below. As you can see
the norovirus does indeed start with patient #7 and moves to all of the
other sick people from there. I encourage you to follow the path of the
tree through the table to confirm the results for yourself.
Can you find a pattern that returns to its starting point after more than two time steps?
What’s the longest you can see a pattern go without repeating a configuration?
To answer these questions I decided to build my own implementation of
Conway’s Game of Life in Python to brute force all possible starting
positions.
"""
Kyle's Game of Life Implementation
1) live cells die if they have 0, 1, or 4+ neighbors
2) empty cells have a birth if they have exactly three neighbors
"""
import numpy as np
# Create a blank board
board = np.zeros((5, 5))
def iterate(board):
"""
This function takes the current board state and returns the next state.
"""
conv_board = np.zeros((7, 7))
conv_board[1:6, 1:6] = board
conv = np.lib.stride_tricks.as_strided(
conv_board,
(5, 5, 3, 3), # view shape
(56, 8, 56, 8) # strides
)
# The new board
b = np.zeros((5, 5))
for i in range(5):
for j in range(5):
# Count the number of neighbor live cells
if conv[i, j, 1, 1] == 1:
# Subtract itself from total count
b[i, j] = conv[i, j].sum() - 1
else:
b[i, j] = conv[i, j].sum()
# Cells with 0, 1, or 4+ die
b[np.any([b <= 1, b >= 4], axis=0)] = 0
# Living cells with 2 neighbors get to keep living
b[np.all([b == 2, board == 1], axis=0)] = 1
# Dead cells with 2 neighbors stay dead
b[np.all([b == 2, board == 0], axis=0)] = 0
# All cells with 3 neighbors live
b[b == 3] = 1
# Return the new board state
return b
if __name__ == '__main__':
while input('Continue? [y/n] ') == 'y':
print(board)
board = iterate(board)
Results
It took approximately two hours to play all 225 possible
starting positions. There were 300,477,379 total steps taken with an
average of 8.95 steps per game. The game with the longest period was 39
steps.