For the past few years I’ve been hosting my blog on Linode, but I’ve decided to switch over to WordPress.com on their “Personal” plan since it was a bit cheaper.
So far everything seems to be working well, and I was able to migrate over all of the posts, pages, and comments from my previous WordPress installation. However, WordPress.com doesn’t support the www subdomain which is annoying, so I either have to make it the naked domain (kylepiira.com) or use another subdomain like blog.kylepiira.com. Right now, I’ve opted for the naked domain, although I think it looks much uglier than the version with www.
Typical web developers, breaking something that’s worked for the last 30 years in favor of the new hotness.
I also had to disable the AMP feature which is enabled by default by going through Settings → General → Performance.
In the video he argues that based on the historical data it is a bad idea to invest into exciting new technologies, and the companies that create them. Historically investors have overestimated the future growth of new innovative firms and underestimated how long it would take for dying industries to become irrelevant.
From 1900 through 2019, rail companies declined from a 63% share of the US stock market to a less than 1% share. It is the ultimate example of a declining industry. Over that time period, rail stocks beat the US market, road transportation stocks, and air transportation stocks.
To illustrate the point, Ben uses the example of the declining rail industry. Despite going from a 63% to 1% share of the stock market capitalization between 1900 and 2019, it still managed to outperform innovative new transportation technologies like cars and airplanes during the same time period.
Investors had overestimated how quickly the railway companies would become obsolete leading them to value those stocks too low. Similarly, they overestimated how well car and airplane companies would do causing those stocks to become overvalued and have lower returns.
The moral of the story is that great companies are not necessarily great investments if you pay too much for them, and when new technologies come out investors get excited and do just that. Additionally, bad companies could be good investments if you can get them cheap enough.
Since the approximate start of the age of information in 1971, the software industry has grown more than any other, from basically non-existent in 1971, to the largest industry by market capitalization at the end of 2019 at nearly 15% of the US stock market. The oil industry on the other hand has seen a massive decline in market capitalization, from nearly 15% of the US market in 1971, to about 3% at the end of 2019. Over this period, a dollar invested in the oil index grew to $134, while a dollar invested in the software index grew to $76.
A second example is that you would have made more money holding oil stocks instead of technology stocks over the time period between 1971 and 2019. Most likely investors are overestimating how quickly renewable energy will make oil obsolete leading to oil stocks being undervalued and having higher returns.
So counterintuitively, it seems like you’re better off investing in cheap dying stocks over expensive growth stocks. In other words, values stocks (those with low multiples) outperform growth stocks (those with high multiples). This is a well known phenomenon called the value premium and is the basis for value investing.
So earlier today I was trying out Microsoft’s online office suite and noticed something interesting. Whenever you create a new Word, Excel, or PowerPoint file from the OneDrive interface, it automatically creates it using the OpenDocument file format (odt, ods, odp) as opposed to the Microsoft Office format (docx, xlsx, ppt). Interestingly if you create it from Office.com it uses the Microsoft format instead.
I really love podcasts. Not only do they provide great entertainment value as an alternative to audiobooks, but they are also one of the last open ecosystems on the web. Anyone can start a podcast by publishing an RSS feed on their website without having to rely on a central platform (thus nobody can “ban” your podcast). Once published listeners can consume their favorite podcasts from any RSS reader, including many specially made for podcasts like PocketCasts and Overcast.
This arrangement is beneficial to creators because it gives them full freedom of expression without having to worry about the censors on platforms like YouTube, and it gives them complete freedom of choice on how to monetize their work. It is equally beneficial to consumers who get to choose among hundreds of independently developed podcast apps to find the one with the best features for them. If a consumer wants to switch podcast players they can also do so while taking their subscriptions with them.
However, over the last few years Spotify has been making moves that could threaten this open ecosystem.
Later in 2019, Spotify acquired the podcast networks Gimlet Media, Anchor FM, and Parcast. However, they did not limit access to podcasts produced on those networks so users could still listen using their client of choice.
In May 2020, Spotify announced that it acquired an exclusive license to The Joe Rogan Experience (a popular comedy podcast) for $100 million dollars. Starting in September 2020, Joe’s podcast will be removed from all 3rd party podcasting apps and made available only in Spotify’s own podcasts section.
If the Joe Rogan license is a commercial success then it seems likely that the shows from the other podcast networks that Spotify owns will also be made exclusive to their own apps.
If Spotify chooses to continue on their current path of exclusive content it will break interoperability with other podcast apps and force listeners of those shows to use the Spotify podcast client. I suspect that many listeners will also transfer their existing subscriptions into Spotify to avoid needing two separate podcast clients.
If Spotify gains enough market share then it will effectively become the de facto gatekeeper of podcasts (similar to how Google Play is the de facto gatekeeper of Android apps despite side loading and alternative app stores). Once that happens many of the benefits of podcasts will be destroyed. Creators will no longer have full creative freedom as they risk annoying the Spotify censors and having a large portion of their audience taken away from them. Consumers will no longer have choice in podcast clients if they want to listen to shows that are exclusive to Spotify.
I really hope that Spotify’s attempt to centralize the podcasting ecosystem around their apps is a colossal failure, however, the Embrace, Extend, and Extinguish strategy is quite effective and thus I fear they may succeed.
As a small and feeble attempt to protest this direction that Spotify is moving I have decided to cancel my Spotify Premium subscription.
I’ve decided that I’m going to be reformatting my 25 TB of external storage capacity (for storing datasets, backups, etc.) to exFAT. Most of it is currently ext4 or NTFS.
exFAT is great because similar to its predecessor FAT it has read-write compatibility with Linux, Windows, and macOS. But while FAT can only have files as big as 4 GB and partitions of 16 TB, exFAT can do 16 EB for files and 64 ZB for partitions. Lots more room to grow.
It’ll be a slow process since I can only format one drive at a time and need to copy the data to another drive and back again. So far I’ve converted 4 TB of data.
So my university has shutdown the campus for the remainder of the semester due to Coronavirus concerns and asked all students to attend classes remotely (mainly using Zoom for live-streaming lectures). I went looking for an open source cross platform video conferencing solution with a fast onboarding process to keep in touch with fellow students and found Jitsi to fit the bill.
It’s free, it’s FOSS, and there are no accounts required to create a chat session on their website. You just need to enter a name for your room, and they give you a link to share for people to join.
The only officially supported web browser is Google Chrome which kinda sucks. But it seems to work okay in Firefox except I couldn’t get it to detect any of my microphones (your usage may vary). Instead, I’m using it in Falkon and it works flawlessly.
Unfortunately, it also doesn’t appear that video chats are end-to-end encrypted which means whoever runs the server can see the raw footage (but you can self-host).
Overall it’s good enough and it looks like the public service is hosted by 8×8, which is a public VoIP company, so I’m not overly concerned about eavesdropping (due to the lack of end-to-end encryption). I’ll keep an eye out for better options but for now I’m sticking with Jitsi.
I was recently wondering which of the popular web search engines provided the best results and decided to try to design an objective benchmark for evaluating them. My hypothesis was that Google would score the best followed by StartPage (Google aggregator) and then Bing and it’s aggregators.
Usually when evaluating search engine performance there are two methods I’ve seen used:
Have humans search for things and rate the results
Create a dataset of mappings between queries and “ideal” result URLs
The problem with having humans rate search results is that it is expensive and hard to replicate results. Creating a dataset of “correct” webpages to return for each query solves the repeatability of the experiment problem but is also expensive upfront and depends on the human creating the dataset’s subjective biases.
Instead of using either of those methods I decided to evaluate the search engines on the specific task of answering factual questions from humans asked in natural language. Each engine is scored by how many of its top 10 results contain the correct answer.
Although this approach is not very effective at evaluating the quality of a single query, I believe in aggregate over thousands of queries it should provide a reasonable estimation of how well each engine can answer the users questions.
To source the factoid questions, I use the Stanford Question Answering Dataset (SQuAD) which is a popular natural language dataset containing 100k factual questions and answers from Wikipedia collected by Mechanical Turk workers.
Here are some sample questions from the dataset:
Q: How did the black death make it to the Mediterranean and Europe?
A: merchant ships
Q: What is the largest city of Poland?
Q: In 1755 what fort did British capture?
A: Fort Beauséjour
Some of the questions in the dataset are also rather ambiguous such as the one below:
Q: What order did British make of French?
A: expulsion of the Acadian
This is because the dataset is designed to train question answering models that have access to the context that contains the answer. In the case of SQaUD each Q/A pair comes with the paragraph from Wikipedia that contains the answer.
However, I don’t believe this is a huge problem since most likely all search engines will perform poorly on those types of questions and no individual one will be put at a disadvantage.
To get the results from each search engine, I wrote a Python script that connects to Firefox via Selenium and performs searches just like regular users via the browser.
The first 10 results are extracted using CSS rules specific to each search engine and then those links are downloaded using the requests library. To check if a particular result is a “match” or not we simply perform an exact match search of the page source code for the correct answer (both normalized to lowercase).
Again this is not a perfect way of determining whether any single page really answers a query, but in aggregate it should provide a good estimate.
Some search engines are harder to scrape due to rate limiting. The most aggressive rate limiters were: Qwant, Yandex, and Gigablast. They often blocked me after just two queries (on a new IP) and thus there are fewer results available for those engines. Also, Cliqz, Lycos, Yahoo!, and YaCy were all added mid experiment, so they have fewer results too.
I scraped results for about 2 weeks and collected about 3k queries for most engines. Below is a graph of the number of queries that were scraped from each search engine.
Crunching the numbers
Now that the data is collected there are lots of ways to analyze it. For each query we have the number of matching documents, and for the latter half of queries also the list of result links saved.
The first thing I decided to do was see which search engine had the highest average number of matching documents.
Much to my surprise, Google actually came in second to Ecosia. I was rather shocked with this since Ecosia’s gimmick is that they plant trees with the money from ads, not having Google beating search results.
Also surprising is the number of Bing aggregators (Ecosia, DuckDuckGo, Yahoo!) that all came in ahead of Bing itself. One reason may be that those engines each apply their own ranking on top of the results returned by Bing and some claim to also search other sources.
Below is a chart with the exact scores of each search engine.
To further understand why the Bing aggregators performed so well, I wanted to check how much of their own ranking was being used. I computed the average Levenshtein distance between each two search engines, which is the minimum number of single result edits (insertions, deletions or substitutions) required to change one results page into the other.
Of the three, Ecosia was the most different from pure Bing with an average edit distance of 8. DuckDuckGo was the second most different with edit distance of 7, followed by Yahoo! with a distance of 5.
Interestingly the edit distances of Ecosia, DuckDuckGo, and Yahoo! seem to correlate well with their overall rankings where Ecosia came in 1st, DuckDuckGo 3rd, and Yahoo! 5th. This would indicate that whatever modifications these engines have made to the default Bing ranking do indeed improve search result quality.
This was a pretty fun little experiment to do, and I am happy to see some different results from what I expected. I am making all the collected data and scripts available for anyone who wants to do their own analysis.
This study does not account for features besides search result quality such as instant answers, bangs, privacy, etc. and thus it doesn’t really show which search engine is “best” just which one provides the best results for factoid questions.
I plan to continue using DuckDuckGo as my primary search engine despite it coming in 3rd place. The results of the top 6 search engines are all pretty close, so I would expect the experience across them to be similar.