📝 tldr;
Whether or not you need a break from the ads, here's how Spotify recommends their next song:
They use an algorithm called ANNOY (Approximate Nearest Neighbors Oh Yeah).
ANNOY relies on objects (songs in this case) represented as “vectors”. Vectors are points in an n-dimensional space where points closer to each other are considered more similar than points far away from each other.
Instead of scanning the entire vector space to find the exact nearest neighbors of your song (which would need O(N) time), ANNOY uses a binary-tree-based approach to find the approximate nearest neighbors which brings down the time complexity to O(log N).
🙋♀️ Credit to our Contributing Author
Bhavana wrote this amazing article to share with our 21,001+ readers! If you’re also interested in being a byte-sized design writer, consider applying here!
Bhavana Hindupur is a Principal Software Engineer at Microsoft. She brings experience from her tenure at tech giants Google and Amazon, where she designed and implemented several large-scale solutions in the cloud.
Read more from her at thepeoplessoftwareengineer
💡 What are Vectors?
In the context of search, vectors are a way of representing data as points in a multidimensional space where the distance between the points is an indicator of similarity.
💪 Brute Force Approach (k Nearest Neighbors or kNN)
We could scan the entire vector space to find the nearest neighbors to our query vector (the original song).
But this would be a slow process since Spotify has 100,000,000 songs 🙀!
🙌🏽 ANN Oh Yeah!
Vocabulary: ANNOY is an “Approximate Nearest Neighbors” (ANN) implementation.
Explanation: The idea behind ANN is to optimize the nearest neighbors search by being okay with finding the almost closest neighbors rather than the exact closest ones.
🔎 How does ANNOY work exactly?
Setup:
It repeatedly divides the vector space into half until each region contains a small number of vectors, essentially creating a binary tree representation of it.
It chooses a random entry point to start dividing the vector space in half.
Query:
Given the query vector, it traverses the tree to find the region that contains the query vector.
It then finds the nearest neighbors by running the “brute force” kNN algorithm on just the vectors in that region. (Region R6 from the above diagram)
🤔 But what if the nearest song is in a different region?
Problem: Sometimes you can get unlucky and have a query vector that is in one region but whose true nearest neighbors are in a different region.As expected, the results from this tree won’t be very good in this case.
Mitigation: To mitigate this, ANNOY builds many trees (aka a Forest) starting from different entry points and averages out the results from several trees instead of just one.
🗞️ Official Article
Link to the official in depth article:
(Subscribing is the best way to support this newsletter)
Keep reading with a 7-day free trial
Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.