Introduction to Large-Scale Similarity Search : HNSW , IVF , LSH
The world is awash in data, and finding the right information within vast stores of multimedia content is becoming increasingly challenging. Imagine trying to find a specific product in a million-item online catalog, or searching for a similar image in a massive database. This is where large-scale similarity search comes into play, offering efficient ways to find items similar to a given query – be it a product, a picture, a piece of text, or even a complex idea.
In this post, we'll explore the fundamental concepts of large-scale similarity search and introduce some of the most powerful techniques used to tackle this challenge. We'll dive into the Faiss library, a popular open-source toolkit specifically designed for efficient similarity search, and explore basic code examples to get you started. This post will serve as an overview and springboard for further exploration of this exciting and rapidly evolving field.
Why is Similarity Search Important?
The rise of AI and machine learning has led to an explosion of high-dimensional data representations. These representations, often called embeddings, capture complex relationships between data points, allowing for powerful analysis and understanding. But finding similar embeddings within a large dataset is a computationally intensive task.
One area where similarity search is revolutionizing the field is in Retrieval-Augmented Generation (RAG). RAG combines traditional information retrieval with language models to improve the accuracy and relevance of generated text. By leveraging similarity search to find relevant documents, RAG models can access a broader knowledge base and produce more informative and contextually rich outputs.
The Challenge of Large-Scale Similarity Search
Traditional databases and search engines struggle to handle the demands of large-scale similarity search. They rely on structured queries and indexing methods that are not designed for the dynamic nature of high-dimensional data. This is where specialized techniques come into play.
Faiss Framework
Faiss, developed by Facebook AI Research, is a powerful library specifically designed for efficient similarity search. It offers a range of indexing methods, each optimized for different trade-offs between performance, accuracy, and memory usage. Faiss also supports GPU acceleration, making it ideal for handling massive datasets.
Starting with the Foundation
we start by installing and importing the necessary packages :
pip install faiss-cpu # for cpu
#pip install faiss-gpu # for gpu
import time
import faiss
import numpy as np
We then define some constants: d
for the dimensionality of our vectors (128), nb
for the number of base vectors (10000), and nq
for the number of query vectors (100).
# Dataset of vectors (10000 vectors of 128 dimensions)
d = 128 # dimension
nb = 10000 # number of base vectors
nq = 100 # number of query vectors
We also initialize a random seed for reproducible results.
np.random.seed(1234) # For reproducibility
We generatec also two sets of random vectors xb
represents our base vectors (10000 x 128) and xq
represents our query vectors (100 x 128).
# Random vectors
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')
These are essentially our data points.
1. Hierarchical Navigable Small World (HNSW)
How it works: HNSW is a graph-based indexing method where vectors are organized in layers of a small-world graph. Each node (vector) in the graph is connected to its closest neighbors. During search, the algorithm navigates the graph to quickly converge on the nearest vectors.
Advantages: HNSW offers high accuracy and fast search times, particularly for high-dimensional datasets.
Key Parameters:
-
M (number of connections per node): Controls the number of neighbors connected to each node in the graph. Higher values improve accuracy at the cost of memory.
-
efConstruction and efSearch: These parameters control the depth of exploration during index construction and search. Higher values lead to better search accuracy but require more computation.
IndexHNSWFlat VS IndexHNSWSQ (Scalar Quantization)
We start by creating two HNSW indexes - HNSWFlat and HNSWSQ - to see how they perform.
HNSWFlat is the basic HNSW implementation :
# 1. HNSW index with Flat (no quantization)
index_hnswflat = faiss.IndexHNSWFlat(d, 32) # 32 is the number of connections per layer
# Add the data
start = time.time()
index_hnswflat.add(xb)
indexing_time_hnswflat = time.time() - start
# Search with HNSWFlat
start = time.time()
D, I = index_hnswflat.search(xq, 5) # Search 5 nearest neighbors
search_time_hnswflat = time.time() - start
While HNSWSQ incorporates Scalar Quantization (SQ) for faster indexing.
# 2. HNSW with Scalar Quantization (SQ)
quantizer_sq = faiss.IndexScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
index_hnswsq = faiss.IndexHNSWFlat(d, 32)
# Add the data
start = time.time()
index_hnswsq.add(xb)
indexing_time_hnswsq = time.time() - start
# Search with HNSWSQ
start = time.time()
D, I = index_hnswsq.search(xq, 5)
search_time_hnswsq = time.time() - start
Both indexes use a "flat" storage method, meaning they don't compress the original vectors.
print(f"HNSWFlat Indexing Time: {indexing_time_hnswflat:.4f} seconds")
print(f"HNSWFlat Search Time: {search_time_hnswflat:.4f} seconds")
print(f"HNSWSQ Indexing Time: {indexing_time_hnswsq:.4f} seconds")
print(f"HNSWSQ Search Time: {search_time_hnswsq:.4f} seconds")
Both HNSW methods are significantly slower to index compared to IVF-based methods because HNSW builds a graph of neighbors, which is computationally expensive. HNSWSQ is slightly faster than HNSWFlat because Scalar Quantization (SQ) reduces the dimensionality of the vectors by representing them more compactly during indexing.
Both HNSWFlat and HNSWSQ have slightly higher search times than IVF-based methods due to the graph traversal needed to find the nearest neighbors. HNSW is known for high accuracy, especially in high-dimensional spaces, but at the cost of slower indexing and search times.
2 . Inverted File Index (IVF)
How it works: The Inverted File Index (IVF) is another commonly used approach in similarity search for large-scale datasets. It divides the dataset into buckets or "lists," and only a subset of the buckets is searched for each query.
Advantages: IVF's main advantage is its lower memory requirement compared to HNSW.
Key Parameters:
-
nlist: Number of clusters (or buckets). A higher number improves precision but increases the indexing time.
-
nprobe: Number of clusters searched during a query. Increasing this parameter improves recall at the cost of search speed.
IndexIVFFlat without Product Quantization vs IndexIVFPQ with Product Quantization
IVFFlat uses a flat index with no quantization .
# Number of clusters (nlist)
nlist = 100
# 3. Flat index (no quantizer)
quantizer = faiss.IndexFlatL2(d) # No quantization
index_ivfflat = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# Train the index
start = time.time()
index_ivfflat.train(xb)
index_ivfflat.add(xb)
indexing_time_ivfflat = time.time() - start
# Search with IVFFlat
index_ivfflat.nprobe = 10 # How many clusters to search
start = time.time()
D, I = index_ivfflat.search(xq, 5) # Search 5 nearest neighbors
search_time_ivfflat = time.time() - start
While IVFPQ incorporates Product Quantization (PQ) for memory efficiency.
# Number of sub-vectors for PQ (m) and number of bits per sub-vector (nbits)
m = 8
nbits = 8
# 4. IVF with Product Quantization (PQ)
index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
# Train the index
start = time.time()
index_ivfpq.train(xb)
index_ivfpq.add(xb)
indexing_time_ivfpq = time.time() - start
# Search with IVFPQ
index_ivfpq.nprobe = 10 # Adjust number of clusters to search
start = time.time()
D, I = index_ivfpq.search(xq, 5) # Search 5 nearest neighbors
search_time_ivfpq = time.time() - start
The key idea is that IVF divides the dataset into clusters (buckets) and searches only a subset of these buckets for each query. We set nlist
to 100, meaning we have 100 clusters.
print(f"IVFFlat Indexing Time: {indexing_time_ivfflat:.4f} seconds")
print(f"IVFFlat Search Time: {search_time_ivfflat:.4f} seconds")
print(f"IVFPQ Indexing Time: {indexing_time_ivfpq:.4f} seconds")
print(f"IVFPQ Search Time: {search_time_ivfpq:.4f} seconds")
The IVFPQ takes significantly longer to index compared to IVFFlat. This is because IVFPQ involves additional steps of quantization after the initial clustering. It applies Product Quantization (PQ), which requires learning a codebook for compressing each vector into multiple quantized sub-vectors.
IVFPQ has a slightly longer search time due to the overhead of decompression and reconstructing vectors from the quantized representation during the search process. However, the difference is minimal, and the memory efficiency gained with IVFPQ is often worth the extra search time.
3. Locality Sensitive Hashing (LSH)
How it works: LSH transforms high-dimensional vectors into lower-dimensional "hash" values. Similar vectors are more likely to share the same hash values, allowing for efficient search by focusing on buckets containing relevant hashes.
Advantages: LSH offers a fast and scalable approach to similarity search, particularly for large datasets and high-dimensional spaces.
Key Parameters:
-
Number of Hash Tables: Controls the trade-off between accuracy and speed. More hash tables improve accuracy but increase search time.
-
Number of Hash Functions per Table: The number of hash functions used to generate each hash value.
IndexLSH
A hashing-based method that uses random projections to create hash values for each vector.
# 5. LSH
nbits = 16 # Number of bits for hash
index_lsh = faiss.IndexLSH(d, nbits)
# Add data to the index
start = time.time()
index_lsh.add(xb)
indexing_time_lsh = time.time() - start
# Search with LSH
start = time.time()
D, I = index_lsh.search(xq, 5) # Search 5 nearest neighbors
search_time_lsh = time.time() - start
Similar vectors are more likely to share the same hash value, allowing for fast search by focusing on buckets containing relevant hashes. We use 16 bits for our hash.
print(f"LSH Indexing Time: {indexing_time_lsh:.4f} seconds")
print(f"LSH Search Time: {search_time_lsh:.4f} seconds")
LSH is extremely fast to index because it only hashes the data points into hash buckets based on random projections. There is no training process like in IVF-based methods or HNSW, so the indexing is nearly instantaneous.
LSH search is relatively slower compared to IVFFlat and HNSW because of the random nature of LSH, which may require searching multiple hash buckets to find the nearest neighbors. LSH typically offers fast indexing but may sacrifice accuracy and search speed compared to more complex methods like HNSW or IVFPQ.
This was just a quick peek into the world of large-scale similarity search. We've only touched on the basics, but there's so much more to discover . In future posts inchallah, we'll dive into real-world applications, explore advanced indexing methods, and even build some cool projects using Faiss.