Python Reference#
LorannIndex#
- class lorann.LorannIndex(data, n_clusters, global_dim, quantization_bits=8, rank=32, train_size=5, euclidean=False, balanced=False)#
Initializes a LorannIndex object. The initializer does not build the actual index.
- Parameters:
data (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – Index points as an \(m \times d\) numpy array.n_clusters (
int
) – Number of clusters. In general, for \(m\) index points, a good starting point is to set n_clusters as around \(\sqrt{m}\).global_dim (
Optional
[int
]) – Globally reduced dimension (\(s\)). Must be either None or an integer that is a multiple of 32. Higher values increase recall but also increase the query latency. In general, a good starting point is to set global_dim = None if \(d < 200\), global_dim = 128 if \(200 \leq d \leq 1000\), and global_dim = 256 if \(d > 1000\).quantization_bits (
Optional
[int
]) – Number of bits used for quantizing the parameter matrices. Must be None, 4, or 8. Defaults to 8. None turns off quantization, and setting quantization_bits = 4 lowers the memory consumption without affecting the query latency but can lead to reduced recall on some data sets.rank (
int
) – Rank (\(r\)) of the parameter matrices. Must be 16, 32, or 64 if quantization_bits is not None. Defaults to 32. Rank = 64 is mainly only useful if no exact re-ranking is performed in the query phase.train_size (
int
) – Number of nearby clusters (\(w\)) used for training the reduced-rank regression models. Defaults to 5, but lower values can be used if \(m \gtrsim 500 000\) to speed up the index construction.euclidean (
bool
) – Whether to use Euclidean distance instead of (negative) inner product as the dissimilarity measure. Defaults to False.balanced (
bool
) – Whether to use balanced clustering. Defaults to False.
- Raises:
ValueError – If the input parameters are invalid.
- Returns:
None
- build(approximate=True, training_queries=None, n_threads=-1)#
Builds the LoRANN index.
- Parameters:
approximate (
bool
) – Whether to turn on various approximations during index construction. Defaults to True. Setting approximate to False slows down the index construction but can slightly increase the recall, especially if no exact re-ranking is used in the query phase.training_queries (
Optional
[ndarray
[tuple
[int
,...
],dtype
[float32
]]]) – An optional matrix of training queries used to build the index. Can be useful in the out-of-distribution setting where the training and query distributions differ. Ideally there should be at least as many training query points as there are index points.n_threads (
int
) – Number of CPU threads to use (set to -1 to use all cores)
- Raises:
RuntimeError – If the index has already been built.
ValueError – If the input parameters are invalid.
- Return type:
None
- search(q, k, clusters_to_search, points_to_rerank, return_distances=False, n_threads=-1)#
Performs an approximate nearest neighbor query for single or multiple query vectors.
Can handle either a single query vector or multiple query vectors in parallel. The query is given as a numpy vector or as a numpy matrix where each row contains a query.
- Parameters:
q (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The query object. Can be either a single query vector or a matrix with one query vector per row.k (
int
) – The number of nearest neighbors to be returned.clusters_to_search (
int
) – Number of clusters to search.points_to_rerank (
int
) – Number of points to re-rank using exact search. If points_to_rerank is set to 0, no re-ranking is performed and the original data does not need to be kept in memory. In this case the final returned distances are approximate distances.return_distances (
bool
) – Whether to also return distances. Defaults to False.n_threads (
int
) – Number of CPU threads to use (set to -1 to use all cores). Only has effect if multiple query vectors are provided.
- Raises:
RuntimeError – If the index has not been been built.
ValueError – If the input parameters are invalid.
- Return type:
Union
[ndarray
[tuple
[int
,...
],dtype
[int32
]],Tuple
[ndarray
[tuple
[int
,...
],dtype
[int32
]],ndarray
[tuple
[int
,...
],dtype
[float32
]]]]- Returns:
If return_distances is False, returns a vector or matrix of indices of the approximate nearest neighbors in the original input data for the corresponding query. If return_distances is True, returns a tuple where the first element contains the nearest neighbors and the second element contains their distances to the query.
- exact_search(q, k, return_distances=False, n_threads=-1)#
Performs an exact nearest neighbor query for single or multiple query vectors.
Can handle either a single query vector or multiple query vectors in parallel. The query is given as a numpy vector or as a numpy matrix where each row contains a query.
- Parameters:
q (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The query object. Can be either a single query vector or a matrix with one query vector per row.k (
int
) – The number of nearest neighbors to be returned.return_distances (
bool
) – Whether to also return distances. Defaults to False.n_threads (
int
) – Number of CPU threads to use (set to -1 to use all cores). Only has effect if multiple query vectors are provided.
- Raises:
ValueError – If the input parameters are invalid.
- Return type:
Union
[ndarray
[tuple
[int
,...
],dtype
[int32
]],Tuple
[ndarray
[tuple
[int
,...
],dtype
[int32
]],ndarray
[tuple
[int
,...
],dtype
[float32
]]]]- Returns:
If return_distances is False, returns a vector or matrix of indices of the exact nearest neighbors in the original input data for the corresponding query. If return_distances is True, returns a tuple where the first element contains the nearest neighbors and the second element contains their distances to the query.
- save(fname)#
Saves the index to a file on the disk.
- Parameters:
fname (
str
) – The filename to save the index to.- Raises:
OSError – If saving to the specified file fails.
- Return type:
None
- Returns:
None
- classmethod load(fname)#
Loads a LorannIndex from a file on the disk.
- Parameters:
fname (
str
) – The filename to load the index from.- Raises:
OSError – If loading from the specified file fails.
- Returns:
The loaded LorannIndex object.
- get_vector(idx)#
Retrieves a vector from the index by its index.
- Parameters:
idx (
int
) – The index of the vector to retrieve.- Return type:
ndarray
[tuple
[int
,...
],dtype
[float32
]]- Returns:
The vector at the specified index.
- Raises:
IndexError – If the index is out of bounds.
- get_dissimilarity(u, v)#
Calculates the dissimilarity between two vectors. The dimensions of the vectors should match the dimension of the index.
- Parameters:
u (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The first vector.v (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The second vector.
- Return type:
float
- Returns:
The dissimilarity between the two vectors.
- Raises:
ValueError – If the vectors are not of the same dimension as the index.
- property n_samples: int#
The number of samples in the index
- property dim: int#
The dimensionality of the data in the index
- property n_clusters: int#
The number of clusters in the index
KMeans#
- class lorann.KMeans(n_clusters, iters=25, euclidean=False, balanced=False, max_balance_diff=16, verbose=False)#
Initializes a KMeans object. The initializer does not perform the actual clustering.
- Parameters:
n_clusters (
int
) – The number of clusters (\(k\)).iters (
int
) – The number of \(k\)-means iterations. Defaults to 25.euclidean (
bool
) – Whether to use Euclidean distance instead of (negative) inner product as the dissimilarity measure. Defaults to False.balanced (
bool
) – Whether to ensure clusters are balanced using an efficient balanced \(k\)-means algorithm. Defaults to False.max_balance_diff (
int
) – The maximum allowed difference in cluster sizes for balanced clustering. Used only if balanced = True. Defaults to 16.verbose (
bool
) – Whether to enable verbose output. Defaults to False.
- Returns:
None
- train(data, n_threads=-1)#
Performs the clustering on the provided data.
- Parameters:
data (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The data as an \(n \times d\) numpy array.n_threads (
int
) – Number of CPU threads to use (set to -1 to use all cores)
- Raises:
ValueError – If the data matrix is invalid.
RuntimeError – If the clustering has already been trained.
- Return type:
List
[ndarray
[tuple
[int
,...
],dtype
[int32
]]]- Returns:
A list of numpy arrays, each containing the ids of the points assigned to the corresponding cluster.
- assign(data, k)#
Assigns given data points to their \(k\) nearest clusters.
The dimensionality of the data should match the dimensionality of the data that the clustering was trained on.
- Parameters:
data (
ndarray
[tuple
[int
,...
],dtype
[float32
]]) – The data as an \(m \times d\) numpy array.k (
int
) – The number of clusters each point is assigned to.
- Raises:
ValueError – If the data matrix is invalid.
RuntimeError – If the clustering has not been trained.
- Return type:
List
[ndarray
[tuple
[int
,...
],dtype
[int32
]]]- Returns:
A list of numpy arrays, one for each cluster, containing the ids of the data points assigned to corresponding cluster.
- get_centroids()#
Retrieves the centroids of the clusters.
- Raises:
RuntimeError – If the clustering has not been trained.
- Return type:
ndarray
[tuple
[int
,...
],dtype
[float32
]]- Returns:
A matrix of centroids, where each row represents a centroid.
- property n_clusters: int#
The number of clusters
- property dim: int#
The dimensionality of the data the clustering was trained on
- property iters: int#
The number of k-means iterations
- property euclidean: bool#
Whether Euclidean distance is used as the dissimilarity measure
- property balanced: bool#
Whether the clustering is (approximately) balanced