C++ Reference#

template<typename DataQuantizer = SQ8Quantizer, typename QueryQuantizer = SQ8Quantizer>
class Lorann : public Lorann::LorannBase#

Public Functions

inline explicit Lorann(float *data, int m, int d, int n_clusters, int global_dim, int rank = 32, int train_size = 5, bool euclidean = false, bool balanced = false)#

Construct a new Lorann object.

NOTE: The constructor does not build the actual index.

Parameters:
  • data – The data matrix as a float array of size \(m \times d\)

  • m – Number of points (rows) in the data matrix

  • d – Number of dimensions (cols) in the data matrix

  • n_clusters – Number of clusters. In general, for \(m\) index points, a good starting point is to set n_clusters as around \(\sqrt{m}\).

  • global_dim – Globally reduced dimension (\(s\)). Must be either -1 or an integer that is a multiple of 32. If global_dim = -1, no dimensionality reduction is used, but the original dimensionality must be a multiple of 32 in this case. Higher values increase recall but also increase the query latency. In general, a good starting point is to set global_dim = -1 if \(d < 200\), global_dim = 128 if \(200 \leq d \leq 1000\), and global_dim = 256 if \(d > 1000\).

  • rank – Rank (\(r\)) of the parameter matrices. Must be 16, 32, or 64. Defaults to 32. Rank = 64 is mainly useful if no exact re-ranking is performed in the query phase.

  • train_size – Number of nearby clusters (\(w\)) used for training the reduced-rank regression models. Defaults to 5, but lower values can be used if \(m \gtrsim 500 000\) to speed up the index construction.

  • euclidean – Whether to use Euclidean distance instead of (negative) inner product as the dissimilarity measure. Defaults to false.

  • balanced – Whether to use balanced clustering. Defaults to false.

inline virtual void search(const float *data, const int k, const int clusters_to_search, const int points_to_rerank, int *idx_out, float *dist_out = nullptr) const override#

Query the index.

Parameters:
  • data – The query vector (dimensionality must match that of the index)

  • k – The number of approximate nearest neighbors retrived

  • clusters_to_search – Number of clusters to search

  • points_to_rerank – Number of points for final (exact) re-ranking. If points_to_rerank is set to 0, no re-ranking is performed and the original data does not need to be kept in memory. In this case the final returned distances are approximate distances.

  • idx_out – The index output array of length k

  • dist_out – The (optional) distance output array of length k

inline virtual void build(const float *query_data, const int query_n, const bool approximate = true, int num_threads = -1) override#

Build the index.

Parameters:
  • query_data – A float array of training queries of size \(n \times d\) used to build the index. Can be useful in the out-of-distribution setting where the training and query distributions differ. Ideally there should be at least as many training query points as there are index points.

  • query_n – The number of training queries

  • approximate – Whether to turn on various approximations during index construction. Defaults to true. Setting approximate to false slows down the index construction but can slightly increase the recall, especially if no exact re-ranking is used in the query phase.

  • num_threads – Number of CPU threads to use (set to -1 to use all cores)

inline void build(const bool approximate = true, int num_threads = -1)#

Build the index.

Parameters:
  • approximate – Whether to turn on various approximations during index construction. Defaults to true. Setting approximate to false slows down the index construction but can slightly increase the recall, especially if no exact re-ranking is used in the query phase.

  • num_threads – Number of CPU threads to use (set to -1 to use all cores)

inline int get_n_samples() const#

Get the number of samples in the index.

Returns:

int

inline int get_dim() const#

Get the dimensionality of the vectors in the index.

Returns:

int

inline int get_n_clusters() const#

Get the number of clusters.

Returns:

int

inline bool get_euclidean() const#

Get whether the index uses the Euclidean distance as the dissimilarity measure.

Returns:

bool

inline void get_vector(const int idx, float *out)#

Get a pointer to a vector in the index.

Parameters:
  • idx – The index to the vector.

  • out – The output buffer.

inline float get_dissimilarity(const float *u, const float *v)#

Compute the dissimilarity between two vectors.

Parameters:
  • u – First vector

  • v – Second vector

Returns:

float The dissimilarity

inline void exact_search(const float *q, int k, int *out, float *dist_out = nullptr) const#

Perform exact k-nn search using the index.

Parameters:
  • q – The query vector (dimension must match the index data dimension)

  • k – The number of nearest neighbors

  • out – The index output array of length k

  • dist_out – The (optional) distance output array of length k

class LorannFP : public Lorann::LorannBase#

Public Functions

inline explicit LorannFP(float *data, int m, int d, int n_clusters, int global_dim, int rank = 24, int train_size = 5, bool euclidean = false, bool balanced = false)#

Construct a new LorannFP object.

NOTE: The constructor does not build the actual index.

Parameters:
  • data – The data matrix as a float array of size \(m \times d\)

  • m – Number of points (rows) in the data matrix

  • d – Number of dimensions (cols) in the data matrix

  • n_clusters – Number of clusters. In general, for \(m\) index points, a good starting point is to set n_clusters as around \(\sqrt{m}\).

  • global_dim – Globally reduced dimension (\(s\)). Must be either -1 or an integer that is a multiple of 64. Higher values increase recall but also increase the query latency. In general, a good starting point is to set global_dim = -1 if \(d < 200\), global_dim = 128 if \(200 \leq d \leq 1000\), and global_dim = 256 if \(d > 1000\).

  • rank – Rank (\(r\)) of the parameter matrices. Defaults to 24. Higher ranks are mainly useful if no exact re-ranking is performed in the query phase.

  • train_size – Number of nearby clusters (\(w\)) used for training the reduced-rank regression models. Defaults to 5, but lower values can be used if \(m \gtrsim 500 000\) to speed up the index construction.

  • euclidean – Whether to use Euclidean distance instead of (negative) inner product as the dissimilarity measure. Defaults to false.

  • balanced – Whether to use balanced clustering. Defaults to false.

inline virtual void search(const float *data, const int k, const int clusters_to_search, const int points_to_rerank, int *idx_out, float *dist_out = nullptr) const override#

Query the index.

Parameters:
  • data – The query vector (dimensionality must match that of the index)

  • k – The number of approximate nearest neighbors retrived

  • clusters_to_search – Number of clusters to search

  • points_to_rerank – Number of points for final (exact) re-ranking. If points_to_rerank is set to 0, no re-ranking is performed and the original data does not need to be kept in memory. In this case the final returned distances are approximate distances.

  • idx_out – The index output array of length k

  • dist_out – The (optional) distance output array of length k

inline virtual void build(const float *query_data, const int query_n, const bool approximate = true, int num_threads = -1) override#

Build the index.

Parameters:
  • query_data – A float array of training queries of size \(n \times d\) used to build the index. Can be useful in the out-of-distribution setting where the training and query distributions differ. Ideally there should be at least as many training query points as there are index points.

  • query_n – The number of training queries

  • approximate – Whether to turn on various approximations during index construction. Defaults to true. Setting approximate to false slows down the index construction but can slightly increase the recall, especially if no exact re-ranking is used in the query phase.

  • num_threads – Number of CPU threads to use (set to -1 to use all cores)

inline void build(const bool approximate = true, int num_threads = -1)#

Build the index.

Parameters:
  • approximate – Whether to turn on various approximations during index construction. Defaults to true. Setting approximate to false slows down the index construction but can slightly increase the recall, especially if no exact re-ranking is used in the query phase.

  • num_threads – Number of CPU threads to use (set to -1 to use all cores)

inline int get_n_samples() const#

Get the number of samples in the index.

Returns:

int

inline int get_dim() const#

Get the dimensionality of the vectors in the index.

Returns:

int

inline int get_n_clusters() const#

Get the number of clusters.

Returns:

int

inline bool get_euclidean() const#

Get whether the index uses the Euclidean distance as the dissimilarity measure.

Returns:

bool

inline void get_vector(const int idx, float *out)#

Get a pointer to a vector in the index.

Parameters:
  • idx – The index to the vector.

  • out – The output buffer.

inline float get_dissimilarity(const float *u, const float *v)#

Compute the dissimilarity between two vectors.

Parameters:
  • u – First vector

  • v – Second vector

Returns:

float The dissimilarity

inline void exact_search(const float *q, int k, int *out, float *dist_out = nullptr) const#

Perform exact k-nn search using the index.

Parameters:
  • q – The query vector (dimension must match the index data dimension)

  • k – The number of nearest neighbors

  • out – The index output array of length k

  • dist_out – The (optional) distance output array of length k

class KMeans#

Public Functions

inline KMeans(int n_clusters, int iters = 25, bool euclidean = false, bool balanced = false, int max_balance_diff = 16, bool verbose = false)#

Construct a new KMeans object.

NOTE: The constructor does not perform the actual clustering.

Parameters:
  • n_clusters – The number of clusters (k)

  • iters – The number of k-means iterations to perform. Defaults to 25.

  • euclidean – Whether to use Euclidean distance instead of (negative) inner product as the dissimilarity measure. Defaults to false.

  • balanced – Whether to ensure clusters are balanced using an efficient balanced k-means algorithm. Defaults to false.

  • max_balance_diff – The maximum allowed difference in cluster sizes for balanced clustering. Used only if balanced = true. Defaults to 16.

  • verbose – Whether to enable verbose output. Defaults to false.

inline std::vector<std::vector<int>> train(const float *data, int n, int m, int num_threads = -1)#

Performs the clustering on the provided data.

Parameters:
  • data – The data matrix

  • n – Number of points (rows) in the data matrix

  • m – Number of dimensions (cols) in the data matrix

  • num_threads – Number of CPU threads to use (set to -1 to use all cores)

Returns:

std::vector<std::vector<int>> Clustering assignments as a vector of vectors where each vector contains the ids of the points assigned to the corresponding cluster

inline std::vector<std::vector<int>> assign(const float *data, int n, int k) const#

Assign given data points to their k nearest clusters.

NOTE: The dimensionality of the data should match the dimensionality of the data that the clustering was trained on.

Parameters:
  • data – The data matrix

  • n – The number of data points (rows) in the data matrix

  • k – The number of clusters each point is assigned to

Returns:

std::vector<std::vector<int>> Clustering assignments as a vector of vectors where each vector contains the ids of the points assigned to the corresponding cluster

inline int get_n_clusters() const#

Get the number of clusters.

Returns:

int

inline int get_iters() const#

Get the number of k-means iterations.

Returns:

int

inline bool get_euclidean() const#

Get whether Euclidean distance is used as the dissimilarity measure.

Returns:

bool

inline bool is_balanced() const#

Get whether balanced k-means is used.

Returns:

bool

inline RowMatrix get_centroids() const#

Get the centroids.

Returns:

float* Centroids as a float array of size n_clusters * dim