Search Data Structures¶
-
class
radbm.search.base.BaseSDS¶ Base Search Data Structure, need to be instantiate. Maintains an index for documents to be retrieved with query. When queried the appropriate index(es) will be returned (not the document(s))
-
batch_insert(documents, indexes, *args, **kwargs)¶ Insert a multiple documents in the data structure by saving their index. The documents are not saved for scalability. This is the default implementation that uses insert. See insert documentation.
Parameters: - documents (numpy.ndarray or torch.Tensor) – The documents to by inserted (will not be saved). The first dimension is for the batch (i.e. documents[0] is a document)
- indexes (object) – A unique identifier for the document, some algorithms require index to be hashable.
- *args – pass to insert (see insert for more details)
- **kwargs – pass to insert (see insert for more details)
-
batch_itersearch(queries, *args, **kwargs)¶ Default function that return a list of generator using itersearch. With ith generator corresponding to the ith query.
Parameters: - queries (numpy.ndarray or torch.Tensor) –
- *args – pass to itersearch (see itersearch for more details)
- **kwargs – pass to itersearch (see itersearch for more details)
Returns: itersearch_list – where the ith generetor yields indexes (set or list) that correspond to the ith query.
Return type: list of generator
-
batch_search(queries, *args, **kwargs)¶ Search in the data structure for the index of a documents. This is the default implementation that uses search. See search documentation.
Parameters: - query (numpy.ndarray or torch.Tensor) –
- *args – pass to search (see search for more details)
- **kwargs – pass to search (see search for more details)
Returns: indexes_list – The indexes of the retrieved documents for each query. If indexes[i] is a list, it should indicate that the indexes are ordered.
Return type: list of (set or list)
-
insert(document, index, *args, **kwargs)¶ Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.
Parameters: - document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved)
- index (object) – A unique identifier for the document, some algorithms require index to be hashable.
- *args – pass to batch_insert (see batch_insert for more details)
- **kwargs – pass to batch_insert (see batch_insert for more details)
-
itersearch(query, *args, **kwargs)¶ Default generator, based on batch_itersearch, that yields indexes.
Parameters: - query (numpy.ndarray or torch.Tensor) –
- *args – pass to batch_itersearch (see batch_itersearch for more details)
- **kwargs – pass to batch_itersearch (see batch_itersearch for more details)
Yields: indexes
-
search(query, *args, **kwargs)¶ Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.
Parameters: - query (numpy.ndarray or torch.Tensor) –
- *args – pass to batch_search (see batch_search for more details)
- **kwargs – pass to batch_search (see batch_search for more details)
Returns: indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.
Return type: set or list
-
-
class
radbm.search.mbsds.HashingMultiBernoulliSDS(ntables, nlookups=1)¶ This algorithm build N hash tables from documents in the form of Multi-Bernoulli distributions. From a documents (i.e. a distribution) we find the i-th most probable outcome (which is a binary vector) and use it as a key for the corresponding i-th hash table.
For retrieval, a query (i.e. a Multi-Bernoulli distribution) we use the M most probable outcomes (which are binary vectors) to do M lookups in the hash tables. In this context, a lookups means looking in each tables. In total NxM hash table calls will be made.
Parameters: - ntables (int) – The number of tables to build
- nlookups (int, optional (default=1)) – The number of lookups to do when searching
Notes
The Multi-Bernoulli distribution are always parametrized with the vector of log probabilities for the bits to be one.
-
get_buckets_avg_size()¶ Returns: buckets_avg_size – The average number of documents per buckets for each hash tables Return type: list of float
-
get_buckets_max_size()¶ Returns: buckets_max_size – The maximum number of documents per buckets for each hash tables Return type: list of float
-
get_state()¶ Returns: tables – The current hash tables Return type: list of dict
-
insert(log_probs, i)¶ Insert a unique document’s index in the each tables. The document most be a Multi-Bernoulli distribution parametrized in log probabilities.
Parameters: - log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
- i (hashable (e.g. int or tuple)) – The index of the document
-
itersearch(log_probs, nlookups=None, yield_empty=False)¶ Generator that search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes. Everytime a set of indexes is found, this generator will yield it. If yield_empty is True, empty set will also be yield.
Parameters: - log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
- nlookups (int, optional) – The upper limit to generate the next most probable outcomes. Not to be confused with the number of item generated. By default, generates every outcomes.
Yields: indexes (set) – The newly found indexes
-
reset()¶ Empty the hash tables
Returns: Return type: self
-
search(log_probs, nlookups=None)¶ Search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes.
Parameters: - log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
- nlookups (int, optional) – The number of top outcome to uses for searching. If not specified the default self.nlookups is used.
Returns: indexes – The indexes of each documents found in the search
Return type: set
-
set_state(state)¶ Parameters: state (list of dict) – The hash tables to use
-
class
radbm.search.radius.HammingRadiusSDS(nbits, radius, ntables=1)¶ A Seach Data Structure for binary vectors, it searches within a Hamming radius. This uses hash tables to speed up the search.
Parameters: - nbits (int) – The number of bits the binary vectors will have.
- radius (int) – The Hamming radius to search inside. Should be small as the masks are precomputed. The memory taken is nbits * sum_{i=0}^radius comb(nbits, i).
- ntables (int) – The number of Hash tables to uses for the documents.
-
insert(document, index)¶ Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.
Parameters: - document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved)
- index (object) – A unique identifier for the document, some algorithms require index to be hashable.
- *args – pass to batch_insert (see batch_insert for more details)
- **kwargs – pass to batch_insert (see batch_insert for more details)
-
itersearch(query, yield_empty=False)¶ Generator that search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes. Everytime a set of indexes is found, this generator will yield it. If yield_empty is True, empty set will also be yield.
Parameters: - log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
- yield_empty (bool (optinal)) – Whether empty set should be yielded (default False)
Yields: indexes (set) – The newly found indexes
-
reset()¶ Empty the hash tables
Returns: Return type: self
-
search(query)¶ Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.
Parameters: - query (numpy.ndarray or torch.Tensor) –
- *args – pass to batch_search (see batch_search for more details)
- **kwargs – pass to batch_search (see batch_search for more details)
Returns: indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.
Return type: set or list
-
class
radbm.search.gridsearch.GridSearch(bounds, divisions)¶ An exact symmetric search (query and documents are in the same space) GridSearch separates the space into a grid and fill each cell with documents using a dictionary (cell -> documents). For retrieval, the the cells are searched in order of distance and document are yielded when it is the closest found so far and when the next nearest cell is farther than the document (this is certifying there is no document closer).
Parameters: - bounds (np.ndarray) – shape=(2,d) with d the dimension of the data (query/documents) min_bounds, max_bounds = bounds are the bounds of the grid.
- divisions (np.ndarray, dtype=int) – shape=(d,) with d the dimension of the data (query/documents) the number of cell along each axis.
-
batch_insert(points, indexes)¶ Insert a multiple documents in the data structure by saving their index. The documents are not saved for scalability. This is the default implementation that uses insert. See insert documentation.
Parameters: - documents (numpy.ndarray or torch.Tensor) – The documents to by inserted (will not be saved). The first dimension is for the batch (i.e. documents[0] is a document)
- indexes (object) – A unique identifier for the document, some algorithms require index to be hashable.
- *args – pass to insert (see insert for more details)
- **kwargs – pass to insert (see insert for more details)
-
compute_cell_relative_positions(points)¶ Gives the position relative to the cell of the points :returns: cell_pos – cell_pos in [0, 1] :rtype: np.ndarray
-
compute_cells(points)¶ Returns: cell – The cells of the corresponding points referended by the closest to the origin corner. Return type: np.ndarray, dtype=int
-
compute_scaled_points(points)¶ Rescale point(s)
Parameters: points (np.ndarray) – A point or a batch of points (query/document) Returns: rescalled_points – rescalled_point[i] in [0, self.divisions[i]] Return type: np.ndarray
-
itersearch(point, cell_generator='bruteforce', max_dist=inf, max_items=inf, yield_point=False, yield_dist=False)¶ Generator
Parameters: - point (np.ndarray) –
- cell_generator (str) – should be bruteforce or graphsearch
- max_dist (number) –
- max_items (number) –
- yield_point (bool) –
- yield_dist (bool) –
Yields: - index (int) – if yield_point is False and yield_dist is False
- point (np.ndarray) – if yield_point is True and yield_dist is False
- (dist, index) ((float, int)) – if yield_point is False and yield_dist is True
- (dist, point) ((float, np.ndarray)) – if yield_point is True and yield_dist is True
Search Data Structure Learning¶
Efficient Learnable Binary Access¶
-
class
radbm.search.elba.base.EfficientLearnableBinaryAccess(fq, fd, struct, optim=<class 'torch.optim.adam.Adam'>, lr=0.001)¶ EfficientLearnableBinaryAccess (ELBA) is a base class for concrete models. Given a search data structure and two parametric encoding functions, one for the queries (fq) and one for the documents (fd), both producing a Multi-Bernoulli code (i.e. in [0,1]^n) in its logits form (pre sigmoid). ELBM uses the structure to store and retrieve data using the Multi-Bernoulli code.
Parameters: - fq (torch.nn.Module) – The parametric function of the queries outputting in logits (pre sigmoid).
- fd (torch.nn.Module) – The parametric function of the documents outputting in logits (pre sigmoid).
- struct (BaseSDS subclass) – The structure used for storing and retrieval.
- optim (torch.optim (optional)) – The optimizer (minimizer) class used for the parametric function. (default torch.optim.Adam)
- lr (float (optional)) – The learning rate of the optimizer. (default 0.001)
-
batch_insert(documents, indexes, *args, **kwargs)¶ Insert the index of each documents in the data structure
Parameters: - documents (torch.tensor) – The documents to insert the first dim being the batch.
- indexes (iterable of hashable) – most notable example is a list of int. len(indexes) most be equal to len(documents).
- *args – passed to self.struct.batch_insert
- **kwargs – passed to self.struct.batch_insert
Returns: Return type: self
-
batch_itersearch(queries, *args, **kwargs)¶ Iteratively search in the data structure for the relevant indexes for each queries.
Parameters: - queries (torch.tensor) – The search queries, the first dim being the batch.
- *args – passed to self.struct.batch_itersearch
- **kwargs – passed to self.struct.batch_itersearch
Returns: generator_list – Each generator yield relevant indexes for the corresponding queries. len(generator_list) = len(queries).
Return type: list of generators (of set or list)
-
batch_search(queries, *args, **kwargs)¶ Search in the data structure for the relevant indexes for each queries.
Parameters: - queries (torch.tensor) – The search queries, the first dim being the batch.
- *args – passed to self.struct.batch_search
- **kwargs – passed to self.struct.batch_search
Returns: indexes_list – Is the list of the relevant indexes for each queries. len(indexes_list) = len(queries).
Return type: list of (set or list)
-
class
radbm.search.elba.fbeta.Fbeta(fq, fd, struct, log_match_prob, sim=<function multi_bernoulli_equality>, match_dist=0, nindex=1, ramp=<function Ramp.<locals>.ramp>)¶ The Fbeta model is an ELBA that use the Fbeta Loss to train the Multi-Bernoulli encoders.
Parameters: - fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
- fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
- struct (BaseSDS subclass) – The structure used in ELBA.
- log_match_prob (float) – The log probability that there is a match given a random query and a random document.
- sim (function (torch.Tensor imes torch.Tensor -> torch.Tensor)) – A function taking the query’s Multi-Bernoulli code and the document’s Multi-Bernoulli code and return the bitwise log probability that each bit match.
- match_dist (int (optional)) – The maximum number of dissimilar bits for which we consider that two binary vectors are matching. (default=0)
- nindex (int (optional)) – The number of indexes. (default=1)
- ramp (function (optional)) – A ramping function used for ramping the log2(beta) at each steps.
-
step(q, d, match, l2_ratio=0.01, nbatch=None)¶ Do a training step.
Parameters: - q (torch.Tensor) – A batch of queries.
- d (torch.Tensor) – A batch of documents.
- match (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with match[i,j] indicating if q[i] match with d[j]
- l2_ratio (float (optional)) – The wanted ratio between the Fbeta Loss and L2 regularization. (default 0.01)
- nbatch (int or None (optional)) – Give the number of batch, this is uses for ramping. If None, this uses the final ramping value.
Returns: loss – The loss of the current batch.
Return type: torch.Tensor (size 1)
-
radbm.search.elba.fbeta.fbeta_loss(match, log_match, beta, log_match_prob)¶ Compute the Fbeta Loss
Parameters: - match (torch.Tensor (dtype=torch.bool)) – Indicate if there is a match or not
- log_match (torch.Tensor (dtype=torch.float)) – The log probability of matching given by an estimator
- beta (float) – The beta in the Fbeta Loss
- log_match_prob (float) – The log probability that there is a match given a random query and a random document.
Returns: fbeta_loss – Containing 1 element, this is the Fbeta Loss of the estimator w.r.t match.
Return type: torch.Tensor (dtype=torch.float)
-
class
radbm.search.elba.hbkl.HBKL(fq, fd, struct, nbits, positive_binomial_prob, negative_binomial_prob, *args, **kwargs)¶ Hamming Binomial with KL-divergence
Parameters: - fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
- fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
- struct (BaseSDS subclass) – The structure used in ELBA.
- nbits (int) – The number of bits
- positive_binomial_prob (float [0,1]) – The target binomial probability for matching query-document pairs
- negative_binomial_prob (float [0,1]) – The target binomial probability for non-matching query-document pairs
-
step(q, d, r)¶ Do a training step.
Parameters: - q (torch.Tensor) – A batch of queries.
- d (torch.Tensor) – A batch of documents.
- r (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with r[i,j] indicating if q[i] match with d[j]
Returns: loss – The loss of HBKL.
Return type: torch.Tensor (size 1)
-
class
radbm.search.elba.mihash.MIHash(fq, fd, struct, nbits, match_prob, *args, **kwargs)¶ MIHash as in “MIHash: Online Hashing with Mutual Information” by Fatih Cakir, Kun He, Sarah Adel Bargal and Stan Sclaroff.
Parameters: - fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
- fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
- struct (BaseSDS subclass) – The structure used in ELBA.
- match_prob (float (in [0,1])) – The probability that there is a match given a random query and a random document.
-
step(q, d, match, l2_ratio=0)¶ Do a training step.
Parameters: - q (torch.Tensor) – A batch of queries.
- d (torch.Tensor) – A batch of documents.
- match (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with match[i,j] indicating if q[i] match with d[j]
Returns: loss – The loss (negative mutual information) of the current batch.
Return type: torch.Tensor (size 1)
-
class
radbm.search.elba.mihash.TriangularKernel(centroids, widths=None)¶ Helper Module, compute the triangular kernel.
-
forward(x)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
radbm.search.elba.mihash.categorical_entropy(cat)¶ -(cat*cat.log()).sum() without the annoying 0*inf
Parameters: cat (torch.Tensor (ndim==1)) – The parameter of a Categorical distribution. Returns: ent – The entropy of the Categorical distribution. Return type: torch.Tensor (a single float)
-
radbm.search.elba.mihash.mi_categorical_bernoulli(pos_cat, neg_cat, p)¶ Compute the Multual Information between a categorical and a bernoulli. This use the fact that I(C, B) = H(C) - pH(C | B=1) - (1-p)H(C | B=0) with C = Cat(pi) and B = Ber(p).
Parameters: - pos_cat (torch.tensor (ndim=1, pos_cat.sum()=1)) – The parameters of C | B=1
- neg_cat (torch.tensor (ndim=1, neg_cat.sum()=1)) – The parameters of C | B=0
- p (float) – The parameters of B
Returns: I – The Mutual Information I(C, B)
Return type: torch.tensor (a single float)
-
class
radbm.search.elba.hashnet.HashNet(fq, fd, struct, match_prob, alpha=1, beta=1, *args, **kwargs)¶ HashNet as in “HashNet: Deep Learning to Hash by Continuation” by Zhangjie Cao, Mingsheng Long, Jianmin Wang and Philip S. Yu.
Parameters: - fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
- fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
- struct (BaseSDS subclass) – The structure used in ELBA.
- match_prob (float (in [0,1])) – The probability that there is a match given a random query and a random document.
- alpha (float (optional)) – The HashNet’s alpha used in the adaptative sigmoid (default 1)
- beta (float (optional)) – The HashNet’s beta used for ramping the stage in the tanh(stage*beta*x). (default 1)
-
step(q, d, r, stage=1)¶ Do a training step.
Parameters: - q (torch.Tensor) – A batch of queries.
- d (torch.Tensor) – A batch of documents.
- r (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with r[i,j] indicating if q[i] match with d[j]
- stage (float (optional)) – The stage for computing the tanh(stage*beta*x). (default=1)
Returns: loss – The loss of HashNet.
Return type: torch.Tensor (size 1)