Search Data Structures¶

class radbm.search.base.BaseSDS¶

Base Search Data Structure, need to be instantiate. Maintains an index for documents to be retrieved with query. When queried the appropriate index(es) will be returned (not the document(s))

batch_insert(documents, indexes, *args, **kwargs)¶

Insert a multiple documents in the data structure by saving their index. The documents are not saved for scalability. This is the default implementation that uses insert. See insert documentation.

Parameters:

documents (numpy.ndarray or torch.Tensor) – The documents to by inserted (will not be saved). The first dimension is for the batch (i.e. documents[0] is a document)
indexes (object) – A unique identifier for the document, some algorithms require index to be hashable.
*args – pass to insert (see insert for more details)
**kwargs – pass to insert (see insert for more details)

batch_itersearch(queries, *args, **kwargs)¶

Default function that return a list of generator using itersearch. With ith generator corresponding to the ith query.

Parameters:	queries (numpy.ndarray or torch.Tensor) – args – pass to itersearch (see itersearch for more details) *kwargs – pass to itersearch (see itersearch for more details)
Returns:	itersearch_list – where the ith generetor yields indexes (set or list) that correspond to the ith query.
Return type:	list of generator

batch_search(queries, *args, **kwargs)¶

Search in the data structure for the index of a documents. This is the default implementation that uses search. See search documentation.

Parameters:	query (numpy.ndarray or torch.Tensor) – args – pass to search (see search for more details) *kwargs – pass to search (see search for more details)
Returns:	indexes_list – The indexes of the retrieved documents for each query. If indexes[i] is a list, it should indicate that the indexes are ordered.
Return type:	list of (set or list)

insert(document, index, *args, **kwargs)¶

Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.

Parameters:	document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved) index (object) – A unique identifier for the document, some algorithms require index to be hashable. args – pass to batch_insert (see batch_insert for more details) *kwargs – pass to batch_insert (see batch_insert for more details)

itersearch(query, *args, **kwargs)¶

Default generator, based on batch_itersearch, that yields indexes.

Parameters:	query (numpy.ndarray or torch.Tensor) – args – pass to batch_itersearch (see batch_itersearch for more details) *kwargs – pass to batch_itersearch (see batch_itersearch for more details)
Yields:	indexes

search(query, *args, **kwargs)¶

Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.

Parameters:	query (numpy.ndarray or torch.Tensor) – args – pass to batch_search (see batch_search for more details) *kwargs – pass to batch_search (see batch_search for more details)
Returns:	indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.
Return type:	set or list

class radbm.search.mbsds.HashingMultiBernoulliSDS(ntables, nlookups=1)¶

This algorithm build N hash tables from documents in the form of Multi-Bernoulli distributions. From a documents (i.e. a distribution) we find the i-th most probable outcome (which is a binary vector) and use it as a key for the corresponding i-th hash table.

For retrieval, a query (i.e. a Multi-Bernoulli distribution) we use the M most probable outcomes (which are binary vectors) to do M lookups in the hash tables. In this context, a lookups means looking in each tables. In total NxM hash table calls will be made.

Parameters:	ntables (int) – The number of tables to build nlookups (int, optional (default=1)) – The number of lookups to do when searching

Notes

The Multi-Bernoulli distribution are always parametrized with the vector of log probabilities for the bits to be one.

get_buckets_avg_size()¶

Returns:	buckets_avg_size – The average number of documents per buckets for each hash tables
Return type:	list of float

get_buckets_max_size()¶

Returns:	buckets_max_size – The maximum number of documents per buckets for each hash tables
Return type:	list of float

get_state()¶

Returns:	tables – The current hash tables
Return type:	list of dict

insert(log_probs, i)¶

Insert a unique document’s index in the each tables. The document most be a Multi-Bernoulli distribution parametrized in log probabilities.

Parameters:

log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
i (hashable (e.g. int or tuple)) – The index of the document

itersearch(log_probs, nlookups=None, yield_empty=False)¶

Generator that search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes. Everytime a set of indexes is found, this generator will yield it. If yield_empty is True, empty set will also be yield.

Parameters:

log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
nlookups (int, optional) – The upper limit to generate the next most probable outcomes. Not to be confused with the number of item generated. By default, generates every outcomes.

Yields:

indexes (set) – The newly found indexes

reset()¶

Empty the hash tables

Returns:
Return type:	self

search(log_probs, nlookups=None)¶

Search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes.

Parameters:	log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1) nlookups (int, optional) – The number of top outcome to uses for searching. If not specified the default self.nlookups is used.
Returns:	indexes – The indexes of each documents found in the search
Return type:	set

set_state(state)¶

Parameters:	state (list of dict) – The hash tables to use

class radbm.search.radius.HammingRadiusSDS(nbits, radius, ntables=1)¶

A Seach Data Structure for binary vectors, it searches within a Hamming radius. This uses hash tables to speed up the search.

Parameters:	nbits (int) – The number of bits the binary vectors will have. radius (int) – The Hamming radius to search inside. Should be small as the masks are precomputed. The memory taken is nbits * sum_{i=0}^radius comb(nbits, i). ntables (int) – The number of Hash tables to uses for the documents.

insert(document, index)¶

Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.

Parameters:	document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved) index (object) – A unique identifier for the document, some algorithms require index to be hashable. args – pass to batch_insert (see batch_insert for more details) *kwargs – pass to batch_insert (see batch_insert for more details)

itersearch(query, yield_empty=False)¶

Generator that search in the tables with a query in the form of a Multi-Bernoulli distribution parametrized in log probabilities. This will search in each tables with each of the top (nlookups) outcomes. Everytime a set of indexes is found, this generator will yield it. If yield_empty is True, empty set will also be yield.

Parameters:

log_probs (numpy.ndarray (ndim == 1 or 2)) – If ndim==1: The Multi-Bernoulli distribution parametrized in log probabilities If ndim==2: len(log_probs) should be 2. The first element should be log probabilities that bits are zero and the second element should be the log probabilities that bits are one. This is for numerical stability (i.e. when probability are too close to 1)
yield_empty (bool (optinal)) – Whether empty set should be yielded (default False)

Yields:

indexes (set) – The newly found indexes

reset()¶

Empty the hash tables

Returns:
Return type:	self

search(query)¶

Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.

Parameters:	query (numpy.ndarray or torch.Tensor) – args – pass to batch_search (see batch_search for more details) *kwargs – pass to batch_search (see batch_search for more details)
Returns:	indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.
Return type:	set or list

class radbm.search.gridsearch.GridSearch(bounds, divisions)¶

An exact symmetric search (query and documents are in the same space) GridSearch separates the space into a grid and fill each cell with documents using a dictionary (cell -> documents). For retrieval, the the cells are searched in order of distance and document are yielded when it is the closest found so far and when the next nearest cell is farther than the document (this is certifying there is no document closer).

Parameters:	bounds (np.ndarray) – shape=(2,d) with d the dimension of the data (query/documents) min_bounds, max_bounds = bounds are the bounds of the grid. divisions (np.ndarray, dtype=int) – shape=(d,) with d the dimension of the data (query/documents) the number of cell along each axis.

batch_insert(points, indexes)¶

Insert a multiple documents in the data structure by saving their index. The documents are not saved for scalability. This is the default implementation that uses insert. See insert documentation.

Parameters:

documents (numpy.ndarray or torch.Tensor) – The documents to by inserted (will not be saved). The first dimension is for the batch (i.e. documents[0] is a document)
indexes (object) – A unique identifier for the document, some algorithms require index to be hashable.
*args – pass to insert (see insert for more details)
**kwargs – pass to insert (see insert for more details)

compute_cell_relative_positions(points)¶: Gives the position relative to the cell of the points :returns: cell_pos – cell_pos in [0, 1] :rtype: np.ndarray

compute_cells(points)¶

Returns:	cell – The cells of the corresponding points referended by the closest to the origin corner.
Return type:	np.ndarray, dtype=int

compute_scaled_points(points)¶

Rescale point(s)

Parameters:	points (np.ndarray) – A point or a batch of points (query/document)
Returns:	rescalled_points – rescalled_point[i] in [0, self.divisions[i]]
Return type:	np.ndarray

itersearch(point, cell_generator='bruteforce', max_dist=inf, max_items=inf, yield_point=False, yield_dist=False)¶

Generator

Parameters:	point (np.ndarray) – cell_generator (str) – should be bruteforce or graphsearch max_dist (number) – max_items (number) – yield_point (bool) – yield_dist (bool) –
Yields:	index (int) – if yield_point is False and yield_dist is False point (np.ndarray) – if yield_point is True and yield_dist is False (dist, index) ((float, int)) – if yield_point is False and yield_dist is True (dist, point) ((float, np.ndarray)) – if yield_point is True and yield_dist is True

Search Data Structure Learning¶

Efficient Learnable Binary Access¶

class radbm.search.elba.base.EfficientLearnableBinaryAccess(fq, fd, struct, optim=<class 'torch.optim.adam.Adam'>, lr=0.001)¶

EfficientLearnableBinaryAccess (ELBA) is a base class for concrete models. Given a search data structure and two parametric encoding functions, one for the queries (fq) and one for the documents (fd), both producing a Multi-Bernoulli code (i.e. in [0,1]^n) in its logits form (pre sigmoid). ELBM uses the structure to store and retrieve data using the Multi-Bernoulli code.

Parameters:

fq (torch.nn.Module) – The parametric function of the queries outputting in logits (pre sigmoid).
fd (torch.nn.Module) – The parametric function of the documents outputting in logits (pre sigmoid).
struct (BaseSDS subclass) – The structure used for storing and retrieval.
optim (torch.optim (optional)) – The optimizer (minimizer) class used for the parametric function. (default torch.optim.Adam)
lr (float (optional)) – The learning rate of the optimizer. (default 0.001)

batch_insert(documents, indexes, *args, **kwargs)¶

Insert the index of each documents in the data structure

Parameters:	documents (torch.tensor) – The documents to insert the first dim being the batch. indexes (iterable of hashable) – most notable example is a list of int. len(indexes) most be equal to len(documents). args – passed to self.struct.batch_insert *kwargs – passed to self.struct.batch_insert
Returns:
Return type:	self

batch_itersearch(queries, *args, **kwargs)¶

Iteratively search in the data structure for the relevant indexes for each queries.

Parameters:	queries (torch.tensor) – The search queries, the first dim being the batch. args – passed to self.struct.batch_itersearch *kwargs – passed to self.struct.batch_itersearch
Returns:	generator_list – Each generator yield relevant indexes for the corresponding queries. len(generator_list) = len(queries).
Return type:	list of generators (of set or list)

batch_search(queries, *args, **kwargs)¶

Search in the data structure for the relevant indexes for each queries.

Parameters:	queries (torch.tensor) – The search queries, the first dim being the batch. args – passed to self.struct.batch_search *kwargs – passed to self.struct.batch_search
Returns:	indexes_list – Is the list of the relevant indexes for each queries. len(indexes_list) = len(queries).
Return type:	list of (set or list)

class radbm.search.elba.fbeta.Fbeta(fq, fd, struct, log_match_prob, sim=<function multi_bernoulli_equality>, match_dist=0, nindex=1, ramp=<function Ramp.<locals>.ramp>)¶

The Fbeta model is an ELBA that use the Fbeta Loss to train the Multi-Bernoulli encoders.

Parameters:

fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
struct (BaseSDS subclass) – The structure used in ELBA.
log_match_prob (float) – The log probability that there is a match given a random query and a random document.
sim (function (torch.Tensor imes torch.Tensor -> torch.Tensor)) – A function taking the query’s Multi-Bernoulli code and the document’s Multi-Bernoulli code and return the bitwise log probability that each bit match.
match_dist (int (optional)) – The maximum number of dissimilar bits for which we consider that two binary vectors are matching. (default=0)
nindex (int (optional)) – The number of indexes. (default=1)
ramp (function (optional)) – A ramping function used for ramping the log2(beta) at each steps.

step(q, d, match, l2_ratio=0.01, nbatch=None)¶

Do a training step.

Parameters:	q (torch.Tensor) – A batch of queries. d (torch.Tensor) – A batch of documents. match (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with match[i,j] indicating if q[i] match with d[j] l2_ratio (float (optional)) – The wanted ratio between the Fbeta Loss and L2 regularization. (default 0.01) nbatch (int or None (optional)) – Give the number of batch, this is uses for ramping. If None, this uses the final ramping value.
Returns:	loss – The loss of the current batch.
Return type:	torch.Tensor (size 1)

radbm.search.elba.fbeta.fbeta_loss(match, log_match, beta, log_match_prob)¶

Compute the Fbeta Loss

Parameters:	match (torch.Tensor (dtype=torch.bool)) – Indicate if there is a match or not log_match (torch.Tensor (dtype=torch.float)) – The log probability of matching given by an estimator beta (float) – The beta in the Fbeta Loss log_match_prob (float) – The log probability that there is a match given a random query and a random document.
Returns:	fbeta_loss – Containing 1 element, this is the Fbeta Loss of the estimator w.r.t match.
Return type:	torch.Tensor (dtype=torch.float)

class radbm.search.elba.hbkl.HBKL(fq, fd, struct, nbits, positive_binomial_prob, negative_binomial_prob, *args, **kwargs)¶

Hamming Binomial with KL-divergence

Parameters:

fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
struct (BaseSDS subclass) – The structure used in ELBA.
nbits (int) – The number of bits
positive_binomial_prob (float [0,1]) – The target binomial probability for matching query-document pairs
negative_binomial_prob (float [0,1]) – The target binomial probability for non-matching query-document pairs

step(q, d, r)¶

Do a training step.

Parameters:	q (torch.Tensor) – A batch of queries. d (torch.Tensor) – A batch of documents. r (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with r[i,j] indicating if q[i] match with d[j]
Returns:	loss – The loss of HBKL.
Return type:	torch.Tensor (size 1)

class radbm.search.elba.mihash.MIHash(fq, fd, struct, nbits, match_prob, *args, **kwargs)¶

MIHash as in “MIHash: Online Hashing with Mutual Information” by Fatih Cakir, Kun He, Sarah Adel Bargal and Stan Sclaroff.

Parameters:	fq (torch.nn.Module) – The query Multi-Bernoulli encoder. fd (torch.nn.Module) – The document Multi-Bernoulli encoder. struct (BaseSDS subclass) – The structure used in ELBA. match_prob (float (in [0,1])) – The probability that there is a match given a random query and a random document.

step(q, d, match, l2_ratio=0)¶

Do a training step.

Parameters:	q (torch.Tensor) – A batch of queries. d (torch.Tensor) – A batch of documents. match (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with match[i,j] indicating if q[i] match with d[j]
Returns:	loss – The loss (negative mutual information) of the current batch.
Return type:	torch.Tensor (size 1)

class radbm.search.elba.mihash.TriangularKernel(centroids, widths=None)¶

Helper Module, compute the triangular kernel.

forward(x)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

radbm.search.elba.mihash.categorical_entropy(cat)¶

-(cat*cat.log()).sum() without the annoying 0*inf

Parameters:	cat (torch.Tensor (ndim==1)) – The parameter of a Categorical distribution.
Returns:	ent – The entropy of the Categorical distribution.
Return type:	torch.Tensor (a single float)

radbm.search.elba.mihash.mi_categorical_bernoulli(pos_cat, neg_cat, p)¶

Compute the Multual Information between a categorical and a bernoulli. This use the fact that I(C, B) = H(C) - pH(C | B=1) - (1-p)H(C | B=0) with C = Cat(pi) and B = Ber(p).

Parameters:	pos_cat (torch.tensor (ndim=1, pos_cat.sum()=1)) – The parameters of C \| B=1 neg_cat (torch.tensor (ndim=1, neg_cat.sum()=1)) – The parameters of C \| B=0 p (float) – The parameters of B
Returns:	I – The Mutual Information I(C, B)
Return type:	torch.tensor (a single float)

class radbm.search.elba.hashnet.HashNet(fq, fd, struct, match_prob, alpha=1, beta=1, *args, **kwargs)¶

HashNet as in “HashNet: Deep Learning to Hash by Continuation” by Zhangjie Cao, Mingsheng Long, Jianmin Wang and Philip S. Yu.

Parameters:

fq (torch.nn.Module) – The query Multi-Bernoulli encoder.
fd (torch.nn.Module) – The document Multi-Bernoulli encoder.
struct (BaseSDS subclass) – The structure used in ELBA.
match_prob (float (in [0,1])) – The probability that there is a match given a random query and a random document.
alpha (float (optional)) – The HashNet’s alpha used in the adaptative sigmoid (default 1)
beta (float (optional)) – The HashNet’s beta used for ramping the stage in the tanh(stage*beta*x). (default 1)

step(q, d, r, stage=1)¶

Do a training step.

Parameters:	q (torch.Tensor) – A batch of queries. d (torch.Tensor) – A batch of documents. r (torch.Tensor (dtype=torch.bool)) – A matrix (2D tensor) with r[i,j] indicating if q[i] match with d[j] stage (float (optional)) – The stage for computing the tanh(stagebetax). (default=1)
Returns:	loss – The loss of HashNet.
Return type:	torch.Tensor (size 1)