Search data structures

Basic search Data structures

class radbm.search.DictionarySearch
get_state()
Returns:table – The current hash table
Return type:dict
insert(document, index)

Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.

Parameters:
  • document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved)
  • index (object) – A unique identifier for the document, some algorithms require index to be hashable.
  • *args – pass to batch_insert (see batch_insert for more details)
  • **kwargs – pass to batch_insert (see batch_insert for more details)
search(query)

Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to batch_search (see batch_search for more details)
  • **kwargs – pass to batch_search (see batch_search for more details)
Returns:

indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.

Return type:

set or list

set_state(state)
Parameters:state (dict) – The hash table to use
class radbm.search.KeyValueHeap(*items, return_counts=False, key=<function identity>)

A key-value heap. I.e, the objects stored in the heap are not (necessarily) those used for comparisons. Similar to the built-in sorted algorithm, A custom key function can be supplied to customize the sort order. The code of the class is an adaptation of the standard python heapq algorithm that support the key keyword and returns the number of swaps and comparison of the insert and pop methods; useful for academic analysis.

Parameters:
  • *items (objects) – The initial to insert in the heap. The items themselve need not to be comparable. However, key(item1) < key(item2) must work.
  • return_counts (bool (optional)) – Whether or not to return the number of swaps and comparisons in the pop method. Those quantities will always be returned in the batch_insert and insert methods. (default: False)
  • key (function (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. (default: identity)
batch_insert(items, key=None)

A non-standard batch_insert methods. It does not take an indexes arguments.

Parameters:
  • items (iterables of objects) – The items to insert in the heap.
  • key (function or None (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. If None, the key provided at initialization will be used. (default: None)
Returns:

  • swap_count (int) – The total number of swaps performed in the heap while inserting every items.
  • comp_count (int) – The total number of comparisons performed in the heap while inserting every items.

Search in the data structure for the index of a documents. This is the default implementation that uses search. See search documentation.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to search (see search for more details)
  • **kwargs – pass to search (see search for more details)
Returns:

indexes_list – The indexes of the retrieved documents for each query. If indexes[i] is a list, it should indicate that the indexes are ordered.

Return type:

list of (set or list)

insert(*item, key=None)

A non-standard insert methods. It does not take an index arguments.

Parameters:
  • item (object) – The item to insert in the heap.
  • key (function or None (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. If None, the key provided at initialization will be used. (default: None)
Returns:

  • swap_count (int) – The total number of swaps performed in the heap while inserting the item.
  • comp_count (int) – The total number of comparisons performed in the heap while inserting the item.

keys(key=None)

Iter over the keys in the heap (not in order) using the key method.

Parameters:key (function or None (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. If None, the key provided at initialization will be used. (default: None)
Yields:k (comparable) – The result of key(item).
overwrite(heap)

overwrite the heap, no check is performed to see if this is a heap.

pop(return_counts=None, key=None)

Pop the smallest item off the heap, maintaining the heap invariant.

Parameters:
  • return_counts (boolean or None (optional)) – If True, the swap and comparison count will be given. If None, the return_counts provided at initialization will be used. (default: None)
  • key (function or None (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. If None, the key provided at initialization will be used. (default: None)
Returns:

  • minitem (object) – The smallest item in the heap
  • swap_count (int (if return_counts is True)) – The total number of swaps performed in the heap while inserting the item.
  • comp_count (int (if return_counts is True)) – The total number of comparisons performed in the heap while inserting the item.

search(key=None)

Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to batch_search (see batch_search for more details)
  • **kwargs – pass to batch_search (see batch_search for more details)
Returns:

indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.

Return type:

set or list

sort(key=None)

Sort the item of the heap in linear time by popping them (O(n)) and reinserting in the heap. The new (sorted) order is still a heap and can still be used as such.

Parameters:key (function or None (optional)) – A function that takes an item an return a object (e.g. a float) that can be compared. If None, the key provided at initialization will be used. (default: None)
Returns:self – With the items sorted.
Return type:KeyValueHeap

Binary-based data structures

class radbm.search.binary.HammingMultiProbing(insert_radius=0, search_radius=0, probing='all', halt_cost=inf)

The binary version of the multi-probing algorithm Modeling lsh for performance tuning.

Parameters:
  • insert_radius (int (optional)) – The Hamming distance radius for insertion. The ball of radius insert_radius around each document will map to the document’s index in the hash table. Can be overwritten in the batch_insert and insert methods using the radius keyword. (default: 0)
  • search_radius (int (optional)) – The Hamming distance radius for searching. The ball of radius search_radius around each query will be looked up in the hash table. Can be overwritten in the batch_search and search methods using the radius keyword. (default: 0)
  • probing (str (optional)) – Should be ‘all’ or ‘align’. If ‘all’, search will retrieve all documents matching with one of the query’s tag. If ‘align’, search will retrieve document s.t. at least one of their i-th tag matchs with the i-th tag of the query. (default: ‘all’)
  • halt_cost (float (optional)) – The cost at which to halt when generating candidates in the itersearch and batch_itersearch methods. The value can be overwritten in those methods using the halt_cost keyword. (default: np.inf)
batch_insert(documents, indexes, radius=None)
Parameters:
  • documents (numpy.ndarray or torch.tensor (dtype: bool, ndim: 2 or 3)) – The documents, must be binary vectors.
  • indexes (iterable of hashable) – A sequence of unique identifier for each corresponding document.
  • radius (None or int (optional)) – The radius at which to insert. If None, self.insert_radius will be used. (default: None)
Returns:

self

Return type:

HammingMultiProbing

Raises:

TypeError – If documents.dtype is not boolean.

Parameters:
  • queries (numpy.ndarray or torch.tensor (dtype: bool, ndim: 2 or 3)) – The queries, must be binary vectors.
  • radius (None or int (optional)) – The radius at which to search. If None, self.search_radius will be used. (default: None)
Returns:

indexes – Each set contains the retrieved documents’ identifier of the corresponding query.

Return type:

list if sets

Raises:

TypeError – If queries.dtype is not boolean.

clear()

Clear the hash table

get_state()
Returns:tables – The current hash table
Return type:dict
itersearch(query, halt_cost=None, yield_cost=False, yield_empty=False, yield_duplicates=False)
Parameters:
  • query (numpy.ndarray or torch.tensor (dtype: bool, ndim: 1 or 2)) – The query, must be a binary vector.
  • halt_cost (None or float (optional)) – The cost at which to halt. If None, self.halt_cost will be used. (default: None)
  • yield_cost (bool (optional)) – Whether to give the total cost each time a set of candidates is yielded. (default: False)
  • yield_empty (bool (optional)) – Whether to yield an empty set when there is a table miss (i.e., when multi-probing tries to lookup for a binary vector not present in the database). (default: False)
  • yield_duplicates (bool (optional)) – Whether to yield documents’ identifier each time they are found. Even if they were found before. (default: False)
Yields:
  • indexes (set (if yield_cost is False (default))) – A set containing the retrieved documents’ identifier of the corresponding query.
  • indexes, cost (set, float (if yield_cost is True)) – If yield_cost is True, tuples will be yieled where costs is the total cost of the search so far.
Raises:

TypeError – If queries.dtype is not boolean.

set_state(state)
Parameters:state (dict) – The hash table to use
class radbm.search.binary.BernoulliMultiProbing(insert_number=1, search_number=1, insert_alternate_tags=True, search_alternate_tags=False, probing='all', halt_cost=inf, cost=<function BernoulliMultiProbing.<lambda>>)

The a variant of the multi-probing algorithm Modeling lsh for performance tuning where each the the documents are Multi-Bernoulli distrutions and what is inserted in the hash table is the most probable outcomes. Similarly, the queries are Multi-Bernoulli and the hash table search is performed using the most probable outcomes.

Parameters:
  • insert_number (int (optional)) – The number of outcomes to insert. Each document’s outcomes will map to the document’s index in the hash table. Can be overwritten in the batch_insert and insert methods using the number keyword. (default: 1)
  • search_number (int (optional)) – The number of outcomes to search for in the hash table. The query’s search_number most probable outcomes will be looked up in the hash table. Can be overwritten in the batch_search and search methods using the number keyword. (default: 1)
  • insert_alternate_tags (bool (optional)) – Whether to take outcomes from each Multi-Bernoulli in rotation, discarding the probability of the outcomes. This only affect the batch_insert and insert methods and can be overwritten when calling them. (default: True)
  • search_alternate_tags (bool (optional)) – Whether to take outcomes from each Multi-Bernoulli in rotation, discarding the probability of the outcomes. This only affect the batch_search and search methods and can be overwritten when calling them. (default: False)
  • probing (str (optional)) – Should be ‘all’ or ‘align’. If ‘all’, search will retrieve all documents matching with one of the query’s tag. If ‘align’, search will retrieve document s.t. at least one of their i-th tag matchs with the i-th tag of the query. (default: ‘all’)
  • halt_cost (float (optional)) – The cost at which to halt when generating candidates in the itersearch and batch_itersearch methods. The value can be overwritten in those methods using the halt_cost keyword.
  • cost (function (optional) (int swap, int comp, int size) -> float cost) – The cost function tells how the itertsearch show compute the engine cost. Swap, comp and size are all statistics from the heap.pop() and heap.push() methods used to compute the outcomes in decreasing order of probabilities. Swap is the number of heap swap, comp is the number of heap comparison, and size is the current heap size. (default: lambda swap,comp,size: swap+comp+1)
batch_insert(documents, indexes, number=None, alternate_tags=None)
Parameters:
  • documents (torch.Tensor or np.ndarray (dtype: float, ndim: 2 or 3)) – The logits (i.e., before applying the sigmoid) of the Multi-Bernoulli. The first dimension corresponds to the batch, the last corresponds to the bits and if ndim==3, the second dimension is for multiple Multi-Bernoulli.
  • indexes (iterable of hashable) – A sequence of unique identifier for each corresponding document.
  • number (None or int (optional)) – The number of outcomes to insert. If None, self.insert_number will be used. (default: None)
  • alternate_tags (None or bool (optional)) – Overwrite the insert_alternate_tags. (default: None)
Returns:

self

Return type:

BernoulliMultiProbing

Raises:
  • TypeError – If documents.dtype is not a float.
  • ValueError – If len(documents) != len(indexes).
batch_itersearch(queries, alternate_tags=None, halt_cost=None, yield_cost=False, yield_empty=False, yield_duplicates=False)
Parameters:
  • queries (torch.Tensor or numpy.ndarray(dtype: float, ndim: 2 or 3)) – The logits (i.e., before applying the sigmoid) of the Multi-Bernoulli. The first dimension corresponds to the batch, the last corresponds to the bits and if ndim==3, the second dimension is for multiple Multi-Bernoulli.
  • alternate_tags (None or bool (optional)) – Overwrite the search_alternate_tags. (default: None)
  • halt_cost (None or float (optional)) – The cost at which to halt. If None, self.halt_cost will be used. (default: None)
  • yield_cost (bool (optional)) – Whether to give the total cost each time a set of candidates is yielded. (default: False)
  • yield_empty (bool (optional)) – Whether to yield an empty set when there is a table miss (i.e., when multi-probing tries to lookup for a binary vector not present in the database). (default: False)
  • yield_duplicates (bool (optional)) – Whether to yield documents’ identifier each time they are found. Even if they were found before. (default: False)
Yields:
  • indexes (set (if yield_cost is False (default))) – A set containing the retrieved documents’ identifier of the corresponding query.
  • indexes, cost (set, float (if yield_cost is True)) – If yield_cost is True, tuples will be yieled where costs is the total cost of the search so far.
Raises:

TypeError – If queries.dtype is not a float.

Parameters:
  • queries (torch.Tensor or numpy.ndarray(dtype: float, ndim: 2 or 3)) – The logits (i.e., before applying the sigmoid) of the Multi-Bernoulli. The first dimension corresponds to the batch, the last corresponds to the bits and if ndim==3, the second dimension is for multiple Multi-Bernoulli.
  • number (None or int (optional)) – The number of outcomes to search for in the hash table. If None, self.search_number will be used. (default: None)
  • alternate_tags (None or bool (optional)) – Overwrite the search_alternate_tags. (default: None)
Returns:

indexes – Each set contains the retrieved documents’ identifier of the corresponding query.

Return type:

list of sets

Raises:

TypeError – If queries.dtype is not a float.

clear()

Clear the hash table

get_state()
Returns:tables – The current hash table
Return type:dict
set_state(state)
Parameters:state (dict) – The hash table to use

Superset search data structure

class radbm.search.superset.SupersetTrieSearch(halt_cost=inf, search_type='dfs')

This implement the superset search algorithm described in A New Method to Index and Query Sets where the documents and queries are indexed by sets.

Parameters:
  • halt_cost (float (optional)) – The number of outcomes to insert. Each document’s outcomes will map to the document’s index in the hash table. Can be overwritten in the batch_insert and insert methods using the number keyword. (default: inf)
  • search_type (str (optional)) – Should be either ‘dfs’ or ‘bfs’ for depth first search or breadth first search respectively. This only affects the itersearch method by changing the order the document will be outputed. This does not affects the search method since it wait for every relevant node to be explored. (default: ‘dfs’)
insert(document, index)
Parameters:
  • document (iterable of hashable with order (e.g. int or str)) – The document’s code
  • indexes (hashable) – A unique identifier for the document.
Returns:

self

Return type:

SupersetTrieSearch

itersearch(query, halt_cost=None, yield_cost=False, yield_empty=False, yield_duplicates=False, search_type=None)
Parameters:
  • query (iterable of hashable with order (e.g. int or str)) – The query’s code to search for superset.
  • halt_cost (None or float (optional)) – The cost at which to halt. If None, self.halt_cost will be used. (default: None)
  • yield_cost (bool (optional)) – Whether to give the total cost each time a set of candidates is yielded. (default: False)
  • yield_empty (bool (optional)) – Whether to yield an empty set when we explore a possible node which could hold a superset the data was inserted. (default: False)
  • yield_duplicates (bool (optional)) – Whether to yield documents’ identifier each time they are found. Even if they were found before. (default: False)
  • search_type (None, str (optional)) – It overwrite the search_type set at initialization. If None it self.search_type will be used. (default: ‘dfs’)
Yields:
  • indexes (set (if yield_cost is False (default))) – A set containing the retrieved documents’ identifier of the corresponding query.
  • indexes, cost (set, float (if yield_cost is True)) – If yield_cost is True, tuples will be yieled where costs is the total cost of the search so far.
Raises:

ValueError – If search_type not ‘dfs’ nor ‘bfs’.

search(query)
Parameters:query (iterable of hashable with order (e.g. int or str)) – The query’s code to search for superset.
Returns:indexes – A set containing the retrieved documents’ identifier of the corresponding query.
Return type:set
class radbm.search.superset.PrioritySupersetTrieSearch(halt_cost=inf, search_type='dfs')

This implement a variant of the superset search algorithm described in A New Method to Index and Query Sets where the documents and are indexed by sets but where the queries provide a priority over each element in the set. When searching, the algorithm will assume the query contains every possible element to search in the trie. After this exploration is done, the search algorithm will remove the element with the least priority and resume the exploration of the trie. This repeat until every document is generated or up to a halt_cost.

Notes

The documents should be iterable of int in {0, 1, …, n} where n is the number of possible elements.

Parameters:
  • halt_cost (float (optional)) – At what cost we stop searching. Each node considered cost 1 and looking inside the hash table cost 1.
  • search_type (str (optional)) – Should be either ‘dfs’ or ‘bfs’ for depth first search or breadth first search respectively. This only affects the itersearch method by changing the order the document will be outputed. This does not affects the search method since it wait for every relevant node to be explored. (default: ‘dfs’)
insert(document, index)
Parameters:
  • document (iterable of hashable with order (e.g. int or str)) – The document’s code
  • indexes (hashable) – A unique identifier for the document.
Returns:

self

Return type:

PrioritySupersetTrieSearch

itersearch(query, halt_cost=None, yield_cost=False, yield_empty=False, yield_duplicates=False, search_type=None, start_phase=0)
Parameters:
  • query (torch.Tensor or numpy.ndarray(dtype: float, ndim: 2)) – The priorities of each element’s presence in the query’s set. The search will start with every element present (except whan start_phase is provided and greater than zero) and elements will be remove one at a time starting with the element having the less (smaller) priority to further search in the trie.
  • halt_cost (None or float (optional)) – The cost at which to halt. If None, self.halt_cost will be used. (default: None)
  • yield_cost (bool (optional)) – Whether to give the total cost each time a set of candidates is yielded. (default: False)
  • yield_empty (bool (optional)) – Whether to yield an empty set when we explore a possible node which could hold a superset the data was inserted. (default: False)
  • yield_duplicates (bool (optional)) – Whether to yield documents’ identifier each time they are found. Even if they were found before. (default: False)
  • search_type (None, str (optional)) – It overwrite the search_type set at initialization. If None it self.search_type will be used. (default: ‘dfs’)
  • start_phase (int (optional)) – How many element to remove from the query’s set before exploring. (default: 0)
Yields:
  • indexes (set (if yield_cost is False (default))) – A set containing the retrieved documents’ identifier of the corresponding query.
  • indexes, cost (set, float (if yield_cost is True)) – If yield_cost is True, tuples will be yieled where costs is the total cost of the search so far.
Raises:

ValueError – If search_type not ‘dfs’ nor ‘bfs’.

search(query, halt_cost=None)
Parameters:
  • query (torch.Tensor or numpy.ndarray(dtype: float, ndim: 2)) – The priorities of each element’s presence in the query’s set. The search will start with every element present (except whan start_phase is provided and greater than zero) and elements will be remove one at a time starting with the element having the less (smaller) priority to further search in the trie.
  • halt_cost (None or float (optional)) – The cost at which to halt. If None, self.halt_cost will be used. (default: None)
Returns:

indexes – A set containing the retrieved documents’ identifier of the corresponding query.

Return type:

set

Search data structure reduction

class radbm.search.reduction.HammingReduction(fq, fd, struct)
class radbm.search.reduction.BernoulliReduction(fq, fd, struct)

Creating custom data structures

BaseSDS is the base class used to construct new data structures. Only one of batch_insert or insert method needs to be overwritten. Same for batch_search, search and batch_itersearch, itersearch. One can implement get_state and set_state to use the save and load methods.

class radbm.search.base.BaseSDS

Base Search Data Structure, need to be instantiate. Maintains an index for documents to be retrieved with query. When queried the appropriate index(es) will be returned (not the document(s))

batch_insert(documents, indexes, *args, **kwargs)

Insert a multiple documents in the data structure by saving their index. The documents are not saved for scalability. This is the default implementation that uses insert. See insert documentation.

Parameters:
  • documents (numpy.ndarray or torch.Tensor) – The documents to by inserted (will not be saved). The first dimension is for the batch (i.e. documents[0] is a document)
  • indexes (object) – A unique identifier for the document, some algorithms require index to be hashable.
  • *args – pass to insert (see insert for more details)
  • **kwargs – pass to insert (see insert for more details)
batch_itersearch(queries, *args, **kwargs)

Default function that return a list of generator using itersearch. With ith generator corresponding to the ith query.

Parameters:
  • queries (numpy.ndarray or torch.Tensor) –
  • *args – pass to itersearch (see itersearch for more details)
  • **kwargs – pass to itersearch (see itersearch for more details)
Returns:

itersearch_list – where the ith generetor yields indexes (set or list) that correspond to the ith query.

Return type:

list of generator

Search in the data structure for the index of a documents. This is the default implementation that uses search. See search documentation.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to search (see search for more details)
  • **kwargs – pass to search (see search for more details)
Returns:

indexes_list – The indexes of the retrieved documents for each query. If indexes[i] is a list, it should indicate that the indexes are ordered.

Return type:

list of (set or list)

insert(document, index, *args, **kwargs)

Insert a single document in the data structure by saving the index. The document is not saved for scalability. This is the default implementation that uses batch_insert. See batch_insert documentation.

Parameters:
  • document (numpy.ndarray or torch.Tensor) – The document to by inserted (will not be saved)
  • index (object) – A unique identifier for the document, some algorithms require index to be hashable.
  • *args – pass to batch_insert (see batch_insert for more details)
  • **kwargs – pass to batch_insert (see batch_insert for more details)
itersearch(query, *args, **kwargs)

Default generator, based on batch_itersearch, that yields indexes.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to batch_itersearch (see batch_itersearch for more details)
  • **kwargs – pass to batch_itersearch (see batch_itersearch for more details)
Returns:

Return type:

Iterable of set of indexes.

search(query, *args, **kwargs)

Search in the data structure for the index of a documents. This is the default implementation that uses batch_search. See batch_search documentation.

Parameters:
  • query (numpy.ndarray or torch.Tensor) –
  • *args – pass to batch_search (see batch_search for more details)
  • **kwargs – pass to batch_search (see batch_search for more details)
Returns:

indexes – The indexes of the retrieved documents. If indexes is a list, it should indicate that the indexes are ordered.

Return type:

set or list

To create custom reduction:

class radbm.search.reduction.base.ModulePointwiseReduction(struct)

Abstract class, queries_reduction and documents_reduction methods need to be overwritten. A Pointwise reduction is the simplest form of reduction. Each document is transformed without taking the other documents in consideration. Similarly, each query is transformed without looking at the database.

Direct subclass of PointwiseReduction that implement get_state and set_state. Allowing the save and load methods. This is done in a way that any attribute of the class torch.nn.Module will be saved.

Parameters:struct (BaseSDS subclass) – The data structure to reduce to.
class radbm.search.reduction.base.PointwiseReduction(struct)

Abstract class, queries_reduction and documents_reduction methods need to be overwritten. A Pointwise reduction is the simplest form of reduction. Each document is transformed without taking the other documents in consideration. Similarly, each query is transformed without looking at the database.

Parameters:struct (BaseSDS subclass) – The data structure to reduce to.
batch_insert(documents, indexes, *args, **kwargs)

Insert the index of each documents in the data structure

Parameters:
  • documents (torch.tensor) – The documents to insert the first dim being the batch.
  • indexes (iterable of hashable) – most notable example is a list of int. len(indexes) most be equal to len(documents).
  • *args – passed to self.struct.batch_insert
  • **kwargs – passed to self.struct.batch_insert
Returns:

Return type:

self

batch_itersearch(queries, *args, **kwargs)

Iteratively search in the data structure for the relevant indexes for each queries.

Parameters:
  • queries (torch.tensor) – The search queries, the first dim being the batch.
  • *args – passed to self.struct.batch_itersearch
  • **kwargs – passed to self.struct.batch_itersearch
Returns:

generator_list – Each generator yield relevant indexes for the corresponding queries. len(generator_list) = len(queries).

Return type:

list of generators (of set or list)

Search in the data structure for the relevant indexes for each queries.

Parameters:
  • queries (torch.tensor) – The search queries, the first dim being the batch.
  • *args – passed to self.struct.batch_search
  • **kwargs – passed to self.struct.batch_search
Returns:

indexes_list – Is the list of the relevant indexes for each queries. len(indexes_list) = len(queries).

Return type:

list of (set or list)