Dataset and Database Loaders¶
Conjunctive Boolean RSS¶
-
class
radbm.loaders.rss.conjunctive_boolean.ConjunctiveBooleanRSS(k, l, m, n, n_queries=None, mode='balanced', n_positives=None, which='train', backend='numpy', device='cpu', rng=<module 'numpy.random' from '/home/docs/checkouts/readthedocs.org/user_builds/radbm/envs/develop/lib/python3.7/site-packages/numpy/random/__init__.py'>)¶ -
generate_subsets(dterms)¶ Parameters: - documents (numpy.ndarray (ndim: 2, shape: (n, l))) – documents[i] is the i-th document.
- k (int) – The size of each query.
- rng (numpy.random.generator.Generator) –
Returns: queries – queries[i] is a subset of documents[i] sampled uniformly.
Return type: numpy.ndarray (ndim: 2, shape: (n, k))
-
residual_log_prob(k, l, m)¶ Numerically stable numpy.log((comb(m-k, l-k)-1)/(comb(m, l)-1))
-
-
class
radbm.loaders.rss.mnist.MnistCB(k, l, m, n, n_queries=None, queries_transform='r0', documents_transform='r0', queries_noise=0, documents_noise=0, path=None, download=True, mode='balanced', n_positives=None, which='train', backend='numpy', device='cpu', rng=<module 'numpy.random' from '/home/docs/checkouts/readthedocs.org/user_builds/radbm/envs/develop/lib/python3.7/site-packages/numpy/random/__init__.py'>)¶ Conjunctive Boolean RSS with Mnist index terms.
Parameters: - k (int) – The number of index terms per query.
- l (int) – The number of index terms per document.
- m (int) – The total number of index terms.
- n (int) – The database size.
- n_queries (int) – The number of queries to sample.
- queries_transform (str (optinal)) – Should be in ‘r0’, ‘r1’, ‘r2’, ‘r3’, ‘sr0’, ‘sr1’, ‘sr2’ or ‘sr3’. It applies a rotation and/or a reflection to the index terms of the queries. It follows the Dihedral group syntaxe, e.g. sr2 indicates that a 180 degree rotation followed by a vertical reflexion is performed. (default: ‘r0’ i.e. doing nothing)
- documents_transform (str (optional)) – Same as queries_transform but for the documents’ index terms.
- path (str or None (optional)) – The path where to find the Mnist dataset, if None we try to find it in the current directory or in the home directory. If it is not found and download is True, it will be downloaded at this location (or in the home directory if it is None). (default: None)
- download (bool (optional)) – If we are allowed to download mnist if not found. (default: True)
- mode (str (optional)) – Should be ‘balanced’ or ‘block’. It dictates the behaviour of the batch method (see the batch method). (default: ‘balanced’)
- which (str (optional)) – Should be ‘train’, ‘valid’ or ‘test’. The initial dataset to use. Changing this attribute directly will yield unknown behaviour. To modify which dataset to use, we need to call train(), valid() or test(). (default : ‘train’)
- backend (str (optional)) – Should be ‘numpy’ or ‘torch’. It dictates the type of data produced by the batch, iter_queries and iter_documents methods. Changing this attribute directly will yield unknown behaviour. To modify which dataset to use, we need to call numpy() or torch(). (default : ‘numpy’)
- device (str (optional)) – Should be ‘cpu’ or ‘cuda’ and cannot be ‘cuda’ if backend is ‘numpy’. Similar to backend, to modify it we need to call cpu() or cuda(). (default: ‘cpu’)
- rng (np.random.generator.Generator) – The random number generator used to generate the batches and the database/queries. Should be used for reproducibility.
-
batch(size, n_positives=None)¶ Parameters: - size (int) – The batch size, i.e. the number of queries and documents to return.
- n_positives (int (optional if mode!='unbalanced')) – The number of positive sample, if mode==’balanced’ this will be overwritten to size//2.
Returns: - queries (np.ndarray or torch.Tensor (dtype: float, shape: (size, k, 28, 28))) – A batch of queries.
- documents (np.ndarray or torch.Tensor (dtype: float, shape: (size, l, 28, 28))) – A batch of documents.
- relevants (np.ndarray or torch.Tensor (dtype: bool, shape: (size,) or (size, size))) – If mode==’balanced’, then relevants.shape is (size,) and relevants[i] indicates if queries[i] matches with documents[i] (i.e. queries[i]’s index terms is a subset of documents[i]’s index terms). The way it is programmed, relevants[:size//2] will always be True while relevants[size//2:] will always be False (this is why the mode is called ‘balanced’). Otherwise, if mode==’block’, then relevants.shape is (size, size) and relevants[i, j] indicates if queries[i] matches with documents[j]. The way it is programmed, the diagonal (relevans[i, i]) is always True while the might be True or False dependant on the probability that a random query matches with a random document.
-
iter_documents(batch_size, maximum=inf, rng=<module 'numpy.random' from '/home/docs/checkouts/readthedocs.org/user_builds/radbm/envs/develop/lib/python3.7/site-packages/numpy/random/__init__.py'>)¶ Generator of the documents in the database with their index.
Parameters: - batch_size (int) – The batch size used for each yield.
- maximum (int (optional)) – The maximum number of documents to yield. (default: np.inf)
Yields: - documents (np.ndarray or torch.Tensor (dtype: float, shape: (batch_size, l, 28, 28))) – A batch of documents.
- indexes (list of int) – The indexes of each documents, i.e. indexes[i] is the index of documents[i].
-
iter_queries(batch_size, maximum=inf, rng=<module 'numpy.random' from '/home/docs/checkouts/readthedocs.org/user_builds/radbm/envs/develop/lib/python3.7/site-packages/numpy/random/__init__.py'>)¶ Generator of the queries and their respective relevant documents’ index.
Parameters: - batch_size (int) – The batch size used for each yield.
- maximum (int (optional)) – The maximum number of documents to yield. (default: np.inf)
Yields: - queries (np.ndarray or torch.Tensor (dtype: float, shape: (batch_size, k, 28, 28))) – A batch of queries.
- relevants (list of list of int) – The indexes of the relevant documents of each query, i.e. relevants[i] is the relevant documents’ list of indexes of for queries[i].
Creating custom database loaders¶
-
class
radbm.loaders.base.IRLoader(mode, which, backend, device, rng=None)¶ A subclass of Loader meant for Information Retrieval. This introduces the notion of mode which gouverns to way batches will be given.
Parameters: mode (str) – should be in IRLoader.get_available_modes()
-
class
radbm.loaders.base.Loader(which, backend, device, rng=None)¶ An abstract class managing numpy vs torch, cpu vs gpu and train vs valid vs test. This should be subclassed with a particular dataset (i.e. Mnist).
Parameters: - which (str) – Which datasets version to use. Should be ‘train’, ‘valid’ or ‘test’.
- backend (str) – Which backend to use, should be ‘numpy’ or ‘torch’.
- device (str) – Which device to use, should be ‘cpu’ or ‘cuda’. ‘cuda’ is only available if backend==’torch’.
- rng (numpy.random.RandomState) – A random number generator for reproducibility.
-
cpu()¶ Transfers each registered data (using register_switch) to the CPU
-
cuda()¶ Transfers each registered data (using register_switch) to the GPU
Raises: ValueError– If backend==’numpy’
-
dynamic_cast(data)¶ Cast data according to the current state of the class. E.g. when backend==’torch’, device==’cuda’ and the inputed data is numpy.ndarray, the array will be converted to torch.Tensor and transfered on the GPU.
Parameters: data (numpy.ndarray or torch.Tensor) – The data to cast Returns: casted_data – The casted data Return type: numpy.ndarray or torch.Tensor
-
get_rng()¶ Utility method to get the rng.
Returns: rng – The rng used inside the utility class TorchNumpyRNG. Return type: numpy.random.RandomState
-
numpy()¶ Converts each registered data (using register_switch) into numpy.ndarray
Raises: ValueError– If device==’cuda’
-
register_switch(name, data)¶ This function should only be used when subclassing. This is to register data for when a user will call: numpy(), torch(), cpu() or cuda(). Each value will be transfered to the appropriate format.
Parameters: - name (str) – The name of the data. setarrt is used so one could later do self.<name> to reach the data.
- data (numpy.ndarray or torch.Tensor) – The data to register.
-
set_rng(rng)¶ Utility method to set the rng.
Parameters: rng (numpy.random.RandomState or TorchNumpyRNG) – The rng to use going forward. Returns: self Return type: Loader
-
test()¶ Switch to testing dataset.
-
torch()¶ Converts each registered data (using register_switch) into torch.Tensor
-
train()¶ Switch to training dataset.
-
valid()¶ Switch to validation dataset.