Metrics for evaluating the searches¶
-
radbm.metrics.hamming.conditional_hamming_counts(documents, queries, relevances, batch_size=100)¶ Compute the count of relevant and not relevant match w.r.t all Hamming distance.
Parameters: - documents (torch.Tensor (2D, dtype=torch.bool)) – The binary reprentation of a batch of documents (database). documents[i] is the ith document. documents.shape[1] must be equal to queries.shape[1]. (should also be on the same device as queries)
- queries (torch.Tensor (2D, dtype=torch.bool)) – The binary reprentation of a batch of queries. queries[i] is the ith query. queries.shape[1] must be equal to documents.shape[1]. (should also be on the same device as documents)
- relevances (list of set of int) – len(relevances) must be len(queries). For each corresponding query, it give the set of relevant documents (given by its index). Explicitly, j in relevances[i] iff query[i] matches with documents[j].
- batch_size (int (optional)) – The number of query for which we compute the Hamming distance at a time. if it is to big the results might not fit in RAM (or on the GPU). (default 100)
Returns: - relevant_counts (torch.Tensor (1D, dtype=torch.float)) – len(relevant_counts) = queries.shape[1] + 1 (also equal to documents.shape[1] + 1) and relevant_dcounts[i] is the number of relevant documents at Hamming distance i.
- irrelevant_counts (torch.Tensor (1D, dtype=torch.float)) – len(irrelevant_countst) = queries.shape[1] + 1 (also equal to documents.shape[1] + 1) and irrelevant_counts[i] is the number of irrelevant documents be at Hamming distance i.
Notes
This assume that each sets in relevances is small compared to len(documents) and should be used on a GPU otherwise it is quite slow.
-
radbm.metrics.hamming.hamming_distance(x, y, *args, **kwargs)¶ Compute the Hamming distance.
Parameters: - x (torch.Tensor (dtype=torch.bool)) –
- y (torch.Tensor (dtype=torch.bool)) –
- *args – Passed to sum
- *kwargs – Passed to sum
Returns: z – The Hamming distance between x and y
Return type: torch.Tensor (dtype=torch.int64)
-
radbm.metrics.hamming.hamming_pr_curve(documents, queries, relevances, batch_size=100, return_valid_dists=False)¶ Compute the precision-recall curve w.r.t the Hamming distance. I.e. it computes the precision and recall for each Hamming distance decision thresholds.
Parameters: - documents (torch.Tensor (2D, dtype=torch.bool)) – The binary reprentation of a batch of documents (database). documents[i] is the ith document. documents.shape[1] must be equal to queries.shape[1]. (should be on the same device as queries)
- queries (torch.Tensor (2D, dtype=torch.bool)) – The binary reprentation of a batch of queries. queries[i] is the ith query. queries.shape[1] must be equal to documents.shape[1]. (should be on the same device as documents)
- relevances (list of set of int) – len(relevances) must be len(queries). For each corresponding query, it give the set of relevant documents (given by its index). Explicitly, j in relevances[i] iff query[i] matches with documents[j].
- batch_size (int (optional)) – The number of query for which we compute the Hamming distance at a time. if it is to big the results might not fit in RAM (or on the GPU). (default 100)
- return_valid_dists (bool (optional)) – Some dists might have an undefined precision. In those case the returned value will be nan by default. If return_valid_dists is True those value won’t be there and dists will be returned with precision and recall. See the returns section for more info. (default False)
Returns: - dists (torch.Tensor (1D, dtype=torch.int64) if return_valid_dists is True) – Only present if return_valid_dists is True. It correspond to the valid distances where the precision is define.
- precisions (torch.Tensor (1D, dtype=torch.float)) – len(precisions) = queries.shape[1] + 1 (also equal to documents.shape[1] + 1) and precision[i] is the precison w.r.t. a Hamming distance of i if return_valid_dists is False otherwise, len(precisions) = len(dists) and precisions[i] is the precision w.r.t a Hamming distance of dists[i].
- recalls (torch.Tensor (1D, dtype=torch.float)) – len(recalls) = queries.shape[1] + 1 (also equal to documents.shape[1] + 1) and recalls[i] is the recall w.r.t. to a Hamming distance of i if return_valid_dists is False otherwise, len(recalls) = len(dists) and recalls[i] is the recall w.r.t a Hamming distance of dists[i].
Notes
This assume that each sets in relevances is small compared to len(documents) and should be used on a GPU otherwise it is quite slow.
-
radbm.metrics.sswr.ChronoSSWR(relevant, delta_generator, N, eta=1, recall=1, allow_halt=False, on_duplicate_candidates='raise')¶ The Chronometer Sequential Search Work Ratio (SSWR) metric used for quick retrieval task. It uses the generating time (in seconds) has a mesure of the work done by the delta_generator.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- delta_generator (generator of set of index) – This should generate set of candidates (index)
- N (int) – The number of documents in the database
- eta (positive float (optional)) – The proportion of importance between a generating time (seconds) and one oracle call. e.g. eta=2 implies that 1 generator second is equivalent to 2 oracle call. (default 1)
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- allow_halt (bool (optional)) – Allow the generator not to necessarily all needed indexes. This is similar, but not equal, to the case where the generator produce every other indexes at one and stop. (default False)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the diplicated candidate(s) will be removed. (default ‘raise’)
Returns: work_ratio – By default, the SSWR. If allow_halt is True it returns a tuple with the SSWR and a boolean indicating if the generator halted abruptly.
Return type: float (or (float, bool) if allow_halt)
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twiceLookupError– If delta_generator stops without generating enough relevant documents
-
radbm.metrics.sswr.CounterSSWR(relevant, delta_generator, N, eta=1, recall=1, allow_halt=False, on_duplicate_candidates='raise')¶ The Counter Sequential Search Work Ratio (SSWR) metric used for quick retrieval task. It uses the number of call to the generator has a mesure of the work done by the delta_generator.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- delta_generator (generator of set of index) – This should generate set of candidates (index)
- N (int) – The number of documents in the database
- eta (positive float (optional)) – The proportion of importance between one generator call and one oracle call. e.g. eta=2 implies that 1 generator call is equivalent to 2 oracle call. (default 1)
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- allow_halt (bool (optional)) – Allow the generator not to necessarily all needed indexes. This is similar, but not equal, to the case where the generator produce every other indexes at one and stop. (default False)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the diplicated candidate(s) will be removed. (default ‘raise’)
Returns: work_ratio – By default, the SSWR. If allow_halt is True it returns a tuple with the SSWR and a boolean indicating if the generator halted abruptly.
Return type: float (or (float, bool) if allow_halt)
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twiceLookupError– If delta_generator stops without generating enough relevant documents
-
radbm.metrics.sswr.HaltSSWR(relevant, monitor, N, max_halt, recall=1, on_duplicate_candidates='raise')¶ The Sequential Search Work Ratio (SSWR) metric used for quick retrieval task this function use monitor has the delta generator and monitor.get_value() for the the cost of generating all the previous delta candidates. On might consider using ChronoSSWR or CounterSSWR, which implement a precise monitor, instead.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- monitor (object) – Iterator of set of index and implementing get_value() -> float
- N (int) – The number of documents in the database
- max_halt (int) – The maximum number of delta_candidates the generator is allow to give before halting.
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the duplicated candidate(s) will be removed. (default ‘raise’)
Returns: - sswrs (np.ndarray (shape=(max_halt+1), dtype=np.float)) – The SSWR w.r.t. all halting between 0 and max_halt. I.e. sswrs[i] is the sswr with halting i.
- halts (np.ndarray (shape=(max_halt+1), dtype=np.bool)) – halts[i] gives whether halting was used. I.e. if halts[i]==False implies all documents were found before the ith delta_candidates were produced
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twice.
Notes
the aforementioned index is something hashable (i.e. hash(index) exists) that can be used to identify uniquely each document in the database. It might be the document itself, but that would be memory heavy. The most common case is using integer but in some cases it might be practical to have more information, maybe packed in a tuple for example.
-
radbm.metrics.sswr.HaltingChronoSSWR(relevant, delta_generator, N, max_halt, eta=1, recall=1, on_duplicate_candidates='raise')¶ The Chronometer Sequential Search Work Ratio (SSWR) metric used for quick retrieval task. It uses the number of call to the generator has a mesure of the work done by the delta_generator w.r.t to all possible halting from 0 to max_halt.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- delta_generator (generator of set of index) – This should generate set of candidates (index)
- N (int) – The number of documents in the database
- max_halt (int) – The maximum number of delta_candidates the generator is allow to give before halting.
- eta (positive float (optional)) – The proportion of importance between one generator call and one oracle call. e.g. eta=2 implies that 1 generator call is equivalent to 2 oracle call. (default 1)
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the diplicated candidate(s) will be removed. (default ‘raise’)
Returns: - sswrs (np.ndarray (shape=(max_halt+1), dtype=np.float)) – The SSWR w.r.t. all halting between 0 and max_halt. I.e. sswrs[i] is the sswr with halting i.
- halts (np.ndarray (shape=(max_halt+1), dtype=np.bool)) – halts[i] gives whether halting was used. I.e. if halts[i]==False implies all documents were found before the ith delta_candidates were produced
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twice
-
radbm.metrics.sswr.HaltingCounterSSWR(relevant, delta_generator, N, max_halt, eta=1, recall=1, on_duplicate_candidates='raise')¶ The Counter Sequential Search Work Ratio (SSWR) metric used for quick retrieval task. It uses the number of call to the generator has a mesure of the work done by the delta_generator w.r.t to all possible halting from 0 to max_halt.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- delta_generator (generator of set of index) – This should generate set of candidates (index)
- N (int) – The number of documents in the database
- max_halt (int) – The maximum number of delta_candidates the generator is allow to give before halting.
- eta (positive float (optional)) – The proportion of importance between one generator call and one oracle call. e.g. eta=2 implies that 1 generator call is equivalent to 2 oracle call. (default 1)
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the diplicated candidate(s) will be removed. (default ‘raise’)
Returns: - sswrs (np.ndarray (shape=(max_halt+1), dtype=np.float)) – The SSWR w.r.t. all halting between 0 and max_halt. I.e. sswrs[i] is the sswr with halting i.
- halts (np.ndarray (shape=(max_halt+1), dtype=np.bool)) – halts[i] gives whether halting was used. I.e. if halts[i]==False implies all documents were found before the ith delta_candidates were produced
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twice
-
radbm.metrics.sswr.MatchingOracleCost(N, K, k)¶ Returns: cost – The expected number of call it take to find k documents out of K in a database containing N documents (without replacement) Return type: float Notes
This function does not check if the inputs are valid i.e. N >= K >= k
-
radbm.metrics.sswr.SSWR(relevant, monitor, N, recall=1, allow_halt=False, on_duplicate_candidates='raise')¶ The Sequential Search Work Ratio (SSWR) metric used for quick retrieval task this function use monitor has the delta generator and monitor.get_value() for the the cost of generating all the previous delta candidates. On might consider using ChronoSSWR or CounterSSWR, which implement a precise monitor, instead.
Parameters: - relevant (set of index) – Corresponding to the set of elemement that need to be retrieved
- monitor (object) – Iterator of set of index and implementing get_value() -> float
- N (int) – The number of documents in the database
- recall (float in [0,1] (optional)) – The minimal percentage of relevant document that should be generated (default 1)
- allow_halt (bool (optional)) – Allow the generator not to necessarily all needed indexes. This is similar, but not equal, to the case where the generator produce every other indexes at one and stop. (default False)
- on_duplicate_candidates (str (optional)) – Should be ‘raise’ or ‘ignore’. Set what should be done if the same index is generated twice. If ‘raise’, a RunetimeError will be raised. Otherwise, if ‘ignore’, the duplicated candidate(s) will be removed. (default ‘raise’)
Returns: work_ratio – By default, the SSWR. If allow_halt is True it returns a tuple with the SSWR and a boolean indicating if the generator halted abruptly.
Return type: float (or (float, bool) if allow_halt)
Raises: ValueError– If on_duplicate_candidates not in {‘raises’, ‘ignore’}.RuntimeError– If on_duplicate_candidates==’raise’ and an index is generated twice.LookupError– If allow_halt is False and delta_generator stops without generating enough relevant documents.
Notes
the aforementioned index is something hashable (i.e. hash(index) exists) that can be used to identify uniquely each document in the database. It might be the document itself, but that would be memory heavy. The most common case is using integer but in some cases it might be practical to have more information, maybe packed in a tuple for example.