Bioscape Result Assessment

From irefindex
Revision as of 16:48, 15 July 2010 by PaulBoddie (talk | contribs) (Added notes about disambiguation using unambiguous/reliable gene mentions.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
NoteNotePlease note that this documentation covers an unreleased product and is for internal use only.

The suggestions produced by Bioscape's search activities can be assessed subject to the availability of "gold standard" data which confirms whether each particular result can be regarded as genuine.

BioCreative 2 Gene Normalisation

In the bsindex distribution, a script is available to export filtered results from Bioscape for assessment against the BioCreative gold standard:

python scripts/bsindex_export_bc2gn_results.py --bionames <generation> --results <generation> --methods human_gene --min-score 1 --output <output>

Once result data is available, this data can be scored through comparison to the gold standard file:

python scripts/bsindex_score_bc2gn_results.py gold <output>

A number of options to the scoring script help compare different sets of results:

python scripts/bsindex_score_bc2gn_results.py gold <output files> --pretty

The --pretty option provides a table with the following columns:

  1. Output filename
  2. Number of true positive results
  3. Number of false positive results
  4. Number of false negative results
  5. Precision
  6. Recall

Combining the output of this script with other Unix commands can be convenient:

python scripts/bsindex_score_bc2gn_results.py gold <output files> --pretty | sort -n -k 5

The above combination should sort the entries on the precision column in order of increasing precision.

Comparing BioCreative Results and Bioscape Results

In order to compare results from BioCreative and Bioscape in the Web interface, the gold standard data must be imported; this involves the following processes:

  1. Import of the gene identifiers and names referenced in the gold standard data file.
  2. Text searching using these names in the appropriate documents, so that regions of text may be shown to provide results.
  3. Propagation of region and gene name information in order to produce specific gene references.

With this information available to Bioscape, it becomes possible to see each result set in the same document and to perform further analysis on the accuracy of Bioscape results.

Isolating Correct and Incorrect Bioscape Results

Using BioCreative results, it is possible to take a selection of Bioscape results and to assess them according to a number of criteria:

  • Correctness: whether each Bioscape result is correct or not - this can already be assessed using the export and scoring scripts described above, but only at the document level.
  • Correspondence: whether each BioCreative result corresponds to any Bioscape results - although this can be done using the scripts at the document level, it now becomes possible to consider the correspondence at the mention level.
  • Ambiguity: the ambiguity of Bioscape suggestions for each BioCreative result - where many Bioscape suggestions indicate ambiguity, and a single suggestion indicates an unambiguous suggestion.
  • Whether Bioscape results appear in places not associated with BioCreative results, and whether these happen to be correspond to BioCreative suggestions for a particular document.

Thus, each Bioscape result can be classified as follows:

Class At known location Predicts correct gene at location Predicts correct gene for document
True positive at "true" BioCreative mention location Yes Yes Yes
False positive at "true" BioCreative mention location Yes No (may co-exist with correct suggestion) No
True positive at wrong "true" BioCreative mention location Yes No (may co-exist with correct suggestion) Yes
True positive at "false" unknown-to-BioCreative mention location No No Yes
False positive at "false" unknown-to-BioCreative mention location No No No

Another way of expressing these result categories is as follows:

At "true" known location At "wrong" known location At "false" unknown location
True positive Bioscape suggestion matches (true positive for mention) Bioscape suggestion matches a suggestion for the document ("accidental" true positive for document)
False positive Bioscape suggestion does not match (and is inappropriate for the document) Bioscape suggestion neither appears at a recognised place or is appropriate for the document

Assessing Ambiguous and Unambiguous Suggestions

For each BioCreative result, zero, one or many suggestions may have been made by Bioscape. For various purposes, we may wish to divide the BioCreative gold standard data into a number of sets of gene mentions:

  1. Those for which results are unambiguously suggested by Bioscape - this can be used to assess the reliability of unambiguous suggestions (albeit at genuine mention locations)
  2. Those for which results are ambiguously suggested by Bioscape - this can be used to assess disambiguation method performance (at least at locations where a correct suggestion has been made)
  3. Those for which no results are suggested by Bioscape - this can be used to assess improved detection techniques

Assessing the Influence of Unambiguous or Reliable Gene Mentions

Result locations where many suggestions are made by Bioscape and where one suggestion is known to be correct offer the possibility of assessing different disambiguation techniques, since it is impossible to always be incorrect when choosing a suggestion at such a location: eventually one choice must be correct. Thus, one can measure the reliability of disambiguation using this subset of the entire collection of Bioscape suggestions.

It is not an unreasonable expectation that gene mentions in a given document will be related - that a gene mentioned in one location may be mentioned in a different way in another location, as well as a gene mentioned in one location being related to another gene mentioned elsewhere through some correspondence - and if a gene is mentioned with certainty in one location, such information could potentially be used in locations where uncertainty exists about which gene is being mentioned, perhaps because a number of different suggestions have been made. Thus, a "reliable" gene mention could be used to influence the choice of a gene suggestion at a location which involves ambiguity.

Strategies for Acquiring Reliable Gene Mentions

One challenge that arises is that of choosing "reliable" gene mentions for use in disambiguation. Although the BioCreative data can tell us which mentions are correct, this is obviously not suitable as a general solution: if such information covered the PubMed document collection in its entirety, there would be no need to detect gene mentions in the first place. However, there are some approaches or strategies which may be used to acquire potentially reliable gene mention information:

  • The precision of Bioscape for unambiguous mentions can be measured, and given a high enough precision through the use of a number of scoring methods, it may be assumed that applying such methods will generally yield high quality gene suggestions for use as reliable gene mentions for disambiguation.
  • There are datasets which seek to offer reliable gene-related information about PubMed documents, such as the GeneRIF dataset and the gene2pubmed table in the Entrez Gene download files (as described in the README file).

By taking such reliable information, either by detecting unambiguous mentions or by consulting lists of annotations, and by scoring results as reliable, it then becomes possible to compile lists of unambiguous or reliable genes which are thought to be referenced in a particular document. Such lists of genes can then be used for disambiguation purposes, as summarised in the diagram below.

Incoming data Table of genes in documents Scores for disambiguation
Result detection Score as unambiguous Storage Disambiguate
Acquire annotation data Score as reliable
Result detection

Choice of Methods for Unambiguous Gene Mentions

It could be argued that any unambiguous suggestion of a gene made by Bioscape could be suitable for use as a reliable gene mention. However, Bioscape is likely to suggest many occurrences of a gene which are not supported, due mostly to the nature of Bioscape's searching process which merely attempts to speculatively match search terms from a lexicon against the text of each document of interest. Thus, a degree of quality control is required to make sure that any suggestions originating from Bioscape appear to make a genuine attempt to suggest the presence of a particular gene.

A number of scoring methods appear to universally improve the quality of Bioscape's suggestions for the purposes of an assessment involving BioCreative documents. The following methods ensure that only names used to refer to human genes are considered, and that suggestions which could also be related to chemical or molecular names or symbols, or to a list of keywords that are considered uninformative, are excluded:

  • human_gene
  • not_chemical_name_mention
  • not_uninformative_keyword_mention

The following method excludes suggestions for genes for which only a single name has been found in a document:

  • confirmed_by_multiple_names

To ensure that suggestions are not competing, at least with others at exactly the same mention location, the following method may be applied:

  • unambiguous_gene_mention_at_overlapping_location

By combining these methods, a set of unambiguous mentions of sufficiently high quality should hopefully be obtained.