Bioscape Issues and Tasks

From irefindex
Revision as of 17:37, 16 February 2009 by PaulBoddie (talk | contribs)

Please note that this documentation covers an unreleased product and is for internal use only.


As described in a document originally produced by Ian Donaldson, here are the current issues, wishes and related notes.

Disambiguation and Elimination

Use synonyms to disambiguate from other bioentities (from the same or different organisms)

  • Some papers (Peregrine, P132, BioCreative #2) suggest only assigning gene identities when mentions are supported by synonyms or (on P133) by uncommon, individual words from the "long-form names" of a gene
  • Some papers (Hakenberg et al, P141) suggest other forms of synonym, typically originating from other data sources (GeneOntology) as well as chromosome information
    • The "disambiguated by competing names" method (which counts the number of names used in a document for a bioentity) manages to consistently raise precision by 4-5%, showing that this does help disambiguation

Use synonyms and capitalisation to identify genuine mentions which resemble English words

Search for ambiguous and unambiguous names separately

Handle ambiguous names and English words by searching for unambiguous names in the same abstract

UMLS term disambiguation (http://www.nlm.nih.gov/research/umls/)

Score according to length in order to decide between overlapping matches ("IL1 receptor" is preferred to "IL1")

Disqualifiers (surrounding words which indicate false positives)

Following words: gene, cell, cells, cell type, domain, DNA binding site, mediated, interactor, protooncogene, costimulates, heterodimer, transcripts, corepressor, exerts, suppresses, encodes

Unspecific synonyms

Added:
  • purely numeric names
  • pN (N being a number)
  • N kDa
  • N kD
  • N k
From BioThesaurus (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus/supplement/BioThesaurus_Supplement.pdf):
  • N k (protein(s))
  • N aa long hypothetical protein(s)>
  • N kaa long hypothetical protein(s)
  • hypothetical protein precursor(s)
  • unnamed protein product(s)
  • conserved (hypothetical)/expressed/hypothetical (conserved)/novel/predicted/putative (exported)/unknown (polyprotein(s)/protein(s)/orf(s))
Action:
Update the scoring for uninformative names in the following file:
bioscape/modules/text/sql/importdb-score-uninformative-pgsql.sql.in
Others:
tRNA, RNA, DNA, mRNA, snRNA
Action:
Check for the presence of such terms in the chemical/molecule name lexicon

Recognition

Conjunctions and enumerations:

  • HAP2, 3, 4
  • HAP2, 3 and 4
  • HAP2-4
  • HAP-2, -3, -4
  • HAP2/4
  • HAP2 to HAP4
  • freac1-freac7
  • M and B creatine kinase

Acronym recognition and expansion/equivalence:

Techniques:

  • Compare words preceding acronyms in parentheses with acronym initials
    • The most conservative acronym disambiguation approach involves comparing the list of candidates suggested by an acronym with those suggested by the accompanying "explanation"
    • However, adopting a name/synonym disambiguation method, such as the "disambiguated by competing names" method, seems to overlap with such an acronym disambiguation technique

Multi-word descriptions and authoritative names (involving commas, parentheses)

Orthographic variation/tokenisation

  • Skipping terms: "type" as in "IL type 1"
  • Greek letters converted to Latin equivalents
  • Hyphens removed, but inserted after every Greek letter
  • Hyphens added at alphabetic/numeric boundaries

Synonyms

Synonyms less than six characters searched with all upper case or initial upper case:

  • ("Change this for yeast.")

Organism-specific

Prefixes ("h") and suffixes ("p") in organism-specific rules

General

Usage of Biothesaurus, BioLexicon

Usage of euGenes (http://eugenes.org/)

Full-text searching

Manual curation lists for adding/removing names for specific bioentities, whole organisms

Case-sensitive searching of ambiguous names:

  • Case-insensitive searching only for numeric names or for names longer than 5 characters

Remove subtype specifier if there is only one subtype in the organism for that organism ("aminocyclase 1" becomes "aninocyclase")

Do pre-search and score on names according to the number of results returned from all PubMed abstracts, filtering out names as a result

Stop words

  • Common English words
  • Protein family terms
  • Non-protein molecules
  • Experimental words