http://irefindex.vib.be/wiki/index.php?title=Gene_Ontology_similarity_measurement&feed=atom&action=historyGene Ontology similarity measurement - Revision history2024-03-29T05:36:42ZRevision history for this page on the wikiMediaWiki 1.33.0http://irefindex.vib.be/wiki/index.php?title=Gene_Ontology_similarity_measurement&diff=4080&oldid=prevPaulBoddie: Attempted to make the descriptions more coherent, adding a discussion of observed term frequencies as the basis for the probability or specificity of a concept in an ontology.2012-02-29T18:01:09Z<p>Attempted to make the descriptions more coherent, adding a discussion of observed term frequencies as the basis for the probability or specificity of a concept in an ontology.</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 18:01, 29 February 2012</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l11" >Line 11:</td>
<td colspan="2" class="diff-lineno">Line 11:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Each concept has a probability associated with it (defining the probability of "encountering an instance" of that concept).</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Each concept has a probability associated with it (defining the probability of "encountering an instance" of that concept).</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div># Where a concept ''<del class="diffchange diffchange-inline">c1</del>'' is subsumed by ''<del class="diffchange diffchange-inline">c2</del>'' (as in ''<del class="diffchange diffchange-inline">c1 </del>is-a <del class="diffchange diffchange-inline">c2</del>'') then the probability of encountering an instance of ''<del class="diffchange diffchange-inline">c1</del>'' is less than that of encountering an instance of ''<del class="diffchange diffchange-inline">c2</del>''.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div># Where a concept ''<ins class="diffchange diffchange-inline">c<sub>specific</sub></ins>'' is subsumed by ''<ins class="diffchange diffchange-inline">c<sub>general</sub></ins>'' (as in ''<ins class="diffchange diffchange-inline">c<sub>specific</sub> </ins>is-a <ins class="diffchange diffchange-inline">c<sub>general</sub></ins>'') then the probability of encountering an instance of ''<ins class="diffchange diffchange-inline">c<sub>specific</sub></ins>'' is less than that of encountering an instance of ''<ins class="diffchange diffchange-inline">c<sub>general</sub></ins>''.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Where a single root concept exists, since it subsumes all possible concepts, the probability of encountering an instance of it is 1.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Where a single root concept exists, since it subsumes all possible concepts, the probability of encountering an instance of it is 1.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Since information content is defined as ''-log p(c)'' for a concept ''c'', less probable concepts have higher information content.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div># Since information content is defined as ''-log p(c)'' for a concept ''c'', less probable concepts have higher information content.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l17" >Line 17:</td>
<td colspan="2" class="diff-lineno">Line 17:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The probability of each concept was defined by the cumulative frequency of all nouns subsumed by that concept divided by the total noun frequency of the corpus.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The probability of each concept was defined by the cumulative frequency of all nouns subsumed by that concept divided by the total noun frequency of the corpus.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">{{Note|</del></div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">== Applying Information Content </ins>to <ins class="diffchange diffchange-inline">Communications ==</ins></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">I find the application of information content </del>to <del class="diffchange diffchange-inline">be less than helpful in this context since it is often used to analyse or illustrate properties of communications representations as described in [http://www.cmh.edu/stats/model/InfoModel.htm these notes about information theory and data compression]. When deriving the information context, one first divides ''p(c)'' into 1 which appears to define the "granularity of the state space" or the number of distinct states required to represent the communication of an occurrence of ''c''. Taking the logarithm of this result (''log 1/p(c)'' and thus ''-log p(c)'') then defines the number of digits or bits (if a base-2 logarithm is used) required to encode such an outcome.</del></div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">My </del>first <del class="diffchange diffchange-inline">instinct was </del>to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) ''n<sub>subtree</sub>'' and then dividing by the total number of terms ''n<sub>total</sub>'' to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">Information content is often used to analyse or illustrate properties of communications representations as described in [http://www.cmh.edu/stats/model/InfoModel.htm these notes about information theory and data compression]. When deriving the information context, one </ins>first <ins class="diffchange diffchange-inline">divides ''p(c)'' into 1 which appears to define the "granularity of the state space" or the number of distinct states required to represent the communication of an occurrence of ''c''. Taking the logarithm of this result (''log 1/p(c)'' and thus ''-log p(c)'') then defines the number of digits or bits (if a base-2 logarithm is used) required to encode such an outcome.</ins></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">|Paul}}</del></div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">Thus, if ''c'' is highly probable, occurring with ''p(c) = 0.5'' then ''-log p(c) = -(-1) = 1'', indicating that a single bit is enough to signal the presence of ''c'' in a signal - with a value of 1, say - whereas all other values would be encoded with an initial bit distinguishing them from ''c'' - therefore, with a value of 0 - and additional bits employed if necessary. This can be visualised using a tree:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">* ''c'' (''p(c) = 0.5'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">* (not ''c'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">** ''d'' (''p(d) = 0.25'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">** (not ''d'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">*** ''e'' (''p(e) = 0.15'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">*** ''f'' (''p(f) = 0.1'')</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">Clearly, in a communications context, the aim is to minimise the size of the message by favouring the most frequent values.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">== Returning to Concept Similarity ==</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">An initial attempt to translate the notion of information content to concept similarity is </ins>to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) ''n<sub>subtree</sub>'' and then dividing by the total number of terms ''n<sub>total</sub>'' to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least<ins class="diffchange diffchange-inline">, and one which upholds the concept hierarchy within the information content framework.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">To measure specificity more accurately, one might introduce frequency observations to the ontology terms, maintaining the general property that more general terms (such as ''c<sub>general</sub>'') subsume more specific terms (such as ''c<sub>1</sub>'', ''c<sub>2</sub>'', ...) such that each term's resultant frequency ''r'' is defined as...</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">''r(c<sub>general</sub>) = r(c<sub>1</sub>) + r(c<sub>2</sub>) + ... + r(c<sub>n</sub>) + f(c<sub>general</sub>)''</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">...where ''f'' is the observed frequency of the term itself</ins>. <ins class="diffchange diffchange-inline">Since the resultant frequency of any given term includes contributions from the entire subtree of the ontology of which it is the root node, the hierarchical information encoded in the more naive approach is preserved.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">The remaining difficulty lies in defining what the "observed frequency" of a term is.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">== Concept Comparison ==</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>When comparing two concepts, ''c1'' and ''c2'', Resnik refers to the set of concepts subsuming ''c1'' and ''c2'' which in a hierarchy will be the common ancestors of ''c1'' and ''c2''. Given a measure for each concept which assigns higher values for concepts further from the root of the hierarchy (more specific terms in an ontology consisting of ''is-a'' relationships directed towards the root), the common ancestor of ''c1'' and ''c2'' furthest from the root (the most specific common ancestor, or "lowest common ancestor (LCA)" according to Schlicker et al.) is likely to provide the highest scoring concept subsuming ''c1'' and ''c2''.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>When comparing two concepts, ''c1'' and ''c2'', Resnik refers to the set of concepts subsuming ''c1'' and ''c2'' which in a hierarchy will be the common ancestors of ''c1'' and ''c2''. Given a measure for each concept which assigns higher values for concepts further from the root of the hierarchy (more specific terms in an ontology consisting of ''is-a'' relationships directed towards the root), the common ancestor of ''c1'' and ''c2'' furthest from the root (the most specific common ancestor, or "lowest common ancestor (LCA)" according to Schlicker et al.) is likely to provide the highest scoring concept subsuming ''c1'' and ''c2''.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Bioscape]]</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Bioscape]]</div></td></tr>
</table>PaulBoddiehttp://irefindex.vib.be/wiki/index.php?title=Gene_Ontology_similarity_measurement&diff=3017&oldid=prevPaulBoddie: Added concept comparison notes.2010-09-30T14:04:05Z<p>Added concept comparison notes.</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 14:04, 30 September 2010</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l22" >Line 22:</td>
<td colspan="2" class="diff-lineno">Line 22:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>My first instinct was to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) ''n<sub>subtree</sub>'' and then dividing by the total number of terms ''n<sub>total</sub>'' to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>My first instinct was to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) ''n<sub>subtree</sub>'' and then dividing by the total number of terms ''n<sub>total</sub>'' to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|Paul}}</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>|Paul}}</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">When comparing two concepts, ''c1'' and ''c2'', Resnik refers to the set of concepts subsuming ''c1'' and ''c2'' which in a hierarchy will be the common ancestors of ''c1'' and ''c2''. Given a measure for each concept which assigns higher values for concepts further from the root of the hierarchy (more specific terms in an ontology consisting of ''is-a'' relationships directed towards the root), the common ancestor of ''c1'' and ''c2'' furthest from the root (the most specific common ancestor, or "lowest common ancestor (LCA)" according to Schlicker et al.) is likely to provide the highest scoring concept subsuming ''c1'' and ''c2''.</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Bioscape]]</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Bioscape]]</div></td></tr>
</table>PaulBoddiehttp://irefindex.vib.be/wiki/index.php?title=Gene_Ontology_similarity_measurement&diff=3015&oldid=prevPaulBoddie: Some initial notes.2010-09-28T17:06:10Z<p>Some initial notes.</p>
<p><b>New page</b></p><div>Some notes about measuring similarity of Gene Ontology terms and thus genes (and perhaps even proteins) on this basis.<br />
<br />
The starting point for this investigation is the paper [http://www.biomedcentral.com/1471-2105/7/302/ Schlicker et al., "A new measure for functional similarity of gene products based on Gene Ontology"] which in turn leads to the following papers:<br />
<br />
* [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.5277 Resnik (1995), "Using Information Content to Evaluate Semantic Similarity in a Taxonomy"]<br />
* [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.6442 Resnik (1999), "Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language"]<br />
* [http://www.ncbi.nlm.nih.gov/pubmed/12835272?dopt=AbstractPlus&holding=f1000,f1000m,isrctn Lord et al., "Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation."]<br />
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655092/?tool=pubmed Sheehan et al., "A relation based measure of semantic similarity for Gene Ontology annotations"]<br />
<br />
Resnik defines the information of each concept (or term) in a taxonomy (or ontology) using the notion of [http://en.wikipedia.org/wiki/Self-information information content] by stating that...<br />
<br />
# Each concept has a probability associated with it (defining the probability of "encountering an instance" of that concept).<br />
# Where a concept ''c1'' is subsumed by ''c2'' (as in ''c1 is-a c2'') then the probability of encountering an instance of ''c1'' is less than that of encountering an instance of ''c2''.<br />
# Where a single root concept exists, since it subsumes all possible concepts, the probability of encountering an instance of it is 1.<br />
# Since information content is defined as ''-log p(c)'' for a concept ''c'', less probable concepts have higher information content.<br />
<br />
The probability of each concept was defined by the cumulative frequency of all nouns subsumed by that concept divided by the total noun frequency of the corpus.<br />
<br />
{{Note|<br />
I find the application of information content to be less than helpful in this context since it is often used to analyse or illustrate properties of communications representations as described in [http://www.cmh.edu/stats/model/InfoModel.htm these notes about information theory and data compression]. When deriving the information context, one first divides ''p(c)'' into 1 which appears to define the "granularity of the state space" or the number of distinct states required to represent the communication of an occurrence of ''c''. Taking the logarithm of this result (''log 1/p(c)'' and thus ''-log p(c)'') then defines the number of digits or bits (if a base-2 logarithm is used) required to encode such an outcome.<br />
<br />
My first instinct was to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) ''n<sub>subtree</sub>'' and then dividing by the total number of terms ''n<sub>total</sub>'' to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least.<br />
|Paul}}<br />
<br />
[[Category:Bioscape]]</div>PaulBoddie