PolySearch2 Algorithm


PolySearch2 System Overview

PolySearch 2.0 consists of five basic components: 1) a web-based user interface for constructing queries, displaying and synopsizing results. 2) a computational API for selecting, ranking and integrating content; 3) a collection of biomedical synonyms (custom thesauruses); 4) a general text search engine with ElasticSearch for retrieving relevant articles and records from heterogeneous databases; 5) a collection of internal text corpora and biomedical databases; An outline of PolySearch2's general design is given in Figure 1.

Figure 1: PolySearch 2.0 system overview showing the resources that PolySearch 2.0 uses and the system architecture.

polysearch2-schema

PolySearch 2.0 Sentence scoring, ranking and integration

Sentence scoring, ranking and integration: Some of the more important rules in PolySearch's pattern recognition system are as follows:

  • For the main pattern "Query Word-Association Word-Thesaurus Word", PolySearch searches for compact patterns first. If a compact pattern cannot be found, then PolySearch searches for general patterns. If a general pattern cannot be found, then PolySearch searches for relaxed patterns.
  • Compact patterns:
    • The query word and the association word must be within 5 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 10 words (tokens) of the query word.
    • A stop word such as "that", "which", "whereas" or "no" cannot be in a "Query Word-Association Word-Thesaurus Word" pattern.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word or stop word is seen, the pattern resets.
  • General patterns:
    • All relevant words must be within 40 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 15 words (tokens) of the query word.
    • A stop word such as "that", "which", "whereas" or "no" cannot be in a "Query Word-Association Word-Thesaurus Word" pattern.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word or stop word is seen, the pattern resets.
  • Relaxed patterns:
    • All relevant words must be within 45 words (tokens) of each other.
    • The query word and the association word must be within 30 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 40 words (tokens) of the query word.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word is seen, the pattern resets.
  • For the "Association Word-Query Word-Thesaurus Word" pattern (mainly for Gene/Protein searches), the association word must have a suffix of -ate, -fer, -ment, -ing, -ion, -lex, -es, or -ions. In addition, all three words must be within 10 words (tokens) of each other.

For the "Query Word-Thesaurus Word-Association Word" pattern (mainly for Gene/Protein searches), the association word must be one of "complex", "complexes", "inhibitor", "inhibitors", "interaction", or "interactions". In addition, all three words must be within 8 words (tokens) of each other.

PolySearch2 Algorithmic Improvement

PolySearch2 incorporates substantial algorithmic improvements, which strengthens the scoring, ranking, and selection of association candidates under the algorithm introduced by the original PolySearch. Please refer to the previous section for a detailed description of the original PolySearch algorithm. In this section we describe algorithmic improvements in PolySearch2. These include: 1) a tightness measure to further discriminate association patterns, 2) a weight boost for database records to favour database records over articles in text corpus, 3) a larger collection of system filter words, and 4) removing borderline associations.

PolySearch2 Tightness Measure

PolySearch2 incorporates a tightness measure to further discriminate association patterns.



This project is supported by the Canadian Institutes of Health Research (award #111062), Alberta Innovates - Health Solutions, and by The Metabolomics Innovation Centre (TMIC), a nationally-funded research and core facility that supports a wide range of cutting-edge metabolomic studies. TMIC is funded by Genome Alberta, Genome British Columbia, and Genome Canada, a not-for-profit organization that is leading Canada's national genomics strategy with $900 million in funding from the federal government.