PolySearch 2.0 is a web server for text mining and semi-automated discovery of text associations between various types of biomedical entities.
A critical task in biomedical text mining is to discover potential associations between various types of biomedical entities. PolySearch 2.0 (polysearch.ca) is an online text-mining system for identifying relationships between human diseases, genes, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch 2.0 supports a generalized 'Given X, find all associated Ys' query, where X and Y can be selected from the aforementioned biomedical entities. For example, 'Find all associated diseases with Bisphenol A'. PolySearch 2.0 searches for associations against comprehensive collections of free-text corpora, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles, and US Patent application abstracts. PolySearch 2.0 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and HMDB to improve its accuracy and coverage. PolySearch 2.0 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch 2.0 also generates, ranks, and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation.
PolySearch was first published in the 2008 NAR Web Server Issue and has been cited 140+ times to date. It was one of the first web-enabled text mining tools to support comprehensive and associative text searches of PubMed abstracts. It is quite popular and has handled more than 23,500 jobs since 2009. A key limitation with PolySearch has been the long search times (2-3 minutes) and its small number of searchable databases.
The major updates for PolySearch version 2.0 include:
- A complete re-implementation the underlying text-mining framework based on the latest search engine technology (Elasticsearch). This has lead to the ability to search against all thesaurus types simultaneously with significant performance improvement and a nearly 25X acceleration in search times.
- A complete upgrade and re-implementation of the web interface using the latest web technologies standards (HTML5 & Twitter Bootstrap). The new server features a quick search mode and an advanced mode for experienced users with optional configurations. The new web interface is also compatible with mobile devices.
- Significantly expanded custom thesauri from 9 to 20 categories, and from just 3000 to over 1.13 million term entries. In particular we have expanded the thesauri to include toxins, food metabolites, biological taxonomies, pathways, as well as Gene Ontology, MeSH terms, and ICD-10 codes. The thesauri also feature many manually curated terms and synonyms for health effects, drug effects, adverse effects, and chemical taxonomies.
- Significantly expanded number of text corpora and databases (by >80%) to include a total of 6 free-text corpora and 14 bioinformatics databases. The latest server searches against over 43 million articles covering Medline abstracts, PubMed Central full-text, Wikipedia articles, US Patent abstracts, and open access textbooks.
- Significantly expanded support for caching and updating capability. Cache keeps track of the latest computed results and provides almost instant feedback. New documents are automatically retrieved and indexed to ensure that our corpora contain latest documents.
- Yifeng Liu, University of Alberta, Edmonton, Canada (yifeng at ualberta.ca),
- Yongjie Liang, University of Alberta, Edmonton, Canada (yongjiel at ualberta.ca),
- David Wishart, University of Alberta, Edmonton, Canada (dwishart at ualberta.ca)
Liu Y., Liang Y., Wishart D.S. (2015) PolySearch 2.0: A significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins, and more. Nucleic Acids Res. 2015 (Web Server Issue) Manuscript submitted
Cheng D., Knox C., Young N., Stothard P., Damaraju S., Wishart D.S. (2008) PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W399-405.
Liu Y., Liang Y., Wishart D.S. (2015) PolySearch 2.0: A significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins, and more. Nucleic Acids Res. 2015 (Web Server Issue) Manuscript submitted.
This project is supported by the Canadian Institutes of Health Research (award #111062), Alberta Innovates - Health Solutions, and by The Metabolomics Innovation Centre (TMIC), a nationally-funded research and core facility that supports a wide range of cutting-edge metabolomic studies. TMIC is funded by Genome Alberta, Genome British Columbia, and Genome Canada, a not-for-profit organization that is leading Canada's national genomics strategy with $900 million in funding from the federal government.