* Geraint Duck, Robert Stevens, David Robertson and Goran Nenadic. Ambiguity and Variability of Database and Software Names in Bioinformatics
There are now numerous options available to achieve various tasks in bioinformatics but, as yet, little progress has been made to capture the common practice by analysing usage and mentions of databases and tools within the literature. In this paper we analyse the variability and ambiguity of database and software name mentions and provide a set of 30 full-text documents manually annotated on the mention level. Our analyses show that identification of mentions of databases and tools is not a task that can be achieved through dictionary matching alone: our baseline dictionary look-up achieved a F-score of just over 50%. This is primarily because of high variability and ambiguity in database and software mentions contained within the literature and due to the extensive number of resources available. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.
* Aron Henriksson, Hans Moen, Maria Skeppstedt, Ann-Marie Eklund, Vidas Daudaravičius and Martin Hassel. Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models
In information extraction, it is useful to know if two signifiers have the same or very similar semantic content. Maintaining such information in a controlled vocabulary is, however, costly. Here it is demonstrated how synonyms of medical terms can be extracted automatically from a large corpus of clinical text using distributional semantics. By combining Random Indexing and Random Permutation, different lexical semantic aspects are captured, effectively increasing our ability to identify synonymic relations between terms. 44% of 340 synonym pairs from MeSH are successfully extracted in a list of ten suggestions. The models can also be used to map abbreviations to their full-length forms; simple pattern-based filtering of the suggestions yields substantial improvements.
* Fiona Callaghan, Matthew Jackson, Dina Demner-Fushman, Swapna Abhyankar and Clement McDonald. NLP-derived information improves the estimates of risk of disease compared to estimates based on manually extracted data alone.
Natural language processing (NLP) enables researchers to extract large quantities of information from free-text that otherwise could only be extracted manually. This information can then be used to answer clinical research questions via statistical analysis. However, NLP extracts information with some degree of error – the sensitivity and specificity of state-of-the-art NLP methods are typically 80-90% -- and most statistical methods assume that the information has been observed "without measurement error". As we show in this paper, if an NLP-derived smoking status predictor is used, for example, to estimate the risk of smoking-related cancer without any adjustment for measurement error, the estimate is biased. Conversely, if a smaller subset of manually extracted data is used alone, then the estimate is unbiased, but imprecise, and the corresponding inference methods tend to have low power to detect significant relationships. We propose using a statistical measurement error method – a maximum likelihood (ML) method – that combines information from NLP with manually validated data to produce unbiased estimates that also have good power to detect a significant signal. This method has the potential to open-up large free-text databases to statistical analysis for clinical research. With a case study using smoking status to predict smoking-related cancer and simulations, we demonstrate that the ML method performs better under a variety of scenarios than using either NLP or manually extracted data alone.
* Tobias Kuhn and Michael Krauthammer. Image Mining from Gel Diagrams in Biomedical Publications
Authors of biomedical publications often use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a way to concisely communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for image mining endeavors. We introduce an approach for the detection of gel images, and present an automatic workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present first results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.
* Seung-Cheol Baek and Jong C. Park. Use of clue word annotations as the silver-standard in training models for biological event extraction
Current state-of-the-art approaches to biological event extraction train models by reconstructing relevant graphs from training sentences, where labeled nodes correspond to tokens that indicate the presence of events and the relations between nodes correspond to the relations between these events and their participants. Since multi-word expressions may also indicate events, these approaches use heuristic rules to define target graphs to reconstruct by mapping various clue words into single tokens. Since training instances define actual problems to solve, the method of deriving graphs must affect the system performance, but there has not been any related study on this aspect, to the best of our knowledge. In this study, we propose an incorporation of an EM algorithm into supervised learning to look for training graphs that are more favorable for model construction. We evaluate our algorithm on the development dataset in the 2009 BioNLP shared task and show that this algorithm makes a statistically meaningful improvement on the performance of trained models over a supervised learning algorithm on a fixed set of training graphs.
* Pontus Stenetorp, Hubert Soyer, Sampo Pyysalo, Sophia Ananiadou and Takashi Chikayama. Size (and Domain) Matters: Evaluating Semantic Word Space Representations for Biomedical Text
Despite the availability of large corpora of unannotated biomedical scientific texts, domain machine learning-based systems tend to draw only on comparatively small manually annotated corpora. In this work, we explore opportunities to support supervised machine learning through the use of word representations induced from large unannotated corpora. We evaluate a number of established methods extrinsically, by studying the capacity of induced representations to support machine learning-based natural language processing tasks, specifically named entity recognition on three different corpora and semantic category disambiguation on a large automatically acquired corpus. Experiments demonstrate both a clear benefit of many semantic representations on both tasks and all corpora as well as a strong domain dependence, indicating that semantic representations should be induced on documents drawn from the domain relevant to the supervised learning tasks they aim to support. All of the code and resources introduced in this study are freely available from http://wordreprs.nlplab.org/
* Gerold Schneider, Simon Clematide, Gintare Grigonyte and Fabio Rinaldi. Using syntax features and document discourse for relation extraction on PharmGKB and CTD
We present an approach to the extraction of relations between pharmacogenomics entities like drugs, genes and diseases which is based on syntax and on discourse. Particularly, discourse has not been studied widely for improving Text Mining. We learn syntactic features semi-automatically from lean document-level annotation. We show how a simple Maximum Entropy based machine learning approach helps to estimate the relevance of candidate relations based on dependency-based features found in the syntactic path connecting the involved entities. Maximum Entropy based relevance estimation of candidate pairs conditioned on syntactic features improves relation ranking by 68% relative increase measured by AUCiP/R and by 60% for TAP-k (k=10). We also show that automatically recognizing document-level discourse characteristics to expand and filter acronyms improves term recognition and interaction detection by 12% relative, measured by AUCiP/R and by TAP-k (k=10). Our pilot study uses PharmGKB and CTD as resources.
* Agnieszka Mykowiecka and Malgorzata Marciniak. Terminology Extraction from Medical Texts in Polish
The paper presents the first results of terminology extraction from hospital discharge documents written in Polish. To begin, the characteristic of the language of texts, which differs significantly from general Polish, is given. Then, we describe our approach to the extraction task which consists of two steps. The fist one identifies candidates for terms, and is supported by linguistic knowledge. The second step is based on statistics, consisting in ranking and filtering candidates for domain terms with the help of a C-value method. In order to count the frequencies of phrases, we decided to use their artificial base forms. The paper presents the pros and cons of this approach. Finally, we describe the results and present two types of evaluation: the first, indicates how many terms we are able to identify in a real text with the help of the extracted terminology, while the second one tests how many proper terms have been extracted.
* Artjom Klein, Alexandre Riazanov, Khaleel Al-Rababah, Mauno Vihinen and Christopher Baker. Towards a next generation protein mutation grounding system for full texts
Mutation grounding is an automated process which links mutation annotations to specific protein sequences and their variants. This is a non-trivial algorithmic task and a number of approaches have been developed, albeit the scalability of existing implementations is still an issue hindering their adoption. In this work we transform a proof-of-concept mutation grounding prototype showing acceptable performance on a modest homogeneous corpus, into a robust system capable of processing a wide range of publications with high precision and recall through rational redesign of the algorithm.
* Brita Keller, Jannik Strötgen and Michael Gertz. Event-centric Document Similarity for Biomedical Literature
Identifying similar documents for a given query document helps users to explore large document collections. However, most existing techniques are based on the vector space model and handle documents only as bags of words. Thus, more complex information that can be used for calculating similarities is not taken into account. For example, events play an important role in the biomedical literature and could be valuable to identify similar documents. In this paper, we present an event-centric document similarity model for biomedical literature and demonstrate the effectiveness of our approach based on experiments using the GENIA corpus.
* Katrin Tomanek, Philipp Daumke, Frank Enders, Jens Huber, Katharina Theres and Marcel Müller. An Interactive De-Identification-System
* Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Sophia Ananiadou and Akiko Aizawa. Normalisation with the brat rapid annotation tool
We present an interactive system for de-identification of unstructured clinical records. De-identification is performed semi-automatically in an interactive manner where the system suggests phrases of identifying information which need then be reviewed and verified by a human. The combination of automatic methods and manual approval ensures a high level of privacy and data security on the one hand and high throughput rates on the other hand.
We introduce new functionality for the BRAT rapid annotation tool, focusing on support for the manual annotation of text with normalisation annotations that identify entries in external resources such as ontologies and entity databases. The tool is available under an open-source license at http://brat.nlplab.org/
* Kai Hakala, Sofie Van Landeghem, Suwisa Kaewphan, Tapio Salakoski, Yves Van de Peer and Filip Ginter. CyEVEX: Literature-scale network integration and visualization through Cytoscape
CyEVEX is a literature-scale event extraction resource, publicly available via a web application and as a relational database. In this paper we present CyEVEX, a plug-in which integrates EVEX with the widely used Cytoscape network analysis platform, making the text mining data readily available for integration with experimental data sources and subsequent biological analysis. CyEVEX can populate existing networks with edges corresponding to EVEX events, as well as add new nodes to the network, revealing novel interesting genes and proteins and their relationships within the existing network.
* Maria Skeppstedt and Hercules Dalianis. Using active learning and pre-tagging for annotating clinical findings in health record text.
A method that combines pre-tagging with a version of active learning is proposed for annotating named entities in clinical text.
* Ying Yan, Senay Kafkas, Matthew Conroy and Dietrich Rebholz-Schuhmann. Towards Generating a Corpus Annotated for Prokaryote-Drug Relations.
In the biomedical text mining community, the development of a corpus which would help to extract information on drug-prokaryote relations is currently an essential requirement. Understanding the relations between drug and bacteria is vital for antibiotics development as well as other drugs docking and not to mention the contribution towards various biological research purposes. In this study, we describe our ongoing efforts to develop such a corpus.