i2b2 NLP Shared Task: Medication Extraction Challenge

Image for i2b2 NLP Shared Task: Medication Extraction Challenge


Northeastern University: College of Computer and Information Science, Boston, MA


Professor Emeritus Carole Hafner


This was a summer project in which I participated to learn more about information extraction, classification, and natural language processing.


The i2b2 Shared Tasks and Challenges are used to obtain annotations for new data sets (each team is assigned a portion of the training corpus to annotate) and to share information about the latest biomedical text processing systems. The 2009 challenge was called "Medication Extraction" and involved the extraction of medication-related information (e.g. brand name, generic name, dosage size, frequency, and purpose) from narrative patient records.


I worked with another graduate student to analyze the training data during the annotation phase and build a framework for extracting the required information. We briefly considered the use of Apache UIMA and cooperation with researchers at the Boston VA, but time and resource constraints were prohibitive. We implemented a straight-forward, semi-supervised system that was trained on publicly available data, such as the Orange Book and RxNorm, and incorporated some fuzzy logic to account for acronyms, misspellings, and shorthand. Unfortunately, this was not nearly good enough to be competitive. We later learned that the top performing systems were not built from scratch and were heavily supplemented with outside resources, some of which we had just barely begun to learn about. Winning teams also tended to be repeat participants who specialized in biomedical text mining and healthcare informatics. In the end, this was a great learning experience, and participating in the competition provided me with the necessary background to understand and appreciate the approaches used by the top performing teams.

Technologies Used

Linux, NLTK, Orange Book, Python, RxNorm, Subversion


  • Annotated training data
  • Source code and documentation
  • Evaluation results