(Thanks to Bevan Koopman for helping compiling this activity)
The following are instructions on the RUSSIR Health Search Tutorial hands-on sessions.
This tutorial demostrates using a number of different tools to implement a search system in the health domain. These tools require some setup and configuration. To use the burden in this we have distributed them pre-build via a docker image.
1) First, download and install docker for your operating system.
Now we need two docker images: one for elastic and the other for this tutorial.
2) For elastic:
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.1
3) For this tutorial:
docker pull ielabgroup/health-search-tutorial
Alternative:
If the docker image was saved to file (not obtained via docker hub) then it can be loaded via:
gunzip health-search-tutorial-docker.tar.gz
docker load -i health-search-tutorial-docker.tar
.)In the task you will:
docker run --net="host" -it ielabgroup/health-search-tutorial
This will run the start the docker image and you will be presented with a bash shell on a Ubuntu virtual machine. By default you will start off in the /health-search-tutorial
directory. This contains all the scripts and documents for this tutorial.
In this direction are the following files:
File | Description |
---|---|
README.md |
This file |
adhoc-queries.json |
Test query topics |
annotate_docs.py |
Script to take plain text docs and annotated with UMLS concepts; see annotate_docs.py -h for info. |
docs |
500 sample clinical trial documents in plain text |
index_docs.py |
Script to take annotated documents and index in Elastic; see index_docs.py -h for details |
qrels-clinical_trials.txt |
Qrels for the test query topics and docs. |
requirements.txt |
List of python dependecies. |
search_docs.py |
Script for searching with Elastic; see search_docs.py -h for details. |
Have a look at the clinical trials documents in the docs
directory:
less docs/NCT02631304
We will now identify medical concepts from free-text.
First start the QuickUMLS service in the background:
python /QuickUMLS-master/server.py /quickumls-data &
Then lets do a test to identify some medical concepts from the phases “Family history of lung cancer”. Run:
python annotate_docs.py "Family history of lung cancer"
You should see some JSON returns with concepts identified; e.g.:
{'C0260515': 'Family history of cancer', 'C0728711': 'Family history of lung cancer'}
Now we will use the same script to annotated all the documents in docs
:
python annotate_docs.py -d docs
This will annotated each document and write the output to annotated_docs
as individual JSON file.
Note: this process can be very slow on docker. You can Ctrl C to cancel it if its taking too long. We have already run this and the output is all in annotated_docs
.
Now, take a look in some files (e.g., less annotated_docs/NCT02631304.json
) and you will see the document now has three fields:
text
containing the original test;cuis
contained the UMLS concept ids for concepts found in the text; andconcepts
containing the umls concepts names.First, start Elastic. We will use the standard docker images provided by elastic to do this.
Open a new terminal window and make sure you run this not in the tutorial docker but on your host machine:
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.3.1
Ensure that the docker service is up and running by visiting: http://localhost:9200/
Return to your tutorial docker shell.
Now we want to index all the annotated documents to Elastic:
python index_docs.py -d annotated_docs
This process will take each JSON document in annoted_docs
and submit to Elatic search.
You can check the indexed docs by searching for all docs in your browser as:
http://localhost:9200/sigir-health-search/_search?pretty
You’ll see the documents have been indexed with the three fields.
Have a play with various searching by setting q=queryterm
; e.g., to search for “cancer”
http://localhost:9200/sigir-health-search/_search?q=cancer&pretty
To run a search and get a results in TREC format run:
python search_docs.py -q "lung cancer"
(remember the quotes for query term)
This will produce a ranked list in trec_eval format.
We have also provided a set of query topics to use in adhoc-queries.json
. Note that this is a special test collection we developed of clinical trials; it has multiple query variations, provided by different doctors, for a single query topic. To do a search using these topics run:
python search_docs.py -f adhoc-queries.json
Note, because there are multiple query variations for a topic, the script will choose a variation from a random person and run that - this highlights some of issues in query variation in health search.
Run the script again but this time pipe the output to a results file:
python search_docs.py -f adhoc-queries.json > trec_results.txt
We can now use the qrels to evaluate our system using trec_eval:
trec_eval qrels-clinical_trials.txt trec_results.txt
Now we want to understand a bit about how different search can be done. Let us look at some concept-based searches. First, lets run some annotation to understand different medical concepts:
Case 1: python annotate_docs.py "metastatic breast cancer"
gives us:
{'C0278488': 'metastatic breast cancer'}
But
Case 2: python annotate_docs.py "breast cancer"
gives us
{'C0006142': 'Breast cancer', 'C0678222': 'Breast cancer'}
Notice the different medical concepts.
Now let’s look at how this affects search:
python search_docs.py -q "C0278488"
yields 5 results.python search_docs.py -q "C0006142 C0006142"
yields 8 different results.python search_docs.py -q "metastatic breast cancer"
yields 107 different results.python search_docs.py -q "breast cancer"
yields 103 different results.We leave you now to experiment and discuss different search options and what impact they may have.