ISP – Gaurav Trivedi

AMIA 2017

I attended my first AMIA meeting last week. It was an exciting experience to meet with close to 2,500 informaticians at once. It was also a bit overwhelming due to the scale of the event as well as being in company of famous researchers whose papers you have read:

Twitter log from the 2017 AMIA Annual Symposium held from Nov 4 – 8 in Washington DC. AMIA brings together informatics researchers, professionals, students, and everyone using informatics in health care… [Click to view]

If you weren’t able to attend the event in person, the good news is that the a lot of informaticians are big into documenting stuff on twitter. Check out my twitter moment here and the hashtag #AMIA2017 for more…

Announcing NLPReViz…

Update – 5 Nov’18: Our paper was featured in AMIA 2018 Fall Symposium’s Year-in-Review!

We have released the source code for our NLPReViz project. Head to http://nlpreviz.github.io to checkout its project page.

Also, here’s our new JAMIA publication on it:

Gaurav Trivedi, Phuong Pham, Wendy W Chapman, Rebecca Hwa, Janyce Wiebe, Harry Hochheiser; NLPReViz: an interactive tool for natural language processing on clinical text. Journal of American Medical Informatics Association. 2017. DOI: 10.1093/jamia/ocx070.

On Interactive Machine Learning

When talking about machine learning, you may encounter many terminologies such as such as “online learning,” “active learning,” and “human in the loop” methods. Here are some of my thoughts on the relationship between interactive machine learning and machine learning in general. This is an extract from my answers to my comprehensive exam.

Traditionally machine-learning has been classified into supervised and unsupervised learning families. In supervised learning the training data, $\mathcal{D}$ , consists of N sets of feature vectors each with a desired label provided by a teacher:

Training Set $\hspace{10pt} \mathcal{D} = \{(\textbf{x}_i, y_i)\}_{i=1}^{N}$

where, $\textbf{x}_i \in \mathcal{X}$ is a d-dimensional feature vector

and $y_i \in \mathcal{Y}$ is the known label for it

The task is to learn a function, $f : \mathcal{X} \to \mathcal{Y}$ , which can be used on unseen data.

In unsupervised learning, our data consists of vectors $\textbf{x}_i$ , but no target label $y_i$ . Common tasks under this category include clustering, density estimation and discovering patterns. A combination of these two is called semi-supervised learning, which has a mixture of labeled and unlabeled data in the training set. The algorithm assigns labels for missing data points using certain similarity measures.

While researchers are actively looking at improving the unsupervised learning techniques, supervised machine learning has been the dominant form of learning till date. However, traditional supervised algorithms assume that we have training data along with the labels readily available. They are not concerned with the process of obtaining the target values $y_i$ s for the training dataset. Often, obtaining labelled data is one of the main bottlenecks in applying these techniques in domain specific applications. Further, current approaches do not provide easy mechanisms for the end-users to correct problems when models deviate from the desired learning concept. NLP models are often built by experts in linguistics and/or machine learning, with limited or no scope for the end-users to provide input. Here the domain experts, or the end-users, provide input to models as annotations for a large batch of training data. This approach can be expensive, inefficient and even infeasible in many situations. This includes many problems in the clinical domain such as building models for analyzing EMR data.

“Human-in-the-loop” algorithms may be able to leverage the capabilities of a domain expert during the learning process. These algorithms can optimize their learning behavior through interaction with humans. Interactive Machine Learning (IML) is a subset of this class of algorithms. It is defined as the process of building machine learning models iteratively through end-user input. It allows the users to review model outputs and make corrections by giving feedback for building revised models. The users are then able to see model changes and verify them. This feedback loop allows end-users to refine the models further with every iteration. Some early examples for this definition include applications in image segmentation, interactive document clustering, document retrieval, bug triaging and even music composition. You can read more about this in the article titled "Power to the People: The Role of Humans in Interactive Machine Learning" (Amershi et.al., 2014).

Interactive machine learning builds on a variety of styles of learning algorithms:

Reinforcement Learning: In this class of learning we still want to learn $f : \mathcal{X} \to \mathcal{Y}$ but we see samples of $\textbf{x}_i$ but no target output $y_i$ . Instead of $y_i$ , we get a feedback from a critic about the goodness of the predicted output. The goal of the learner is to optimize for the reward function by selecting outputs that get best scores from the critics. The critic can be a human or any other agent. There need not be a human-in-the-loop for the algorithm to be classified under reinforcement learning. Several recent examples of this type include building systems that learn to play games such as Flappy Bird, Mario etc.
Active Learning: Active learning algorithms try to optimize for the number of training examples. Such an algorithm would ask an oracle to give labels such that it can achieve higher accuracy with smallest number of queries. These queries contain a batch of examples to be labelled. For example, in SVMs, one could select training sets for labeling that are closest to the margin hyperplanes to reduce the number of queries.
Online Algorithms: Online learning algorithms are used when training data is available in sequential order, say due to the nature of the problem or memory constraints, as opposed to a batch learning technique where all the training data is available at once. The algorithm must adapt to the continuous stream of data made available to it. Formulating the learning problem to handle this situation forms the core of designing algorithms under this class.
A commonly used example would be the online gradient descent method for linear regression: Suppose we are trying to learn the parameters $\mathbf{w}$ for $f(\mathbf{x}) = w_0 + w_1x_1 + \ldots w_d x_d$ . We update the weights when we receive the $i$ th training example by taking the gradient of the defined error function:
$\mathbf{w}_{new} \leftarrow \mathbf{w} - \alpha \times \Delta_{\mathbf{w}} Error_i (\mathbf{w})$ . Where, $\alpha$ is defined as the learning rate.

This is how the relationship between supervised, interactive machine learning, and human-in-the-loop algorithms may be represented in a Venn diagram.

Interactive machine learning methods can include all or some of these learning techniques. The common property between all the interactive machine learning methods is the tight interaction loop between the human and the learning algorithm. Most of the effort in interactive machine learning has been about designing interactions for each step of this loop. My work on interactive clinical and legal text analysis also follows this pattern. You are welcome to check out those posts as well!

References

Amershi et.al. (2014), Power to the People: The Role of Humans in Interactive Machine Learning. Available: https://www.microsoft.com/en-us/research/publication/power-to-the-people-the-role-of-humans-in-interactive-machine-learning/.

Hey, I passed another exam!

Today, I have completed three years of having a blog. I took to blogging as a way to document my PhD experiences (and for learning to write :D). Though, it was very satisfying to see tens of thousands of visitors finding posts of their interest here.

As a coincidence I also passed my PhD comprehensive exam today and wanted to write-up a post to help future students understand these milestones. As a PhD student you take so many courses and exams, but you also need to pass a few extra special ones. Different departments and schools have their own requirements but the motivation behind having each of the milestones is similar.

ISP has three main exams on a way to PhD. You first finish all your coursework and take a preliminary exam, or prelims, with a 3-member committee of your choice. The goal here is to prove your ability to do original research by presenting the work you’ve done till then. At this point, you already have or are on your way toward your first publication in the program. After taking this exam and completing the coursework, you are eligible to receive your masters (or second masters) degree.

This is how an average timeline for a PhD student in my department looks like. — This is how a typical timeline for a PhD student in my department looks like. Of course you can expect everyone to have their own custom versions of it.

Next is the comprehensive exam (comps). The committee structure is similar to the prelims, but here you pick three topics related to your research and decide a member responsible for each. By working with your committee members, you prepare a reading list of recent publications, important papers and book chapters.

Each of the committee members will select a list of questions for you to answer. You get 9 days to answer these questions. It may be challenging to keep up with all the papers in the list if it has a lot of items. Usually it is a good idea to include those papers that you have referred to in your prior research work.

I immensely enjoyed this process and was reminded of the Illustrated guide to a PhD by Matt Might. Specially the one about “Reading research papers takes you to the edge of human knowledge”. If you haven’t seen those posts and intend to pursue a PhD, I would definitely recommend them.

Most of the questions in my exam were subjective, open-ended problems. Except the first one which made me wonder if I was interpreting it correctly. I guess, it was only there as a loosener ^[1] .

After you send in your written answers, you do an oral presentation in front of all three committee members. I was also asked a few follow-up questions based on my responses. Overall, it went smoothly and every one left pleased with my presentation.

Footnotes

A term used in cricket for an easy first ball of the over ^

Learning from multiple annotators

I recently prepared a deck of slides for my machine learning course. In the presentation, I talk about some of the recently proposed methods on learning from multiple annotators. In these methods we do not assume the labels that we get from the annotators to be the ground truth, as we do in traditional machine learning, but try to find “truth” from noisy data.

There are two main directions of work in this area. One focuses on finding the consensus labels first and then do traditional learning, while the other approach is to learn a consensus model directly. In the second approach, we may estimate the consensus labels during the process of building a classifier itself.

Here are the slides for the presentation. I would be happy to receive your comments and suggestions.

Talk: Human-Data Interaction

This week I attended a high energy ISP seminar on Human-Data Interaction by Saman Amirpour. Saman is an ISP graduate student who also works with the CREATE Lab. His work in progress project on the Explorable Visual Analytics tool serves as a good introduction to this post:

While this may have some resemblance with other projects such as the famous Gapminder Foundation led by Hans Rosling, Saman presented a bigger picture in his talk and provided motivation for the emergence of a new field: Human-Data Interaction.

Big data is a term that gets thrown around a lot these days and probably needs no introduction. There are three parts of the big data problem, involving data collection, knowledge discovery and communication. Although we are able to collect massive amounts of data easily, the real challenge lies in using it to our advantage. Unfortunately, we do not enough sophistication in our machine learning algorithms that can handle this as yet. You really can’t do without the human in the loop for making some sense of the data and asking intelligent questions. And as this Wired article points out, visualization is the key for allowing us humans to do this. But, our present-day tools are not well suited for this purpose and it is difficult to handle high dimensional data. We have a tough time to intuitively understand such data. For example, try visualizing a 4D analog of a cube in your head!

So now the relevant question that one could ask is that if Human-data interaction (or HDI) really any different from the long existing areas of visualization and visual analytics? Saman suggests that HDI addresses much more than visualization alone. It involves answering 4 big questions on:

Steering To help in navigate the high dimensional space. This is the main area of interest for researchers in the visualization area.

But we also need to solve problems with:

Sense-making i.e. how can we help the users to make discoveries from the data. Sometimes, the users may not even start with the right questions in mind!
Communication The data experts need a medium to share their models that can in-turn allow others to ask new questions.
And finally, all of this needs to be done after solving the Technical challenges in building the interactive systems that support all of this.

Tools that can sufficiently address these challenges are the way to go in future. They can truly help the humans in their sense-making processes by providing them with responsive and interactive methods to not only test and validate their hypotheses but also communicate them.

Saman devoted the rest of the talk to demo some of the tools that he contributed towards and gave some examples of beautiful data visualizations. Most of them were accompanied by a lot of gasping sounds from the audience. He also presented some initial guidelines for building HDI interfaces based on these experiences.

Talk: The Signal Processing Approach to Biomarkers

A biomarker is a measurable indicator of a biological condition. Usually it is seen as a substance or a molecule introduced in the body but even physiological indicators may function as dynamic biomarkers for certain diseases. Dr. Sejdić and his team at the IMED Lab work on finding innovative ways to measure such biomarkers. During the ISP seminar last week, he presented his work on using low-cost devices with simple electronics such as accelerometers and microphones, to capture the unique patterns of physiological variables. It turns out that by analyzing these patterns, one can differentiate between healthy and pathological conditions. Building these devices requires an interdisciplinary investigation and insights from signal processing, biomedical engineering and also machine learning.

Listening to the talk, I felt that Dr. Sejdić is a researcher who is truly an engineer at heart as he described his work on building an Asperometer. It is a device that is placed on the throat of a patient to find out when they have swallowing difficulties (Dysphagia). The device picks up the vibrations from the throat and does a bit of signal processing magic to identify problematic scenarios. Do you remember the little flap called the Epiglotis that guards the entrance to your wind pipe, from your high school Biology? Well, that thing is responsible for directing the food into the oesophagus (food pipe) while eating and preventing it from going into wrong places (like the lungs!). As it moves to cover the wind pipe, it records a characteristic motion pattern on the accelerometer. The Asperometer can then distinguish between regular and irregular patterns to find out when should we be concerned. The current gold standard to do these assessments involve using some ‘Barium food’ and X-Rays to visualize its movement. As you may have realized, the Asperometer is not only unobstrusive but also appears to be a safer method to do so. There are a couple of issues left to iron out though, such as removing sources of noise in the signal due to speech or even breathing through the mouth. We can, however, still use it in controlled usage scenarios in the presence of a specialist.

The remaining part of the talk briefly dealt with Dr. Sejdić’s investigations of gait, handwriting processes and preference detection, again with the help of signal processing and some simple electronics on the body. He is building on work in biomedical engineering to study age and disease related changes in our bodies. The goal is to explore simple instruments providing useful information that can ultimately help to prevent, retard or reverse such diseases.