Interactive Natural Language Processing for Legal Text

· Posted in Artificial Intelligence, HCI, Machine Learning, Projects

Update: We received the best student paper award for our paper at JURIX’15!

In an earlier post, I talked about my work on Natural Language Processing in the clinical domain. The main idea behind the project is to enable domain experts to build machine learning models for analyzing text. We do this by designing usable tools for NLP without really having the need to send datasets to machine learning experts or understanding the inner working details of the algorithms. The post also features a demo video of the prototype tool that we have built.

I was presenting this work at my program’s bi-weekly meetings where Jaromir, a fellow ISP graduate student, pointed out that such an approach could be useful for his work as well. Jaromir also holds a degree in Law and works on building AI systems for legal applications. As a result, we ended up collaborating on a project on using the approach for statutory analysis. While, the main topic of discussion in the project is on the framework in which a human experts cooperate with a machine learning text classification algorithm, we also ended up augmenting our approach with a new way of capturing and re-using knowledge. In our tool datasets and models are treated separately and our not tied together. So, if you were building a classification model for say statutes from the state of Alaska, when you need to analyze laws from Kansas you need not start from scratch. This allows us to be in a better starting place in terms of all the performance measures and build a model using fewer training examples.

The results of the cold start (Kansas) and the knowledge re-use (Alaska) experiment. In the Figure KS stands for Kansas, AK for Alaska, 1p and 2p for the first (ML model-oriented) and second (interaction-oriented) evaluation perspectives, P for precision, R for recall, F1 for F1 measure, and ROC with a number for an ROC curve of the ML classifier trained on the specified number of documents.

The results of the cold start (Kansas) and the knowledge re-use (Alaska) experiment. In the Figure KS stands for Kansas, AK for Alaska, P for precision, R for recall, F1 for F1 measure, and ROC with a number for an ROC curve of the ML classifier trained on the specified number of documents.

We will be presenting this work at JURIX’15 during the 28th year of the conference focusing on legal information systems. Previously, we had presented portions of this work at the AMIA Summit on Clinical Research Informatics and at the ACM IUI Workshop on Visual Text Analytics.

References

Jaromír Šavelka, Gaurav Trivedi, and Kevin Ashley. 2015. Applying an Interactive Machine Learning Approach to Statutory Analysis. In Proceedings of the 28th International Conference on Legal Knowledge and Information Systems (JURIX ’15). Braga, Portugal. [PDF] – Awarded the Best Student Paper (Top 0.01%).

Machines learn to play Tabla

· Posted in Artificial Intelligence, Fun, Machine Learning

If you follow machine learning topics in the news, I am sure by now you would have come across Andrej Karpathy‘s blog post on The Unreasonable Effectiveness of Recurrent Neural Networks.[1] Apart from the post itself, I have found it very fascinating to read about the diverse applications that its readers have found for it. Since then I have spent several hours hacking with different machine learning models to compose tabla rhythms:

Although Tabla does not have a standardized musical notation that is accepted by all, it does have a language based on the ‘bols’ (literally, verbalize in English) or the sounds of the strokes played on it. These ‘bols’ may be expressed in written form which when pronounced in Indian languages sound similar to the drums. For example, the ‘theka’ for the commonly used 16-beat cycle – Teental is written as follows:

Dha | Dhin | Dhin | Dha | Dha | Dhin | Dhin | Dha |
Dha | Tin  | Tin  | Ta  | Ta  | Dhin | Dhin | Dha

For this task, I made use of Abhijit Patait‘s software – TaalMala, which provides a GUI environment for composing Tabla rhythms by writing them out in this language. The bols can then be synthesized to produce the sound of the drum. In his software, Abhijit extended the tabla language to make it easier for users to compose tabla rhythms by adding a square brackets after each bol that specify the number of beats within which it must be played. You could also lay more emphasis on a particular bol by adding ‘+’ symbols which increased their intensity when synthesized to sound. Variations of standard bols can be defined as well based on different the hand strokes used:

Dha1 = Na + First Closed then Open Ge

Now that we are armed with this background knowledge, it is easy to see how we may attempt to learn tabla like a standard Natural Language Processing language model. Predictive modeling of tabla has been previously explored in "N-gram modeling of tabla sequences using variable-length hidden Markov models for improvisation and composition" (Avinash Sastry, 2011). But, I was not able to access the datasets used in the study and had to rely on the compositions that came with the TaalMala software.[2] This is comparatively a much smaller database than what you would otherwise use to train a neural network: It comprises of 207 rhythms with 6,840 bols in all. I trained a char-rnn and sampled some compositions after priming it with different seed text such as “Dha”, “Na” etc. Given below is a minute long composition sampled from my network. We can see that not only the network has learned the TaalMala notation but it has also understood some common phrases used in compositions such as the occurrence of the phrase “TiRa KiTa“, repetitions of “Tun Na” etc.:

Ti [0.50] | Ra | Ki | Te | Dha [0.50] | Ti [0.25] | Ra | Ki
| Ta | Tun [0.50] | Na | Dhin | Na 
| Tun | Na | Tun | Na | Dha | Dhet | Dha | Dhet | Dha | Dha
| Tun | Na | Dha | Tun | Na | Ti | Na | Dha | Ti | Te | Ki |
Ti | Dha [0.50] | Ti [0.25] | Ra | Ki | Te | Dhin [0.50] |
Dhin | Dhin | Dha | Ge | Ne | Dha | Dha | Tun | Na | Ti
[0.25] | Ra | Ki | Ta | Dha [0.50] | Ti [0.25] | Ra | Ki |
Te | Dha [1.00] | Ti | Dha | Ti [0.25] | Ra | Ki | Te | Dha
[0.50] | Dhet | Dhin | Dha | Tun | Na | Ti [0.25] | Ra | Ki
| Ta | Dha [0.50] | Ti [0.25] | Ra | Ki | Te | Ti | Ka | Tra
[0.50] | Ti | Ti | Te | Na [0.50] | Ki [0.50] | Dhin [0.13]
| Ta | Ti [0.25] | Ra | Ki | Te | Tra | Ka | Ti [0.25] | Ra
| Ki | Te | Dhin [0.50] | Na [0.25] | Ti [0.25] | Ra | Ki |
Te | Tra | Ka | Dha [0.34] | Ti [0.25] | Ra | Ki | Ta | Tra
| Ka | Tra [0.50] | Ki [0.50] | Tun [0.50] | Dha [0.50] | Ti
[0.25] | Ra | Ki | Ta | Tra | Ka | Ta | Te | Ti | Ta | Kat |
Ti | Dha | Ge | Na | Dha | Ti [0.25] | Ra | Ki | Te | Dha
[0.50] | Dhin | Dhin | Dhin | Dha | Tun | Na | Ti | Na | Ki
| Ta | Dha [0.50] | Dha | Ti [0.50] | Ra | Ki | Te | Tun
[0.50] | Tra [0.25] | Ti [0.25] | Ra | Ki | Te | Tun | Ka |
Ti [0.25] | Ra | Ki | Te | Dha [0.50] | Ki [0.25] | Ti | Dha
| Ti | Ta | Dha | Ti | Dha [0.50] | Ti | Na | Dha | Ti
[0.25] | Ra | Ki | Te | Dhin [0.50] | Na | Ti [0.25] | Ra |
Ki | Te | Tra | Ka | Dha [0.50] | Ti [0.50] | Ra | Ki | Te |
Tun [0.50] | Na | Ki [0.25] | Te | Dha | Ki | Dha [0.50] |
Ti [0.25] | Ra | Ki | Te | Dha [0.50] | Ti [0.25] | Ra | Ki
| Te | Dha [0.50] | Tun | Ti [0.25] | Ra | Ki | Te | Dhin
[0.50] | Na | Ti [0.25] | Te | Dha | Ki [0.25] | Te | Ki |
Te | Dhin [0.50] | Dhin | Dhin | Dhin | Dha | Dha | Tun | Na
| Na | Na | Ti [0.25] | Ra | Ki | Ta | Ta | Ka | Dhe [0.50]
| Ti [0.25] | Ra | Ki | Te | Ti | Re | Ki | Te | Dha [0.50]
| Ti | Dha | Ge | Na | Dha | Ti [0.25] | Ra | Ki | Te | Ti |
Te | Ti | Te | Ti | Te | Dha [0.50] | Ti [0.25] | Te | Ra |
Ki | Te | Dha [0.50] | Ki | Te | Dha | Ti [0.25]

Here’s a loop that I synthesized by pasting a composition sampled 4 times one after the another:

Of course, I also tried training n-gram models and the smoothing methods using the SRILM toolkit. Adding spaces between letters is a quick hack that can be used to train character level models using existing toolkits. Which one produces better compositions? I can’t tell for now but I am trying to collect more data and hope to add updates to this post as and when I find time to work on it. I am not confident if simple perplexity scores may be sufficient to judge the differences between two models, specially on the rhythmic quality of the compositions. There are many ways in which one can extend this work. One there is a possibility of training on different kinds of compositions: kaidas, relas, laggis etc., different rhythm cycles and also on compositions from different gharanas. All of this would required collecting a bigger composition database:

And then there is a scope for allowing humans to interactively edit compositions at places where AI goes wrong, but using the samples generated by it as an infinite source of inspiration.

Finally, here’s a link to the work in progress playlist of the rhythms I have sampled till now.

References

  1. Avinash Sastry (2011), N-gram modeling of tabla sequences using variable-length hidden Markov models for improvisation and composition. Available: https://smartech.gatech.edu/bitstream/handle/1853/42792/sastry_avinash_201112_mast.pdf?sequence=1.

Footnotes

  1. If you encountered a lot of new topics in this post, you may find this post on Understanding natural language using deep neural networks and the series of videos on Deep NN by Quoc Le helpful. ^
  2. On the other hand, Avinash Sastry‘s work uses a more elaborate Humdrum notation for writing tabla compositions but is not as easy to comprehend for tabla players. ^


Bike ride from Pittsburgh to DC

· Posted in Fun, Opinion

This week I did a 335 mi (540 km) bicycle tour from Pittsburgh to Washington DC along with a group of 3 other folks from the school. This is the longest I have ever biked and covered the distance over a period of 5 days. The entire trip is divided into two  trails – the 150 mile Great Allegheny Passage from Pittsburgh to Cumberland, followed by the 185.5 mile long Chesapeake and Ohio Canal (C&O Canal) Towpath.

We carried camping equipment on our bikes and enjoyed a lot of flexibility in deciding where to stay each night, although we roughly followed the original plan that our group agreed upon before starting the trip. We biked for 8-12 hours during the day and stayed overnight at each of the following cities:

Day City Miles Daily Mileage Elevation in feet
0 Pittsburgh, PA 0 0 720
1 Ohiopyle, PA 77 77 1,230
2 Frostburg, MD 134 57 1,832
3 Little Orleans, MD 193 59 450
4 Harpers Ferry, MD 273 80 264
5 Georgetown, Washington DC 335 62 10
Mile 0 of the GAP Trail. The C&O trail begins from there onwards.

Mile 0 of the GAP trail. The C&O trail begins from here onwards.

If there’s one change I could make in this schedule, it would be to avoid staying over at Harpers Ferry which involved climbing a foot bridge without any ramp for the bikes. It is even more difficult if you are carrying a lot of weight on your bike racks. On the positive side, it allowed us to experience the main streets of Harpers Ferry which is rightly called “a place in time”. Another tip that you could use is to take the Western Maryland Trail near Hancock. It runs parallel to the route and is a paved one, which provides a welcome break after long hours of riding on the C&O trail.

There are lots of campsites near the trail. There are hiker-biker camps near most major towns on the C&O trail and are free to use. We also camped at commercial campgrounds, like at the Trail Inn Campground in Frostburg, where we could use a shower. You can also get your laundry done at these places and save some luggage space. For food and drinks – I suggest that you follow the general long distance biking guidelines about eating at regular intervals while on the bike. I also strongly recommend using a hydration backpack though it adds to the weight you have carry on your shoulders.

Here's a picture of our bikes with our panniers and the camping equipment.

Here’s a picture of our bikes with our panniers and the camping equipment.

I used a hybrid bike – Raleigh Misceo and was very comfortable riding it through all parts of the trail. I was expecting a couple of flat tires specially on the C&O sections with loose gravel and other debris on the trail, but didn’t face any problems. As long as you are not using a road bike with narrow tires you should be good on these trails. Finally for getting back to Pittsburgh we rented a minivan and put our bikes in the trunk which had ample space for 4 bikes with their front wheels taken off.

If you decide to take this tour in future, we have plenty of online guides available for each of the GAP and C&O Canal trails. For a paper-based guide, I would recommend buying the Trailbook published by the Allegheny Trail Alliance. We also created a small webapp called the GAP Map that helped us plan our trip and prepare a schedule.

Here are some of the scenic views along the tour as captured from my phone camera:

Monongehala River

View of the Monongehala river.

McKeesport

A short stop near Buena Vista.

Cumberland

Along the trail near Cumberland.

East Continental Divide

Elevation Chart marking the good news for us at the East Continental Divide.

C&O Trail Bridge

One of many bridges on the C&O Trail.

C&O Canal Bike Path

Bike path on the C&O Canal trail. It also has several lock houses along the way which have been renovated and can be used for overnight stay.

Harpers Ferry

Shops in Harpers Ferry.

C&O Canal

A section of the C&O Canal that once ferried goods between Washington DC and Cumberland.


Mathematics, Tabla and the Arts

· Posted in Opinion

Spring break is here and I finally have ample time to practice my tabla. In the absence of a regular schedule and a teacher, I rely on online videos to improve my skills. Following my YouTube recommendations, I came across this talk given by Manjul Bhargava to a group of school children in Bangalore. Not many of you may know that Dr. Bhargava is not only the 2014 Fields Medal winner, but he is also an accomplished tabla player who has studied under one of the greatest tabla player of our times – Zakir Hussain.

I thought I should post this on my blog for it is certainly the kind of talk that I would have cherished as a kid attending it. Also, I really liked the way he simplified and explained a reasonably difficult concept to his audience. I am sure it would have made a lot of minds curious about the topic:

If you found this interesting, you can find a nice tutorial on it with the title Mathematics for Poets and Drummers by Dr. Rachel Hall (also has an extended version that I haven’t been through yet). Also if this talk inspired you to pick up tabla, I found this very useful series of videos on a YouTube channel by Tej Singh for beginning and intermediate tabla players.


Clinical Text Analysis Using Interactive Natural Language Processing

· Posted in HCI, Machine Learning, Projects

I am working on a project to support the use of Natural Language Processing in the clinical domain. Modern NLP systems often make use of machine learning techniques. However, physicians and other clinicians, who are interested in analyzing clinical records, may be unfamiliar with these methods. Our project aims to enable such domain experts make use of Natural Language Processing using a point-and-click interface . It combines novel text-visualizations to help its users make sense of NLP results, revise models and understand changes between revisions. It allows them to make any necessary corrections to computed results, thus forming a feedback loop and helping improve the accuracy of the models.

Here’s the walk-through video of the prototype tool that we have built:

At this point we are redesigning some portions of our tool based on feedback from a formative user study with physicians and clinical researchers. Our next step would be to conduct an empirical evaluation of the tool to test our hypotheses about its design goals.

We will be presenting a demo of our tool at the AMIA Summit on Clinical Research Informatics and also at the ACM IUI Workshop on Visual Text Analytics in March.

References

  1. Gaurav Trivedi. 2015. Clinical Text Analysis Using Interactive Natural Language Processing. In Proceedings of the 20th International Conference on Intelligent User Interfaces Companion (IUI Companion ’15). ACM, New York, NY, USA, 113-116. DOI 10.1145/2732158.2732162 [Presentation] [PDF]
  2. Gaurav Trivedi, Phuong Pham, Wendy Chapman, Rebecca Hwa, Janyce Wiebe, Harry Hochheiser. 2015. An Interactive Tool for Natural Language Processing on Clinical Text. Presented at 4th Workshop on Visual Text Analytics (IUI TextVis 2015), Atlanta. http://vialab.science.uoit.ca/textvis2015/ [PDF]
  3. Gaurav Trivedi, Phuong Pham, Wendy Chapman, Rebecca Hwa, Janyce Wiebe, and Harry Hochheiser. 2015. Bridging the Natural Language Processing Gap: An Interactive Clinical Text Review Tool. Poster presented at the 2015 AMIA Summit on Clinical Research Informatics (CRI 2015). San Francisco. March 2015. [Poster][Abstract]


Learning from multiple annotators

· Posted in HCI, ISP, Machine Learning

I recently prepared a deck of slides for my machine learning course. In the presentation, I talk about some of the recently proposed methods on learning from multiple annotators. In these methods we do not assume the labels that we get from the annotators to be the ground truth, as we do in traditional machine learning, but try to find “truth” from noisy data.

There are two main directions of work in this area. One focuses on finding the consensus labels first and then do traditional learning, while the other approach is to learn a consensus model directly. In the second approach, we may estimate the consensus labels during the process of building a classifier itself.

Here are the slides for the presentation. I would be happy to receive your comments and suggestions.


Macbook 2011 Problem

· Posted in Opinion

issueUpdate: Apple finally owned up to the problem and is offering free repairs starting Feb 20, 2015.

As everyone was starting to queue up for the sale of iPhone 6 today, I was trying to get my macbook working again at the Genius bar. I had been facing problems with a garbled display since the last couple of days which then turned into a critical problem rendering my computer unable to start up. It would show the apple logo and the loading animation but then get stuck with a gray screen.

I am not sure how everyone’s experience at the Apple store has been but I have always had a terrible time explaining any of my problems to the “geniuses” out there. It takes a lot of patience to listen to the condescending way they would talk to you (You seem to be having a problem with the logic board – you know, the brain of the computer”). I don’t know if they are trained to talk like that to anyone not wearing a blue t-shirt or is it their everyday experiences that make them act like that.

Turns out that early 2011 version of the macbooks with ATI graphics cards came with a manufacturing defect that Apple is refusing to own up to. There are several threads on popular sites and on Apple’s discussion forums that talk about it. In fact, there has been a class action lawsuit has also been filed against Apple against this issue. You can also support the change.org petition and sign your name along with 15,000 supporters there:

In case you have been facing the same issue and are looking to buy some more time to get a final backup and move your stuff etc… Moving the graphics driver to a temporary directory helped me with that. Here‘s the StackExchange answer that you could make use of. And if you own a Macbook Pro from 2011 and haven’t faced this problem yet, please do bookmark the link for you would eventually need it anytime you do any graphics intensive work.

 

Edits:

1. Zach Clawson has compiled a list of actions that you could possibly take if you are facing a similar problem – https://people.cam.cornell.edu/~zc227/extras/early2011mbp_graphics.html.

2. I had written this post as a rant after a rather disappointing visit to the Apple Store. I have edited some portions since then for providing a more objective view.

3. Add bug report screenshot:

Bug report I filed The update making it a dupe

 

I have filed a bug report for this issue as well to make sure that it is in their system (which they later updated to be a duplicate of another issue – see picture).