Making makefiles for your research code

Edit: This post needs a refresh using modern methods namely, Docker and Kubernetes. I hope to find some time to write a post on them one of these days…

There has been a lot of discussion lately about Reproducibility in computer science [1] . It is a bit disappointing to know that a lot of the research described in recent papers is not reproducible. This is despite that only equipment needed to conduct a good part of these experiments is something that you already have access to. In this study described in the paper here, only about half of the projects could be even built to begin with. And we are not talking about reproducing the same results as yet. So why is it so hard for people who are the leaders in the field to publish code that can be easily compiled? There could be a lot of factors like – lack of incentives, time constraints, maintenance costs etc. that have already been put forward by the folks out there, so I wouldn’t really go into that. This post is about my experiences with building research code. And, I have had my own moments of struggle with them now and then!

One is always concerned about not wasting too much effort on seemingly unproductive tasks while working on a research project. But spending time on preparing proper build scripts could in fact be more efficient when you need to send code to your advisor, collaborate with others, publish it … Plus it makes things easy for your own use. This becomes much more important for research code which often has several different pieces delicately stitched together, just to show the specific point that you are trying to make in your project. There’s no way even you would remember how it worked once you are done with it.

In one of my current projects I have some code from four different sources, all written in different programming languages. My code (like most other projects) builds upon a couple of projects and work done by other people. It certainly doesn’t have to be the way it has been structured right now and it wouldn’t be unacceptable to do so for any other use but a research project. It seemed quite logical to reuse the existing code as much as possible to save time and effort. However, this makes it very hard to compile and run the project when you don’t remember the steps involved in between.

One way to tackle this problem is to write an elaborate readme file. You could even use simple markdown tags to format it nicely as well. But that is not quite elegant enough when used as a substitute for build scripts. You wouldn’t know how many “not-so-obvious” steps you’d skip during documentation. Besides it wouldn’t be as simple as to running a build command to try out the cool thing that you made. A readme on the other hand should carry other important stuff like a short introduction to the code, how to use it and a description of the “two”-step build process that you chose for it.

Luckily this is not a new problem, and generations of programmers have provided us with excellent tools for getting around it. They offer a mechanism to document your readme steps in a very systematic way. And there’s no reason you shouldn’t use them!

One such program that you may already know about is make. Here’s a short and sweet introduction to make by @mattmight. Allow me to take this a little further to demonstrate why these tools are indeed so useful. Let’s start from something simple. A very basic makefile could read something like:

But its advantage seems more clear when you’d like to handle a more complicated scenario. So let’s cook up an example for that; say, I’d like to convert some Python code to C (don’t ask why!) using Cython and then create an executable by compiling the converted C code. Here’s how I’d probably write a makefile for it:

Now the wins are quite obvious. It saves you from remembering such a long build command and also documents the steps you need to follow for building the code. But we still have a couple of issues left if you were to distribute your code. You’d notice that I have hard-coded my python versions as well as the path to the include directories in my makefile. Running this on a different computer would certainly cause problems. One way to handle this is to declare all the variables in the beginning of your make file:

This makes it quite easy for the poor souls using your code to edit the variables according to their configurations. All of the things to change are conveniently located at the top. But wouldn’t it be nice if you could save them from all of this manual labor of finding the right paths for linking libraries, versions of the software installed etc. as well? Reading and understanding your code is already hard enough :D. A shell script could have been quite useful, no?

These awesome people at GNU Autotools have already done the hard work and have given us a bunch of tools just to do exactly what we need here. These tools includes libtool, automake and autoconf to help you create and configure your makefiles.

To write a configure script, you’d first need a configure.ac file. This can be used by the autoconf tool to generate a script to fill the variables in the makefile. Using these tools will make sure that all of your projects have a consistent two-step build process. So that anyone wanting to run your code would have to simply run the configure script followed by make to build your project. No manual tweaking of variables is required during these steps.

There are couple of other helper tools that offer you the luxury of using macros that cut your work further in writing these files. Let us continue with our cython example here.

With just two statements in my configure.ac, I’d be able to create configuration file to fill in my makefile variables:

And to tell what to fill in, I’ll add some placeholder text in my makefile and call it Makefile.in:

At this point I can run autoconf to generate the configure script that would do all the work of figuring out and filling in the variables.

I can even code my own checks here. So let’s add a couple: With my configure script, I’d like to not only assign the path for linking python libraries but also check if the user has all the pre-requisites installed on the system to be able to compile the code. You have an option to prompt the user to install the missing pieces or even start an installation for them. We’ll stop ourselves at just printing a message for the user to do the needful. So let’s go back to our configure.ac.

Here I have added some code to check if cython is available on the user’s machine. Also note that with the AC_PYTHON_DEVEL macro, I am also making sure that the python installed on the user’s machine is newer than version 2.5. You can add more checks here depending on what else is needed for your code to build and run. The best part is that a lot of macros are already available so you don’t have to write them from scratch.

There’s more stuff that you could explore here: alternatives like cmake provide a more cross-platform approach to managing your build processes and also have GUIs to do these steps. A couple of other tools which could handle the configuration portions such as pkg-config exist as well but may not come pre-installed on most OS, unlike make. There are a few language specific project managers that you could also consider (like Rake for Ruby). If you are dealing with a Java project then Ant or Maven are also good candidates. IDEs such as Netbeans create configuration files for them automatically. There are a lot of newer (relatively speaking) projects out there that let you easily package code involving web applications (more on this here) and make them ready for deployment on other machines.

Footnotes

  1. You might also be interested in this article in response to the article raising questions on reproducibility ^

Presentations on the Cloud

Screen Shot 2013-09-19 at 11.51.30 PM
A presentation on Google Drive.
With an old Microsoft Office-y feel.

Like many of you, I have been using the Google Docs (or Google Drive) for a long time. It works just fine when you need to work with a group and have several members contributing to a project. In fact, that is the only application that I use for collaborating on documents and spreadsheets. You sometimes wonder about how did we even manage before the times when it wasn’t possible to edit your documents online.

But when it comes to making presentations online, I haven’t been able to find a very usable solution. I have never found the Google’s interface good enough. It takes some effort and time to get used to so many toolbars inside a browser.

Screen Shot 2013-09-19 at 11.54.39 PM
Keynote on iCloud with a super easy interface.

While I don’t really “create” new presentations on the cloud, but I do tend to edit them quite often and make a lot of changes before presenting. I would recall the points that I should (or shouldn’t 🙂 ) have included at times when I wouldn’t have an access to my computer. Or I’d be using my office computer which has a different operating system or even worse, on mobile.

Keynote on iCloud offers something that seems just right for my needs. It has a super easy to use interface which looks very familiar across devices and has all the features that I frequently use. It is so much more convenient to revise presentations with it. You can seamlessly convert and download your presentations in the format of your choice when you are done. Or, if you don’t depend on the presenter view a lot, you can also play the presentation right from the browser.

I must admit that I am an Apple fan-boy when talking about user interfaces. iCloud not only offers the same desktop-like interface across all devices but presents all of that with very neat designs. Take a look at the home page for iCloud, for example:Screen Shot 2013-09-19 at 11.56.03 PMiCloud has many more things to offer with just as stunning interfaces. I haven’t explored the other available apps since I haven’t really found many use-cases for them. For mail, calendar and contacts I still prefer to use the good old Google with its familiar power-user functions.

URI for me!

A google search with my name yields more than 518,000 results. Nah, I am not popular (I wish!) but it turns out that there are a lot of “Gaurav Trivedi”s in this world. Yes, with the same first name and the last name. A search on Facebook will give you results along with their pictures as well. So I do share my name with loads of “real” people. For the first time I wished that my parents had given me a middle name. It would have been easier to stand out.

Screen Shot 2013-09-10 at 2.33.45 AM
The omniscient google has sympathies for me!

Fortunately, I have been using a systematic strategy to use trivedigaurav as an identifier for myself online (You have been notified now!). For example, this site: www.trivedigaurav.com and on Twitter (@trivedigaurav). Now that I do know that so many others with the same name exist; I have come to a realisation that I’ve been quite lucky to have that ID available for me, specially on popular sites. I am proud of the 5-year old me who could think ahead 😉

But as an aspiring researcher, is this something that I should be concerned about? Would it be a good idea to have a pen-name now that I’d be starting to author more relevant academic writings? Here’s a question on StackExchange that deals with the same problem.


Update 8/24/17:

I got myself an Orcid ID: orcid.org/0000-0001-8472-2139, but haven’t really made use of it yet!

Talk: Intelligent Tutoring Systems

Starting this week, I am adding a new feature on the blog. Every week I’ll be posting something about a talk or a colloquium that I attend. Serves as good talk notes, a writing practice and an assignment all in one full scoop? You bet it does!

The program that I am pursuing, Intelligent Systems Program provides a collaborative atmosphere for both students and faculty by giving them regular opportunities to present their research. It not only helps them gather feedback from others but also introduce their work to the new members of the program (like me!). As a part of these efforts, we have a series of talks called the ISP Colloquium Series.

For the first set of talks from the ISP Colloquium Series this semester, we had Mohammad Falakmasir and Roya Hosseini to present two of their award winning papers, both on Intelligent Tutoring Systems.

1. A Spectral Learning Approach to Knowledge Tracing by Mohammad Falakmasir

For developing intelligent tutoring systems that adapt to the student’s requirements, one would need a way to determine the student’s knowledge of skills being taught. This is commonly done by modeling it based on a couple of parameters. After learning from sequences of students’ responses to a quiz, one could predict the values of these parameters for future questions. This information could then be used to adapt the tutor to keep a pace that students are comfortable with. The paper proposes the use of a Spectral Learning [1] algorithm over other techniques such as Expectation Maximization (or EM) to estimate these parameters that model knowledge. EM is known to be a time consuming algorithm. The results of this paper show that similar or higher accuracy in prediction can be achieved while significantly improving the knowledge tracing time.

To design experiments with this new method, Mohammad and his co-authors analyzed data collected using a software-tutor. This tool was being used for an Introductory programming class at Pitt for over 9-semesters. They could then compare the performance of their new method over EM learning of parameters. They calculated both accuracy of prediction and root mean squared error as metrics for the comparison. Learning data was used from the first semester and tested against the second semester, and they could do this over and over again by learning data from the first-two semesters and predict the results from the third one and so on. This allowed them to back their results that show a time-improvement by a factor of 30(!), with a robust statistical analysis.

2. KnowledgeZoom for Java: A Concept-Based Exam Study Tool with a Zoomable Open Student Model by Roya Hosseini

Roya talks about open student modeling as opposed to a hidden one for modelling the students’ skills and knowledge. In her paper, she goes on to propose that a visual presentation of this model could be helpful during exam preparation. Using it one could quickly review the entire syllabus and identify the topics that need more work. I find it to be a very interesting concept and again something that I would personally like to use.

The authors designed a software tutor called Knowledge Zoom that could be used as an exam preparation tool for Java classes. It is based on a concept-level model of knowledge about Java and Object-oriented programming. Each question is associated with these concepts and specifies the pre-requisites that are needed to answer it. It also gives details on outcome concepts that could be mastered by working on a particular question. The students are provided with a zoom-able tree explorer that visually presents this information. Each node is represented using different sizes and colors that indicate the importance of the concept and the student’s knowledge in that area respectively. Another component of the tool provides students with a set of questions and adaptively recommends new questions. Based on the information from the ontology and indexing of the questions as discussed above, it can calculate how prepared a student is to attempt a particular question.

Evaluation of this method is done using a class-room study where students could use multiple tools (including KZ) to answer Java questions. They do a statistical analysis in comparison to the other tools that the features that KZ introduces. The results demonstrated that KZ helped students to reach their goals faster in moving from easy to harder questions. I was impressed by the fact that on top of these results, the authors decided to back it up with a subjective analysis by the students. Students preferred KZ over others by a great margin. They also received valuable feedback from them during this analysis.

While these tutors can currently support only concept-based subjects like programming and math where one could do by testing with objective-styled questions, the fact that we can intelligently adapt to a student’s pace of learning, is something that is really promising. I wish I could use some of these tools for learning my courses!

Footnotes

  1. You can find out more about spectral learning algorithms here: http://www.cs.cmu.edu/~ggordon/spectral-learning/. ^

Futher Reading

  1. M. H. Falakmasir, Z. A. Pardos, G. J. Gordon, P. Brusilovsky, A Spectral Learning Approach to Knowledge Tracing, In Proceedings of the 6th International Conference on Educational Data Mining. Memphis, TN, July 2013. Available: http://people.cs.pitt.edu/~falakmasir/images/EDMPaper2013.pdf
  2. Brusilovsky, P., Baishya, D., Hosseini, R., Guerra, J., & Liang, M.,“KnowledgeZoom for Java: A Concept-Based Exam Study Tool with a Zoomable Open Student Model”, ICALT 2013, Beijing, China. Available: http://people.cs.pitt.edu/~hosseini/papers/kz.pdf

My First Post

Lately I have found myself reading a lot about academic blogging. There is no dearth of articles that aggressively advertise blogging by academics and its benefits; such as the ones here, here and here. Evidently, I have been able to convince myself to start a new blog (and hence this post). Along the way I have also been able get some insights on their drawbacks as well but the positives seem to overwhelmingly outweigh the reasons for not blogging.

If you have been through my about me page, you’d know that I am a first year graduate student. In fact, I’ll be starting my graduate studies this week and it would be nice to try my hands at blogging at the beginning of my grad school journey. I have a couple of my own reasons for taking up this project. Allow me to discuss some of them and a bit on how do I plan on taking this blog further.

My First Post
Composing my first blog post!
  1. Blogging as a writing exercise
    Learning to write for a wider audience is one important skill that can be developed by blogging. It may not be easy for you to read through my initial posts but I hope to improve upon their quality over time. It would be a good plan to cut one’s dependency on advisors, course-instructors and co-authors for improving the quality of their writing. In order to succeed, researchers ought to be able to communicate their ideas well enough.
  2. Blogging for fun
    Coming up with ideas for a blog post and the planning process could actually be taken up as a recreational activity. Blogging, being so much more flexible than writing formal academic writing, has a lot of scope for creativity. Acting upon the crazy-burst-of-inspirations during the process has been a source of new ideas for their own research work for some of the academics who blog. But here’s a caveat; and it is because of the very same reason that it is useful. One could risk spending a little too much time and energy blogging, getting lost in their train of thoughts and procrastinating about more pressing matters (deadlines!). That’s something that I’d like to keep at the back of my head.
  3. Blogging for record keeping
    A searchable data-base containing all my ideas and reviews about other projects could potentially serve a very useful resource down the line. WordPress offers a variety of ways for posting using mobile devices, email etc; it is never too difficult to quickly write about anything worth noting. While there are tools with which I could do this privately (more on this later), blog posts could encourage other readers to pitch in their own ideas and work in a collaborative way. What better way to explore social computing than participating in it!

Hope this also motivates some more people to start blogging. It would be exciting to see where this leads to. And if you have made it till here, thanks for reading my first post!