Edit: This post needs a refresh using modern methods namely, Docker and Kubernetes. I hope to find some time to write a post on them one of these days…
There has been a lot of discussion lately about Reproducibility in computer science  . It is a bit disappointing to know that a lot of the research described in recent papers is not reproducible. This is despite that only equipment needed to conduct a good part of these experiments is something that you already have access to. In this study described in the paper here, only about half of the projects could be even built to begin with. And we are not talking about reproducing the same results as yet. So why is it so hard for people who are the leaders in the field to publish code that can be easily compiled? There could be a lot of factors like – lack of incentives, time constraints, maintenance costs etc. that have already been put forward by the folks out there, so I wouldn’t really go into that. This post is about my experiences with building research code. And, I have had my own moments of struggle with them now and then!
One is always concerned about not wasting too much effort on seemingly unproductive tasks while working on a research project. But spending time on preparing proper build scripts could in fact be more efficient when you need to send code to your advisor, collaborate with others, publish it … Plus it makes things easy for your own use. This becomes much more important for research code which often has several different pieces delicately stitched together, just to show the specific point that you are trying to make in your project. There’s no way even you would remember how it worked once you are done with it.
In one of my current projects I have some code from four different sources, all written in different programming languages. My code (like most other projects) builds upon a couple of projects and work done by other people. It certainly doesn’t have to be the way it has been structured right now and it wouldn’t be unacceptable to do so for any other use but a research project. It seemed quite logical to reuse the existing code as much as possible to save time and effort. However, this makes it very hard to compile and run the project when you don’t remember the steps involved in between.
One way to tackle this problem is to write an elaborate readme file. You could even use simple markdown tags to format it nicely as well. But that is not quite elegant enough when used as a substitute for build scripts. You wouldn’t know how many “not-so-obvious” steps you’d skip during documentation. Besides it wouldn’t be as simple as to running a build command to try out the cool thing that you made. A readme on the other hand should carry other important stuff like a short introduction to the code, how to use it and a description of the “two”-step build process that you chose for it.
Luckily this is not a new problem, and generations of programmers have provided us with excellent tools for getting around it. They offer a mechanism to document your readme steps in a very systematic way. And there’s no reason you shouldn’t use them!
One such program that you may already know about is
make. Here’s a short and sweet introduction to make by @mattmight. Allow me to take this a little further to demonstrate why these tools are indeed so useful. Let’s start from something simple. A very basic makefile could read something like:
But its advantage seems more clear when you’d like to handle a more complicated scenario. So let’s cook up an example for that; say, I’d like to convert some Python code to C (don’t ask why!) using Cython and then create an executable by compiling the converted C code. Here’s how I’d probably write a makefile for it:
Now the wins are quite obvious. It saves you from remembering such a long build command and also documents the steps you need to follow for building the code. But we still have a couple of issues left if you were to distribute your code. You’d notice that I have hard-coded my python versions as well as the path to the include directories in my makefile. Running this on a different computer would certainly cause problems. One way to handle this is to declare all the variables in the beginning of your make file:
This makes it quite easy for the poor souls using your code to edit the variables according to their configurations. All of the things to change are conveniently located at the top. But wouldn’t it be nice if you could save them from all of this manual labor of finding the right paths for linking libraries, versions of the software installed etc. as well? Reading and understanding your code is already hard enough :D. A shell script could have been quite useful, no?
These awesome people at GNU Autotools have already done the hard work and have given us a bunch of tools just to do exactly what we need here. These tools includes
autoconf to help you create and configure your makefiles.
To write a configure script, you’d first need a
configure.ac file. This can be used by the autoconf tool to generate a script to fill the variables in the makefile. Using these tools will make sure that all of your projects have a consistent two-step build process. So that anyone wanting to run your code would have to simply run the
configure script followed by
make to build your project. No manual tweaking of variables is required during these steps.
There are couple of other helper tools that offer you the luxury of using macros that cut your work further in writing these files. Let us continue with our cython example here.
With just two statements in my
configure.ac, I’d be able to create configuration file to fill in my makefile variables:
And to tell what to fill in, I’ll add some placeholder text in my makefile and call it
At this point I can run
autoconf to generate the
configure script that would do all the work of figuring out and filling in the variables.
I can even code my own checks here. So let’s add a couple: With my
configure script, I’d like to not only assign the path for linking python libraries but also check if the user has all the pre-requisites installed on the system to be able to compile the code. You have an option to prompt the user to install the missing pieces or even start an installation for them. We’ll stop ourselves at just printing a message for the user to do the needful. So let’s go back to our
Here I have added some code to check if cython is available on the user’s machine. Also note that with the
AC_PYTHON_DEVEL macro, I am also making sure that the python installed on the user’s machine is newer than version 2.5. You can add more checks here depending on what else is needed for your code to build and run. The best part is that a lot of macros are already available so you don’t have to write them from scratch.
There’s more stuff that you could explore here: alternatives like cmake provide a more cross-platform approach to managing your build processes and also have GUIs to do these steps. A couple of other tools which could handle the configuration portions such as
pkg-config exist as well but may not come pre-installed on most OS, unlike make. There are a few language specific project managers that you could also consider (like Rake for Ruby). If you are dealing with a Java project then Ant or Maven are also good candidates. IDEs such as Netbeans create configuration files for them automatically. There are a lot of newer (relatively speaking) projects out there that let you easily package code involving web applications (more on this here) and make them ready for deployment on other machines.