Continuous Integration

This post is based off a presentation I made for my lab group. I hope you find it helpful!

The main message of this presentation is that continuous integration is a super helpful tool to doing good science. It does this by achieving two main things:

  1. It helps me make sure the code I am writing does what I expect it to do (even as I make changes to it)
  2. It makes my code more accessible to others (and therefore helps me communicate ideas)

In addition, continuous integration pipelines are actually very easy to implement.

I’ll cover three things I typically do in my continuous integration pipelines, in increasing order of complexity to implement:

  1. Formatting
  2. Type checking
  3. Unit testing (& linting)

1. Formatting

1.1. Why format?

The first thing I run on a pipeline is a formatter; this just checks that my code adheres to a set of rules around what code should look like (e.g. strings always use double quotes, ", instead of single quotes ').

This is valuable for 2 reasons:

  1. If we all agree on a consistent code format with which we write python code, its one less piece of cognitive overhead necessary to dive into someone’s code
  2. It acts as a first layer of bug catching, making sure that (at a minimum) everything can be parsed

1.2. How to format?

The formatter I recommend using is black. It’s extremely opinionated (“any color so long as its black”), but is also widely adopted, and super easy to use.

There are two ways I use black:

  1. To actually make the formatting changes to my code. In this case, I run black . in the project directory.
  2. To check if any formatting changes would be made. In this case, I run black --check --diff . in the project directory. This is also the command I use in a continuous integration pipeline, since the pipeline won’t be changing any code itself.

2. Type checking

2.1. Why type hint / type check?

Python is not a typed language, but you can add type hints. Consider the following function signature:

def test_data(flatten_x=False, max_size=None):

This signature leaves lots of questions! For instance, what is max_size when its not None? What does this function return? Type hinting can help answer these questions - a typed hint equivalent of the function signature above is:

def test_data(flatten_x: bool =False, max_size: Optional[int] = None) -> Generator[str, TestInstance]:

Now, I know that max_size is an int if its not None, and the function returns a generator of strings and TestInstance objects.

In my opinion, this is the major advantage of using type hints; it allows me to look back at code I wrote a long time ago, and much more quickly understand what was going on.

Type hints are definitely a bit clunky, but there are lots of functions which I look back on and really have to think to understand what the inputs and outputs even are.

2.2. How to type check

mypy allows you to actually check the type hints (e.g. do I say a function accepts a string and then pass an integer to it?). It can allow me to catch some very non trivial bugs in the code - I find this especially valuable in large ML pipelines which can take a while to run, since these bugs may only otherwise have taken a while to pop up.

3. Unit tests

3.1. Why unit test?

Unit tests consist of testing small chunks (units) of your code to make sure they do what you expect them to do.

In additon to letting you make sure the code is behaving as expected, it also encourages the good programming practice of splitting your program into many smaller functions instead of a monolithic function which does everything (since such a function would be much harder to test).

When writing unit tests for research, be forgiving to yourself; things can move quickly and the overhead required to test a new piece of functionality may not be worth it (but it probably is).

I find unit testing especially important in machine learning because we have so many silent failure modes:

CI_1

3.2. How to write unit tests

I use pytest to run all my unit tests. Using pytest gets you lots of other things for free, such as flake8, which checks for pep8 complicance.

A unit test is just a function with an assert statement: I expect something to be a certain way in my code, and make sure this is the case.

For example, say I’ve written a function to memoize the reading of a particular file, and I want to make sure that objects created using this function actually share memory. I would write the following test:

def test_labels_share_memory(tmpdir):

    geopandas.GeoDataFrame(
        data={"a": [1, 2, 3]}, geometry=[Point(1, 1), Point(2, 2), Point(3, 3)]
    ).to_file(tmpdir / LABELS_FILENAME, driver="GeoJSON")

    labels = read_geopandas(tmpdir / LABELS_FILENAME)
    labels_2 = read_geopandas(tmpdir / LABELS_FILENAME)

    assert labels is labels_2

When writing tests, there are three places I focus on:

3.2.1. Testing the data

Especially when I do lots of complex data transformations, I find unit tests helpful to make sure the data that ultimately goes into the model makes sense (and is what I expect it to be).

One thing I have found really helpful here is to actually include (small) data files in the tests folder, so that I don’t have to programmatically create the data I am testing.

For instance, here I include a tif file which I’ll be testing against.

When testing data, it can be hard to test for specific values, so instead I often sanity check my data - I make sure its within reasonable bounds:

  • here I check latitudes and longitudes are correctly bounded.
  • here I check the temperature is correctly bounded.

This isn’t a failsafe - the data could still be incorrect - but I find it strikes a balance between ease of implementation and checks on the data.

I also try and test complicated calculations.

3.2.2. Testing the models

When testing algorithmic implementations, a) it can be a nuisance to kick off a long process and expect it to fail as you implement it, b) there are so many silent failure modes.

So I find unit tests really help here too. As an example, here I test a (slightly modified) LSTM implementation against the PyTorch implementation.

3.2.3. Testing where its easy to test

Often, a piece of functionality is super easy to test (especially once all the continuous integration infrastructure is in place):

def test_to_one_hot():

    encodings: List = []
    for x in CropClassifications:
        encodings.append(to_one_hot(x.name))
    encodings_np = np.array(encodings)

    for i in range(encodings_np.shape[0]):
        assert encodings_np[i].sum() == 1

In these cases, I find it worthwhile to just add a test. In addition to catching silly mistakes, it also gets tested every time the pipeline runs in the future! So it also insulates me against silly mistakes I might make in the future.

4. Bringing it all together

To wrap it all up, I’ll share two examples of continueous integration workflows on github:

  • This worklow additionally tests the code on lots of different platforms - this helps me make sure lots of other people can run the code as well
  • This workflow is simpler - you can see all the tools described above being run together

And here is an example of the tests coming in handy and catching several bugs I initially made (but preventing them from appearing in main).

On a final note, I recommend the I don’t like notebooks talk by Joel Grus, which has significantly informed how I think about programming for research.