Our experience with Hypothesis testing (and why do we love it so much !)

Miniature Article

Intro

At Scille, we are developing Parsec : an open-source zero-trust file sharing software (think of it as a Dropbox with full client-side encryption).

The project has been in development for 4 years (5k commits on master, 12 contributors) and has been used in production for about a year now. According to cloc (and excluding autogenerated files), Parsec is composed of 42k SLOC (lines of code excluding empty lines and comments) for it main code and 31k SLOC for it tests.

Parsec is made entirely of Python. This 40/60 division between tests and main code illustrates well the amount of $METRIC (time, SLOC, energy, cognitive thinking… pick any one you want ^^) one should invest in order to keep a codebase in good shape on any reasonably complex project.

As a matter of fact, Parsec has more than its share of complexity and as a result we devoted far more than 40% of our time and energy to improve its testing. So we thought it would be good to share this knowledge with others ;-)

So what makes Parsec complex? As we said earlier the basic idea of the project is:

  • users install the Parsec client on their computer

  • they create organizations on a Parsec server and invite other users to join their organization

  • the server acts as a relay between clients to store data and notify changes, but only works with encrypted data

  • each client accesses the data from a virtual filesystem (using FUSE on Linux and WinFSP on Windows)

  • the client can work while being offline (given, unlike servers, users tend to reside more often in the countryside with poor internet connection than in datacenter with gigabit fiber...) and the data automatically sync up with the server when back online

From a technical point of view, this means we have a dumb server and smart clients: any operation on decrypted data is performed on the client side (e.g. file conflict resolution, new user invitation) with a high risk of losing connection before the operation can be achieved, and an equally high risk of clashing with changes made by others users when trying to sync up when back online with the server.

On top of that multiple operations can run in parallel with very different timescales. For instance you could be trying to access a file not in your local cache (hence it must be fetched from the server over your poor countryside connection...) while, at the same time, saving a file on disk (which, for extra fun, could be the very same file you are trying to read, who knows!).

To solve this concurrency mess, we made the choice of going asynchronous with trio (this library is totally awesome and you should totally check it out !). However this didn't go without its own share of complexity and, long story short, we wrote some plugins to improve the situation quite a bit ;-)

Enters Hypothesis

The key takeaway here is we ended up with a lot of complex moving parts interacting with each others. That's where Hypothesis entered and saved the day (yeah, at least the day, most likely the entire year ^^).

Hypothesis is a “property-based testing” tool, different from the typical “example-based testing”. Both approaches consist of performing operations on some data and assert the given result, but the way of setting up the data is different: example-based testing requires the data to be set up manually where property-based works by creating a specification for the data.

Property-based testing comes from the Haskell world with the QuickCheck library (but don't worry you don't need to know that a monad is a monoid in the category of endofunctors to read this article ^^).

Here is a trivial example to illustrate:

from hypothesis import given
from hypothesis.strategies import integer

def adder(a: int, b: int) -> int:  # Most useless function ever !
    return a + b

# example-based testing
@pytest.parametrize("a,b,result", [0, 0, 0], [-1, 0, -1], [1000, 555, 1555])
def test_adder_by_examples(a, b, result):
    assert adder(a, b) == result

# property-based testing
@given(integer(), integer())
def test_adder_by_property(a, b):
    assert adder(a, b) == a + b

As you can see, example-based testing requires the developer to find some good testcases (if you don't know pytest.parametrize , just consider it as a for loop on testcases ), where property-based testing leaves this task to the testing framework.

From this example we could think Hypothesis is just some syntax sugar around random.getrandbits and our test is roughly equivalent to:

def test_poor_man_property():
    for tries in range(100):  # Arbitrary number of testcases
        a = random.randint(-2**32, 2**32)
        b = random.randint(-2**32, 2**32)
        assert adder(a, b) == a + b

Hypothesis is (fortunately !) a lot smarter than that:

  • random.randint will generate values with a normal distribution around 0, this means a lot of similar testcases. Hypothesis knows better and picks much more diverse values (typically here the min and max values, then zero, then the same value for a and b , then a == -b etc.)

  • Hypothesis stores all the generated examples in a local database. This allows to retry previously failed tests between runs and also to have more context to choose new interesting testcases.

  • Once an error case has been discovered, Hypothesis tries to simplify it as much as possible (it is called "shrinking"). For instance if adder wasn't able to support negative values, hypothesis could detect a == -423245454 is a falsy example by trying a random value, then shrink from there to end up with a == -1 , which is a much better starting point for debugging our code ;-)

  • Finally Hypothesis also monitors the generation of testcases to ensure it doesn't endup being ineficient (for instance integers().filter(lambda x: 11 < x < 20) is a very inefficient way of doing integers(min_value=11, max_value=20) and Hypothesis will complain about it)

Stateless mode

The example we just saw uses what Hypothesis calls "stateless mode" (by opposition of "stateful mode", we will get to it). This is the common use case of Hypothesis: generate meaningful test parameters based on a strategy and keep a clear separation between the test case and the parameters configuration, making the test usually very simple and readable.

Back to Parsec: as we said earlier the Parsec server is tasked with storing all the encrypted data from the users (we call them blocks). To do so, the server relies on object storage (AWS S3 or OpenStack Swift) and also embeds a RAID5 redundancy algorithm : encrypted blocks are chunked and dispatched over at least three different object storage providers. Stripping blocks in this way guarantees data redundancy, allowing the system to lose one provider, and moreover none of the provider can store all the chunks for one block. This mechanism provides extra safety because one provider is not able to fully rebuild the encryption blocks.

The code responsible for this is quite short, but also very critical. Our tests on it should ensure that:

  • blocks can be splitted in chunks (to be dispatched over providers)
  • blocks can be rebuilt from chunks (gathered from different providers)
  • blocks can be rebuild with one missing storage

On top of that we'd like to test this with an arbitrary number of object storage providers and a random block size. With example-based testing we would have to write a fair amount of testcases to cover all those requirements, however Hypothesis makes this much simpler:

@given(block=st.binary(max_size=2 ** 8), nb_blockstores=st.integers(min_value=3, max_value=16)) 
def test_split_block(block, nb_blockstores): 

The generated test parameters from Hypothesis based on integer and binary strategies covers basic cases (testing with small or big blocks), but also more advanced cases (block of zero bytes, even number of providers) and even the nasty ones we may not have thought of (block size smaller than the number of providers).

The overall test code is quite short : it first splits the given block between the different providers, then ensure it can retrieve the original block with a missing provider no matter which one it is.

This approach brings high quality test coverage while keeping the test code small and simple. Hypothesis is especially good at this kind of testing where you apply an operation, then the reverse operation and finally compare the result with the input data.

A typical example of this is how Pypy tests its implementation of the json module:

jsondata = strategies.recursive(
    strategies.none() |
    strategies.booleans() |
    strategies.floats(allow_nan=False) |
    strategies.text(),
    lambda children: strategies.lists(children) |
        strategies.dictionaries(strategies.text(), children))

@given(jsondata)
def test_roundtrip(d):
    assert json.loads(json.dumps(d)) == d

As you can see, most of the test is telling Hypothesis how to generate JSON data ;-)

Stateful mode

As a matter of fact, we don't use much the stateless mode in Parsec (that's why last paragraph ended up with an example from Pypy 😉). The real killer feature for us is the stateful mode: the idea here is to describe a testing state machine.

As we said earlier, the main goal of Parsec is to expose a virtual filesystem that will sync its data with a remote server. To achieve that we have multiple components working closely together (local storage with encryption, client/server communication, sync on remote changes, sync on local changes, virtual FS mountpoint etc.). As we discovered, most of the bugs come from interactions between components so unit tests don't help much and we focused on integration testing. However finding good testcases on components interactions is a tedious task that grows exponentially in complexity with the number of involved components :'(

As you guessed, Hypothesis's stateful mode is here to solve the trouble: this time, instead of asking Hypothesis to run a single test function with its generated data, we make it run an entire statemachine !

Here is a typical example of what we use it for:

BLOCK_SIZE = 16
PLAYGROUND_SIZE = BLOCK_SIZE * 1

class FSOfflineRestartAndRWFile(...):
    @initialize()  # Hypothesis always calls this first
    async def init(self):
        ...
        # Hypothesis will run multiple test cases, each time starting by calling this function
        # hence we must start by reseting any internal state
        await self.reset_all()
        ...
        # We create a new file that will get all the write/read/truncates
        await self.workspace.touch("/foo.txt")
        # We create a reference file to compare the behavior
        self.file_oracle = FileOracle()

    # During a test case, Hypothesis will choose multiple rules to call

    @rule()
    async def restart(self):
        # Simulate the fact we close the application and restart it later
        await self.restart_user_fs(self.device)
        self.workspace = self.user_fs.get_workspace(self.wid)

    @rule(
        size=st.integers(min_value=0, max_value=PLAYGROUND_SIZE),
        offset=st.integers(min_value=0, max_value=PLAYGROUND_SIZE),
    )
    async def atomic_read(self, size, offset):
        async with await self.workspace.open_file("/foo.txt", "rb") as f:
            await f.seek(offset)
            content = await f.read(size)
        expected_content = self.file_oracle.read(size, offset)
        assert content == expected_content

    @rule(
        offset=st.integers(min_value=0, max_value=PLAYGROUND_SIZE),
        content=st.binary(max_size=PLAYGROUND_SIZE),
    )
    async def atomic_write(self, offset, content):
        async with await self.workspace.open_file("/foo.txt", "rb+") as f:
            await f.seek(offset)
            await f.write(content)
        self.file_oracle.write(offset, content)

    @rule(length=st.integers(min_value=0, max_value=PLAYGROUND_SIZE))
    async def atomic_truncate(self, length):
        await self.workspace.truncate("/foo.txt", length=length)
        self.file_oracle.truncate(length)

A few points here:

  • The code is asynchronous (using trio), but you can just pretend the async/await parts doesn't exist

  • However if you care about asynchronous programing: coroutines scheduling is non-deterministic but Hypothesis got us covered here

  • We must take extra care not to keep any state between two testcases (otherwise the test won't be deterministic, but don't worry: Hypothesis will detect this an invite you to fix your test ^^).

  • We test everything against a file oracle (a blackbox that give a reference implementation of something). Here the oracle is simple in-memory implementation

From there Hypothesis will be able to find really nasty test cases (what if you write something, then restart the application, then truncate the write and finally read the whole thing ?). For instance if for some reason the local storage is broken and doesn't store the memory cache when the application is restarted, we'll get something like:

$  py.test tests/core/fs/test_fs_offline_restart_and_rwfile.py
[...]
Falsifying example:
state = FSOfflineRestartAndRWFile()
async def steps():
    await state.init()
    await state.atomic_truncate(length=1)
    await state.restart()
    await state.atomic_read(offset=0, size=1)
    await state.teardown()
state.trio_run(steps)
[...]

The best part about this is we can now simple copy&paste this output as a new test and start tweaking for our debug !

From there on it would be tempting to start adding more and more rules into our test statemachine (in our case we could also test the synchronization of our file with a server for instance).

However this is not a good idea given the more rules you have in your statemachine, the less each one of them will be tested ! Instead it's much better to divide our tests accross topics, for instance to test folder operations in Parsec we have:

Conclusion

Hypothesis brings to the table a whole new way of testing, however it's still not a silver bullet but more of a very efficient way to test some specific usecases (is there a better feeling than replacing hundred of lines of code by a small and elegant Hypothesis test ? 😋).

However writting Hypothesis tests requires to switch to a different mindset from traditional example-based tests. Hence it is often puzzling for beginner to determine where Hypothesis is a good fit. A rule of thumb would be:

  • you are writting a kind of encoder/decoder. Hypothesis is simply the perfect solution for you ;-)

  • you can write a simple oracle (or even better: you already have one available given you are reimplementing a protocol !)

  • work is of type "hard to compute, simple to check" (see how Pypy tests str.find for instance)