Tests as Documentation
Wouldn’t it be nice if?
When disciplined programmers write unit tests, they often make reference to the fact that their tests provide a means of documentation the software that it is testing. This documentation is more appropriate than what would otherwise be informal and potentially ambiguous comments using English. Take the simple example of adding two numbers. We might document using informal language:
/**
* Adds the two arguments.
*
* @param a Add this argument to the other one.
* @param b Add this argument to the other one.
* @return The sum of the two arguments.
*/With unit tests, we might instead write something more formal and unambiguous:
assertEqual(add(2, 2), 4) assertEqual(add(4, 3), 7) ... so on
It might be argued that both these forms of documentation complement each other. After all, while the unit tests have less room for misinterpretation, they are incomplete; for example, what about add(88, 37)? The English description makes up for this shortcoming.
We could reword our English to be a little more succinct:
/**
* Passing 0 as one argument returns the other argument,
* otherwise, the result is the same as subtracting 1 from one argument and
* adding 1 to the other argument then passing those values instead.
* e.g. add(2, 8 ) is the same as add(1, 9) and so on until one of the arguments reaches 0.
*/Wouldn’t it be nice if we could express this formally in unit tests? You can, read on.
While this example is trivial, it scales in proportion to the amount of discipline that the programmer is willing to exercise by controlling side-effects in their program. If we write our programs such that most of our methods retain the property of referential transparency, we can use this advanced method of tests as documentation. When we refactor our code to make tests easier to write, it is often the case that we are doing exactly this anyway. Win win!
Let’s scale up a little
We’ll give a slightly less trivial example next, but not so trivial that it takes away from the important points. In fact, let’s unit test a specific part of the Java Collections library — the java.util.Collections.reverse method. There are various ways of testing this method and we will choose one here that serves to illustrate the point of unit tests as documentation.
The reverse method can be described as follows:
- For the empty list, then reversing this list is always the same list
- For the list with one element, then reversing this list is always the same list
- For any other two lists (let’s call them ‘a’ and ‘b’), then appending b to a then reversing will yield the same list as reversing a, then appending the result to the reverse of b. Since this statement is a little convoluted, let’s write it with some pseudo-Java syntax notation:
(a.append(b)).reverse() == b.reverse().append(a.reverse())
It is an interesting observation here that we have completely specified the reverse method. That is, under some reasonable assumptions, it is not possible to write a method that is not equivalent to reverse that also satisfies our statements above. This is the ultimate form of code documentation!
We will ignore the first statement for the sake of interest and verbosity and focus on expressing the other two. This is because statements 2 and 3 have free variables, while statement 1 is merely an assertion that does not illustrate any interesting points. Let us start with the second statement and articulate it using Reductio:
Property p2 = property(arbInteger, new F<Integer, Property>() { public Property f(Integer i) { return prop(single(i).equals(reverse(single(i)))); } });
That pretty much sums up statement 2 doesn’t it? What about statement 3:
Property p3 = property(arbLinkedList(arbInteger), arbLinkedList(arbInteger), new F2<LinkedList<Integer>, LinkedList<Integer>, Property>() { public Property f(LinkedList<Integer> a, LinkedList<Integer> b) { final LinkedList<Integer> x = reverse(append(a, b)); final LinkedList<Integer> y = append(reverse(b), reverse(a)); return prop(x.equals(y)); } });
Is that it?
Yep. Notwithstanding the absence of statement 1, we have completely specified the behaviour for the Java Collections.reverse method. We have exhaustive and formal documentation instead of one or the other as we traditionally do. What an improvement!
Yeah but I want to unit test it too
That’s not hard either. How many unit tests do you want to run? By default, Reductio will run 100 unit tests per Property declaration. You can adjust this and various other factors about how your unit tests are executed. If you want to take the default, then a few more lines of code are enough to do just that:
list(p2, p3).foreach(new Effect<Property>() { public void e(Property p) { summary.println(p.check()); } });
If you run this line of code, you will see the result of your 200 unit tests on the standard output:
OK, passed 100 tests. OK, passed 100 tests.
Is that too magical for you? Don’t believe me? Want to see it fail? OK, let’s fail it. In the expression of statement 3, change the line b.addAll(a) to a.addAll(b) and run again. What did you see? Here is what I saw:
OK, passed 100 tests. Falsified after 4 passed tests with arguments: [[3, 2, -3, 4, -3],[2, -3, 4, -3]]
Yep, it failed alright
When those two list values are used as our free variables, the property is false and the unit test fails.
Other Resources
- A Case for Automated Testing
- Reductio EqualsHashCode
- Reductio Website
- More Reductio examples
- Reductio Manual
Complete Runnable Source Code
import fj.Effect; import fj.F; import fj.F2; import static fj.data.List.list; import static reductio.Arbitrary.arbInteger; import static reductio.Arbitrary.arbLinkedList; import static reductio.CheckResult.summary; import reductio.Property; import static reductio.Property.prop; import static reductio.Property.property; import java.util.Collections; import static java.util.Collections.singletonList; import java.util.LinkedList; public class ListReverse { public static void main(String[] args) { Property p2 = property(arbInteger, new F<Integer, Property>() { public Property f(Integer i) { return prop(single(i).equals(reverse(single(i)))); } }); Property p3 = property(arbLinkedList(arbInteger), arbLinkedList(arbInteger), new F2<LinkedList<Integer>, LinkedList<Integer>, Property>() { public Property f(LinkedList<Integer> a, LinkedList<Integer> b) { final LinkedList<Integer> x = reverse(append(a, b)); final LinkedList<Integer> y = append(reverse(b), reverse(a)); return prop(x.equals(y)); } }); list(p2, p3).foreach(new Effect<Property>() { public void e(Property p) { summary.println(p.check()); } }); } static <A> LinkedList<A> single(A a) { return new LinkedList<A>(singletonList(a)); } static <A> LinkedList<A> reverse(LinkedList<A> as) { LinkedList<A> aas = new LinkedList<A>(as); Collections.reverse(aas); return aas; } static <A> LinkedList<A> append(LinkedList<A> as1, LinkedList<A> as2) { LinkedList<A> aas = new LinkedList<A>(as1); aas.addAll(as2); return aas; } }
June 17th, 2008 at 4:17 pm
I think the problem I’m having in understanding this post is that for the simple cases you’ve presented, the implementations of add and reverse themselves are both more accurate and more readable than either an English description or a test case.
Of course I have never really done either English docs OR unit testing terribly well.
June 17th, 2008 at 4:36 pm
If you call “reverse” a “non-trivial example”, then we definitely have different definitions for this adjective.
I also notice it’s a frightening amount of code to test such simple cases.
Could you show Reductio in action on some more practical matters, such as database testing?
June 17th, 2008 at 5:54 pm
Seal,
I called it a “less trivial” example. The real question is whether or not the example scales to — as you point out — a database. The answer is yes, but requires a level of programming discipline that most Java programmers have never encountered. I have refrained from presenting this here for fear of scaring people away.
Here is another interesting question (perhaps more interesting). What property makes it the case that these trivial example do not scale to your database (etc.) application? For example, when does a trivial example no longer apply and why?
I’ll let you ponder this latter question rather than reveal any hints just yet
June 18th, 2008 at 1:48 am
Tony,
If this example will scare Java programmers away, then Reductio will never take off, because this is the kind of real world programming that we do every day. I’m a bit disappointed by your response because it seems like a cop-out, as if you already know that your approach doesn’t scale beyond trivial examples.
Please prove me wrong and show me that Reductio will make my life easier than with TestNG or JUnit and I’ll adopt it in a heartbeat.
Until Reductio becomes popular in the Java world, the burden of proof is on you.
June 18th, 2008 at 5:07 am
Hi Seal,
I meant that a thorough explanation of why these examples scale, will scare Java programmers away. I know this because there is already proof out there and I’d either be preaching to the choir if I were to repeat it.
I’m not sure why popularity shifts a burden of proof; this sounds like a classic logical fallacy to me.
I’d really rather that you try to find a counter-example, not because I think the burden of proof is on you, but because I believe that in doing so, you will have some very euphoric moments of realisation.
I can give you a hint; notice that I mentioned referential transparency. What happens when you violate this property in your software? Take the given example;
addAllandreverseare not referentially transparent. This result in poor composition, but luckily a fix was available without too much effort (i.e. copy the list, etc.).To take your database example, suppose you had a method:
List<Person> persons = getPersonsFromDatabase();This method is difficult to test because it is not referentially transparent. Suppose that I passed the database itself as an argument and received one back:
PersonsAndDatabase persons_db = getPersonsFromDatabase(db);Now there is no distinction between this method and the example I gave in the post that should concern you. The real question is, can I write my code this way? I think I’ve done enough paraphrasing some important concepts here (with the danger of undermining their importance and depth), so I will stop here and hopefully you can take it yourself for a little.
Finally, although it needn’t be stated (nor has it ever been my greatest intention), Reductio is preferable to JUnit/TestNG simply by the mere fact that it automates otherwise manual tasks. Indeed, I have considered writing a “JUnit compiler” from Reductio, so that if I were forced to use JUnit, I could use Reductio behind the scenes and generate JUnit source code (which is still easier than using JUnit manually). This is also the case for a test framework that I wrote some time ago called JTiger.
I hope this helps and I’m more than happy to delve deeper if you are.
July 4th, 2008 at 1:33 am
I’m a fan of
[1,2,3].reverse.should == [3,2,1]
July 4th, 2008 at 9:28 am
Pat,
What about all the other possible list values?! This does not disambiguate the reverse function; not even close.
Wouldn’t you prefer this? (forall a. forall b. forall c.)
[a, b, c].reverse.should == [c, b, a]July 4th, 2008 at 6:24 pm
Not really. Tests are not a mathematical proof. They’re a tool to help me design my code, give me reasonable confidence that it does what I want, and serves as documentation for the future. With all due respect, the example you give absolutely sucks as documentation when compared to my example.
July 4th, 2008 at 6:44 pm
No, tests are not a mathematical proof; that’s what the type system does. Indeed, it is possible to prove some functions with a clever enough type system.
I’m afraid that your example sucks
def reverse[A](as: List[A]) = if(as.length == 3) then reallyReverse(as) else NilYour test passes here yet the function is wrong. I don’t like hints about program behaviour (and such a small hint it is); I much prefer unambiguous specification. The fact that a proof is not given is a known limitation of computation theory.
July 5th, 2008 at 3:50 am
Hi Tony,
There’s no denying that it’s possible to write implementations for looser tests that satisfy the tests but don’t satisfy the requirements.
July 5th, 2008 at 4:07 am
Hi Tony,
There’s no denying that it’s possible to write implementations for looser tests that satisfy the tests but don’t satisfy the requirements. My barometer when testing is confidence, not unambiguous specification. For example, when I write the test
[1,2,3].reverse.should == [3,2,1]
and it passes, I’m moderately confident that it does what I expect. If I add another test with four elements, I’m even more confident.
When TDDing, the steps go
1. Write failing test
2. Make it pass
3. Refactor to remove duplication, generalizing as necessary
Your intentially flawed implementation is obviously not sufficiently generalized. So you’re absolutely right that your implementation passes the test but breaks the spec, but I’d argue that it’s completely contrived to begin with.
Anyway, my main criticism with your example is entirely within the context of the post’s title, “Tests as Documentation.” In my opinion, your tests don’t serve as documentation at all. The test takes up 25+ lines of code and relies on understanding this mathematical property of list reversal. Compare that with mine, which is one line, shows a list, the operation performed on it, and the expected result. If you think that your test is better documentation, you’re deluding yourself. Note that your test is on the same order of complexity as the implementation of reverse itself!
Good tests, in my experience, follow the pattern of specification by example. That is to say, the clearest, most succinct way to specify code is to give examples of how to use it.
July 5th, 2008 at 7:06 am
Pat,
When you say you are now “moderately confident”, you are making an over-confident judgement. That my counter-example is obviously flawed is only for the purpose of demonstration. In practice, the fact that you tested one element from a function domain of an enormous size offers very very little confidence and should not be used to form any conclusions about correctness whatsoever — except that, for this element of the function domain, this property holds — this is incredibly weak both in terms of verification and documentation.
TDD is what I refer to as an anti-concept, so I will not enter any discussion around it. For me (and perhaps not you), it is equivalent to discussing that which does not exist.
My tests take up 25 lines because it is Java. Java is an awful language that really should not be used for anything practical. Let’s use a more reasonable language:
So with 3 lines of code we have perfect documentation and it is absolutely clear. The fact that I have effectively “rewritten reverse” is an essential property of software verification and is inescapable (even in its most diluted form when writing assert this and that) — see Goedel’s Incompeteness.
If you think that
"reverse [1,2,3] == [3,2,1]“serves as better (or is useful at all!) documentation, then it is not I who is “deluded” (these kind of discussions remind me of tennis — your hit :)). Indeed, I am reasonably confident any objective person (who is not tainted with ideas from the pseudo-science that envelopes the programming community) would much prefer an inductive definition of reverse than the clumsy “assert silliness” (this needs a name) that does not document very much at all and so destroys any meaningful discussion. That would be an interesting experiment to try on someone.Have you ever written proofs — for example, using the Coq proof assistant (or perhaps that which you can achieve using Haskell’s type system)? Also, have you read the QuickCheck paper (Koen Claessen and John Hughes)?
I agree with your last paragraph, though we likely have a different definition of “succinct”. Mine is more inline with the mathematical and computer science texts.
July 9th, 2008 at 12:15 am
The art of programming is the creation of unambiguous instructions for a computer to follow. Some people program in assembly, some in C++, and some in Prolog, but thanks to various results such as the equivalence of all Turing-complete languages, all of these are ultimately equivalent.
Suppose you are writing a piece of code and you REALLY want it to be correct. One way to do so is to write it TWICE (perhaps even in two different languages), then check to ensure that the two implementations match. It is not efficient (takes at least 2x as long plus time to do the comparison although it may save some debugging time), but it IS pretty reliable.
In the end, that is what you are describing here. If you are creating Reductio specifications so complete that they unambiguously specify the behavior of the function, then you are effectively “writing it in Reductio”. (A sufficiently clever person could create compiler from unambiguous Reductio specifications to working code.) Then you compare this version with the version you wrote in Scala (or Java).
In my opinion, it’s an excellent academic exercise particularly on tiny examples like reverse(), and it might be useful on those rare cases where very tricky code needs to be 100% accurate without regard to the time spent coding it (certain datastructure manipulations in an OS kernel perhaps). But it is NOT the same as writing tests — creating these Reductio specifications takes at least as much time as writing the code itself (probably more — Reductio is nice, but it’s still not as easy to write or to think in as Scala). But I wouldn’t expect to be using it on a real project.
July 9th, 2008 at 5:59 am
Michael,
I was with you the whole way — I have only kind of alluded to what you mention about “rewriting the program” without mentioning it explicitly.
But here’s the part that grabs me. You admit that rewriting the program is an unavoidable aspect of testing, then in the very next breath, claim you “won’t be using this on a real project”. Yet, you feel it is OK to instead “rewrite the program” using ‘assert this and assert that’? I can’t understand this.
Why do you think specification-based testing is not appropriate for a ‘real project’? Is it because you think real projects are full of side-effects, making testing difficult? I think this is a common fallacy and I am prepared to show why it is false.
Here is an assertion to get started; almost all ‘real projects’ contain code that — if written in a sufficiently disciplined manner (this is significantly higher than the mainstream using TDD or what-have-you) — can be tested using automated (JUnit et. al. exemplify manual testing) and compositional methods, since by implication, the side-effects — of which there are very few in almost all ‘real world’ applications — have been separated in such a way that the pure code can be examined independently.
It is in these cases — where the code has been written by a professional — that automated testing is certainly going to be favoured over traditional ‘assert this and that’ testing.
I’m still baffled why you think you can escape the principle of rewriting the program by avoiding automated (in the true sense) testing.