I had that haunting feeling that I had forgotten something during the last editing pass on that previous post about test driving the datatypes dropped into the Jaro-Winkler test harness we made. Our graphs looked incomplete somehow…
D’oh! Error bars. Thanks to Neil Brown for pointing out our omission.
The statistics we have to offer
criterion computes some statistics for you, as it chews through its benchmarking duties. It applies resampling in ways we haven’t investigated, identifies outliers and their possible affect on variance, remarks on the confidence interval, and computes the standard deviations for the mean and for its lower and upper bounds within a benchmark’s iterations.
As with all statistics, these numbers risk lulling the trusting reader into a submissive state.
We were a little too trusting at first. If it finds that outliers have affected your run,
criterion warns you but doesn’t attempt re-run. We had thought it would. So just a heads up about that: check for crazy variances, look for that warning! My laptop apparently sleepwalks during the full moon: there were three giant execution times recorded during our first overnight benchmarking run. It took 600 minutes. Only in the past couple days did we realize it should actually only take 50 minutes to do the entire run! We have no idea what happened there, but have been on the look out for large variances ever since.
So what is the consequence of the 95% confidence interval
criterion reported to us? It’s pretty sure that it got the right measurement for that run of that particular benchmark. It does not take into account whatever sort of balancing act your OS might be doing for that benchmark that it didn’t do for others, even within the same run of the execution of the whole suite. Not that we expect
criterion to be able to do that sort of thing†! We’re just couching the implications of the statistics.
All of that rant is setup for
Disclaimer: We ran these measurements on a laptop with as few other processes as we could muster. We are not benchmarking pros, and so we really have no idea how best to quantify or even qualify the consistency of our measurements.
Bringing error bars into the equation might satiate our need for statistical significance, but we just want to highlight how much, for example, the OS environment might affect the run and that the standard deviation as computed by
criterion does little to assuage that.
Here’s another graph to bring home our disclaimer. We’ve executed the entire benchmark multiple times; here’s the same slice of four such runs lined up. Recall that each of the lines on this graph achieved that 95% confidence interval nod.
We can’t explain those slight differences. Red, orange, and green were ran back to back one night, and blue was ran over lunch the next day. Otherwise, the system was in a very similar state. Maybe we ought to reboot just before benchmarking? (We tried booting into single-user mode and benchmarking there, but all the times were roughly twice as slow. /shrug)
An example graph
Even so, a graph looks so much more dapper with error bars. So we jumped through all the hoops‡ to get a PNG with error bars out of Open Office. (I chose not to install the graphics libraries
criterion relies on for its graphs.) This is from a different set of data than the numbers in the previous post, but you get the idea. See the full graph PNG if this is too hard to read.
The error bars in that graph are +/- the standard deviation.
Here’s another thing we learned during this experiment: Google Spreadsheets does not support error bars. Swammi really gives it to ’em there!
The only other statistic we’re prepared to offer is a characterization of our SNR. That’s the mean divided by the standard deviation. The higher that quotient is, the less jittery the results were.
We don’t have any intuitions about “good” SNR values, but we guess those are mostly pretty good? Again, this just means that each particular benchmark within the set of benchmarks of that particular execution was pretty consistent. It doesn’t necessarily say anything about how much we can trust any comparisons between them. It does suggest that nothing funky was going on, so it doesn’t per se warn against comparing between the variants measured within that benchmark run. On that note, we used
sleep 360 && … to let the display get to sleep and the computer settle before the benchmark started. If it was not done by the time we woke the display, we restarted the run.
FYI, the eight SNRs below 200 are
ByteString * ByteString,
ByteString * String,
UArray Int Char * UArray Int Char with tuples,
String * ByteString,
Seq Char * UArray Int Char, and a few of the variants we have yet to post discussions about: a further optimization of symmetric
ByteString and two of the refinement steps between the
String spec and the primary
jaro algorithm we settled on for the previous post. We see no notable pattern in those — again, though, we’re not too familiar with SNRs.
So that’s our graph and a characterization of our SNRs. Any suggestions on how to characterize our confidence in the comparisons? We suppose you could derive that from a characterization of how consistent the state of the machine was throughout the execution of the entire benchmarking suite. Other than just turning off AirPort, for example. (We’re working with what we have…)
Thanks again, Neil, for the reminder.
And thank you for reading! Please still stay tuned for that last post about optimizing
jaro. The request has been made, so our code will be posted too.
† That’d be really awesome, though. Maybe it could jumble up some portions of the iterations of a benchmark, trying to balance them across the entirety of that run of the benchmarking suite? Hrmmmm… (k)