Update 2 June 2011: we’ve since discovered an error in our benchmarking suite.
Data.Text
was at a disadvantage and actually performs on par withString
. So, theData.Text
results below are inaccurate — everything else should be fine AFAWK.
I don’t really know if this is technically the same “functional time”. But maybe by “functional” Evan meant whenever it’s ready — so here goes. I’m Nick. Thanks for reading. I have to apologize ahead of time: I cannot even feign the capability to compete with Evan’s cultural allusions.
A quick recap of the last post: when optimizing a straight-forward implementation of the Jaro string distance, we found some success with datatype refinements to IntSet
and ByteString
(surprise surprise). But then we found out an inconvenient truth about the current GHC profiling, namely that it isn’t to be trusted.
This post includes some more of our experiments and design considerations. In particular, we focus on trying out four alternative types for representing the strings.
tl;w(on’t)r
The following chart summarizes the results; the scalar is an average time for an iteration of the outermost loop of our testing harness and the label is the type used to represent words.
Disclaimer: These are all the same algorithm, with different data structures just plugged it. With algorithms tailored to their interface, these datatypes might perform better!
ByteString
wins again. In this case, we think the chart indicates that much of that performance benefit is due to the support for O(1) random-access. This claim is supported by the comparable success of UArray Int Char
. Read on for a more thorough discussion of these the highlights:
- we couldn’t squeeze any performance gains out of mutability and
unsafePerformIO
, - don’t underestimate the benefits of a
data
declaration with strict fields for your inner loop’s state!, and Data.Sequence
severely underperforms, but we’re not sure why.
Prompted by a comment to the last post by Erik Hesselink (Thanks, Erik!), we included a variant for Bryan O’Sullivan’s Data.Text
. We have not investigated why it’s outperformed by it only performs on par with String
(which shares the same time-complexities for the interface we use); the only caveat we’ll mention is the minimality of this algorithm. It’s basically a strict left fold, length, and random access otherwise. There’s really no opportunity for fusion to kick in that we’ve noticed.
We’d be happy to send our code your way if you’d like — just let us know in the comments. You can dig deeper if you like, but we’re just investigating a “plug-and-play” like use here.
Experimental Setup
As a first step, we identify the essential interface to the String
-like datatype for calculating the Jaro distance. This type class is used in the code we actually benchmarked; each instance of Jaro
is saturated with INLINE
pragmas and the jaro
and matches
definitions were cut-and-pasted for each datatype in order to avoid dictionary passing at run-time. We couldn’t get all the overhead removed via SPECIALISE
— we have yet to determine why not.
class Jaro c where fromBS :: ByteString -> c iLen :: c -> Int index :: c -> Int -> Char foldl' :: (b -> Char -> b) -> b -> c -> b {-# INLINE gLen #-} gLen :: Enum n => c -> n gLen = toEnum . iLen
The test harness uses a strict ByteString
to read in the test data. The fromBS
method is used to convert the input word to the appropriate type for the jaro
variant under test. We use Control.DeepSeq.rnf
in order to exclude the conversion from our measurement of execution time. Otherwise, the methods simply provide access to the length of the word, random-access indexing, and a strict left fold.
The Jaro
type class provides all the necessary methods for implementing the Jaro algorithm.
jaro :: (Jaro c, Fractional n, Enum n) => c -> c -> n jaro s1 s2 | 0 == m = 0 | otherwise = (m / gLen s1 + m / gLen s2 + (m - t) / m) / 3 where m = toEnum mInt (x1, x2) = if iLen s2 < iLen s1 then (s2, s1) else (s1, s2) (mInt, x1', hits2) = matches x1 x2 t = (/ 2) $ toEnum $ loop hits2 x1' (0 :: Int) where loop (i : is) (c1 : cs1) = loop is cs1 . f where f = if c1 /= x2 `index` i then succ else id loop _ _ = id
The definition of jaro
re-orders the input words so that the outer-loop of matches
traverses the shorter word. That traversal yields the number of matches, mInt
; the string of matched characters from the short string, x1'
; and the ascending indices of matched characters in the long string, hits2
. The latter two are simultaneously traversed in order to compute the transpositions.
The matches
consists of a left fold over the short input word, which carries the following state type, St
. The data
declaration uses “bang patterns” in order to specify that the fields should never contain thunks. Along with the -funbox-strict-fields
compilation flag, this technique pushes these values nearer to the CPU at run-time — they are the data manipulated by our program’s hot-spot and any access-time overhead is costly.
data St = St { idx1, m :: !Int, mk :: !(String -> String), used :: !IntSet } {-# INLINE matches #-} matches :: Jaro c => c -> c -> (Int, String, [Int]) matches s1 s2 = (m, mk [], IntSet.toAscList used) where len2 = iLen s2 window = max 0 (max (iLen s1) len2 `div` 2 - 1) St {m, mk, used} = foldl' snoc nil s1 where nil = St { idx1 = 0, m = 0, mk = id, used = IntSet.empty } snoc st@St{..} c1 = (case inline List.find (\idx -> IntSet.notMember idx used && c1 == s2 `index` idx) [lo .. hi] of Nothing -> st Just idx2 -> st { m = m + 1, mk = mk . (c1 : ), used = IntSet.insert idx2 used } ) { idx1 = idx1 + 1 } where lo = max 0 (idx1 - window) hi = min (len2 - 1) (idx1 + window)
The INLINE
pragma merely mitigates the function call overhead; we’re not aware of any optimizations it enables within jaro
.
This particular definition of jaro
and matches
was the result of iterating on the various data and slight algorithmic choices when refining the original specification from the previous post. The primary differences are some manually eliminated common subexpressions, some explicit inlining via GHC.Exts.inline
, and some trivial changes to the types.
As a reminder, profiling was unfortunately of little use because of its current detrimental interactions with optimization, as mention in this GHC ticket. For our available machines, this bug either prevented any timing information or caused invalid compilation! Thus, we only had our own coarse-grain wall-clock time measurements to work with, computed carefully with Data.Time.getCurrentTime
and Control.DeepSeq.rnf
.
Jaro
instances
The instances for Jaro
are crucial to the performance of the jaro
variant that uses the datatype. Here, we have omitted the INLINE
pragmas that saturate these instance declarations for the sake of presentation, but they are crucial to avoiding dictionary passing at run time. Also, recall that the execution of fromBS
is not included in the measurements.
For ByteString
, String
, and Text
, the instances are immediate.
instance Jaro ByteString where fromBS = id iLen = BS.length index = BS.index foldl' = BS.foldl' instance Jaro [Char] where fromBS = BS.unpack iLen = length index = (!!) foldl' = List.foldl' instance Jaro T.Text where -- NB get special input treatment instead fromBS = error "Text from BS" iLen = T.length index = T.index foldl' = T.foldl'
The functions used in these instances are defined in heavily optimized libraries. This is, surprisingly, apparently not the case for Seq Char
.
instance Jaro (Seq Char) where fromBS = Seq.fromList . BS.unpack iLen = Seq.length index = Seq.index foldl' = \snoc nil -> List.foldl' snoc nil . Fold.toList
The foldl'
definition via the conversion to String
consistently outperforms those via Data.Sequence.foldlWithIndex
and Data.Foldable.foldl'
by about 0.1 seconds, which is degradation of about 15% to 50% compared to the execution times of all of our tests in this experiment. That would seem to merit some investigation; we haven’t any theories yet.
instance Jaro (UArray Int Char) where fromBS = \bs -> IA.listArray (0, BS.length bs - 1) (BS.unpack bs) iLen = IA.numElements index = (IA.!) foldl' = \snoc nil arr -> let (lo, hi) = IA.bounds arr w !idx !acc | hi < idx = acc | otherwise = w (succ idx) (snoc acc (arr IA.! idx)) in w lo nil
criterion
Harness
Our main
function reads in the input file as a ByteString
and then immediately prepares some thunks for converting that data into the various pairs for cross product of datatypes (one for short word, one for long). We also use Data.Text
‘s lazy IO in order to thunk its input until it’s needed; we don’t convert it from a ByteString
since it supports richer encodings.
type C0 = ByteString type C1 = UArray Int Char … main = do ns <- BS.lines `fmap` BS.readFile wordsFile ns_text <- (map (id &&& id) . T.lines . LT.toStrict) `fmap` LT.readFile wordsFile let ns00 = map (fromBS &&& fromBS) ns :: [(C0, C0)] ns01 = map (fromBS &&& fromBS) ns ns02 = map (fromBS &&& fromBS) ns … ns33 = map (fromBS &&& fromBS) ns :: [(C3, C3)]
We basically hand everything off to criterion
from here. We use rnf
“in the right places” in order to force the conversion to take place outside of the execution window that criterion
measures. Each bgroup
corresponds to one of the sixteen pairs of representation types (e.g. ByteString
and UArray Int Char
) and contains all of the bench
marks that use that input data. Combined with the cfgPerformGC
option to criterion
, this should ensure that only the ByteString
for the entire file and one conversion from that data are resident in memory at any given time.
let cfg = defaultConfig { cfgPerformGC = ljust True, cfgSummaryFile = ljust "crit.csv" } defaultMainWith cfg (return ()) [ bgroup "00" [ bench "jaro0" $ rnf ns00 `seq` whnfIO (critrun jaro0 ns00) , bench "jaro0'" $ whnfIO (critrun jaro0' ns00) , … {- a bench for each variant that uses ns00 -}], bgroup "01" [ bench "jaro01" $ rnf ns01 `seq` whnfIO (critrun jaro01 ns01) ], bgroup "02" [ bench "jaro02" $ rnf ns02 `seq` whnfIO (critrun jaro02 ns02) ], … {- a bgroup for each input type except Text -}, bgroup "text" [ bench "jaro_text" $ rnf ns_text `seq` whnfIO (critrun jaro_text ns_text) ] ]
We compile with
ghc -O2 -funbox-strict-fields -fforce-recomp
-fspec-constr-count=6 --make Main
We just bumped the SpecConstr
count until the warnings went away; 6 didn’t seem offensively large, and there were two call patterns at most (recall that GHC reduces the count as part of its heuristic).
Results and Discussion
Measurements
We used the 8K /usr/share/dict/propernames
file on our Mac OS X distributions to run the tests, since Jaro-Winkler is designed to primarily work on names. Our test harness traverses the diagonalization of the names in this file, and sums the similarity metrics between all of the words (this diverges from the first post’s test harness, but it’s much faster since we’re running so many variants and iterations). The result is a nonsensical number, but it forces the necessary evaluation, has a small memory footprint, and represents a hash of sorts of the files contents. All of the jaro
variants computed the same sum (without the profiling, that is), so we’re confident they are equivalent. As an exercise in autodidacticism, this work did not inspire us to compare for correctness against the golden C implementation of Jaro-Winkler (linked to by the Wikipedia page referenced above).
It is well understood that standard Haskell String
s limit performance; this is why ByteString
was written in the first place. For this experiment, we primarily focused on various data structures for representing the words during the two loops in matches
. Since each loop accesses the respective argument differently (short word – outer loop, long word – inner loop), we tried all sixteen combinations of our 4 datatypes of interest. Direct benchmarking via Data.Time.getCurrentTime
and Control.DeepSeq.rnf
yielded the following measurements.
Outer loop | Inner loop | Measured time (s) |
ByteString |
ByteString |
0.22892195665577 |
ByteString |
UArray Int Char |
0.3117424055312 |
ByteString |
String |
0.4186133953307 |
ByteString |
Seq Char |
0.62530112468937 |
UArray Int Char |
ByteString |
0.32210345232228 |
UArray Int Char |
UArray Int Char |
0.33752040827015 |
UArray Int Char |
String |
0.45714450323322 |
UArray Int Char |
Seq Char |
0.68065535986164 |
String |
ByteString |
0.37693604194859 |
String |
UArray Int Char |
0.40947800600269 |
String |
String |
0.49661373102406 |
String |
Seq Char |
0.74052640163639 |
Seq Char |
ByteString |
0.53333888256291 |
Seq Char |
UArray Int Char |
0.58034939014652 |
Seq Char |
String |
0.68305074178913 |
Seq Char |
Seq Char |
0.8523849221442 |
The Datatypes
The graph from the introduction tells the big story: ByteString
is best. We chose the other types in order to explore our straight-forward and simple hypotheses. We included String
since it’s the most familiar and default type in Haskell for words. We were surprised it fared so well. UArray Int Char
was intended as a less optimized version of ByteString
. We presumed that its implementation is basically the same with fewer possible rewrites and slightly worse access-times, and so we expected a constant factor of overhead compared to ByteString
. Our use of Data.Sequence
is slightly vestigial: an earlier algorithmic variant used its queue-like interface (O(1) tail and snoc) to slide the matching window across the second string. Without profiling, we haven’t the means to determine what’s “wrong” with Data.Sequence
; the time complexities listed in its documentation suggest it should do better than String
.
In order to better compare the types, we chart the times with a fixed type for the short word and average across the measured times for each of the possible types for the longer word. In other words (har), we fix the type of the outer loop and vary the type of the inner loop.
Next, we do the opposite: fix the inner loop type and vary the outer loop type, again averaging the results to a single scalar.
Without going into statistics, the best we can do is eyeball it. It does seem that the “slope” of the second graph is steeper, which indicates that the datatype used for the inner loop has a greater effect on performance. That sounds reasonable.
Algorithmic Variations
Strict Tuple Patterns
Unaccustomed to low-level optimizing, we didn’t immediately remember to employ a data
declaration with strict, unboxed fields. Its performance benefit at the time was startling; a factor of 2 or 3. So we’ve retained variants of jaro
that use simple tuples saturated with strictness annotations just to see the difference.
Datatype | Measured time (s) | Degradation |
ByteString |
0.35290416443089 | 54.16% |
UArray Int Char |
0.45579727375248 | 35.04% |
String |
0.61194113933781 | 23.22% |
Seq Char |
0.99888140403965 | 17.19% |
The data
declaration with strict fields is clearly crucial.
Mutable bit array
All of the measured variants discussed above use a Data.IntSet
to track which of the characters in the second string have already been matched. Since this is a persistent data structure — even though it’s so tightly packed and efficient — we though that changing it to be a mutable bit array would be a big win. However, all of our attempts at doing so, including ST
, unsafePerformIO
, re-using a singly allocated array to allow Data.ByteString.Internal.inlinePerformIO
within the loop, cloning Data.ByteString.foldl'
to work with a monadic snoc
, etc. failed to perform as well all as ByteString
+ IntSet
. We peeked at the Core enough to suppose that there were bounds checks interfering with some optimizations; the IntSet
had 10 instances of the loops whereas the array code had only 4. After eliminating the bounds checks (with some really low-level coding to circumvent the Data.Array
API), there were a matching 10 instances of the loops in the array based code, but it remained less performant. Without profiling and better understanding/navigation of Core, we’re unable to determine why.
The variant with the singly allocated array and using only Data.ByteString.Internal.inlinePerformIO
within the inner loop had a measured time of 0.3209107s, which is a 40.18% performance degradation.
Inlining List.find
In the definition of matches
, we use GHC.Exts.inline
to force the inlining of Data.List.find
. Since it’s a library function, this is the only way we are aware of to force it to be inlined. Without inlining it, the ByteString
jaro
averages 0.29225487196186, which is a 27.67% performance degradation. We surmise that its unfolding is enabling a fusion rule for filter
and enumFromTo
.
Having never worked this hard at optimizing a function, we were surprised that a use of a standard function like List.find
wasn’t optimally handled by GHC automatically. In particular, we thought it was very likely to unfold functions with such short definitions; perhaps library must mark it as INLINABLE with a pragma? Odd that such a standard library wouldn’t. Looking now at the definition of Data.List.find
, it seems it’d be more efficient to case directly on the result of the filter
instead of converting to a Maybe
. Is listToMaybe
INLINE
d even though it has no such pragma? (At the time of writing, those links hit base-4.3.1.0
.)
Eliminating the double traversal
The counting of transpositions in jaro
traverses the strings of matched characters. With respect to the short string, the characters occur in the exact same order within the outer loop of matches
. These characters are essentially traversed twice, and — presumably worse still — a list is constructed in between, bound to x1'
in jaro
.
We wrote a version using circular programming to avoid the construction of the string of matched characters from the short word. The resulting fold counts the transpositions “as” it determines the matches.
data St n = St { idx1, m :: !Int, t2 :: n, used :: !IntSet, hits2 :: String } matches_t :: (Jaro c, Fractional n, Enum n) => c -> c -> (Int, n) matches_t s1 s2 = (m, t2 / 2) where window = max 0 (max (iLen s1) len2 `div` 2 - 1) len2 = iLen s2 St {m, used, t2} = foldl' snoc nil s1 where nil = St { idx1 = 0, m = 0, t2 = 0, used = IntSet.empty, hits2 = map (index s2) (IntSet.toAscList used) } snoc st@St{..} c1 = (case inline List.find (\idx -> IntSet.notMember idx used && c1 == s2 `index` idx) [lo .. hi] of Nothing -> st Just idx2 -> st { m = m + 1, hits2 = tail hits2, used = IntSet.insert idx2 used, t2 = (if c1 /= head hits2 then succ else id) t2 } ) { idx1 = idx1 + 1 } where lo = max 0 (idx1 - window) hi = min (len2 - 1) (idx1 + window)
The circularity is between the outermost binding of used
and the initial value of the hits2
field in the nil
state. Divergence is avoided because, during the traversal, none of the other fields depend on hits2
or t2
and their fields in St
are not strict. As a result, those fields accumulate as thunks, which only collapse once t2
is forced. This ultimately causes the rest of the traversal to take place first.
With ByteString
s, this algorithm and the primary algorithm perform the same at optimization level -O1
. We surmise that the thunks that build up are operationally comparable to the construction of the list of matched characters, x1'
. That would still eliminate the redundant traversal, but for the fact that the thunks merely accumulate during the fold; their collapse happens only once t2
is forced, and is effectively a second traversal. Furthermore, with -O2
the circular program is significantly less performant than the primary algorithm; we surmise that the circularity prevents some of the more aggressive optimizations.
Though the circular programming removes the redundant traversal from the code, it fails to eliminate it from the execution. We surmise the existence of a much more invasive alteration of the primary algorithm where the outer loop counts the transpositions for the prefix of the longer string that is beyond the lower limit of the matching window for the shorter string’s last character. This ought to reduce the redundancy of the second traversal. We have not yet implemented it.
“Surmise, surmise, surmise”, you say. We’re trying to not look at the generated Core, especially since -O2
does a lot of unfolding. Perhaps heap profiling would be a good indicator? … You know, we do have research to do.
criterion
We eventually jacked our code into Bryan O’Sullivan’s criterion
package late into our process. In particular, during development we did simple timing with Data.Time.getCurrentTime
with a handful of iterations for quick turnaround.
For the numbers and graphs in this post, though, we ran everything in a single criterion
harness. Fortunately, comparing the two data sets at the shared data points validates our simple method.
Phew! Everything just about lines up. The criterion
benchmarking was ran over night, with no other processes — we assume that sort of OS-level inequity in the execution environment is where the seemingly constant offset comes from.
Conclusions & Questions
We kind of already concluded above. ByteString
was best. We’re surprised we couldn’t beat IntSet
by using a mutable array — that’s probably our most interesting result. But all of the IO-escaping (even via inlinePerformIO
) seemed to inhibit some key optimizations. Maybe with the right refactoring and compilation flags, the mutable bitarray would out-perform IntSet
. Suffice it to say it apparently wouldn’t be a small, obvious change.
We also didn’t try the other unboxed array packages. We just hadn’t heard of them yet. So maybe there’s something else out there for mutable bitarrays that would outperform IntSet
and not block those aggressive optimizations as we’re supposing was happening with our attempts. Please let us know!
If you’re interested in getting access to the code we used:
- Leave a comment saying so! We’ll need some motivation to package it up since the semester is over…
- Brace yourself! There’s a lot of “duplicate” code. We manually circumvented
SPECIALIZE
pragmas, since our initial tests indicated some overhead. Also, our naming conventions got a little frantic at the end there. Maybe we’ll clean it up before posting it if you ask nice.
If you’re just after the fastest jaro
definition (in order to try to beat it, perhaps? Hmm?), just wait for the next post. We’d love it if someone could show us some further tricks! We’d also really like to know why we’ve failed to speed it up with the mutable bit array.
Also, we are concerned about the apparent SPECIALIZE
overhead we noticed. Is this a known issue? Are we missing something?
This point about missing flags deserves repetition. In fact…
Disclaimer: we could have overlooked crucial
INLINE
s or compilation flags.
Those sorts of things could really affect the results we’re getting and so possibly invalidate some of our conclusions (e.g. those regarding inlining List.find
or Seq Char
’s slow folds). Unfortunately, we can’t even find Acovea on the Internets and aren’t otherwise familiar with the multitude of GHC flags. … Help?
Thanks again for reading! Our next post will discuss some more invasive algorithmic changes; there’s more we can do to speed up jaro
, albeit not by much. Fuse, fuse, fuse.
Good work guys!
You could also try `Vector Char` and `Vector Word8` (the only library that implements the full stream fusion framework). In particular `Data.Vector.Unboxed.Vector Word8` should be comparable to bytestring.
Thanks! Will do.
Nice write-up!
Would you mind posting the complete working code of the best implementation so far. I’d like to have a look at strictness and specialization issues. Please also post the command you use to compile the code.
Can do! Sorry for stalling 🙂
Is there any particular datatypes you’re more or less interested in? Or just
ByteString
since it’s been the fastest? We can serve up a slice for you.Very interesting!
I’m looking to reuse this benchmark in order to do some benchmarking/optimization for the Text library. Is the full source code available somewhere, and is there any particular copyright note I should be aware of?
We haven’t actually considered licensing yet, but I strongly doubt it’ll be an obstacle to you. (Un)Fortunately, Evan is at this summer school right now, but I imagine we won’t need to communicate much to decide on something soon.
Just to add: one of the nice things about criterion is not just that it gives a more accurate measurement, but also that it gives you confidence intervals on the timing. It’s hard to compare the point estimates in your table without knowing the variability, so that’s one aspect where criterion does/will help you 🙂
I’m actually working on a new package vector-bytestring that provides the type synonym:
type ByteString = Data.Vector.Storable.Vector Word8
and exports the same API as bytestring but adapted to Vectors. I can present something in about a week.
ByteString
would be enough.What I want to look at is code like this:
jaro
isn’t strict in s1 and s2 (as they aren’t used in the first branch or used to decide which branch to take) even though it probably should be. In addition, you probably want toSPECIALIZE
then function to the actual types.Pingback: D’oh! Bars | Functional Kansans
Very interesting. Just for the fun and experience, I’m working on a competing version with a very different implementation. I’ll reply to the next post with benchmark runs of your code and mine from my machine.
Have you considered using -fllvm? I saw a roughly 10% improvement using llvm over gcc.
We’ve since discovered an error in our benchmarking suite.
Data.Text
was at a disadvantage and actually performs on par withString
. So, theData.Text
results were inaccurate — everything else should be fine AFAWK.We found the error preparing our code for release (surprise surprise). Expect that post to go live today.
Pingback: Henny Penny Blues: An apologetic, but enthusiastic release of the Jaro benchmarking code | Functional Kansans
Pingback: Incremental optimizations | Functional Kansans