As we refined the presentation of the algorithms for our first post, we developed multiple “naive” implementations. Between edits, we spent some time incrementally optimizing the definition below towards the primary algorithm — the one that we tested with all the various data types the data types post — specialized to strings, i.e.
jaro2 :: String -> String -> Double).
jaro :: String -> String -> Double jaro s1 s2 | 0 == m = 0 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3 where n1 = fromIntegral $ length s1 n2 = fromIntegral $ length s2 (used, m1) = matches s1 s2 m2 = map (s2 !!) $ sort used m = fromIntegral $ length m1 t = (/ 2) $ fromIntegral $ length $ filter id $ zipWith (/=) m1 m2 matches :: String -> String -> ([Int], String) matches s1 s2 = foldl snoc initst (zip s1 [0..]) where window = max 0 (max (length s1) (length s2) `div` 2 - 1) initst = (, "") snoc st@(used, m1) (c1, idx1) = case find p [lo .. hi] of Nothing -> st Just idx2 -> (idx2 : used, m1 ++ [c1]) where p idx2 = idx2 `notElem` used && c1 == s2 !! idx2 lo = max 0 (idx1 - window) hi = min (length s2 - 1) (idx1 + window)
The error bars in the graph below are +/- the standard deviation as reported by the
criterion package. See the full graph PNG if this is too hard to read.
We decrypt the labels for you in the table below, but you can mostly read this bar graph from top to bottom as chronological order. In other words, the lower bars mostly include the optimizations of the bars above, with a few major exceptions. The “IntSet” variation abandons all of the optimizations between it and the original “List.sort”. The “! patterns” variation is a dead-end, and is replaced with the variant right below it, “data with ! fields…”. And all of the variants under the “beta-reduce in
t” label are one-shot variations of that same algorithm — they stop accumulating. Moreover, “(
jaro2)” is the primary algorithm at the
String type, not a refinement of
For the longest time, we were stuck at
outer loop = shorter string version. All of the variations below that one explore refactorings that we had assumed GHC does automatically. In particular, we were mistaken to assume that
notElem e . List.takeWhile (<= e) would get fused and that multiple occurrences of
length s2 would get common-subexpression-eliminated. We continued testing a few other things we take for granted, but found no more surprises: those last four variations don’t seem to make a difference.
|List.sort||The naive algorithm above|
|List.sort + all||Difference list instead of
|List.insert||Reset to just the naive algorithm, but maintain the list of second string indices in sorted order and use this to short-circuit the
|difference list||Also use a difference list instead of
|data with ! fields||Also use strict unboxed fields for loop state and strict
|inline List.find + fuse foldl’/zip||Also inline
|IntSet||Reset to just the naive algorithm, but use
|difference list||as above|
|! patterns||Also add strictness annotations on the tuple patterns over the state and strict
|data with ! fields instead||Forgo tuples for the unboxed strict fields as above|
|inline List.find||as above|
|fuse foldl’/zip||as above|
|m in state||Also count the matches in the loop state instead of outside of
|outer loop = shorter string||Also use the shorter string for the outer loop|
|fused filter/zipWith||Also manually fuse the loops for calculating
|(+1) instead of succ||Also change all
|(jaro2)||Not actually a refinement — this is the primary algorithm with
|beta in t, cse (length s2)||When calculating
|un-cse (length s2)||Do not manually eliminate the common
|IntSet.toAscList in t||Convert the
|named pred||Un-inline the
|manually inline snoc||Manually inline
The Best jaro Yet
When deriving all of these specifications after having ran the experiment, we of course ended up improving on the primary algorithm. These variations are due to improving the loop in the calculation of
t; they all use the same
matches function, which is only different from the primary one in that it builds the list of matched characters (
m1) simply using the
(:) constructor instead of a snoc or difference lists, which is the basic idea of all these optimizations. As a result, however, the characters are in reversed order, and so the
IntSet of matched indices in the second string needs to be traversed in reverse order. Since the strings are the same length, it doesn’t matter in which order we twin traverse them.
Data.IntSet does not export a
toDescList analog to
toAscList, so we experimented with the options in the graph. The best one fuses the
t loop with the flattening (to a list) of the
In order to continue optimizing the algorithm, we will need to investigate the arithmetic primitives (probably investigating the generated assembly) or change the interface to, for example, work on multiple longer strings (i.e. inner loop) at once.
We actually do have one idea within the current interface, but its effects will likely only be visible for really long words: most of the
t loop can be fused with the
matches outer loop, excluding the phase of the outer loop where the window (i.e.
hi) has reached the end of the longer string.
Qualifying our Experimental Design
Recall that profiling is not interacting well with optimization in GHC. Hence, none of these changes to the algorithm were inspired by any hard evidence: we can’t trust the profiling information right now. Each change is merely a possibility we considered when looking at the source for inefficiencies. We were (uncomfortably) flying blind.
Another consideration is that some of these variants could behave drastically different under compilation flags (or, less likely, different RTS flags) we haven’t considered. We sought out the tool Don Stewart used, Avocea, but could not find anything but broken links.
Code on its way
Sorry for another round of delays, but yet again the code exploring these optimizations will be polished up and released a bit later. One of the paper deadlines we’re under was extended, which is always a blessing, but also means that its import continues to loom over our time budget. This code needs some work, since we’ve been quite lax with our naming conventions (ticks ad nauseum…). Thanks!