Incremental optimizations

As we refined the presentation of the algorithms for our first post, we developed multiple “naive” implementations. Between edits, we spent some time incrementally optimizing the definition below towards the primary algorithm — the one that we tested with all the various data types the data types post — specialized to strings, i.e. jaro2 :: String -> String -> Double).

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where 
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
   where
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

The error bars in the graph below are +/- the standard deviation as reported by the criterion package. See the full graph PNG if this is too hard to read.

We decrypt the labels for you in the table below, but you can mostly read this bar graph from top to bottom as chronological order. In other words, the lower bars mostly include the optimizations of the bars above, with a few major exceptions. The “IntSet” variation abandons all of the optimizations between it and the original “List.sort”. The “! patterns” variation is a dead-end, and is replaced with the variant right below it, “data with ! fields…”. And all of the variants under the “beta-reduce in t” label are one-shot variations of that same algorithm — they stop accumulating. Moreover, “(jaro2)” is the primary algorithm at the String type, not a refinement of jaro.

For the longest time, we were stuck at outer loop = shorter string version.  All of the variations below that one explore refactorings that we had assumed GHC does automatically. In particular, we were mistaken to assume that notElem e . List.takeWhile (<= e) would get fused and that multiple occurrences of length s2 would get common-subexpression-eliminated. We continued testing a few other things we take for granted, but found no more surprises: those last four variations don’t seem to make a difference.

Variation Label Explanation
List.sort The naive algorithm above
List.sort + all Difference list instead of snoc, unboxed strict fields for loop state, strict foldl', inline List.find, carry idx1 in state instead of via zip
List.insert Reset to just the naive algorithm, but maintain the list of second string indices in sorted order and use this to short-circuit the notElem in the List.find predicate
difference list Also use a difference list instead of snoc to build m1
data with ! fields Also use strict unboxed fields for loop state and strict foldl'
inline List.find + fuse foldl’/zip Also inline List.find and carry idx1 in state instead of via zip
IntSet Reset to just the naive algorithm, but use Data.IntSet to track used indices from the second string
difference list as above
! patterns Also add strictness annotations on the tuple patterns over the state and strict foldl'
data with ! fields instead Forgo tuples for the unboxed strict fields as above
inline List.find as above
fuse foldl’/zip as above
m in state Also count the matches in the loop state instead of outside of matches
outer loop = shorter string Also use the shorter string for the outer loop
fused filter/zipWith Also manually fuse the loops for calculating t
(+1) instead of succ Also change all succs to (+ 1) in order to avoid an overflow check!
(jaro2) Not actually a refinement — this is the primary algorithm with String for both input words
beta in t, cse (length s2) When calculating t, carry the Int through the loop instead of building a function in the loop and applying it to 0; also inline pred
un-cse (length s2) Do not manually eliminate the common length s2 subexpressions within matches — this confirms that GHC does not combine those subexpressions automatically
IntSet.toAscList in t Convert the IntSet to a list outside of matches instead of inside
lifted nil Define nil in a local where clause instead of in parallel with the foldl' invocation
named pred Un-inline the List.find predicate (not a lambda-lift, just a binding)
manually inline snoc Manually inline snoc

The Best jaro Yet

When deriving all of these specifications after having ran the experiment, we of course ended up improving on the primary algorithm. These variations are due to improving the loop in the calculation of t; they all use the same matches function, which is only different from the primary one in that it builds the list of matched characters (m1) simply using the (:) constructor instead of a snoc or difference lists, which is the basic idea of all these optimizations. As a result, however, the characters are in reversed order, and so the IntSet of matched indices in the second string needs to be traversed in reverse order. Since the strings are the same length, it doesn’t matter in which order we twin traverse them.

Unfortunately Data.IntSet does not export a toDescList analog to toAscList, so we experimented with the options in the graph. The best one fuses the t loop with the flattening (to a list) of the IntSet via IntSet.fold.

In order to continue optimizing the algorithm, we will need to investigate the arithmetic primitives (probably investigating the generated assembly) or change the interface to, for example, work on multiple longer strings (i.e. inner loop) at once.

We actually do have one idea within the current interface, but its effects will likely only be visible for really long words: most of the t loop can be fused with the matches outer loop, excluding the phase of the outer loop where the window (i.e. hi) has reached the end of the longer string.

Qualifying our Experimental Design

Recall that profiling is not interacting well with optimization in GHC. Hence, none of these changes to the algorithm were inspired by any hard evidence: we can’t trust the profiling information right now. Each change is merely a possibility we considered when looking at the source for inefficiencies. We were (uncomfortably) flying blind.

Another consideration is that some of these variants could behave drastically different under compilation flags (or, less likely, different RTS flags) we haven’t considered. We sought out the tool Don Stewart used, Avocea, but could not find anything but broken links.

Code on its way

Sorry for another round of delays, but yet again the code exploring these optimizations will be polished up and released a bit later. One of the paper deadlines we’re under was extended, which is always a blessing, but also means that its import continues to loom over our time budget. This code needs some work, since we’ve been quite lax with our naming conventions (ticks ad nauseum…). Thanks!

2 responses to “Incremental optimizations

  1. The source code for Acovea is still available via the Internet Archive, here. The downloads are at the top right of the page, just under the logo. There are also some build instructions.

Leave a comment