Incremental optimizations

As we refined the presentation of the algorithms for our first post, we developed multiple “naive” implementations. Between edits, we spent some time incrementally optimizing the definition below towards the primary algorithm — the one that we tested with all the various data types the data types post — specialized to strings, i.e. jaro2 :: String -> String -> Double).

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where 
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
   where
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

The error bars in the graph below are +/- the standard deviation as reported by the criterion package. See the full graph PNG if this is too hard to read.

We decrypt the labels for you in the table below, but you can mostly read this bar graph from top to bottom as chronological order. In other words, the lower bars mostly include the optimizations of the bars above, with a few major exceptions. The “IntSet” variation abandons all of the optimizations between it and the original “List.sort”. The “! patterns” variation is a dead-end, and is replaced with the variant right below it, “data with ! fields…”. And all of the variants under the “beta-reduce in t” label are one-shot variations of that same algorithm — they stop accumulating. Moreover, “(jaro2)” is the primary algorithm at the String type, not a refinement of jaro.

For the longest time, we were stuck at outer loop = shorter string version. All of the variations below that one explore refactorings that we had assumed GHC does automatically. In particular, we were mistaken to assume that notElem e . List.takeWhile (<= e) would get fused and that multiple occurrences of length s2 would get common-subexpression-eliminated. We continued testing a few other things we take for granted, but found no more surprises: those last four variations don’t seem to make a difference.

Variation Label	Explanation
List.sort	The naive algorithm above
List.sort + all	Difference list instead of `snoc`, unboxed strict fields for loop state, strict `foldl'`, inline `List.find`, carry `idx1` in state instead of via `zip`
List.insert	Reset to just the naive algorithm, but maintain the list of second string indices in sorted order and use this to short-circuit the `notElem` in the `List.find` predicate
difference list	Also use a difference list instead of `snoc` to build `m1`
data with ! fields	Also use strict unboxed fields for loop state and strict `foldl'`
inline List.find + fuse foldl’/zip	Also inline `List.find` and carry `idx1` in state instead of via `zip`
IntSet	Reset to just the naive algorithm, but use `Data.IntSet` to track used indices from the second string
difference list	as above
! patterns	Also add strictness annotations on the tuple patterns over the state and strict `foldl'`
data with ! fields instead	Forgo tuples for the unboxed strict fields as above
inline List.find	as above
fuse foldl’/zip	as above
m in state	Also count the matches in the loop state instead of outside of `matches`
outer loop = shorter string	Also use the shorter string for the outer loop
fused filter/zipWith	Also manually fuse the loops for calculating `t`
(+1) instead of succ	Also change all `succ`s to `(+ 1)` in order to avoid an overflow check!
(jaro2)	Not actually a refinement — this is the primary algorithm with `String` for both input words
beta in t, cse (length s2)	When calculating `t`, carry the `Int` through the loop instead of building a function in the loop and applying it to `0`; also inline `pred`
un-cse (length s2)	Do not manually eliminate the common `length s2` subexpressions within `matches` — this confirms that GHC does not combine those subexpressions automatically
IntSet.toAscList in t	Convert the `IntSet` to a list outside of matches instead of inside
lifted nil	Define `nil` in a local `where` clause instead of in parallel with the `foldl'` invocation
named pred	Un-inline the `List.find` predicate (not a lambda-lift, just a binding)
manually inline snoc	Manually inline `snoc`

The Best `jaro` Yet

When deriving all of these specifications after having ran the experiment, we of course ended up improving on the primary algorithm. These variations are due to improving the loop in the calculation of t; they all use the same matches function, which is only different from the primary one in that it builds the list of matched characters (m1) simply using the (:) constructor instead of a snoc or difference lists, which is the basic idea of all these optimizations. As a result, however, the characters are in reversed order, and so the IntSet of matched indices in the second string needs to be traversed in reverse order. Since the strings are the same length, it doesn’t matter in which order we twin traverse them.

Unfortunately Data.IntSet does not export a toDescList analog to toAscList, so we experimented with the options in the graph. The best one fuses the t loop with the flattening (to a list) of the IntSet via IntSet.fold.

In order to continue optimizing the algorithm, we will need to investigate the arithmetic primitives (probably investigating the generated assembly) or change the interface to, for example, work on multiple longer strings (i.e. inner loop) at once.

We actually do have one idea within the current interface, but its effects will likely only be visible for really long words: most of the t loop can be fused with the matches outer loop, excluding the phase of the outer loop where the window (i.e. hi) has reached the end of the longer string.

Qualifying our Experimental Design

Recall that profiling is not interacting well with optimization in GHC. Hence, none of these changes to the algorithm were inspired by any hard evidence: we can’t trust the profiling information right now. Each change is merely a possibility we considered when looking at the source for inefficiencies. We were (uncomfortably) flying blind.

Another consideration is that some of these variants could behave drastically different under compilation flags (or, less likely, different RTS flags) we haven’t considered. We sought out the tool Don Stewart used, Avocea, but could not find anything but broken links.

Code on its way

Sorry for another round of delays, but yet again the code exploring these optimizations will be polished up and released a bit later. One of the paper deadlines we’re under was extended, which is always a blessing, but also means that its import continues to loom over our time budget. This code needs some work, since we’ve been quite lax with our naming conventions (ticks ad nauseum…). Thanks!

2 responses to “Incremental optimizations”

Cain | 6 June 2011 at 4:24 pm | Reply

The source code for Acovea is still available via the Internet Archive, here. The downloads are at the top right of the page, just under the logo. There are also some build instructions.
- nifr | 6 June 2011 at 5:14 pm | Reply
  
  Great! Thanks.

Incremental optimizations

The Best `jaro` Yet

Qualifying our Experimental Design

Code on its way

2 responses to “Incremental optimizations”

Leave a comment Cancel reply

Search It!

Recent Entries

Links

Incremental optimizations

The Best jaro Yet

Qualifying our Experimental Design

Code on its way

Share this:

Related

2 responses to “Incremental optimizations”

Leave a comment Cancel reply

Search It!

Recent Entries

Links

The Best `jaro` Yet