View Patterns and “And Patterns” for Record Patterns

View Patterns

Man oh man, I’m fond of view patterns.

I don’t know when it happened, but I became slightly obsessed with scoping the Haskell I write. I try both to make as few bindings as possible (without obfuscating too much) and to minimize the scope of each binding. I find view patterns crucial for the first one.

This post was prompted by the newest way I found to use view patterns. I had pondered using them to simplify some of my nested record patterns, but I had thought it would interfere with @-patterns. It was a pleasant surprise to realize that changes like the following work just right.

The pattern tds@TDS {rename_ = Rename {ctor_ = ren}}
becomes       tds@(ctor_ . rename_ -> ren).

The improvement here, in my opinion, is that we eliminate the redundancy between the constructors and their field patterns. Of course, multiple constructors in the same datatype can have fields with the same label, so understand that the view pattern is less specific than the record syntax I shared above — unless TDS and Rename are the only constructors of their respective type.

“And Patterns”

If you want multiple fields from a “level” of the pattern, then we need to do some extra plumbing. I suggest

infixr 3 :&, &:
f &: g = \x -> f x :& g x
data x :& y = x :& y

Thus

tds@TDS {rename_ = Rename {ctor_ = ren},
         fs_ = fs@FS{targets_ = ots, require_ = req}

becomes

tds@(rename_ &: fs_ -> (ctor_ -> ren) :&
                fs@(targets_ &: require_ -> ots :& req))

Even I admit that’s a little dense; it’s unfortunate that the field name and the corresponding pattern get separated. We’d need a pattern combinator akin to &: in order to mitigate that separation, but, without a language extension, that would require metaprogramming. The Template Haskell seems a little out of hand…

(&) :: Q Pat -> Q Pat -> Q Pat
p1 & p2 = viewP [| id &: id |] [p| p1 :& p2 |]

tds@$([p| rename_ -> ctor_ -> ren |] &
      [p| fs@(fs_ -> $([p| targets_ -> ots |] &
                       [p| require_ -> req |])) |])

Also, it seems TH doesn’t currently parse view patterns inside quotations.

The & combinator, though, is pretty nice; I call it an and pattern. The semantics are simply that both of its pattern arguments need to match in order for it to match. Quasiquotation offers another way forward with the hypothetical

tds@[andp| rename_ -> ctor_ -> ren &
           fs_ -> fs@(targets_ -> ots & require_ -> req) |]

where & has really low fixity and corresponds to a use of & as defined above. It’d probably be easiest to write such a customized pattern parser where view patterns are limited to applications of a named function and type ascriptions are disallowed; that way you’re not parsing expressions and types.

Deferring implicit imports (Rosetta use-clauses)

Intro

I’m going to use this post to checkpoint some of my recent research. I realized the other day that I don’t know any practical motivations for this work, so I’m going to summarize it in this post and seek feedback regarding its worthwhileness.

Background

Rosetta is a specification language with full-spectrum dependent types and first-class records. This turns out to not play very well with a particular mechanism of import or opening construct we are hoping to support in Rosetta.

use-expressions

The syntax for the import mechanism we’re after is use t1 in t2. In-house, we call it a use-expression. Such a term is well-formed if the term t1 is a well-formed term denoting a record and the term t2 is well-formed if the fields exported by t1 are in scope as variables. Thus we say that a use-expression binds the fields of t1 in t2.

In more general vocabulary, I characterize use-expressions as an implicit import mechanism in order to emphasize the fact that the names it brings into scope do not themselves occur in its syntax. This is a key distinction: an explicit import mechanism is much simpler to formalize since the names are immediately available.

Metavariables

As a consequence of the dependent-types and the fact that Rosetta is a logic, we have a lot of metavariables in our metatheory. These can arise for a variety of reasons. One example is what Rosetta jargon deems “uninterpreted values”: Rosetta declarations can bring a value into scope by specifying its type without specifying its value. So, given the declaration F :: integer -> type, all the rest of the spec sees is that F is a type constructor. The value of F is represented as a metavariable. Furthermore, there’s not yet enough information to know anything but the type of an application of F. Thus, the tools also generate a metavariable for an application of F. Until more details are available about F (i.e. it is refined, perhaps via user interaction), all we know about this metavariable is that it is equivalent to the application of F. The same/an equivalent metavariable represents all applications F to the same/an equivalent argument.

The Question

I sought to answer the question: What happens when you use a term that has a metavariable as its type? The essence of the problem is that the names being bound are not immediately available. Since the type of the used term is just a metavariable, the names of its exported fields are cannot necessarily be determined. This complicates the metatheory, since conventional term representations tacitly assume that names can be resolved (i.e. occurrences can be related to the corresponding declaration) immediately.

I have experimented with various context formers (e.g. …, using u# = x :: T, … |- … where u# is a fresh name), typing and term equivalence rules (e.g. use-bound names become either projections when the name is known or else all name occurrences under a use-binding become a metavariable that is suitable related to the used term), restrictions on the well-formedness that would partially simplify the metatheory (e.g. names bound by a use-clause do/must not eclipse existing names), and term representations (e.g. locally nameless with explicit substitutions). The metatheory either gets pretty complicated — i.e. lots of metavariables everywhere (disclaimer: perhaps its already this complicated in most provers?) — or the restrictions are unnatural for the user (e.g. some kinds of shadowing are ignored: (let x :: int = 4 in x) == (let x :: integer = 4 in use <x :: bool = True> in x)).

The Rub

I neglected to motivate implicit import mechanisms above. Their major benefit is that they decouple the import site (i.e. the use-expression) from the interface of the imported record/module (i.e. the definition of the used value). This decoupling improves modularity in a traditional software engineering way. They’re quite similar to Java’s lib.piece.* import mechanism.

There’s actually a dual to this benefit that is itself a perhaps more significant drawback: if the used value is extended with another field, then all import sites will suddenly incur an extra binding that might not have been foreseen — this is akin to unwanted variable capture! This drawback can only be safely ignored when the interface of the imported record/module/library is very stable or standardized. But, in these cases, it’s also highly unlikely that the type of the record/module/library would be a metavariable.

Thus the interplay between metavariable types and implicit imports that the research set out to formalize has the two following properties.

  1. It either complicates the metatheory, such as an explosion of metavariables; or it incurs behavior the unnatural to the user, such as use-bound bindings not being allowed to eclipse other bindings.
  2. It doesn’t actually arise often in practice: you only tend to import quite stable interfaces — otherwise it’s not very beneficial to import, since there’s not many names in the interface; or it’s not very safe, because of the risk of unintended variable capture.

So, supporting implicit import complicates the language metatheory and it both doesn’t really happen that often and is kind of risky to use in the first place. That’s why I think it’s probably not very motivatable.

Incremental optimizations

As we refined the presentation of the algorithms for our first post, we developed multiple “naive” implementations. Between edits, we spent some time incrementally optimizing the definition below towards the primary algorithm — the one that we tested with all the various data types the data types post — specialized to strings, i.e. jaro2 :: String -> String -> Double).

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where 
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
   where
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

The error bars in the graph below are +/- the standard deviation as reported by the criterion package. See the full graph PNG if this is too hard to read.

We decrypt the labels for you in the table below, but you can mostly read this bar graph from top to bottom as chronological order. In other words, the lower bars mostly include the optimizations of the bars above, with a few major exceptions. The “IntSet” variation abandons all of the optimizations between it and the original “List.sort”. The “! patterns” variation is a dead-end, and is replaced with the variant right below it, “data with ! fields…”. And all of the variants under the “beta-reduce in t” label are one-shot variations of that same algorithm — they stop accumulating. Moreover, “(jaro2)” is the primary algorithm at the String type, not a refinement of jaro.

For the longest time, we were stuck at outer loop = shorter string version.  All of the variations below that one explore refactorings that we had assumed GHC does automatically. In particular, we were mistaken to assume that notElem e . List.takeWhile (<= e) would get fused and that multiple occurrences of length s2 would get common-subexpression-eliminated. We continued testing a few other things we take for granted, but found no more surprises: those last four variations don’t seem to make a difference.

Variation Label Explanation
List.sort The naive algorithm above
List.sort + all Difference list instead of snoc, unboxed strict fields for loop state, strict foldl', inline List.find, carry idx1 in state instead of via zip
List.insert Reset to just the naive algorithm, but maintain the list of second string indices in sorted order and use this to short-circuit the notElem in the List.find predicate
difference list Also use a difference list instead of snoc to build m1
data with ! fields Also use strict unboxed fields for loop state and strict foldl'
inline List.find + fuse foldl’/zip Also inline List.find and carry idx1 in state instead of via zip
IntSet Reset to just the naive algorithm, but use Data.IntSet to track used indices from the second string
difference list as above
! patterns Also add strictness annotations on the tuple patterns over the state and strict foldl'
data with ! fields instead Forgo tuples for the unboxed strict fields as above
inline List.find as above
fuse foldl’/zip as above
m in state Also count the matches in the loop state instead of outside of matches
outer loop = shorter string Also use the shorter string for the outer loop
fused filter/zipWith Also manually fuse the loops for calculating t
(+1) instead of succ Also change all succs to (+ 1) in order to avoid an overflow check!
(jaro2) Not actually a refinement — this is the primary algorithm with String for both input words
beta in t, cse (length s2) When calculating t, carry the Int through the loop instead of building a function in the loop and applying it to 0; also inline pred
un-cse (length s2) Do not manually eliminate the common length s2 subexpressions within matches — this confirms that GHC does not combine those subexpressions automatically
IntSet.toAscList in t Convert the IntSet to a list outside of matches instead of inside
lifted nil Define nil in a local where clause instead of in parallel with the foldl' invocation
named pred Un-inline the List.find predicate (not a lambda-lift, just a binding)
manually inline snoc Manually inline snoc

The Best jaro Yet

When deriving all of these specifications after having ran the experiment, we of course ended up improving on the primary algorithm. These variations are due to improving the loop in the calculation of t; they all use the same matches function, which is only different from the primary one in that it builds the list of matched characters (m1) simply using the (:) constructor instead of a snoc or difference lists, which is the basic idea of all these optimizations. As a result, however, the characters are in reversed order, and so the IntSet of matched indices in the second string needs to be traversed in reverse order. Since the strings are the same length, it doesn’t matter in which order we twin traverse them.

Unfortunately Data.IntSet does not export a toDescList analog to toAscList, so we experimented with the options in the graph. The best one fuses the t loop with the flattening (to a list) of the IntSet via IntSet.fold.

In order to continue optimizing the algorithm, we will need to investigate the arithmetic primitives (probably investigating the generated assembly) or change the interface to, for example, work on multiple longer strings (i.e. inner loop) at once.

We actually do have one idea within the current interface, but its effects will likely only be visible for really long words: most of the t loop can be fused with the matches outer loop, excluding the phase of the outer loop where the window (i.e. hi) has reached the end of the longer string.

Qualifying our Experimental Design

Recall that profiling is not interacting well with optimization in GHC. Hence, none of these changes to the algorithm were inspired by any hard evidence: we can’t trust the profiling information right now. Each change is merely a possibility we considered when looking at the source for inefficiencies. We were (uncomfortably) flying blind.

Another consideration is that some of these variants could behave drastically different under compilation flags (or, less likely, different RTS flags) we haven’t considered. We sought out the tool Don Stewart used, Avocea, but could not find anything but broken links.

Code on its way

Sorry for another round of delays, but yet again the code exploring these optimizations will be polished up and released a bit later. One of the paper deadlines we’re under was extended, which is always a blessing, but also means that its import continues to loom over our time budget. This code needs some work, since we’ve been quite lax with our naming conventions (ticks ad nauseum…). Thanks!

Henny Penny Blues: An apologetic, but enthusiastic release of the Jaro benchmarking code

First thing first: we accidentally sabotaged Data.Text. Its variant used tuples for the outer loop state instead of the data type with strict fields — quite a disadvantage. It now performs a bit better than String. Please accept our apologies for screaming that that sky was falling down.

I’ve posted a prettied-up slice of our benchmarking suite on my university personal page. It’s the main algorithm from the data types post for ByteString, UArray Int Char, String, Seq Char, and Data.Text.

We also include a ByteString version that has some further optimizations we came up with after the fact. Please try to best it! Enjoy!

SPECIALISE pragma vs copy-and-paste

Turns out the slight overhead of SPECIALISE we were seeing was most likely incidental and just happened to look systematic. We had decided to use the pragma in this code release, since it’s just so much cleaner that way. That gave Text a significant performance boost and that’s how we realized we had copied from the wrong variant when writing the Text variant. This is exactly why we were so concerned that the SPECIALISE pragma somehow had some overhead: copy-and-pasting is just begging for bugs! … so ashamed …

Here’s what we get when we run this code as released (with SPECIALISE pragmas) and also with manual specialization via copy-and-paste. Note that use of the SPECIALISE pragma shows no overhead.

variant average time (s) slow down factor standard deviation SNR
with the SPECIALISE pragma
spec 0.857 4.47 0.00238 360.83
ByteString 0.237 1.23 0.00217 109.28
UArray Int Char 0.345 1.80 0.00124 278.67
String 0.505 2.63 0.00196 258.11
Seq Char 0.841 4.38 0.00247 340.12
Text 0.487 2.54 0.00220 221.25
optimized ByteString 0.192 1.00 0.00082 234.68
with manual specialization via copy-and-paste-and-annotate
spec 0.857 4.46 0.00359 238.89
ByteString 0.236 1.23 0.00122 192.89
UArray Int Char 0.345 1.80 0.00129 266.63
String 0.505 2.63 0.00184 274.63
Seq Char 0.841 4.38 0.00240 350.24
Text 0.487 2.53 0.00206 236.75
optimized ByteString 0.192 1.00 0.00107 180.06

(I just ran the SPECIALISEd version again in the background while editing this post and String slightly outperformed Text. This nondeterminism is driving me nuts.)

edit-distance

Also, check out Max Bolingbroke’s edit-distance library. That’s the only “related work” we found on hackage, and from his source code comments it looks like he’s waaay better at optimizing than we are.

That last post we promised is still in the works… thanks for your patience. Due dates, due dates, due dates!

D’oh! Bars

I had that haunting feeling that I had forgotten something during the last editing pass on that previous post about test driving the datatypes dropped into the Jaro-Winkler test harness we made. Our graphs looked incomplete somehow…

D’oh! Error bars. Thanks to Neil Brown for pointing out our omission.

The statistics we have to offer

criterion computes some statistics for you, as it chews through its benchmarking duties. It applies resampling in ways we haven’t investigated, identifies outliers and their possible affect on variance, remarks on the confidence interval, and computes the standard deviations for the mean and for its lower and upper bounds within a benchmark’s iterations.

As with all statistics, these numbers risk lulling the trusting reader into a submissive state.

We were a little too trusting at first. If it finds that outliers have affected your run, criterion warns you but doesn’t attempt re-run. We had thought it would. So just a heads up about that: check for crazy variances, look for that warning! My laptop apparently sleepwalks during the full moon: there were three giant execution times recorded during our first overnight benchmarking run. It took 600 minutes. Only in the past couple days did we realize it should actually only take 50 minutes to do the entire run! We have no idea what happened there, but have been on the look out for large variances ever since.

So what is the consequence of the 95% confidence interval criterion reported to us? It’s pretty sure that it got the right measurement for that run of that particular benchmark. It does not take into account whatever sort of balancing act your OS might be doing for that benchmark that it didn’t do for others, even within the same run of the execution of the whole suite. Not that we expect criterion to be able to do that sort of thing! We’re just couching the implications of the statistics.

All of that rant is setup for

Disclaimer: We ran these measurements on a laptop with as few other processes as we could muster. We are not benchmarking pros, and so we really have no idea how best to quantify or even qualify the consistency of our measurements.

Bringing error bars into the equation might satiate our need for statistical significance, but we just want to highlight how much, for example, the OS environment might affect the run and that the standard deviation as computed by criterion does little to assuage that.

Here’s another graph to bring home our disclaimer. We’ve executed the entire benchmark multiple times; here’s the same slice of four such runs lined up. Recall that each of the lines on this graph achieved that 95% confidence interval nod.

We can’t explain those slight differences. Red, orange, and green were ran back to back one night, and blue was ran over lunch the next day. Otherwise, the system was in a very similar state. Maybe we ought to reboot just before benchmarking? (We tried booting into single-user mode and benchmarking there, but all the times were roughly twice as slow. /shrug)

An example graph

Even so, a graph looks so much more dapper with error bars. So we jumped through all the hoops to get a PNG with error bars out of Open Office. (I chose not to install the graphics libraries criterion relies on for its graphs.) This is from a different set of data than the numbers in the previous post, but you get the idea. See the full graph PNG if this is too hard to read.

The error bars in that graph are +/- the standard deviation.

Here’s another thing we learned during this experiment: Google Spreadsheets does not support error bars. Swammi really gives it to ’em there!

Signal-to-Noise Ratio

The only other statistic we’re prepared to offer is a characterization of our SNR. That’s the mean divided by the standard deviation. The higher that quotient is, the less jittery the results were.

We don’t have any intuitions about “good” SNR values, but we guess those are mostly pretty good? Again, this just means that each particular benchmark within the set of benchmarks of that particular execution was pretty consistent. It doesn’t necessarily say anything about how much we can trust any comparisons between them. It does suggest that nothing funky was going on, so it doesn’t per se warn against comparing between the variants measured within that benchmark run. On that note, we used sleep 360 && … to let the display get to sleep and the computer settle before the benchmark started. If it was not done by the time we woke the display, we restarted the run.

FYI, the eight SNRs below 200 are ByteString * ByteString, ByteString * String, UArray Int Char * UArray Int Char with tuples, String * ByteString, Seq Char * UArray Int Char, and a few of the variants we have yet to post discussions about: a further optimization of symmetric ByteString and two of the refinement steps between the String spec and the primary jaro algorithm we settled on for the previous post. We see no notable pattern in those — again, though, we’re not too familiar with SNRs.

Summary

So that’s our graph and a characterization of our SNRs. Any suggestions on how to characterize our confidence in the comparisons? We suppose you could derive that from a characterization of how consistent the state of the machine was throughout the execution of the entire benchmarking suite. Other than just turning off AirPort, for example. (We’re working with what we have…)

Thanks again, Neil, for the reminder.

And thank you for reading! Please still stay tuned for that last post about optimizing jaro. The request has been made, so our code will be posted too.


Footnotes

† That’d be really awesome, though. Maybe it could jumble up some portions of the iterations of a benchmark, trying to balance them across the entirety of that run of the benchmarking suite? Hrmmmm… (k)

That last Bugzilla comment gets me every time. (k)

Trying some datatypes on for time (+ a few algorithmic variations)

Update 2 June 2011: we’ve since discovered an error in our benchmarking suite. Data.Text was at a disadvantage and actually performs on par with String. So, the Data.Text results below are inaccurate — everything else should be fine AFAWK.

I don’t really know if this is technically the same “functional time”. But maybe by “functional” Evan meant whenever it’s ready — so here goes. I’m Nick. Thanks for reading. I have to apologize ahead of time: I cannot even feign the capability to compete with Evan’s cultural allusions.

A quick recap of the last post: when optimizing a straight-forward implementation of the Jaro string distance, we found some success with datatype refinements to IntSet and ByteString (surprise surprise). But then we found out an inconvenient truth about the current GHC profiling, namely that it isn’t to be trusted.

This post includes some more of our experiments and design considerations. In particular, we focus on trying out four alternative types for representing the strings.

tl;w(on’t)r

The following chart summarizes the results; the scalar is an average time for an iteration of the outermost loop of our testing harness and the label is the type used to represent words.

Disclaimer: These are all the same algorithm, with different data structures just plugged it. With algorithms tailored to their interface, these datatypes might perform better!

ByteString wins again. In this case, we think the chart indicates that much of that performance benefit is due to the support for O(1) random-access. This claim is supported by the comparable success of UArray Int Char. Read on for a more thorough discussion of these the highlights:

  • we couldn’t squeeze any performance gains out of mutability and unsafePerformIO,
  • don’t underestimate the benefits of a data declaration with strict fields for your inner loop’s state!, and
  • Data.Sequence severely underperforms, but we’re not sure why.

Prompted by a comment to the last post by Erik Hesselink (Thanks, Erik!), we included a variant for Bryan O’Sullivan’s Data.Text. We have not investigated why it’s outperformed by it only performs on par with String (which shares the same time-complexities for the interface we use); the only caveat we’ll mention is the minimality of this algorithm. It’s basically a strict left fold, length, and random access otherwise. There’s really no opportunity for fusion to kick in that we’ve noticed.

We’d be happy to send our code your way if you’d like — just let us know in the comments. You can dig deeper if you like, but we’re just investigating a “plug-and-play” like use here.

Experimental Setup

As a first step, we identify the essential interface to the String-like datatype for calculating the Jaro distance. This type class is used in the code we actually benchmarked; each instance of Jaro is saturated with INLINE pragmas and the jaro and matches definitions were cut-and-pasted for each datatype in order to avoid dictionary passing at run-time. We couldn’t get all the overhead removed via SPECIALISE — we have yet to determine why not.

class Jaro c where
  fromBS :: ByteString -> c
  iLen :: c -> Int
  index :: c -> Int -> Char
  foldl' :: (b -> Char -> b) -> b -> c -> b
  {-# INLINE gLen #-}
  gLen :: Enum n => c -> n
  gLen = toEnum . iLen

The test harness uses a strict ByteString to read in the test data. The fromBS method is used to convert the input word to the appropriate type for the jaro variant under test. We use Control.DeepSeq.rnf in order to exclude the conversion from our measurement of execution time. Otherwise, the methods simply provide access to the length of the word, random-access indexing, and a strict left fold.

The Jaro type class provides all the necessary methods for implementing the Jaro algorithm.

jaro :: (Jaro c, Fractional n, Enum n) => c -> c -> n
jaro s1 s2
  | 0 == m = 0
  | otherwise =
    (m / gLen s1 + m / gLen s2 + (m - t) / m) / 3
  where
    m = toEnum mInt

    (x1, x2) =
      if iLen s2 < iLen s1 then (s2, s1) else (s1, s2)

    (mInt, x1', hits2) = matches x1 x2

    t = (/ 2) $ toEnum $ loop hits2 x1' (0 :: Int) where
      loop (i : is) (c1 : cs1) = loop is cs1 . f
        where f = if c1 /= x2 `index` i then succ else id
      loop _ _ = id

The definition of jaro re-orders the input words so that the outer-loop of matches traverses the shorter word. That traversal yields the number of matches, mInt; the string of matched characters from the short string, x1'; and the ascending indices of matched characters in the long string, hits2. The latter two are simultaneously traversed in order to compute the transpositions.

The matches consists of a left fold over the short input word, which carries the following state type, St. The data declaration uses “bang patterns” in order to specify that the fields should never contain thunks. Along with the -funbox-strict-fields compilation flag, this technique pushes these values nearer to the CPU at run-time — they are the data manipulated by our program’s hot-spot and any access-time overhead is costly.

data St = St { idx1, m :: !Int, mk :: !(String -> String),
               used :: !IntSet }

{-# INLINE matches #-}
matches :: Jaro c => c -> c -> (Int, String, [Int])
matches s1 s2 = (m, mk [], IntSet.toAscList used) where
  len2 = iLen s2

  window = max 0 (max (iLen s1) len2 `div` 2 - 1)

  St {m, mk, used} = foldl' snoc nil s1 where
    nil = St { idx1 = 0, m = 0, mk = id,
               used = IntSet.empty }
    snoc st@St{..} c1 =
      (case inline List.find
                 (\idx -> IntSet.notMember idx used &&
                          c1 == s2 `index` idx)
         [lo .. hi] of
         Nothing -> st
         Just idx2 -> st { m = m + 1, mk = mk . (c1 : ),
                           used = IntSet.insert idx2 used }
      ) { idx1 = idx1 + 1 }
      where lo = max 0 (idx1 - window)
            hi = min (len2 - 1) (idx1 + window)

The INLINE pragma merely mitigates the function call overhead; we’re not aware of any optimizations it enables within jaro.

This particular definition of jaro and matches was the result of iterating on the various data and slight algorithmic choices when refining the original specification from the previous post. The primary differences are some manually eliminated common subexpressions, some explicit inlining via GHC.Exts.inline, and some trivial changes to the types.

As a reminder, profiling was unfortunately of little use because of its current detrimental interactions with optimization, as mention in this GHC ticket. For our available machines, this bug either prevented any timing information or caused invalid compilation! Thus, we only had our own coarse-grain wall-clock time measurements to work with, computed carefully with Data.Time.getCurrentTime and Control.DeepSeq.rnf.

Jaro instances

The instances for Jaro are crucial to the performance of the jaro variant that uses the datatype. Here, we have omitted the INLINE pragmas that saturate these instance declarations for the sake of presentation, but they are crucial to avoiding dictionary passing at run time. Also, recall that the execution of fromBS is not included in the measurements.

For ByteString, String, and Text, the instances are immediate.

instance Jaro ByteString where
  fromBS = id
  iLen = BS.length
  index = BS.index
  foldl' = BS.foldl'

instance Jaro [Char] where
  fromBS = BS.unpack
  iLen = length
  index = (!!)
  foldl' = List.foldl'

instance Jaro T.Text where
  -- NB get special input treatment instead
  fromBS = error "Text from BS"
  iLen = T.length
  index = T.index
  foldl' = T.foldl'

The functions used in these instances are defined in heavily optimized libraries. This is, surprisingly, apparently not the case for Seq Char.

instance Jaro (Seq Char) where
  fromBS = Seq.fromList . BS.unpack
  iLen = Seq.length
  index = Seq.index
  foldl' = \snoc nil -> List.foldl' snoc nil . Fold.toList

The foldl' definition via the conversion to String consistently outperforms those via Data.Sequence.foldlWithIndex and Data.Foldable.foldl' by about 0.1 seconds, which is degradation of about 15% to 50% compared to the execution times of all of our tests in this experiment. That would seem to merit some investigation; we haven’t any theories yet.

instance Jaro (UArray Int Char) where
  fromBS = \bs -> IA.listArray (0, BS.length bs - 1)
                 (BS.unpack bs)
  iLen = IA.numElements
  index = (IA.!)
  foldl' = \snoc nil arr ->
    let (lo, hi) = IA.bounds arr
        w !idx !acc | hi < idx = acc
                    | otherwise =
          w (succ idx) (snoc acc (arr IA.! idx))
    in w lo nil

criterion Harness

Our main function reads in the input file as a ByteString and then immediately prepares some thunks for converting that data into the various pairs for cross product of datatypes (one for short word, one for long). We also use Data.Text‘s lazy IO in order to thunk its input until it’s needed; we don’t convert it from a ByteString since it supports richer encodings.

type C0 = ByteString
type C1 = UArray Int Char
…

main = do
  ns <- BS.lines `fmap` BS.readFile wordsFile
  ns_text <- (map (id &&& id) . T.lines . LT.toStrict)
               `fmap` LT.readFile wordsFile

  let ns00 = map (fromBS &&& fromBS) ns :: [(C0, C0)]
      ns01 = map (fromBS &&& fromBS) ns
      ns02 = map (fromBS &&& fromBS) ns
      …
      ns33 = map (fromBS &&& fromBS) ns :: [(C3, C3)]

We basically hand everything off to criterion from here. We use rnf “in the right places” in order to force the conversion to take place outside of the execution window that criterion measures. Each bgroup corresponds to one of the sixteen pairs of representation types (e.g. ByteString and UArray Int Char) and contains all of the benchmarks that use that input data. Combined with the cfgPerformGC option to criterion, this should ensure that only the ByteString for the entire file and one conversion from that data are resident in memory at any given time.

  let cfg = defaultConfig
        { cfgPerformGC = ljust True,
          cfgSummaryFile = ljust "crit.csv" }

  defaultMainWith cfg (return ()) [
    bgroup "00" [
        bench "jaro0" $
          rnf ns00 `seq` whnfIO (critrun jaro0 ns00)
      , bench "jaro0'" $ whnfIO (critrun jaro0' ns00)
      , … {- a bench for each variant that uses ns00 -}],
    bgroup "01" [
        bench "jaro01" $
          rnf ns01 `seq` whnfIO (critrun jaro01 ns01) ],
    bgroup "02" [
        bench "jaro02" $
          rnf ns02 `seq` whnfIO (critrun jaro02 ns02) ],
    … {- a bgroup for each input type except Text -},
    bgroup "text" [
      bench "jaro_text" $
        rnf ns_text `seq` whnfIO (critrun jaro_text ns_text) ]
               ]

We compile with

ghc -O2 -funbox-strict-fields -fforce-recomp
    -fspec-constr-count=6 --make Main

We just bumped the SpecConstr count until the warnings went away; 6 didn’t seem offensively large, and there were two call patterns at most (recall that GHC reduces the count as part of its heuristic).

Results and Discussion

Measurements

We used the 8K /usr/share/dict/propernames file on our Mac OS X distributions to run the tests, since Jaro-Winkler is designed to primarily work on names. Our test harness traverses the diagonalization of the names in this file, and sums the similarity metrics between all of the words (this diverges from the first post’s test harness, but it’s much faster since we’re running so many variants and iterations). The result is a nonsensical number, but it forces the necessary evaluation, has a small memory footprint, and represents a hash of sorts of the files contents. All of the jaro variants computed the same sum (without the profiling, that is), so we’re confident they are equivalent. As an exercise in autodidacticism, this work did not inspire us to compare for correctness against the golden C implementation of Jaro-Winkler (linked to by the Wikipedia page referenced above).

It is well understood that standard Haskell Strings limit performance; this is why ByteString was written in the first place. For this experiment, we primarily focused on various data structures for representing the words during the two loops in matches. Since each loop accesses the respective argument differently (short word – outer loop, long word – inner loop), we tried all sixteen combinations of our 4 datatypes of interest. Direct benchmarking via Data.Time.getCurrentTime and Control.DeepSeq.rnf yielded the following measurements.

Outer loop Inner loop Measured time (s)
ByteString ByteString 0.22892195665577
ByteString UArray Int Char 0.3117424055312
ByteString String 0.4186133953307
ByteString Seq Char 0.62530112468937
UArray Int Char ByteString 0.32210345232228
UArray Int Char UArray Int Char 0.33752040827015
UArray Int Char String 0.45714450323322
UArray Int Char Seq Char 0.68065535986164
String ByteString 0.37693604194859
String UArray Int Char 0.40947800600269
String String 0.49661373102406
String Seq Char 0.74052640163639
Seq Char ByteString 0.53333888256291
Seq Char UArray Int Char 0.58034939014652
Seq Char String 0.68305074178913
Seq Char Seq Char 0.8523849221442

The Datatypes

The graph from the introduction tells the big story: ByteString is best. We chose the other types in order to explore our straight-forward and simple hypotheses. We included String since it’s the most familiar and default type in Haskell for words. We were surprised it fared so well. UArray Int Char was intended as a less optimized version of ByteString. We presumed that its implementation is basically the same with fewer possible rewrites and slightly worse access-times, and so we expected a constant factor of overhead compared to ByteString. Our use of Data.Sequence is slightly vestigial: an earlier algorithmic variant used its queue-like interface (O(1) tail and snoc) to slide the matching window across the second string. Without profiling, we haven’t the means to determine what’s “wrong” with Data.Sequence; the time complexities listed in its documentation suggest it should do better than String.

In order to better compare the types, we chart the times with a fixed type for the short word and average across the measured times for each of the possible types for the longer word. In other words (har), we fix the type of the outer loop and vary the type of the inner loop.

Next, we do the opposite: fix the inner loop type and vary the outer loop type, again averaging the results to a single scalar.

Without going into statistics, the best we can do is eyeball it. It does seem that the “slope” of the second graph is steeper, which indicates that the datatype used for the inner loop has a greater effect on performance. That sounds reasonable.

Algorithmic Variations

Strict Tuple Patterns

Unaccustomed to low-level optimizing, we didn’t immediately remember to employ a data declaration with strict, unboxed fields. Its performance benefit at the time was startling; a factor of 2 or 3. So we’ve retained variants of jaro that use simple tuples saturated with strictness annotations just to see the difference.

Datatype Measured time (s) Degradation
ByteString 0.35290416443089 54.16%
UArray Int Char 0.45579727375248 35.04%
String 0.61194113933781 23.22%
Seq Char 0.99888140403965 17.19%

The data declaration with strict fields is clearly crucial.

Mutable bit array

All of the measured variants discussed above use a Data.IntSet to track which of the characters in the second string have already been matched. Since this is a persistent data structure — even though it’s so tightly packed and efficient — we though that changing it to be a mutable bit array would be a big win. However, all of our attempts at doing so, including ST, unsafePerformIO, re-using a singly allocated array to allow Data.ByteString.Internal.inlinePerformIO within the loop, cloning Data.ByteString.foldl' to work with a monadic snoc, etc. failed to perform as well all as ByteString + IntSet. We peeked at the Core enough to suppose that there were bounds checks interfering with some optimizations; the IntSet had 10 instances of the loops whereas the array code had only 4. After eliminating the bounds checks (with some really low-level coding to circumvent the Data.Array API), there were a matching 10 instances of the loops in the array based code, but it remained less performant. Without profiling and better understanding/navigation of Core, we’re unable to determine why.

The variant with the singly allocated array and using only Data.ByteString.Internal.inlinePerformIO within the inner loop had a measured time of 0.3209107s, which is a 40.18% performance degradation.

Inlining List.find

In the definition of matches, we use GHC.Exts.inline to force the inlining of Data.List.find. Since it’s a library function, this is the only way we are aware of to force it to be inlined. Without inlining it, the ByteString jaro averages 0.29225487196186, which is a 27.67% performance degradation. We surmise that its unfolding is enabling a fusion rule for filter and enumFromTo.

Having never worked this hard at optimizing a function, we were surprised that a use of a standard function like List.find wasn’t optimally handled by GHC automatically. In particular, we thought it was very likely to unfold functions with such short definitions; perhaps library must mark it as INLINABLE with a pragma? Odd that such a standard library wouldn’t. Looking now at the definition of Data.List.find, it seems it’d be more efficient to case directly on the result of the filter instead of converting to a Maybe. Is listToMaybe INLINEd even though it has no such pragma? (At the time of writing, those links hit base-4.3.1.0.)

Eliminating the double traversal

The counting of transpositions in jaro traverses the strings of matched characters. With respect to the short string, the characters occur in the exact same order within the outer loop of matches. These characters are essentially traversed twice, and — presumably worse still — a list is constructed in between, bound to x1' in jaro.

We wrote a version using circular programming to avoid the construction of the string of matched characters from the short word. The resulting fold counts the transpositions “as” it determines the matches.

data St n = St { idx1, m :: !Int, t2 :: n,
                 used :: !IntSet, hits2 :: String }

matches_t ::
  (Jaro c, Fractional n, Enum n) => c -> c -> (Int, n)
matches_t s1 s2 = (m, t2 / 2) where
  window = max 0 (max (iLen s1) len2 `div` 2 - 1)
  len2 = iLen s2

  St {m, used, t2} = foldl' snoc nil s1 where
    nil = St { idx1 = 0, m = 0, t2 = 0,
               used = IntSet.empty,
               hits2 =
                 map (index s2) (IntSet.toAscList used) }
    snoc st@St{..} c1 =
      (case inline List.find
                 (\idx -> IntSet.notMember idx used &&
                          c1 == s2 `index` idx)
         [lo .. hi] of
         Nothing -> st
         Just idx2 ->
           st { m = m + 1, hits2 = tail hits2,
                used = IntSet.insert idx2 used,
                t2 = (if c1 /= head hits2
                      then succ else id) t2 }
      ) { idx1 = idx1 + 1 }
      where lo = max 0 (idx1 - window)
            hi = min (len2 - 1) (idx1 + window)

The circularity is between the outermost binding of used and the initial value of the hits2 field in the nil state. Divergence is avoided because, during the traversal, none of the other fields depend on hits2 or t2 and their fields in St are not strict. As a result, those fields accumulate as thunks, which only collapse once t2 is forced. This ultimately causes the rest of the traversal to take place first.

With ByteStrings, this algorithm and the primary algorithm perform the same at optimization level -O1. We surmise that the thunks that build up are operationally comparable to the construction of the list of matched characters, x1'. That would still eliminate the redundant traversal, but for the fact that the thunks merely accumulate during the fold; their collapse happens only once t2 is forced, and is effectively a second traversal. Furthermore, with -O2 the circular program is significantly less performant than the primary algorithm; we surmise that the circularity prevents some of the more aggressive optimizations.

Though the circular programming removes the redundant traversal from the code, it fails to eliminate it from the execution. We surmise the existence of a much more invasive alteration of the primary algorithm where the outer loop counts the transpositions for the prefix of the longer string that is beyond the lower limit of the matching window for the shorter string’s last character. This ought to reduce the redundancy of the second traversal. We have not yet implemented it.

“Surmise, surmise, surmise”, you say. We’re trying to not look at the generated Core, especially since -O2 does a lot of unfolding. Perhaps heap profiling would be a good indicator? … You know, we do have research to do.

criterion

We eventually jacked our code into Bryan O’Sullivan’s criterion package late into our process. In particular, during development we did simple timing with Data.Time.getCurrentTime with a handful of iterations for quick turnaround.

For the numbers and graphs in this post, though, we ran everything in a single criterion harness. Fortunately, comparing the two data sets at the shared data points validates our simple method.

Phew! Everything just about lines up. The criterion benchmarking was ran over night, with no other processes — we assume that sort of OS-level inequity in the execution environment is where the seemingly constant offset comes from.

Conclusions & Questions

We kind of already concluded above. ByteString was best. We’re surprised we couldn’t beat IntSet by using a mutable array — that’s probably our most interesting result. But all of the IO-escaping (even via inlinePerformIO) seemed to inhibit some key optimizations. Maybe with the right refactoring and compilation flags, the mutable bitarray would out-perform IntSet. Suffice it to say it apparently wouldn’t be a small, obvious change.

We also didn’t try the other unboxed array packages. We just hadn’t heard of them yet. So maybe there’s something else out there for mutable bitarrays that would outperform IntSet and not block those aggressive optimizations as we’re supposing was happening with our attempts. Please let us know!

If you’re interested in getting access to the code we used:

  1. Leave a comment saying so! We’ll need some motivation to package it up since the semester is over…
  2. Brace yourself! There’s a lot of “duplicate” code. We manually circumvented SPECIALIZE pragmas, since our initial tests indicated some overhead. Also, our naming conventions got a little frantic at the end there. Maybe we’ll clean it up before posting it if you ask nice.

If you’re just after the fastest jaro definition (in order to try to beat it, perhaps? Hmm?), just wait for the next post. We’d love it if someone could show us some further tricks! We’d also really like to know why we’ve failed to speed it up with the mutable bit array.

Also, we are concerned about the apparent SPECIALIZE overhead we noticed. Is this a known issue? Are we missing something?

This point about missing flags deserves repetition. In fact…

Disclaimer: we could have overlooked crucial INLINEs or compilation flags.

Those sorts of things could really affect the results we’re getting and so possibly invalidate some of our conclusions (e.g. those regarding inlining List.find or Seq Char’s slow folds). Unfortunately, we can’t even find Acovea on the Internets and aren’t otherwise familiar with the multitude of GHC flags. … Help?

Thanks again for reading! Our next post will discuss some more invasive algorithmic changes; there’s more we can do to speed up jaro, albeit not by much. Fuse, fuse, fuse.

Exercises in Optimization (A O(n)-Part Series)

Greetings and welcome to the first entry of Functional Kansans, a new blog brought to you by a few of the people interested in functional programming at the University of Kansas.

We open the blog with a series of posts by myself and Nick Frisby detailing our recent exploration into the world of Haskell program optimization.  Before we begin, I should state that this work arose from what was supposed to be a class-wide project for Andy Gill’s advanced functional programming class that Nick and I co-opted as our own, perhaps greedily so.  The following content could not have been possible without Andy’s excellent instruction and the support of our other classmates (who hopefully will have posts of their own on this blog at some point).

The Problem

The target of our optimization efforts was the Jaro-Winkler algorithm, a metric for judging the similarity of two strings.  Specifically, we looked at the Jaro distance metric given that its time-complexity dominates that of the Winkler variant.

PSUEDO-CODE

This section provides pseudo-code snippets in order to explain the algorithm.

The Jaro measure is computed in terms of the string lengths, number of matches, m, and the number of transpositions, t, and yields a similarity metric between 0 and 1.

jaro s1 s2 = (m / n1 + m / n2 + (m - t) / m) / 3
  where n1 = length s1; n2 = length s2
  	m = matches s1 s2
	t = transpositions s1 s2

A pair of characters from each string match if they are equal and not too far apart.  The maximal match distance is defined as a sliding window, framed by the list of indices [lo .. hi], whose size is determined relative to the length of the two input strings.  Additionally and importantly, each character can only match once.

matches s1 s2 = snd (foldl snoc nil (zip s1 [0..])) where
  nil = ([], 0)

  snoc st@(used, m) (c1, idx1) =
      case find p [lo .. hi] of
         Nothing -> st
         Just idx2 -> (idx2 : used, m + 1)
    where
      p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

To calculate the number of transpositions, count the number of differences in the in-order matched characters.

transpositions s1 s2 =
  length $ filter (\c1 c2 -> c1 /= c2) $
  zip (filter wasMatched s1) (filter wasMatched s2)

Note that since each character can only match once and only with a character from the other string, there’s the same number of matched characters in each string; the zip doesn’t truncate either argument. Also note that we avoid specifying a full definition for wasMatched, but suffice it to say that an implementation that matches its current type of Char -> Bool is not possible.
The next sections will present the starting implementation of this algorithm and a discussion of our profiling techniques used for our optimization efforts.

The Starting Point

As was just mentioned, wasMatched cannot have the type of Char -> Bool, at least as the transpositions function is currently specified, because it has a missing dependency on the indices calculated in the matches function; the character value on its own is insufficient information for determining if a position in the string was matched.  Presented below is a naive implementation of the algorithm that corrects that deficiency by having matches return the string consisting of all matched characters, in order, from the left string (s1) and all matched indices from the right string (s2).  Please also note that the transpositions function has been “folded into” the definition of jaro in its where clause, mainly for the purpose of brevity.

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where 
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
   where
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

From this point the path to an optimized implementation is simple: find what is slow and make it faster.  Flippancy aside, we made the decision to identify time-cost hotspots using GHC’s profiling functionality.  Many notable bloggers and Haskell hackers have also advocated for the direct inspection of the Core code that GHC produces to augment profiling information; however, we chose not to.
The honest truth of the matter is that Core is not a fun language to read, except perhaps to gluttons for syntactic punishment.  Furthermore, the practice quickly becomes infeasible for any significantly large code base.

Profiling

So we made up our mind.  Profiling and profiling only.

Unfortunately, running a single iteration of jaro takes such a trivially small amount of time that no worthwhile profiling data would be collected.  To sidestep this issue we devised the following test harness that calls jaro a large number of times.

words_file :: FilePath
words_file = "/usr/share/dict/propernames"

main :: IO ()
main = do
  ns <- lines `fmap` readFile words_file
  now <- length ns `seq` getCurrentTime
  let xs :: [Double]
      xs = sortBy (flip compare)
           [ jaro n m | n <- ns, m <- ns ]
  now2 <- rnf xs `seq` getCurrentTime
  print (diffUTCTime now2 now)

In short, the harness inputs a line delimited file, splits it into words, and then calculates the jaro similarity factor for the cross product of the word list.  Also note that there is some additional plumbing to measure execution time with the Data.Time library and to force evaluation of the list to head normal form using the Control.DeepSeq library.  This combination provides us with informal benchmarking capabilities that should confirm any results we find with profiling.

For all profiling experiments in this post we will be using the following compilation flags: -rtsopts -prof -auto-all -O2.  The first three flags represent the standard flags required for profiling all top-level definitions with the final flag representing the simplest way to activate most of the compiler optimizations that would make a difference for us.  There may be a future post discussing other optimization flags that would be of benefit, however for now we accept the O2 flag as “good enough.”

Likewise we will be using the following runtime system options: -p -i0.001.  There are no surprises here, we simply activate profiling and change the time scale for profiling ticks to 1 ms.  Note that with GHC 7 most runtime system options were disabled by default, hence the need to use the -rtsopts flag during compilation.

Profiling our naive implementation with the above settings gives us the following information:

total time  =        0.88 secs
total alloc = 3,255,736,936 bytes
wall clock = 14.746186 secs 
COST CENTRE                MODULE           %time %alloc

main                       Main              61.4   36.8
matches                    Main              29.8   47.7
jaro                       Main               8.8   15.5

Fantastic.  So now we know that matches is the dominant function, accounting for roughly 30% of our total run time and almost 48% of our total allocation.  Looking back at the matches function, though, the code is fairly dense and, at least to less experienced eyes, does not appear to reflect any obvious optimization opportunities.  To gain more information we will introduce our own cost centers to measure any internal expressions that we think may account for a large portion of our run time.

matches :: String -> String -> ([Int], String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 : used, m1 ++ [c1])
   where
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

Running the profiling with the new cost centers does in fact give us quite a bit more information to work with:

total time  =        1.06 secs
total alloc = 2,883,573,832 bytes
wall clock  = 15.011385 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              30.5   41.5
matchesPred                Main              25.0    0.0
matches                    Main              13.5   18.3
jaro                       Main              12.6   17.5
matchesFold                Main               9.6   14.6
matchesWindow              Main               4.2    1.5
matchesHi                  Main               3.3    2.5
matchesSnoc                Main               1.3    4.1

Hold on, though.  The introduction of our own cost centers affected our total time and allocation by fairly significant numbers.  That can’t be right, can it?  We’ll touch on this more in a later section, but for now, thankfully, we can rely on the independent benchmarking that we built into our test harness (recorded above in the wall clock field) to confirm any speed ups that we find.

Down the Rabbit Hole

Looking at the profiling information, we see that the dominant cost center we introduced is matchesPred which measures our find predicate:

p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

There are a few things going on in this function, so lets work from left to right.  Recall that used is simply a list of integers, making notElem a linear traversal.
It should be possible to find a better container type for used that increases the speed of this check without penalizing other operations over used.

We decided to change used to be modeled as an IntSet (note: we imported qualified as IS) rather than a list of integers.  In the worst case, the complexity of the operations should be the same as a list; however, in practice we know that at the very least the allocation behavior should improve, likely leading to an improvement in time-cost as well.  We also gain the benefit of using a container that better matches the semantics we’re looking for;  after all, a set of matched indices needs no room for duplicates.

matches :: String -> String -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = (IS.empty, "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 `IS.insert` used, m1 ++ [c1])
   where
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `IS.notMember` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.72 secs
total alloc = 2,490,396,576 bytes
wall clock  = 14.79711 sec

COST CENTRE                MODULE           %time %alloc

main                       Main              41.1   48.1
matchesPred                Main              20.7    0.0
matches                    Main              11.7   11.6
matchesFold                Main               9.4   17.0
jaro                       Main               9.4   12.7
matchesWindow              Main               4.0    1.7
matchesSnoc                Main               1.8    6.2
matchesHi                  Main               1.8    2.8

As expected, this lead to a significant decrease in total allocation during the execution of the program.  Unfortunately, while the profiling time did decrease by roughly 30%, the reduction in wall clock time was not as significant.  That leaves matchesPred as still the dominant cost center, so we revisit it looking for another optimization opportunity.  The only other function in this expression that is not a primitive comparison operator is the string indexing.  Obviously strings are not designed for efficient random access, so again we go on the quest for an alternative container type.

ByteStrings

The obvious and immediate choice here would be to use a ByteString as our new container type.  ByteStrings are so attractive because they provide constant time indexing and length checks in addition to extremely efficient stream-fused traversals, all of which should provide some big wins across the entire algorithm.

When considering which types to change, it is important to note that string s2 is the only one ever indexed into.  However, because s1 and s2 are built from the same word list it would seem to make sense to have them be of the same type to avoid unnecessary calls to pack or unpack which can sometimes be a major source of inefficiency when using ByteStrings.  This is something that we will discuss in more detail in the conclusion, mainly to lead in to the next post in our series, but for now we operate under the assumption that the preceding statement is correct.

The resultant types of our functions are shown below:

jaro :: ByteString -> ByteString -> Double
matches :: ByteString -> ByteString -> (IntSet, String)

For the most part the available ByteString operations mirror the String operations provided by the prelude, so code changes are predominantly just qualifications to tell GHC which version to use (note: we imported qualified as BS).  Given this, we will avoid showing the updated code for brevity’s sake.  The profiling results, though, are shown below in their entirety.

total time  =        0.51 secs
total alloc = 2,739,268,472 bytes
wall clock  = 13.066119 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              64.4   43.7
matchesPred                Main              10.2    0.0
matches                    Main               9.8   10.5
jaro                       Main               5.9   13.6
matchesFold                Main               4.7   22.5
matchesSnoc                Main               2.9    5.6
matchesHi                  Main               1.4    2.6
matchesWindow              Main               0.6    1.5

The first thing that we notice is that the switch to ByteStrings has pushed our allocation back up to a level near where it was at before we made the switch to an IntSet to carry our match indices.  The tradeoff, though, is significant performance increases across the board resulting in shaving an almost two full seconds off of the wall clock time.

Looking at our code again, we recognize that the expressions within the matchesWindow, matchesPred, and matchesHi cost centers have now all been reduced to either constant time or primitive operations.  To clear up our profiling information from this point on we will be removing those cost centers since there is not much else we can do to optimize them directly.  That does not mean that we are done, though.  Even though our test harness is now the dominant cost center by a very large margin, we still have a few tricks up our sleeves that we have yet to exhaust to see if we can make the Jaro algorithm even faster.

Down the Rabbit Hole in Space Jam

All of the changes we have made up to this point have been, for the most part, algorithm impartial.  We have identified a few hot spots and attempted to mitigate their cost by changing the data representations of the objects we are working with to make those operations more efficient; however, the structure of the code itself was left largely unchanged. What follows in this section is an exploration of more “invasive” optimization attempts.  If the previous section was an allusion to simply following the white rabbit (profiling output) to guide our path, this section would best be described as again following the rabbit, but this time asking more in-depth questions like, “Why is Bill Murray here?”

The Bill Murray in this case is the large amount of allocation still happening in the matches function, specifically in our matchesFold cost center.  Whenever you see allocation issues with a fold you should think to yourself, maybe now is the time to try a strict version of that fold.  In fact, the ByteString library provides a very nice and efficient version of foldl' to use.  The issue, though, is that we are not currently folding over the string directly, but rather over the zip of the string and an incrementing counter.  To use the ByteString traversals we must fuse that zip in by tracking the counter at each iteration of the fold.

While we are fusing traversals we should also pay attention to the jaro function.  Currently we traverse string s2 indexing into it, only to again traverse the result with a zip, and then a filter.  In parallel we also calculate the value of m by taking the length of our matches string. All of these passes can, and should, be fused into a single loop.  The new code for all of these changes is shown below:

jaro :: ByteString -> ByteString -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where
   n1 = fromIntegral $ BS.length s1
   n2 = fromIntegral $ BS.length s2

   (used, m1) = matches s1 s2
   m = fromIntegral $ length m1

   t = loop (IS.toAscList used) m1 0 where
     loop (i : is) (c : m1') = loop is m1' . f
         where f = if c /= s2 `BS.index` i then (+ 1)
                   else id
     loop _ _ = (/ 2)

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                snd $ BS.foldl' snoc initst s1 where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 initst = (0, (IS.empty, ""))

 snoc (idx1, st@(used, m1)) c1 =
   case find p [lo .. hi] of
     Nothing -> (idx1 + 1, st)
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx1 + 1, (idx2 `IS.insert` used
                             , m1 ++ [c1]))
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.77 secs
total alloc = 2,698,599,372 bytes
wall clock  = 11.650707 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              62.3   44.3
matches                    Main              21.9   32.6
jaro                       Main               9.6   13.5
matchesFold                Main               3.5    1.3
matchesSnoc                Main               2.6    8.3

Looking at this profiling information it’s hard to instantly declare whether this was a success or not.  Recall that we removed some custom cost centers and, as discussed previously, that probably distorted total time and allocation results.  However, if we operate under the assumption that the relational cost percentage numbers are accurate then we can claim that we shrunk the % allocation for the matchesFold cost center significantly.  If that wasn’t indicative enough of a clear win, we also shaved almost another two full seconds off of the wall clock time. The trade off, though, was that % allocation rose in a few places, notably in the matches function itself.

Let’s see if we can fix this by pushing strictness a bit further.  Rather than have each iteration of our fold return nested tuples, lets instead have it work over a record type with strict fields.  Furthermore, let’s hit it with the sledgehammer that is the -funbox-strict-fields compilation flag to see if we can squeeze any more performance out of it.  Note that the code below requires the RecordWildCards and NamedFieldPuns language extensions: these let us write the more concise patterns St{used, m1} and St{..}.

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, m1) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, m1} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, m1 = "" }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , m1 = m1 ++ [c1] }
   ) { idx1 = idx1 + 1 }
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.26 secs
total alloc = 2,525,880,780 bytes
wall clock  = 11.152646 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              68.5   47.4
matches                    Main              16.3   31.8
jaro                       Main               8.9   14.5
matchesFold                Main               3.5    1.7
matchesSnoc                Main               2.7    4.7

While not as impressive of a change as some of our previous tweaks, moving to the strict record type still increased our performance by about half a second in wall clock time.  The last thing we can attempt to decrease our allocation is moving away from using append in the matchesSnoc cost center and instead try to build our matches string using cons.

data St = St { idx1 :: !Int, used :: !IntSet
             , mk :: !(String -> String) }

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, mk []) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, mk} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, mk = id }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , mk = mk . (c1 : ) }
   ) { idx1 = idx1 + 1 }
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.67 secs
total alloc = 2,523,483,604 bytes
wall clock  = 11.279014 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              82.2   47.4
matches                    Main              10.9   31.9
jaro                       Main               4.6   14.5
matchesFold                Main               1.2    1.7
matchesSnoc                Main               1.0    4.6

Looking at the total allocation and wall clock time one would be tempted to say that this change did not really do much.  Interestingly enough, though, if you look at the %time numbers, almost all cost centers pushed a significant amount of time to the main function.  Theoretically a more efficient test harness could leverage this change for some bigger gains down the road.

Results

We finish by stripping out our SCC annotations, adding an inline pragma for matches, manually inlining our find predicate, and doing common subexpression eliminations where the ByteString lengths were used.  This exhausts our optimization efforts, at least for this post.  The resultant profiling information is shown below.

total time  =        0.87 secs
total alloc = 2,593,431,612 bytes
wall clock  = 10.096731 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              63.8   46.1
jaro                       Main              36.1   53.9

Compared to the profiling information from our starting point we have sped up our implementation by almost a full five seconds of wall clock time; a whopping one third of our total execution time.  Furthermore, our total allocation was shrunk by almost one billion bytes.  Not too shabby for a few hours worth of work.

The Pool of Tears or:  How I Learned to Stop Crying and Love Benchmarking Libraries

This blog post is not a complete summary of our optimization efforts; to be honest, we got quite “stupid” with it.  In fact, Nick will be following this post with a few more detailing our efforts to see if we could find a data type that could out perform ByteStrings and his own wild journeys off into the world of minute algorithmic variations.  Regardless of whether you have enjoyed this post or not (Space Jam jokes are not for everyone) I encourage you to check the future ones out.  The reason I mention this, in addition to providing a moderately decent introduction to the next posts in this series, is to set up the following anecdote.

Towards the end of our optimization fun-time, perhaps a week in at that point, we were playing with a variation of the Jaro algorithm that was barely recognizable.  The code was rife with language extensions, unboxed types and primitives, and more calls to unsafePerformIO than I honestly feel comfortable admitting.  Given how far we had travelled away from the original implementation, we built a few sanity checks into our test harness to confirm that each of our manual optimizations weren’t changing the input/output behavior of jaro.  Another thing we got into the habit of doing was clocking the time with both a profiled and non-profiled version, if only to have a small, meaningless victory with every test we ran.  We had just made another arbitrary change and ran the tests when something terrible happened: the sanity check had passed for the non-profiled version of the code and failed for the profiled version.

Profiling had just affected the evaluation of our code.  That’s not good.
Those of us fortunate enough to have the newest generation of unibody Macs (read: not me) already knew that there were some issues with profiling for that architecture, given that most attempts to profile came back with zeros across the board.  Looking for answers, Nick turned to the Haskell Cafe for help where he was pointed to this GHC bug report.  In short, profiling does not play nicely with the optimization flags, including the default -O0 flag (no optimizations).  Basically, profiling is currently broken and not to be trusted.

Obviously this was a problem given how I had just written an entire blog post that detailed an optimization methodology based almost entirely on profiling.  Oh that reminds me, all of the profiling numbers in this post are probably wrong so take them with a grain of salt.  This brings up a very good point that I neglected to mention earlier:  There is no magic bullet of optimization.  Profiling, while powerful when it works, will never be entirely sufficient on its own.  Thankfully we included benchmarking code in our test harness, so we were never proceeding truly blind.

That statement leads to my next point:  Do whatever you can to avoid writing your own benchmarking code.  For any well established programming language, invariably someone else will have already developed a robust benchmarking library.  If you can, use it.  For example, Bryan O’Sullivan has developed a fantastic benchmarking library for Haskell called criterion that I wished we would have used from the beginning.  Instead, we spent almost as much time changing and improving our test harness as we did optimizing the Jaro algorithm.  Even with all of that effort, when it came time to to execute a large batch of tests, roughly 100 tests with an estimated run time of 10+ hours, we rewrote everything to use criterion.  The reason was simple;  it was the better testing framework.

I think both of these points are probably things that I’ve heard before.  Undoubtably some professor tried to sagely pass on this knowledge during my undergraduate days, probably during the phase where I thought I would never use any of that material ever again (I’m still crossing my fingers that this holds true for C++).  The fact of the matter, though, is that it didn’t really ring true until I ran into these issues myself.

So if you get only one thing out of reading this post it should be the following: Participate in your own optimization experiment and feel free to get “stupid” with it.  Hopefully you will learn an enormous amount about the language you are working in.  Even more importantly, hopefully you will have a lot of fun.

Thank you for reading and see you next time.  Same functional time.  Same functional place.  Same functional channel.

-Evan