Category Archives: Haskell

A Redshift Detour: Proving in Haskell with Singleton Types

Disclaimer: I do eventually remark about unsoundness.

Colors blunderingly plucked from
Ethan Schoonover’s style.css.

Full code in this gist.

Edit 10:52 Central 11 Oct 2012: added mention of kamatsu’s proof.

I’m going to do some proving in Haskell in this post using singleton types. There’s no new fundamentals in here, but I’ve not seen much demonstration of these techniques. As I understand it, there’s a research agenda (with a strong UPenn contingent) to enrich Haskell’s type system to the point where translation from a vanilla Agda development to Haskell would be mechanical. I’m guessing that the code I’m sharing here is probably a very human-readable version of what such a tool from the future might output from a student’s first Agda homework assignment of the semester.

This post is self-contained, but it relies heavily on concepts that also live in at least a couple of Hackage packages that I’m aware of. Richard Eisenberg’s singleton package and his Haskell 2012 paper with Stephanie Weirich provides tremendously convenient support for dependently-typed programming with singleton types. In this post, I’m going to focus more on proving things, and I use singleton types to do so. I remark below on how this sort of proving can help with programming.

The other package that has a lot of related content is Edward Kmett’s constraints package. In particular, my evidence for implications is largely isomorphic to Kmett’s \\ function.

OK, on with the show.

A Framework for Constructive Type Theory in Haskell

One of the fundamental lessons from Harper’s lectures at this year’s Oregon Programming Language Summer School was the coarse difference between classical and constructive logic. Classical logicians only care about provability, whereas constructive logicians are also interested in evidence. In this post, I’m modeling propositions as Constraints and their proofs as dictionaries. Actually, I’m treating a proof as an implication with a trivial antecedent, the () Constraint. The evidence of an implication is a function that squeezes a dictionary for the consequent’s Constraint out of a dictionary for the antecedent’s Constraint. For ease-of-use, I’m modelling dictionaries via rank-2 continuations instead of the Dict GADT.

type p :=> q = forall a. p => (q => a) -> a

type Proof p = () :=> p

That’s really all of the CTT fundamentals we’ll need; GHC’s type rules for constraints takes care of the rest.

Just to be clear, this is just one methodology of “proving” in Haskell. Here’s a a different proof by kamatsu of the ultimate theorem discussed in this post. My motivation here is to show some techniques for “thinly” veiled dependent type theory in Haskell.


We’re ultimately going to prove that addition on naturals is commutative, so we need to put some familiar pieces on the board.

data Nat = Z | S Nat

data Nat1 :: Nat -> * where
  Z1 :: Nat1 Z
  S1 :: Nat1 n -> Nat1 (S n)

Nat is a datatype, but we’ll be using it as a type via promotion. Nat1 is the singleton type for Nat. Its key property as a singleton type is that a value of type Nat1 n fully determines the type n (and vice versa). Since the value determines the type, this gives us leverage for dependent typing. Again, see the Eisenberg and Weirich paper for more. Also note that the singletons package could generate this Nat1 declaration, as well as many many others kinds of declaration.


Naturals are an inductive datatype; they define an induction principle. In dependent programming and CTT, we can understand an induction principle as a dependently-typed fold operator, called a “dependent eliminator.” In Haskell with singleton types, it looks like this for Nat1.

type BaseCase      c = Proof (c Z)
type InductiveStep c = forall n. Nat1 n -> c n :=> c (S n)

ind_Nat :: Predicate c ->
           BaseCase c -> InductiveStep c ->
           Nat1 n -> Proof (c n)

This is familiar from undergrad: a predicate c over the naturals is an inductive property if is has a proof for zero and a proof for any n implies a proof for 1 + n. Induction says that given those two proofs, I can built a proof for any n, which I think is a pretty simple reading of the type of ind_Nat. The definition of ind_Nat just puts this familiar notion of induction to work.

ind_Nat _ base _ Z1 = \k -> base k

ind_Nat pc base step (S1 n1) = \k ->
  ind_Nat pc base step n1 $ -- provides the inductive hypothesis
  step n1 k

In the above definitions, it probably helps to unfold the definition of Proof to see that k has a qualified type and that the code between the lambda bindings of k and its occurrence are just building up a suitable context for that qualified type from the base and step arguments and the recursive call to ind_Nat.

Now that we have induction, let’s prove an inductive property!

Another common ingredient in the singleton type recipe is the class for building Nat1 from a Nat — this is the realizer of the “vice versa” part I slipped in when grossly explaining singleton types. This class effectively characterizes what it means for a type to be a natural, so I certainly hope it’s inductive…

class Natural (n :: Nat) where nat1 :: Nat1 n

inductive_Natural :: Nat1 n -> Proof (Natural n)
inductive_Natural = ind_Nat (Predicate :: Predicate Natural)
  (\k -> k) -- base case
  (\_ k -> k) -- inductive step

In fact, it is inductive, and in the most trivial way possible, as can be seen from those two arguments to ind_Nat. In a sense, “naturalness” is inductive by definition, which reminds me that we haven’t actually defined Natural yet, we’ve just declared it. We need instances.

In this case, I’m confident the instances are pretty self-explanatory, but I want to show a slight odd way of deriving their shape. In Agda, you can put placeholders (called sheds, I think) in your incomplete proofs and then Agda will tell you what that property that placeholers actually needs to prove. We can kind of get the same behavior here, but inspecting the type errors generated by these most naive “placeholders” \k -> k and \_ k -> k for the base case and inductive step. Since we haven’t declared the necessary instances of Natural, we get the following errors, in which GHC specifies what instances we need in order for Natural to be an inductive property of Nats.

    Could not deduce (Natural 'Z) ...
    Possible fix: add an instance declaration for (Natural 'Z)

    Could not deduce (Natural ('S m)) ...
    or from (Natural m)
    Possible fix: add an instance declaration for (Natural ('S m))

To determine what implications are necessary for a more complicated property, this technique might be helpful. We’ll try it out again for the commutativity proof later.

The requisite instances are the usual ones for a singleton type’s builder class.

instance Natural Z where nat1 = Z1
instance Natural m => Natural (S m) where nat1 = S1 nat1

Now the inductive_Natural definition type checks; we’ve “proved” it. “Proved” because Haskell wasn’t designed for proving things: the pervasive laziness bites us here insomuch as the GHC Haskell type system is not a sound realization of the Curry-Howard isomorphism. For example, here’s another definition of inductive_Natural that GHC happily accepts as well-typed.

inductive_Natural = let x = x in x

Just because we have a well-typed term of type Proof doesn’t mean that Proof is actually true. Note well that RHSs can be accidentally equivalent to this one in any number of insidious ways. If you want soundness, use a proof assistant. That is, extract your Haskell code from proofs in Coq, in Isabelle, or in Agda. CRUCIAL DISCLAIMER: I never said the extraction would be easy…

So why do this?

Extracting Haskell via proof assistants is one end of the spectrum. But common Haskell practice has been creeping in that direction from the other end of the spectrum for quite some time now. Just because our type system isn’t sound, doesn’t mean we can’t write valid proofs in it, if it would behoove our program. … And when would it do that?

I’ll by honest: the majority of the times I’ve applied these techniques, I’ve later realized I didn’t actually need it — there was a better simplification to be made. However, not always. One lesson I’ve learned while enjoying Conor McBride’s online lectures is that Curry-Howard is vastly easier to use when the recursion schemes of your type definitions match the recursion schemes of the relevant value-level definitions that involve those types. When they don’t match up, however, you might be able to overcome the mismatch via the techniques sketched in this post. For example, if you define addition as I do in the next section, you might have to do a bit of extra work in order to get the accumulator-based definition of reverse for vectors (ie fixed-length lists encoded with a GADT) to type-check. Note that I said might; I’m just estimating based on my experiences…

Finally, the proof

Recall that we’re proving that Plus is commutative; here’s the set up.

type family Plus (n :: Nat) (m :: Nat) :: Nat
type instance Plus Z     m = m
type instance Plus (S n) m = S (Plus n m)

class    (Plus n m ~ Plus m n) => Plus_Comm (n :: Nat) (m :: Nat)
instance (Plus n m ~ Plus m n) => Plus_Comm  n          m

plus_Comm :: forall n. Nat1 n -> forall m. Nat1 m ->
  Proof (Plus_Comm n m)

The property we’re really after is Plus n m ~ Plus m n, for any n and m. Note that this is precisely the declared type of plus_Comm. The Plus_Comm class just organizes the bindings of n and m (in an arbitrary way) so that we can use the ind_Nat function.

Since Plus recurs on in its left argument, we’ll induct on that one.

-- ill-typed first guess
plus_Comm n1 = ind_Nat (Predicate :: Predicate (Plus_Comm n))
  (\k -> k)   -- the most naive first guess at base case
  (\_ k -> k) -- the most naive first guess at inductive step

This doesn’t type-check, but again the type errors are telling.

    Couldn't match type `m' with `Plus m 'Z'
      `m' is a rigid type variable bound by
          the type signature for
            plus_Comm :: Nat1 n -> Nat1 m -> Proof (Plus_Comm m n)
          at InductiveConstraints.hs:72:22

    Could not deduce (Plus n ('S m1) ~ 'S (Plus n m1))
    or from (Plus_Comm n m1)

The first one comes from our base case. It says we need to explain to GHC how Plus Z m ~ Plus m Z. Recall that Plus inspects its first argument, an in this property that’s a universally quantified type variable of kind Nat. We just need to use induction again, but on m this time.

-- well-typed actual proof
plus_Comm n1 = ind_Nat (Predicate :: Predicate (Plus_Comm n))
  -- base case
  (let plus0 :: forall n. Nat1 n -> Proof (Plus_Comm Z n)
       plus0 = ind_Nat (Predicate :: Predicate (Plus_Comm Z))
              (\x -> x)   -- base case
              (\_ x -> x) -- inductive step
   in \x -> plus0 n1 x)
  -- inductive step below

The second error comes from the naive guess for the inductive step. It says that GHC doesn’t know how Plus_Comm n m1 should imply Plus n ('S m1) ~ 'S (Plus n m1). Again, Plus is blocked on a variable, this time n. So let’s induct on n in another lemma.

  -- the well-typed inductive step
  (let lemma :: forall m. Nat1 m -> Nat1 n ->
                Plus_Comm n m :=> Plus_Comm_Lemma m n
       lemma _ = ind_Nat
         (Predicate :: Predicate (Plus_Comm_Lemma m))
         (\x -> x) (\_ x -> x)
   in \m1 x -> lemma m1 n1 x)

class    (Plus n (S m) ~ S (Plus n m)) => Plus_Comm_Lemma m n
instance (Plus n (S m) ~ S (Plus n m)) => Plus_Comm_Lemma m n

Again, the class just imposes some binding order on the ~ constraint so that we can use our ind_Nat function.

Q.E.D. … sort of, as I qualified before.

That’s very similar to one way to prove commutativity in Coq, by the way. In that sense, the above is an actual proof, regardless of Haskell’s type system’s unsoundness. With the appropriate totality analysis (and maybe something else, I don’t know), I’m pretty sure we could verify that we used only sound inference in the above. In other words, the Core coercions that the proof reduces to when GHC compiles it will be safe at run-time.

Final Remarks

I didn’t show any interesting inductive type classes; Natural is a pretty cheesy example, but there are other inductive classes. However, the proof of ~ properties is a particularly interesting thing to do because there’s no other way to extend it. At least not for something like Plus commutativity, as far as I know.

That’s also the reason I used ConstraintKinds as properties: because I wanted to reason about ~ type equalities. There’s probably other possible models for properties in the Haskell type system, but I can’t think of anything as interesting at the moment.

Thanks for reading. I hope you enjoyed it.


Project Redshift: expanding the generic-deriving/GHC.Generics universe, Phase 1

Summary: the :.: type from generic-deriving/GHC.Generics precludes the use of indexed datatypes inside of the represented datatypes; I propose a way to lift that restriction, but don’t have a compelling example that needs the added exressivity.

In generic-deriving — and hence in GHC.Generics — the :.: type (read “compose”) is used to represent internal applications. I’ll explain how in a bit.

newtype (f :.: g) p = Comp1 (f (g p))

In this post, I’m proposing an alternative to :.:. This alternative is designed to allow indexed datatypes as the f argument of :.:; currently such use of indexed datatypes results in either ill-typed or unusable instances of Generic1. I explain why they’re currently not supported, stop to wonder the benefit of adding this support for indexed datatypes, and then demonstrate how to do it.

The Problem with :.:

In generic-deriving, f and g parameters of :.: are both supposed to be representations. In GHC.Generics, f is a normal type (ie an unrepresented type like [], Maybe, or IO) and g is a representation. That discrepancy is my fault — I talked Pedro into it. So if it’s ultimately the wrong move… mea culpa. (If you’re interested in why it’s beneficial, please ask in the comments and I’ll blog about it.)

To set the stage, let’s see :.: in action. Consider representing the MyList type.

data MyList f a = MyNil | MyCons [f a] (MyList f a)

The :.: representation type handles applications of types (excepting recursive occurrences — that’s what Rec1 is for). In the MyCons constructor, two types are applied: [] and f. These applications are what I call internal applications; they occur inside the datatype’s fields. Since there’s two internal applications in MyList, there are two resulting uses of :.: in the representation.

instance Functor f => Generic1 (MyList f) where
  type Rep1 (MyList f) =
    U1     :+:     [] :.: (f :.: Par1)   :*:   Rec1 (MyList f)

In order to define the methods of Generic1 for representations involving :.:, we use fmap — hence the Functor f constraint. In general, any internally applied type must be a Functor and results in a corresponding constraint in the Generic1 instance.

  from1 MyNil = L1 U1
  from1 (MyCons x y) =
    R1 $ Comp1 (fmap (Comp1 . fmap Par1) x) :*: Rec1 y
  to1 (L1 U1) = MyNil
  to1 (R1 (Comp1 x :*: Rec1 y)) =
    MyCons (fmap (fmap unPar1 . unComp1) x) y

I’ll let you convince yourself that that instance declaration type-checks in GHC 7.6.1. It’s not the full GHC.Generics representation of MyList, but it’s sufficiently characteristic for this post.

That Functor f constraint is the focus of this post. In this example, it prevents us from using generic functions on MyList f when f is an indexed datatype, because indexed datatypes generally cannot be Functors. My proposed alternative to :.: eliminates the requirement that internally applied types be Functors.

… Why?

I actually don’t have a devastatingly compelling reason to accommodate internal applications of indexed datatypes. I figured out the stuff in this post before realizing that the generic functions I had in mind all themselves (ie independent of the jump through Generic1) ultimately require that internally applied types be Functors. There is, though, still the slightest hope for usefulness.

I have come up with one class whose generic definition would work with internally applied indexed datatypes: the EqT class. And indeed, instances of EqT are tedious to write. However, two things render this class a mediocre motivator.

  • It’s not terribly common.

  • I don’t know how to write a generic definition of EqT in generic-deriving/GHC.Generics. (I am, however, making promising progress on one with my polyvariadic variant of generic-deriving.)


  • What are the generic functions used on indexed datatypes?

  • Do any of them not require the internally applied types to be covariant?

  • Is EqT the only one? Is it compelling to you?

If you’re still interested in how to remove that Functor constraint, read on.

My proposal: :@: in place of :.:

Full code in this gist.

We can eliminate that constraint by using :@: in place of :.:.

-- newtype (f :.: g  ) p = Comp1 (f (       g   p))

newtype    (t :@: rep) p = App1  (t (UnRep1 rep p))

The UnRep1 type family is the linchpin. It is only ever applied to representations of constructor arguments: Par1, K1, Rec1, and :@:.

type family UnRep1 (rep :: * -> *) (p :: *) :: *

type instance UnRep1 Par1     p = p
type instance UnRep1 (K1 i c) p = c
type instance UnRep1 (Rec1 f) p = f p
type instance UnRep1 (t :@: rep) p = t (UnRep1 rep p)

This family — as you may have guessed — in a sense does the opposite of Rep1: it converts a representation of a constructor argument back into the actual constructor argument. Note well that the names I’ve given are a bit misleading, since UnRep1 is not fully inverse to Rep1. The Rep1 family describes the structure of an entire datatype declaration, whereas the UnRep1 family forgets the structure of a type term — so it’s more so an inversion of the spine view. What UnRep1 is actually undoing is the reification of a constructor argument’s structure performed by the generic-deriving Template Haskell code or by the -XDeriveGeneric extension of GHC.

Since I’m showing that :@: totally supplants :.:, I need to show that generic definitions for :.: can also be defined for :@:. I’ll just do Functor — classes like Eq and Show can be derived via -XStandaloneDeriving. Enum also works.

We’ll need the UnGeneric1 class in order to do this modularly. It supports UnRep1 in the same way that the Generic1 class supports Rep1, and it’s not otherwise remarkable.

class UnGeneric1 (rep :: * -> *) where
  fromUn1 ::        rep p -> UnRep1 rep p
  toUn1   :: UnRep1 rep p ->        rep p
instance UnGeneric1 Par1     where
  fromUn1 = unPar1 ; toUn1 = Par1
instance UnGeneric1 (K1 i c) where
  fromUn1 = unK1   ; toUn1 = K1
instance UnGeneric1 (Rec1 f) where
  fromUn1 = unRec1 ; toUn1 = Rec1
instance UnGeneric1 (t :@: rep) where
  fromUn1 = unApp1 ; toUn1 = App1

Using fromUn1, toUn1, and a small type ascription, the Functor instance for :@: is just a tweak of the conventional instance for :.:.

-- instance (Functor f, Functor g) => Functor (f :.: g) where
--   fmap f = Comp1 . fmap (     (fmap f)) . unComp1

instance (Functor t, Functor rep, UnGeneric1 rep) =>
  Functor (t :@: rep) where
     fmap f = App1  . fmap (wrap (fmap f)) . unApp1 where
       wrap g = fromLeaf1 .
         g . (id :: forall x. rep x -> rep x) . toLeaf1

It’s also a drop-in replacement for :.: in Rep1 instances, and it simplifies the method definitions, to boot.

instance Generic1 (MyList f) where
  type Rep1 (MyList f) =
    U1      :+:      [] :@: (f :@: Par1)   :*:   Rec1 (MyList f)
  from1 MyNil = L1 U1
  from1 (MyCons x y) = R1 $ App1 x :*: Rec1 y
  to1 (L1 U1) = MyNil
  to1 (R1 (App1 x :*: Rec1 y)) = MyCons x y

Note well the absence of the Functor f constraint — that’s basically the whole point.


So, using :@: in place of :.: does two things.

  • It weakens the context of Generic1 instances. As a particular result, datatypes that contain internal applications of indexed datatypes can be instances of Generic1.

  • It slightly complicates the instances of generic classes for type application. In particular, it requires a bit of additional coercion; compare and contrast the Functor instance for :.: to the one for :@: to get a feel for it.

I am not currently aware of an existing generic definition that actually requires this weakening of the Generic1 instances’ contexts. I’m working on one for EqT, but it’s going to be rather complicated and isn’t the most compelling example. It’d be really nice to have an existing example that benefits from :@: — you don’t by chance know of one, do you?

I do have other possibile generic definitions in mind that would benefit from :@:, but they’re rather experimental. They are thankfully enough, however, for me to believe that :@: might actually end up being useful!

GHC syntax proposal: Guarded Instances

In GHC ticket #5590, I proposed some new syntactic sugar with an elaborative semantics. The objective is to make it easier to use type families for instance selection.

SPJ commented:

I’m afraid I do not understand the proposal in this ticket, yet anyway; an example or two doesn’t tell me its syntax, static semantics, internal language, or dynamic semantics.

I’m writing this post to emphasize that my proposal is technically just a new syntax with an elaboration semantics. I’ll still start with another example, but it’s a better one than I used in the ticket description. I’ll close with a more formal take on the proposed syntax and its elaboration semantics.

There are plenty of simpler examples of this technique, but I’m sharing an example here that

  1. uses rather familiar notions,
  2. I find compelling, and
  3. demonstrates the idea both for a class and for a type family.

In combination with total type families (possibly implemented via NewAxioms), this sugar could succinctly supplant much of the functionality of OverlappingInstances — stare deep into the Oleg.

Motivating Example

(Full code on hpaste.)

Consider writing a Functor instance for the * -> * composition type, :.:.

newtype (:.:) (f :: * -> *) (g :: * -> *) = Comp {unComp :: f (g a)}

The wrinkle with writing a Functor instance for :.: is that there are two equally valid implementations, but they are syntactically ambiguous. Because they have identical instance heads, GHC won’t allow both of them. Conal Elliott laments this problem in the type-compose package.

class Contra f where contramap :: (a -> b) -> f b -> f a

instance (Functor f, Functor g) => Functor (f :.: g) where
  fmap f = Comp . fmap (fmap f) . unComp

instance (Contra f, Contra g) => Functor (f :.: g) where
  fmap f = Comp . contramap (contramap f) . unComp

Solution with current GHC syntax

Here’s a solution to this wrinkle that I consider to be the most direct. My basic proposal in #5590 merely eliminates the syntactic overhead of this workaround — it involves no new semantics or expressivity.

The two Functor instances are distinguished semantically based on the variance of f (or of g — it seems to be a stylistic choice). This semantic information is not present in the instance head, so GHC cannot distinguish the instances. The distinction is present in the constraints, but GHC ignores those during instance selection.

The idea of using the instance contexts to differentiate instances raises even more issues, because of the various extensions to instances like Overlapping or Incoherent. Type families, on the other hand, do not raise such issues because their instances are more restricted than are class instances. So we turn to type families.

We encode variance, the property we wish to use to distinguish our two Functor instances, as a type family. We’ll keep it simple for now and consider only co- and contra-variant parametric data types.

{-# LANGUAGE DataKinds, PolyKinds #-} -- etc.

data V = CoV | ContraV -- datakind for variance; see footnote [1]

data PROXY a = PROXY -- polykinded proxy, for passing types explicitly

-- polykinded variance w.r.t. the next argument of t
type family Variance (t :: kD -> kR) :: V

type instance Variance [] = CoV
type instance Variance Maybe = CoV
type instance Variance (->) = ContraV
type instance Variance ((->) a) = CoV -- etc.

We’ll need an example contravariant type of kind * -> *.

newtype IntSink a = IntSink (a -> Int)

type instance Variance IntSink = ContraV
instance Contra IntSink where
  contramap f (IntSink x) = IntSink (x . f)

This is the type for which we need that second Functor instance, the one that uses Contra. For example, our goal is to make the test function below well-typed.

ex :: (IntSink :.: IntSink) String
ex = Comp $ IntSink $ \(IntSink sink) -> sink "payload"

test :: (IntSink :.: IntSink) Int
test = fmap length ex

The well-typedness of test requires the instance Functor (IntSink :.: IntSink).

All of the code up to this point is motivation for the example; we’re just now getting into the bits relevant to #5590.

Using the Variance type family, we declare an auxiliary class that accommodates our two instances with distinct instance heads.

class Functor_V f (v :: V) where
  fmap_V :: PROXY v -> (a -> b) -> f a -> f b

instance (Functor f, Functor g) => Functor_V (f :.: g) CoV where
  fmap_V _ f = Comp . fmap (fmap f) . unComp

instance (Contra f, Contra g) => Functor_V (f :.: g) ContraV where
  fmap_V _ f = Comp . contramap (contramap f) . unComp

Note that Functor_V is totally determined by Functor and our choice to branch based on Variance. The method bodies are thus just transliterations of our intended Functor instances. We include the additional PROXY argument so that the method’s type determines all of the class parameters.

Finally, we declare the instance of Functor for :.: by delegating to the auxiliary class Functor_V.

{-# LANGUAGE ScopedTypeVariables #-}

instance (fvar ~ Variance f, Functor_V f g fvar) =>
  Functor (f :.: g) where
  fmap = fmap_V (PROXY :: PROXY fvar)

Problem solved; our objective function test now type-checks as-is.

Not just for classes

Left-nestings of :.: will require an instance of Variance for :.:. We can use the same approach for the instance of Variance for :.: that we used for its Functor instance: use an auxiliary family indexed by V.

type instance Variance (f :.: g) = Variance_V (f :.: g) (Variance f)

type family Variance_V (a :: kD -> kR) (v :: V) :: V
type instance Variance_V (f :.: g) CoV = Variance g
type instance Variance_V (f :.: g) ContraV = NegateV (Variance g)

type family NegateV (v :: V) :: V
type instance NegateV CoV = ContraV
type instance NegateV ContraV = CoV

Both Variance_V and Functor_V just get an additional parameter. And both Variance (f :.: g) and Functor (f :.: g) instances just delegate to the *_V counterpart by passing in the appropriate V argument. These are two instantiations of the general technique:

Delegate to an auxiliary class/family that dispatches based on the result of a relevant type family.

My proposal in #5590 is syntactic sugar for that technique.

Solution with the proposed syntax

In the new syntax, we would write the following instead of the Functor_V class, its instances, and the delegating Functor instance for :.:.

guarded instance Functor (f :.: g) with (Variance f)

instance (Functor f, Functor g) => Functor (f :.: g) with CoV where
  fmap f = Comp . fmap (fmap f) . unComp

instance (Contra f, Contra g) => Functor (f :.: g) with ContraV where
  fmap f = Comp . contramap (contramap f) . unComp

It’s much the same for type instances; the following replaces the Variance_V class and its instances as well as the Variance instance for :.:.

type guarded instance Variance (f :.: g) with (Variance f)

type instance Variance (f :.: g) with CoV = Variance g
type instance Variance (f :.: g) with ContraV = NegateV (Variance g)

There are three major benefits of this syntax.

  1. The user isn’t required to invent names for the auxiliary class/family and duplicate the methods’ signatures.
  2. Instance selection is still straight-forward: a guarded instance just forces a bit of type computation and then proceeds with an extra type parameter.
  3. Everything remains open. In our example, the kind of the with parameter is closed since it’s a datakind. If it instead involved *, new with-instances could be declared as usual.

The New Syntax

My proposal enriches the notion of instance head (involving the new keyword with) and adds a new declaration form (involving the new keyword guarded).

In this rest of this post, I only discuss guarded instances of classes. Guarded instance of families are always simpler.

The definition of instance head is enriched to allow zero or more with-clauses on the end. A with-clause is the keyword with follow by one or more type expressions, which are separated syntactically in the same way as the parameters of a multi-parameter type class. These type expressions are called with arguments. Each occurrence of with in an instance head associates to the left; the left argument of a with-clause is called its source. For example, (a, b) with E1 with E2 is a with-clause E2 and its source (a, b) with E1. As a result of enriching instance heads, all instance declarations can now include zero or more with-clauses at the end of their head.

Each guarded instance has the syntactic form
guarded instance Θ => H with T0 T1 … Tn.

n must be positive. Θ is an instance context, H is the head of the guarded instance (and possibly includes further with-clauses), and each Ti is a type expression. All type variables occurring in those Tis must be bound by H or Θ.


Any instance (guarded or vanilla) with a with-clause in its head has a head of the shape H with Ω. For this instance head to be well-formed, there must be a guarded instance in scope of the shape Hg with Ωg where H is an instance of Hg with the substitution σ and Ω is of the same kind as the application of σ to Ωg.

Elaboration semantics

Every guarded instance guarded instance Θ => H with Ω elaborates to two conventional declarations. Let C be H's class. The first generated declaration is an auxiliary class C' that is a fresh copy of C with a new parameter for each type in Ω (added after the existing parameters) and a new PROXY argument in every method for each new class parameter. The second generated declaration is an instance of C with context (Θ, H[C'/C] Ω) that defines each method by applying the corresponding method of C' to the appropriately typed proxy arguments.

Every instance with a head that includes with-clauses Θ => H with Ω elaborates to a single conventional instance (Θ, Θg') => H[C'/C] Ω, where C' and Θg are determined by the corresponding guarded instance for H.

Using polykinds, the elaborator could more efficiently re-use class copies that require the same number of additional parameters.


I’m confident this discussion is better than my original ticket, but it still seems to make it overly complicated. I’m only proposing a little sugar for the underlying technique that uses type families to encode richer instance selection. On the other hand, that technique isn’t even well-known or particularly simple. What can I say? It has worked pretty well for me, and I’d like to be able to use it with less syntactic overhead. YMMV.

Thanks for reading along! Please let me know what you think.

[1] — While I’ve kept the definition of V minimal for this example, it should in general be at least a complete lattice with four elements, co-, contra-, in-, and non-variance. Invariance is both co- and contra-, like a -> a; non-variance is neither co- nor contra-. NB even this lattice is insufficient for some higher-order data types, like data X a f = X (f a), where the variance w.r.t. one parameter depends on the variance of a later parameter. I’m not sure how Variance interacts with non-parametric types (= indexed? or >= indexed?).

Henny Penny Blues: An apologetic, but enthusiastic release of the Jaro benchmarking code

First thing first: we accidentally sabotaged Data.Text. Its variant used tuples for the outer loop state instead of the data type with strict fields — quite a disadvantage. It now performs a bit better than String. Please accept our apologies for screaming that that sky was falling down.

I’ve posted a prettied-up slice of our benchmarking suite on my university personal page. It’s the main algorithm from the data types post for ByteString, UArray Int Char, String, Seq Char, and Data.Text.

We also include a ByteString version that has some further optimizations we came up with after the fact. Please try to best it! Enjoy!

SPECIALISE pragma vs copy-and-paste

Turns out the slight overhead of SPECIALISE we were seeing was most likely incidental and just happened to look systematic. We had decided to use the pragma in this code release, since it’s just so much cleaner that way. That gave Text a significant performance boost and that’s how we realized we had copied from the wrong variant when writing the Text variant. This is exactly why we were so concerned that the SPECIALISE pragma somehow had some overhead: copy-and-pasting is just begging for bugs! … so ashamed …

Here’s what we get when we run this code as released (with SPECIALISE pragmas) and also with manual specialization via copy-and-paste. Note that use of the SPECIALISE pragma shows no overhead.

variant average time (s) slow down factor standard deviation SNR
with the SPECIALISE pragma
spec 0.857 4.47 0.00238 360.83
ByteString 0.237 1.23 0.00217 109.28
UArray Int Char 0.345 1.80 0.00124 278.67
String 0.505 2.63 0.00196 258.11
Seq Char 0.841 4.38 0.00247 340.12
Text 0.487 2.54 0.00220 221.25
optimized ByteString 0.192 1.00 0.00082 234.68
with manual specialization via copy-and-paste-and-annotate
spec 0.857 4.46 0.00359 238.89
ByteString 0.236 1.23 0.00122 192.89
UArray Int Char 0.345 1.80 0.00129 266.63
String 0.505 2.63 0.00184 274.63
Seq Char 0.841 4.38 0.00240 350.24
Text 0.487 2.53 0.00206 236.75
optimized ByteString 0.192 1.00 0.00107 180.06

(I just ran the SPECIALISEd version again in the background while editing this post and String slightly outperformed Text. This nondeterminism is driving me nuts.)


Also, check out Max Bolingbroke’s edit-distance library. That’s the only “related work” we found on hackage, and from his source code comments it looks like he’s waaay better at optimizing than we are.

That last post we promised is still in the works… thanks for your patience. Due dates, due dates, due dates!

D’oh! Bars

I had that haunting feeling that I had forgotten something during the last editing pass on that previous post about test driving the datatypes dropped into the Jaro-Winkler test harness we made. Our graphs looked incomplete somehow…

D’oh! Error bars. Thanks to Neil Brown for pointing out our omission.

The statistics we have to offer

criterion computes some statistics for you, as it chews through its benchmarking duties. It applies resampling in ways we haven’t investigated, identifies outliers and their possible affect on variance, remarks on the confidence interval, and computes the standard deviations for the mean and for its lower and upper bounds within a benchmark’s iterations.

As with all statistics, these numbers risk lulling the trusting reader into a submissive state.

We were a little too trusting at first. If it finds that outliers have affected your run, criterion warns you but doesn’t attempt re-run. We had thought it would. So just a heads up about that: check for crazy variances, look for that warning! My laptop apparently sleepwalks during the full moon: there were three giant execution times recorded during our first overnight benchmarking run. It took 600 minutes. Only in the past couple days did we realize it should actually only take 50 minutes to do the entire run! We have no idea what happened there, but have been on the look out for large variances ever since.

So what is the consequence of the 95% confidence interval criterion reported to us? It’s pretty sure that it got the right measurement for that run of that particular benchmark. It does not take into account whatever sort of balancing act your OS might be doing for that benchmark that it didn’t do for others, even within the same run of the execution of the whole suite. Not that we expect criterion to be able to do that sort of thing! We’re just couching the implications of the statistics.

All of that rant is setup for

Disclaimer: We ran these measurements on a laptop with as few other processes as we could muster. We are not benchmarking pros, and so we really have no idea how best to quantify or even qualify the consistency of our measurements.

Bringing error bars into the equation might satiate our need for statistical significance, but we just want to highlight how much, for example, the OS environment might affect the run and that the standard deviation as computed by criterion does little to assuage that.

Here’s another graph to bring home our disclaimer. We’ve executed the entire benchmark multiple times; here’s the same slice of four such runs lined up. Recall that each of the lines on this graph achieved that 95% confidence interval nod.

We can’t explain those slight differences. Red, orange, and green were ran back to back one night, and blue was ran over lunch the next day. Otherwise, the system was in a very similar state. Maybe we ought to reboot just before benchmarking? (We tried booting into single-user mode and benchmarking there, but all the times were roughly twice as slow. /shrug)

An example graph

Even so, a graph looks so much more dapper with error bars. So we jumped through all the hoops to get a PNG with error bars out of Open Office. (I chose not to install the graphics libraries criterion relies on for its graphs.) This is from a different set of data than the numbers in the previous post, but you get the idea. See the full graph PNG if this is too hard to read.

The error bars in that graph are +/- the standard deviation.

Here’s another thing we learned during this experiment: Google Spreadsheets does not support error bars. Swammi really gives it to ’em there!

Signal-to-Noise Ratio

The only other statistic we’re prepared to offer is a characterization of our SNR. That’s the mean divided by the standard deviation. The higher that quotient is, the less jittery the results were.

We don’t have any intuitions about “good” SNR values, but we guess those are mostly pretty good? Again, this just means that each particular benchmark within the set of benchmarks of that particular execution was pretty consistent. It doesn’t necessarily say anything about how much we can trust any comparisons between them. It does suggest that nothing funky was going on, so it doesn’t per se warn against comparing between the variants measured within that benchmark run. On that note, we used sleep 360 && … to let the display get to sleep and the computer settle before the benchmark started. If it was not done by the time we woke the display, we restarted the run.

FYI, the eight SNRs below 200 are ByteString * ByteString, ByteString * String, UArray Int Char * UArray Int Char with tuples, String * ByteString, Seq Char * UArray Int Char, and a few of the variants we have yet to post discussions about: a further optimization of symmetric ByteString and two of the refinement steps between the String spec and the primary jaro algorithm we settled on for the previous post. We see no notable pattern in those — again, though, we’re not too familiar with SNRs.


So that’s our graph and a characterization of our SNRs. Any suggestions on how to characterize our confidence in the comparisons? We suppose you could derive that from a characterization of how consistent the state of the machine was throughout the execution of the entire benchmarking suite. Other than just turning off AirPort, for example. (We’re working with what we have…)

Thanks again, Neil, for the reminder.

And thank you for reading! Please still stay tuned for that last post about optimizing jaro. The request has been made, so our code will be posted too.


† That’d be really awesome, though. Maybe it could jumble up some portions of the iterations of a benchmark, trying to balance them across the entirety of that run of the benchmarking suite? Hrmmmm… (k)

That last Bugzilla comment gets me every time. (k)

Trying some datatypes on for time (+ a few algorithmic variations)

Update 2 June 2011: we’ve since discovered an error in our benchmarking suite. Data.Text was at a disadvantage and actually performs on par with String. So, the Data.Text results below are inaccurate — everything else should be fine AFAWK.

I don’t really know if this is technically the same “functional time”. But maybe by “functional” Evan meant whenever it’s ready — so here goes. I’m Nick. Thanks for reading. I have to apologize ahead of time: I cannot even feign the capability to compete with Evan’s cultural allusions.

A quick recap of the last post: when optimizing a straight-forward implementation of the Jaro string distance, we found some success with datatype refinements to IntSet and ByteString (surprise surprise). But then we found out an inconvenient truth about the current GHC profiling, namely that it isn’t to be trusted.

This post includes some more of our experiments and design considerations. In particular, we focus on trying out four alternative types for representing the strings.


The following chart summarizes the results; the scalar is an average time for an iteration of the outermost loop of our testing harness and the label is the type used to represent words.

Disclaimer: These are all the same algorithm, with different data structures just plugged it. With algorithms tailored to their interface, these datatypes might perform better!

ByteString wins again. In this case, we think the chart indicates that much of that performance benefit is due to the support for O(1) random-access. This claim is supported by the comparable success of UArray Int Char. Read on for a more thorough discussion of these the highlights:

  • we couldn’t squeeze any performance gains out of mutability and unsafePerformIO,
  • don’t underestimate the benefits of a data declaration with strict fields for your inner loop’s state!, and
  • Data.Sequence severely underperforms, but we’re not sure why.

Prompted by a comment to the last post by Erik Hesselink (Thanks, Erik!), we included a variant for Bryan O’Sullivan’s Data.Text. We have not investigated why it’s outperformed by it only performs on par with String (which shares the same time-complexities for the interface we use); the only caveat we’ll mention is the minimality of this algorithm. It’s basically a strict left fold, length, and random access otherwise. There’s really no opportunity for fusion to kick in that we’ve noticed.

We’d be happy to send our code your way if you’d like — just let us know in the comments. You can dig deeper if you like, but we’re just investigating a “plug-and-play” like use here.

Experimental Setup

As a first step, we identify the essential interface to the String-like datatype for calculating the Jaro distance. This type class is used in the code we actually benchmarked; each instance of Jaro is saturated with INLINE pragmas and the jaro and matches definitions were cut-and-pasted for each datatype in order to avoid dictionary passing at run-time. We couldn’t get all the overhead removed via SPECIALISE — we have yet to determine why not.

class Jaro c where
  fromBS :: ByteString -> c
  iLen :: c -> Int
  index :: c -> Int -> Char
  foldl' :: (b -> Char -> b) -> b -> c -> b
  {-# INLINE gLen #-}
  gLen :: Enum n => c -> n
  gLen = toEnum . iLen

The test harness uses a strict ByteString to read in the test data. The fromBS method is used to convert the input word to the appropriate type for the jaro variant under test. We use Control.DeepSeq.rnf in order to exclude the conversion from our measurement of execution time. Otherwise, the methods simply provide access to the length of the word, random-access indexing, and a strict left fold.

The Jaro type class provides all the necessary methods for implementing the Jaro algorithm.

jaro :: (Jaro c, Fractional n, Enum n) => c -> c -> n
jaro s1 s2
  | 0 == m = 0
  | otherwise =
    (m / gLen s1 + m / gLen s2 + (m - t) / m) / 3
    m = toEnum mInt

    (x1, x2) =
      if iLen s2 < iLen s1 then (s2, s1) else (s1, s2)

    (mInt, x1', hits2) = matches x1 x2

    t = (/ 2) $ toEnum $ loop hits2 x1' (0 :: Int) where
      loop (i : is) (c1 : cs1) = loop is cs1 . f
        where f = if c1 /= x2 `index` i then succ else id
      loop _ _ = id

The definition of jaro re-orders the input words so that the outer-loop of matches traverses the shorter word. That traversal yields the number of matches, mInt; the string of matched characters from the short string, x1'; and the ascending indices of matched characters in the long string, hits2. The latter two are simultaneously traversed in order to compute the transpositions.

The matches consists of a left fold over the short input word, which carries the following state type, St. The data declaration uses “bang patterns” in order to specify that the fields should never contain thunks. Along with the -funbox-strict-fields compilation flag, this technique pushes these values nearer to the CPU at run-time — they are the data manipulated by our program’s hot-spot and any access-time overhead is costly.

data St = St { idx1, m :: !Int, mk :: !(String -> String),
               used :: !IntSet }

{-# INLINE matches #-}
matches :: Jaro c => c -> c -> (Int, String, [Int])
matches s1 s2 = (m, mk [], IntSet.toAscList used) where
  len2 = iLen s2

  window = max 0 (max (iLen s1) len2 `div` 2 - 1)

  St {m, mk, used} = foldl' snoc nil s1 where
    nil = St { idx1 = 0, m = 0, mk = id,
               used = IntSet.empty }
    snoc st@St{..} c1 =
      (case inline List.find
                 (\idx -> IntSet.notMember idx used &&
                          c1 == s2 `index` idx)
         [lo .. hi] of
         Nothing -> st
         Just idx2 -> st { m = m + 1, mk = mk . (c1 : ),
                           used = IntSet.insert idx2 used }
      ) { idx1 = idx1 + 1 }
      where lo = max 0 (idx1 - window)
            hi = min (len2 - 1) (idx1 + window)

The INLINE pragma merely mitigates the function call overhead; we’re not aware of any optimizations it enables within jaro.

This particular definition of jaro and matches was the result of iterating on the various data and slight algorithmic choices when refining the original specification from the previous post. The primary differences are some manually eliminated common subexpressions, some explicit inlining via GHC.Exts.inline, and some trivial changes to the types.

As a reminder, profiling was unfortunately of little use because of its current detrimental interactions with optimization, as mention in this GHC ticket. For our available machines, this bug either prevented any timing information or caused invalid compilation! Thus, we only had our own coarse-grain wall-clock time measurements to work with, computed carefully with Data.Time.getCurrentTime and Control.DeepSeq.rnf.

Jaro instances

The instances for Jaro are crucial to the performance of the jaro variant that uses the datatype. Here, we have omitted the INLINE pragmas that saturate these instance declarations for the sake of presentation, but they are crucial to avoiding dictionary passing at run time. Also, recall that the execution of fromBS is not included in the measurements.

For ByteString, String, and Text, the instances are immediate.

instance Jaro ByteString where
  fromBS = id
  iLen = BS.length
  index = BS.index
  foldl' = BS.foldl'

instance Jaro [Char] where
  fromBS = BS.unpack
  iLen = length
  index = (!!)
  foldl' = List.foldl'

instance Jaro T.Text where
  -- NB get special input treatment instead
  fromBS = error "Text from BS"
  iLen = T.length
  index = T.index
  foldl' = T.foldl'

The functions used in these instances are defined in heavily optimized libraries. This is, surprisingly, apparently not the case for Seq Char.

instance Jaro (Seq Char) where
  fromBS = Seq.fromList . BS.unpack
  iLen = Seq.length
  index = Seq.index
  foldl' = \snoc nil -> List.foldl' snoc nil . Fold.toList

The foldl' definition via the conversion to String consistently outperforms those via Data.Sequence.foldlWithIndex and Data.Foldable.foldl' by about 0.1 seconds, which is degradation of about 15% to 50% compared to the execution times of all of our tests in this experiment. That would seem to merit some investigation; we haven’t any theories yet.

instance Jaro (UArray Int Char) where
  fromBS = \bs -> IA.listArray (0, BS.length bs - 1)
                 (BS.unpack bs)
  iLen = IA.numElements
  index = (IA.!)
  foldl' = \snoc nil arr ->
    let (lo, hi) = IA.bounds arr
        w !idx !acc | hi < idx = acc
                    | otherwise =
          w (succ idx) (snoc acc (arr IA.! idx))
    in w lo nil

criterion Harness

Our main function reads in the input file as a ByteString and then immediately prepares some thunks for converting that data into the various pairs for cross product of datatypes (one for short word, one for long). We also use Data.Text‘s lazy IO in order to thunk its input until it’s needed; we don’t convert it from a ByteString since it supports richer encodings.

type C0 = ByteString
type C1 = UArray Int Char

main = do
  ns <- BS.lines `fmap` BS.readFile wordsFile
  ns_text <- (map (id &&& id) . T.lines . LT.toStrict)
               `fmap` LT.readFile wordsFile

  let ns00 = map (fromBS &&& fromBS) ns :: [(C0, C0)]
      ns01 = map (fromBS &&& fromBS) ns
      ns02 = map (fromBS &&& fromBS) ns
      ns33 = map (fromBS &&& fromBS) ns :: [(C3, C3)]

We basically hand everything off to criterion from here. We use rnf “in the right places” in order to force the conversion to take place outside of the execution window that criterion measures. Each bgroup corresponds to one of the sixteen pairs of representation types (e.g. ByteString and UArray Int Char) and contains all of the benchmarks that use that input data. Combined with the cfgPerformGC option to criterion, this should ensure that only the ByteString for the entire file and one conversion from that data are resident in memory at any given time.

  let cfg = defaultConfig
        { cfgPerformGC = ljust True,
          cfgSummaryFile = ljust "crit.csv" }

  defaultMainWith cfg (return ()) [
    bgroup "00" [
        bench "jaro0" $
          rnf ns00 `seq` whnfIO (critrun jaro0 ns00)
      , bench "jaro0'" $ whnfIO (critrun jaro0' ns00)
      , … {- a bench for each variant that uses ns00 -}],
    bgroup "01" [
        bench "jaro01" $
          rnf ns01 `seq` whnfIO (critrun jaro01 ns01) ],
    bgroup "02" [
        bench "jaro02" $
          rnf ns02 `seq` whnfIO (critrun jaro02 ns02) ],
    … {- a bgroup for each input type except Text -},
    bgroup "text" [
      bench "jaro_text" $
        rnf ns_text `seq` whnfIO (critrun jaro_text ns_text) ]

We compile with

ghc -O2 -funbox-strict-fields -fforce-recomp
    -fspec-constr-count=6 --make Main

We just bumped the SpecConstr count until the warnings went away; 6 didn’t seem offensively large, and there were two call patterns at most (recall that GHC reduces the count as part of its heuristic).

Results and Discussion


We used the 8K /usr/share/dict/propernames file on our Mac OS X distributions to run the tests, since Jaro-Winkler is designed to primarily work on names. Our test harness traverses the diagonalization of the names in this file, and sums the similarity metrics between all of the words (this diverges from the first post’s test harness, but it’s much faster since we’re running so many variants and iterations). The result is a nonsensical number, but it forces the necessary evaluation, has a small memory footprint, and represents a hash of sorts of the files contents. All of the jaro variants computed the same sum (without the profiling, that is), so we’re confident they are equivalent. As an exercise in autodidacticism, this work did not inspire us to compare for correctness against the golden C implementation of Jaro-Winkler (linked to by the Wikipedia page referenced above).

It is well understood that standard Haskell Strings limit performance; this is why ByteString was written in the first place. For this experiment, we primarily focused on various data structures for representing the words during the two loops in matches. Since each loop accesses the respective argument differently (short word – outer loop, long word – inner loop), we tried all sixteen combinations of our 4 datatypes of interest. Direct benchmarking via Data.Time.getCurrentTime and Control.DeepSeq.rnf yielded the following measurements.

Outer loop Inner loop Measured time (s)
ByteString ByteString 0.22892195665577
ByteString UArray Int Char 0.3117424055312
ByteString String 0.4186133953307
ByteString Seq Char 0.62530112468937
UArray Int Char ByteString 0.32210345232228
UArray Int Char UArray Int Char 0.33752040827015
UArray Int Char String 0.45714450323322
UArray Int Char Seq Char 0.68065535986164
String ByteString 0.37693604194859
String UArray Int Char 0.40947800600269
String String 0.49661373102406
String Seq Char 0.74052640163639
Seq Char ByteString 0.53333888256291
Seq Char UArray Int Char 0.58034939014652
Seq Char String 0.68305074178913
Seq Char Seq Char 0.8523849221442

The Datatypes

The graph from the introduction tells the big story: ByteString is best. We chose the other types in order to explore our straight-forward and simple hypotheses. We included String since it’s the most familiar and default type in Haskell for words. We were surprised it fared so well. UArray Int Char was intended as a less optimized version of ByteString. We presumed that its implementation is basically the same with fewer possible rewrites and slightly worse access-times, and so we expected a constant factor of overhead compared to ByteString. Our use of Data.Sequence is slightly vestigial: an earlier algorithmic variant used its queue-like interface (O(1) tail and snoc) to slide the matching window across the second string. Without profiling, we haven’t the means to determine what’s “wrong” with Data.Sequence; the time complexities listed in its documentation suggest it should do better than String.

In order to better compare the types, we chart the times with a fixed type for the short word and average across the measured times for each of the possible types for the longer word. In other words (har), we fix the type of the outer loop and vary the type of the inner loop.

Next, we do the opposite: fix the inner loop type and vary the outer loop type, again averaging the results to a single scalar.

Without going into statistics, the best we can do is eyeball it. It does seem that the “slope” of the second graph is steeper, which indicates that the datatype used for the inner loop has a greater effect on performance. That sounds reasonable.

Algorithmic Variations

Strict Tuple Patterns

Unaccustomed to low-level optimizing, we didn’t immediately remember to employ a data declaration with strict, unboxed fields. Its performance benefit at the time was startling; a factor of 2 or 3. So we’ve retained variants of jaro that use simple tuples saturated with strictness annotations just to see the difference.

Datatype Measured time (s) Degradation
ByteString 0.35290416443089 54.16%
UArray Int Char 0.45579727375248 35.04%
String 0.61194113933781 23.22%
Seq Char 0.99888140403965 17.19%

The data declaration with strict fields is clearly crucial.

Mutable bit array

All of the measured variants discussed above use a Data.IntSet to track which of the characters in the second string have already been matched. Since this is a persistent data structure — even though it’s so tightly packed and efficient — we though that changing it to be a mutable bit array would be a big win. However, all of our attempts at doing so, including ST, unsafePerformIO, re-using a singly allocated array to allow Data.ByteString.Internal.inlinePerformIO within the loop, cloning Data.ByteString.foldl' to work with a monadic snoc, etc. failed to perform as well all as ByteString + IntSet. We peeked at the Core enough to suppose that there were bounds checks interfering with some optimizations; the IntSet had 10 instances of the loops whereas the array code had only 4. After eliminating the bounds checks (with some really low-level coding to circumvent the Data.Array API), there were a matching 10 instances of the loops in the array based code, but it remained less performant. Without profiling and better understanding/navigation of Core, we’re unable to determine why.

The variant with the singly allocated array and using only Data.ByteString.Internal.inlinePerformIO within the inner loop had a measured time of 0.3209107s, which is a 40.18% performance degradation.

Inlining List.find

In the definition of matches, we use GHC.Exts.inline to force the inlining of Data.List.find. Since it’s a library function, this is the only way we are aware of to force it to be inlined. Without inlining it, the ByteString jaro averages 0.29225487196186, which is a 27.67% performance degradation. We surmise that its unfolding is enabling a fusion rule for filter and enumFromTo.

Having never worked this hard at optimizing a function, we were surprised that a use of a standard function like List.find wasn’t optimally handled by GHC automatically. In particular, we thought it was very likely to unfold functions with such short definitions; perhaps library must mark it as INLINABLE with a pragma? Odd that such a standard library wouldn’t. Looking now at the definition of Data.List.find, it seems it’d be more efficient to case directly on the result of the filter instead of converting to a Maybe. Is listToMaybe INLINEd even though it has no such pragma? (At the time of writing, those links hit base-

Eliminating the double traversal

The counting of transpositions in jaro traverses the strings of matched characters. With respect to the short string, the characters occur in the exact same order within the outer loop of matches. These characters are essentially traversed twice, and — presumably worse still — a list is constructed in between, bound to x1' in jaro.

We wrote a version using circular programming to avoid the construction of the string of matched characters from the short word. The resulting fold counts the transpositions “as” it determines the matches.

data St n = St { idx1, m :: !Int, t2 :: n,
                 used :: !IntSet, hits2 :: String }

matches_t ::
  (Jaro c, Fractional n, Enum n) => c -> c -> (Int, n)
matches_t s1 s2 = (m, t2 / 2) where
  window = max 0 (max (iLen s1) len2 `div` 2 - 1)
  len2 = iLen s2

  St {m, used, t2} = foldl' snoc nil s1 where
    nil = St { idx1 = 0, m = 0, t2 = 0,
               used = IntSet.empty,
               hits2 =
                 map (index s2) (IntSet.toAscList used) }
    snoc st@St{..} c1 =
      (case inline List.find
                 (\idx -> IntSet.notMember idx used &&
                          c1 == s2 `index` idx)
         [lo .. hi] of
         Nothing -> st
         Just idx2 ->
           st { m = m + 1, hits2 = tail hits2,
                used = IntSet.insert idx2 used,
                t2 = (if c1 /= head hits2
                      then succ else id) t2 }
      ) { idx1 = idx1 + 1 }
      where lo = max 0 (idx1 - window)
            hi = min (len2 - 1) (idx1 + window)

The circularity is between the outermost binding of used and the initial value of the hits2 field in the nil state. Divergence is avoided because, during the traversal, none of the other fields depend on hits2 or t2 and their fields in St are not strict. As a result, those fields accumulate as thunks, which only collapse once t2 is forced. This ultimately causes the rest of the traversal to take place first.

With ByteStrings, this algorithm and the primary algorithm perform the same at optimization level -O1. We surmise that the thunks that build up are operationally comparable to the construction of the list of matched characters, x1'. That would still eliminate the redundant traversal, but for the fact that the thunks merely accumulate during the fold; their collapse happens only once t2 is forced, and is effectively a second traversal. Furthermore, with -O2 the circular program is significantly less performant than the primary algorithm; we surmise that the circularity prevents some of the more aggressive optimizations.

Though the circular programming removes the redundant traversal from the code, it fails to eliminate it from the execution. We surmise the existence of a much more invasive alteration of the primary algorithm where the outer loop counts the transpositions for the prefix of the longer string that is beyond the lower limit of the matching window for the shorter string’s last character. This ought to reduce the redundancy of the second traversal. We have not yet implemented it.

“Surmise, surmise, surmise”, you say. We’re trying to not look at the generated Core, especially since -O2 does a lot of unfolding. Perhaps heap profiling would be a good indicator? … You know, we do have research to do.


We eventually jacked our code into Bryan O’Sullivan’s criterion package late into our process. In particular, during development we did simple timing with Data.Time.getCurrentTime with a handful of iterations for quick turnaround.

For the numbers and graphs in this post, though, we ran everything in a single criterion harness. Fortunately, comparing the two data sets at the shared data points validates our simple method.

Phew! Everything just about lines up. The criterion benchmarking was ran over night, with no other processes — we assume that sort of OS-level inequity in the execution environment is where the seemingly constant offset comes from.

Conclusions & Questions

We kind of already concluded above. ByteString was best. We’re surprised we couldn’t beat IntSet by using a mutable array — that’s probably our most interesting result. But all of the IO-escaping (even via inlinePerformIO) seemed to inhibit some key optimizations. Maybe with the right refactoring and compilation flags, the mutable bitarray would out-perform IntSet. Suffice it to say it apparently wouldn’t be a small, obvious change.

We also didn’t try the other unboxed array packages. We just hadn’t heard of them yet. So maybe there’s something else out there for mutable bitarrays that would outperform IntSet and not block those aggressive optimizations as we’re supposing was happening with our attempts. Please let us know!

If you’re interested in getting access to the code we used:

  1. Leave a comment saying so! We’ll need some motivation to package it up since the semester is over…
  2. Brace yourself! There’s a lot of “duplicate” code. We manually circumvented SPECIALIZE pragmas, since our initial tests indicated some overhead. Also, our naming conventions got a little frantic at the end there. Maybe we’ll clean it up before posting it if you ask nice.

If you’re just after the fastest jaro definition (in order to try to beat it, perhaps? Hmm?), just wait for the next post. We’d love it if someone could show us some further tricks! We’d also really like to know why we’ve failed to speed it up with the mutable bit array.

Also, we are concerned about the apparent SPECIALIZE overhead we noticed. Is this a known issue? Are we missing something?

This point about missing flags deserves repetition. In fact…

Disclaimer: we could have overlooked crucial INLINEs or compilation flags.

Those sorts of things could really affect the results we’re getting and so possibly invalidate some of our conclusions (e.g. those regarding inlining List.find or Seq Char’s slow folds). Unfortunately, we can’t even find Acovea on the Internets and aren’t otherwise familiar with the multitude of GHC flags. … Help?

Thanks again for reading! Our next post will discuss some more invasive algorithmic changes; there’s more we can do to speed up jaro, albeit not by much. Fuse, fuse, fuse.

Exercises in Optimization (A O(n)-Part Series)

Greetings and welcome to the first entry of Functional Kansans, a new blog brought to you by a few of the people interested in functional programming at the University of Kansas.

We open the blog with a series of posts by myself and Nick Frisby detailing our recent exploration into the world of Haskell program optimization.  Before we begin, I should state that this work arose from what was supposed to be a class-wide project for Andy Gill’s advanced functional programming class that Nick and I co-opted as our own, perhaps greedily so.  The following content could not have been possible without Andy’s excellent instruction and the support of our other classmates (who hopefully will have posts of their own on this blog at some point).

The Problem

The target of our optimization efforts was the Jaro-Winkler algorithm, a metric for judging the similarity of two strings.  Specifically, we looked at the Jaro distance metric given that its time-complexity dominates that of the Winkler variant.


This section provides pseudo-code snippets in order to explain the algorithm.

The Jaro measure is computed in terms of the string lengths, number of matches, m, and the number of transpositions, t, and yields a similarity metric between 0 and 1.

jaro s1 s2 = (m / n1 + m / n2 + (m - t) / m) / 3
  where n1 = length s1; n2 = length s2
  	m = matches s1 s2
	t = transpositions s1 s2

A pair of characters from each string match if they are equal and not too far apart.  The maximal match distance is defined as a sliding window, framed by the list of indices [lo .. hi], whose size is determined relative to the length of the two input strings.  Additionally and importantly, each character can only match once.

matches s1 s2 = snd (foldl snoc nil (zip s1 [0..])) where
  nil = ([], 0)

  snoc st@(used, m) (c1, idx1) =
      case find p [lo .. hi] of
         Nothing -> st
         Just idx2 -> (idx2 : used, m + 1)
      p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

To calculate the number of transpositions, count the number of differences in the in-order matched characters.

transpositions s1 s2 =
  length $ filter (\c1 c2 -> c1 /= c2) $
  zip (filter wasMatched s1) (filter wasMatched s2)

Note that since each character can only match once and only with a character from the other string, there’s the same number of matched characters in each string; the zip doesn’t truncate either argument. Also note that we avoid specifying a full definition for wasMatched, but suffice it to say that an implementation that matches its current type of Char -> Bool is not possible.
The next sections will present the starting implementation of this algorithm and a discussion of our profiling techniques used for our optimization efforts.

The Starting Point

As was just mentioned, wasMatched cannot have the type of Char -> Bool, at least as the transpositions function is currently specified, because it has a missing dependency on the indices calculated in the matches function; the character value on its own is insufficient information for determining if a position in the string was matched.  Presented below is a naive implementation of the algorithm that corrects that deficiency by having matches return the string consisting of all matched characters, in order, from the left string (s1) and all matched indices from the right string (s2).  Please also note that the transpositions function has been “folded into” the definition of jaro in its where clause, mainly for the purpose of brevity.

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

From this point the path to an optimized implementation is simple: find what is slow and make it faster.  Flippancy aside, we made the decision to identify time-cost hotspots using GHC’s profiling functionality.  Many notable bloggers and Haskell hackers have also advocated for the direct inspection of the Core code that GHC produces to augment profiling information; however, we chose not to.
The honest truth of the matter is that Core is not a fun language to read, except perhaps to gluttons for syntactic punishment.  Furthermore, the practice quickly becomes infeasible for any significantly large code base.


So we made up our mind.  Profiling and profiling only.

Unfortunately, running a single iteration of jaro takes such a trivially small amount of time that no worthwhile profiling data would be collected.  To sidestep this issue we devised the following test harness that calls jaro a large number of times.

words_file :: FilePath
words_file = "/usr/share/dict/propernames"

main :: IO ()
main = do
  ns <- lines `fmap` readFile words_file
  now <- length ns `seq` getCurrentTime
  let xs :: [Double]
      xs = sortBy (flip compare)
           [ jaro n m | n <- ns, m <- ns ]
  now2 <- rnf xs `seq` getCurrentTime
  print (diffUTCTime now2 now)

In short, the harness inputs a line delimited file, splits it into words, and then calculates the jaro similarity factor for the cross product of the word list.  Also note that there is some additional plumbing to measure execution time with the Data.Time library and to force evaluation of the list to head normal form using the Control.DeepSeq library.  This combination provides us with informal benchmarking capabilities that should confirm any results we find with profiling.

For all profiling experiments in this post we will be using the following compilation flags: -rtsopts -prof -auto-all -O2.  The first three flags represent the standard flags required for profiling all top-level definitions with the final flag representing the simplest way to activate most of the compiler optimizations that would make a difference for us.  There may be a future post discussing other optimization flags that would be of benefit, however for now we accept the O2 flag as “good enough.”

Likewise we will be using the following runtime system options: -p -i0.001.  There are no surprises here, we simply activate profiling and change the time scale for profiling ticks to 1 ms.  Note that with GHC 7 most runtime system options were disabled by default, hence the need to use the -rtsopts flag during compilation.

Profiling our naive implementation with the above settings gives us the following information:

total time  =        0.88 secs
total alloc = 3,255,736,936 bytes
wall clock = 14.746186 secs 
COST CENTRE                MODULE           %time %alloc

main                       Main              61.4   36.8
matches                    Main              29.8   47.7
jaro                       Main               8.8   15.5

Fantastic.  So now we know that matches is the dominant function, accounting for roughly 30% of our total run time and almost 48% of our total allocation.  Looking back at the matches function, though, the code is fairly dense and, at least to less experienced eyes, does not appear to reflect any obvious optimization opportunities.  To gain more information we will introduce our own cost centers to measure any internal expressions that we think may account for a large portion of our run time.

matches :: String -> String -> ([Int], String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 : used, m1 ++ [c1])
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

Running the profiling with the new cost centers does in fact give us quite a bit more information to work with:

total time  =        1.06 secs
total alloc = 2,883,573,832 bytes
wall clock  = 15.011385 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              30.5   41.5
matchesPred                Main              25.0    0.0
matches                    Main              13.5   18.3
jaro                       Main              12.6   17.5
matchesFold                Main               9.6   14.6
matchesWindow              Main               4.2    1.5
matchesHi                  Main               3.3    2.5
matchesSnoc                Main               1.3    4.1

Hold on, though.  The introduction of our own cost centers affected our total time and allocation by fairly significant numbers.  That can’t be right, can it?  We’ll touch on this more in a later section, but for now, thankfully, we can rely on the independent benchmarking that we built into our test harness (recorded above in the wall clock field) to confirm any speed ups that we find.

Down the Rabbit Hole

Looking at the profiling information, we see that the dominant cost center we introduced is matchesPred which measures our find predicate:

p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

There are a few things going on in this function, so lets work from left to right.  Recall that used is simply a list of integers, making notElem a linear traversal.
It should be possible to find a better container type for used that increases the speed of this check without penalizing other operations over used.

We decided to change used to be modeled as an IntSet (note: we imported qualified as IS) rather than a list of integers.  In the worst case, the complexity of the operations should be the same as a list; however, in practice we know that at the very least the allocation behavior should improve, likely leading to an improvement in time-cost as well.  We also gain the benefit of using a container that better matches the semantics we’re looking for;  after all, a set of matched indices needs no room for duplicates.

matches :: String -> String -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = (IS.empty, "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 `IS.insert` used, m1 ++ [c1])
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `IS.notMember` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.72 secs
total alloc = 2,490,396,576 bytes
wall clock  = 14.79711 sec

COST CENTRE                MODULE           %time %alloc

main                       Main              41.1   48.1
matchesPred                Main              20.7    0.0
matches                    Main              11.7   11.6
matchesFold                Main               9.4   17.0
jaro                       Main               9.4   12.7
matchesWindow              Main               4.0    1.7
matchesSnoc                Main               1.8    6.2
matchesHi                  Main               1.8    2.8

As expected, this lead to a significant decrease in total allocation during the execution of the program.  Unfortunately, while the profiling time did decrease by roughly 30%, the reduction in wall clock time was not as significant.  That leaves matchesPred as still the dominant cost center, so we revisit it looking for another optimization opportunity.  The only other function in this expression that is not a primitive comparison operator is the string indexing.  Obviously strings are not designed for efficient random access, so again we go on the quest for an alternative container type.


The obvious and immediate choice here would be to use a ByteString as our new container type.  ByteStrings are so attractive because they provide constant time indexing and length checks in addition to extremely efficient stream-fused traversals, all of which should provide some big wins across the entire algorithm.

When considering which types to change, it is important to note that string s2 is the only one ever indexed into.  However, because s1 and s2 are built from the same word list it would seem to make sense to have them be of the same type to avoid unnecessary calls to pack or unpack which can sometimes be a major source of inefficiency when using ByteStrings.  This is something that we will discuss in more detail in the conclusion, mainly to lead in to the next post in our series, but for now we operate under the assumption that the preceding statement is correct.

The resultant types of our functions are shown below:

jaro :: ByteString -> ByteString -> Double
matches :: ByteString -> ByteString -> (IntSet, String)

For the most part the available ByteString operations mirror the String operations provided by the prelude, so code changes are predominantly just qualifications to tell GHC which version to use (note: we imported qualified as BS).  Given this, we will avoid showing the updated code for brevity’s sake.  The profiling results, though, are shown below in their entirety.

total time  =        0.51 secs
total alloc = 2,739,268,472 bytes
wall clock  = 13.066119 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              64.4   43.7
matchesPred                Main              10.2    0.0
matches                    Main               9.8   10.5
jaro                       Main               5.9   13.6
matchesFold                Main               4.7   22.5
matchesSnoc                Main               2.9    5.6
matchesHi                  Main               1.4    2.6
matchesWindow              Main               0.6    1.5

The first thing that we notice is that the switch to ByteStrings has pushed our allocation back up to a level near where it was at before we made the switch to an IntSet to carry our match indices.  The tradeoff, though, is significant performance increases across the board resulting in shaving an almost two full seconds off of the wall clock time.

Looking at our code again, we recognize that the expressions within the matchesWindow, matchesPred, and matchesHi cost centers have now all been reduced to either constant time or primitive operations.  To clear up our profiling information from this point on we will be removing those cost centers since there is not much else we can do to optimize them directly.  That does not mean that we are done, though.  Even though our test harness is now the dominant cost center by a very large margin, we still have a few tricks up our sleeves that we have yet to exhaust to see if we can make the Jaro algorithm even faster.

Down the Rabbit Hole in Space Jam

All of the changes we have made up to this point have been, for the most part, algorithm impartial.  We have identified a few hot spots and attempted to mitigate their cost by changing the data representations of the objects we are working with to make those operations more efficient; however, the structure of the code itself was left largely unchanged. What follows in this section is an exploration of more “invasive” optimization attempts.  If the previous section was an allusion to simply following the white rabbit (profiling output) to guide our path, this section would best be described as again following the rabbit, but this time asking more in-depth questions like, “Why is Bill Murray here?”

The Bill Murray in this case is the large amount of allocation still happening in the matches function, specifically in our matchesFold cost center.  Whenever you see allocation issues with a fold you should think to yourself, maybe now is the time to try a strict version of that fold.  In fact, the ByteString library provides a very nice and efficient version of foldl' to use.  The issue, though, is that we are not currently folding over the string directly, but rather over the zip of the string and an incrementing counter.  To use the ByteString traversals we must fuse that zip in by tracking the counter at each iteration of the fold.

While we are fusing traversals we should also pay attention to the jaro function.  Currently we traverse string s2 indexing into it, only to again traverse the result with a zip, and then a filter.  In parallel we also calculate the value of m by taking the length of our matches string. All of these passes can, and should, be fused into a single loop.  The new code for all of these changes is shown below:

jaro :: ByteString -> ByteString -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
   n1 = fromIntegral $ BS.length s1
   n2 = fromIntegral $ BS.length s2

   (used, m1) = matches s1 s2
   m = fromIntegral $ length m1

   t = loop (IS.toAscList used) m1 0 where
     loop (i : is) (c : m1') = loop is m1' . f
         where f = if c /= s2 `BS.index` i then (+ 1)
                   else id
     loop _ _ = (/ 2)

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                snd $ BS.foldl' snoc initst s1 where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 initst = (0, (IS.empty, ""))

 snoc (idx1, st@(used, m1)) c1 =
   case find p [lo .. hi] of
     Nothing -> (idx1 + 1, st)
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx1 + 1, (idx2 `IS.insert` used
                             , m1 ++ [c1]))
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.77 secs
total alloc = 2,698,599,372 bytes
wall clock  = 11.650707 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              62.3   44.3
matches                    Main              21.9   32.6
jaro                       Main               9.6   13.5
matchesFold                Main               3.5    1.3
matchesSnoc                Main               2.6    8.3

Looking at this profiling information it’s hard to instantly declare whether this was a success or not.  Recall that we removed some custom cost centers and, as discussed previously, that probably distorted total time and allocation results.  However, if we operate under the assumption that the relational cost percentage numbers are accurate then we can claim that we shrunk the % allocation for the matchesFold cost center significantly.  If that wasn’t indicative enough of a clear win, we also shaved almost another two full seconds off of the wall clock time. The trade off, though, was that % allocation rose in a few places, notably in the matches function itself.

Let’s see if we can fix this by pushing strictness a bit further.  Rather than have each iteration of our fold return nested tuples, lets instead have it work over a record type with strict fields.  Furthermore, let’s hit it with the sledgehammer that is the -funbox-strict-fields compilation flag to see if we can squeeze any more performance out of it.  Note that the code below requires the RecordWildCards and NamedFieldPuns language extensions: these let us write the more concise patterns St{used, m1} and St{..}.

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, m1) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, m1} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, m1 = "" }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , m1 = m1 ++ [c1] }
   ) { idx1 = idx1 + 1 }
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.26 secs
total alloc = 2,525,880,780 bytes
wall clock  = 11.152646 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              68.5   47.4
matches                    Main              16.3   31.8
jaro                       Main               8.9   14.5
matchesFold                Main               3.5    1.7
matchesSnoc                Main               2.7    4.7

While not as impressive of a change as some of our previous tweaks, moving to the strict record type still increased our performance by about half a second in wall clock time.  The last thing we can attempt to decrease our allocation is moving away from using append in the matchesSnoc cost center and instead try to build our matches string using cons.

data St = St { idx1 :: !Int, used :: !IntSet
             , mk :: !(String -> String) }

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, mk []) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, mk} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, mk = id }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , mk = mk . (c1 : ) }
   ) { idx1 = idx1 + 1 }
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.67 secs
total alloc = 2,523,483,604 bytes
wall clock  = 11.279014 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              82.2   47.4
matches                    Main              10.9   31.9
jaro                       Main               4.6   14.5
matchesFold                Main               1.2    1.7
matchesSnoc                Main               1.0    4.6

Looking at the total allocation and wall clock time one would be tempted to say that this change did not really do much.  Interestingly enough, though, if you look at the %time numbers, almost all cost centers pushed a significant amount of time to the main function.  Theoretically a more efficient test harness could leverage this change for some bigger gains down the road.


We finish by stripping out our SCC annotations, adding an inline pragma for matches, manually inlining our find predicate, and doing common subexpression eliminations where the ByteString lengths were used.  This exhausts our optimization efforts, at least for this post.  The resultant profiling information is shown below.

total time  =        0.87 secs
total alloc = 2,593,431,612 bytes
wall clock  = 10.096731 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              63.8   46.1
jaro                       Main              36.1   53.9

Compared to the profiling information from our starting point we have sped up our implementation by almost a full five seconds of wall clock time; a whopping one third of our total execution time.  Furthermore, our total allocation was shrunk by almost one billion bytes.  Not too shabby for a few hours worth of work.

The Pool of Tears or:  How I Learned to Stop Crying and Love Benchmarking Libraries

This blog post is not a complete summary of our optimization efforts; to be honest, we got quite “stupid” with it.  In fact, Nick will be following this post with a few more detailing our efforts to see if we could find a data type that could out perform ByteStrings and his own wild journeys off into the world of minute algorithmic variations.  Regardless of whether you have enjoyed this post or not (Space Jam jokes are not for everyone) I encourage you to check the future ones out.  The reason I mention this, in addition to providing a moderately decent introduction to the next posts in this series, is to set up the following anecdote.

Towards the end of our optimization fun-time, perhaps a week in at that point, we were playing with a variation of the Jaro algorithm that was barely recognizable.  The code was rife with language extensions, unboxed types and primitives, and more calls to unsafePerformIO than I honestly feel comfortable admitting.  Given how far we had travelled away from the original implementation, we built a few sanity checks into our test harness to confirm that each of our manual optimizations weren’t changing the input/output behavior of jaro.  Another thing we got into the habit of doing was clocking the time with both a profiled and non-profiled version, if only to have a small, meaningless victory with every test we ran.  We had just made another arbitrary change and ran the tests when something terrible happened: the sanity check had passed for the non-profiled version of the code and failed for the profiled version.

Profiling had just affected the evaluation of our code.  That’s not good.
Those of us fortunate enough to have the newest generation of unibody Macs (read: not me) already knew that there were some issues with profiling for that architecture, given that most attempts to profile came back with zeros across the board.  Looking for answers, Nick turned to the Haskell Cafe for help where he was pointed to this GHC bug report.  In short, profiling does not play nicely with the optimization flags, including the default -O0 flag (no optimizations).  Basically, profiling is currently broken and not to be trusted.

Obviously this was a problem given how I had just written an entire blog post that detailed an optimization methodology based almost entirely on profiling.  Oh that reminds me, all of the profiling numbers in this post are probably wrong so take them with a grain of salt.  This brings up a very good point that I neglected to mention earlier:  There is no magic bullet of optimization.  Profiling, while powerful when it works, will never be entirely sufficient on its own.  Thankfully we included benchmarking code in our test harness, so we were never proceeding truly blind.

That statement leads to my next point:  Do whatever you can to avoid writing your own benchmarking code.  For any well established programming language, invariably someone else will have already developed a robust benchmarking library.  If you can, use it.  For example, Bryan O’Sullivan has developed a fantastic benchmarking library for Haskell called criterion that I wished we would have used from the beginning.  Instead, we spent almost as much time changing and improving our test harness as we did optimizing the Jaro algorithm.  Even with all of that effort, when it came time to to execute a large batch of tests, roughly 100 tests with an estimated run time of 10+ hours, we rewrote everything to use criterion.  The reason was simple;  it was the better testing framework.

I think both of these points are probably things that I’ve heard before.  Undoubtably some professor tried to sagely pass on this knowledge during my undergraduate days, probably during the phase where I thought I would never use any of that material ever again (I’m still crossing my fingers that this holds true for C++).  The fact of the matter, though, is that it didn’t really ring true until I ran into these issues myself.

So if you get only one thing out of reading this post it should be the following: Participate in your own optimization experiment and feel free to get “stupid” with it.  Hopefully you will learn an enormous amount about the language you are working in.  Even more importantly, hopefully you will have a lot of fun.

Thank you for reading and see you next time.  Same functional time.  Same functional place.  Same functional channel.