Exercises in Optimization (A O(n)-Part Series)

Greetings and welcome to the first entry of Functional Kansans, a new blog brought to you by a few of the people interested in functional programming at the University of Kansas.

We open the blog with a series of posts by myself and Nick Frisby detailing our recent exploration into the world of Haskell program optimization.  Before we begin, I should state that this work arose from what was supposed to be a class-wide project for Andy Gill’s advanced functional programming class that Nick and I co-opted as our own, perhaps greedily so.  The following content could not have been possible without Andy’s excellent instruction and the support of our other classmates (who hopefully will have posts of their own on this blog at some point).

The Problem

The target of our optimization efforts was the Jaro-Winkler algorithm, a metric for judging the similarity of two strings.  Specifically, we looked at the Jaro distance metric given that its time-complexity dominates that of the Winkler variant.

PSUEDO-CODE

This section provides pseudo-code snippets in order to explain the algorithm.

The Jaro measure is computed in terms of the string lengths, number of matches, m, and the number of transpositions, t, and yields a similarity metric between 0 and 1.

jaro s1 s2 = (m / n1 + m / n2 + (m - t) / m) / 3
  where n1 = length s1; n2 = length s2
  	m = matches s1 s2
	t = transpositions s1 s2

A pair of characters from each string match if they are equal and not too far apart.  The maximal match distance is defined as a sliding window, framed by the list of indices [lo .. hi], whose size is determined relative to the length of the two input strings.  Additionally and importantly, each character can only match once.

matches s1 s2 = snd (foldl snoc nil (zip s1 [0..])) where
  nil = ([], 0)

  snoc st@(used, m) (c1, idx1) =
      case find p [lo .. hi] of
         Nothing -> st
         Just idx2 -> (idx2 : used, m + 1)
    where
      p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

To calculate the number of transpositions, count the number of differences in the in-order matched characters.

transpositions s1 s2 =
  length $ filter (\c1 c2 -> c1 /= c2) $
  zip (filter wasMatched s1) (filter wasMatched s2)

Note that since each character can only match once and only with a character from the other string, there’s the same number of matched characters in each string; the zip doesn’t truncate either argument. Also note that we avoid specifying a full definition for wasMatched, but suffice it to say that an implementation that matches its current type of Char -> Bool is not possible.
The next sections will present the starting implementation of this algorithm and a discussion of our profiling techniques used for our optimization efforts.

The Starting Point

As was just mentioned, wasMatched cannot have the type of Char -> Bool, at least as the transpositions function is currently specified, because it has a missing dependency on the indices calculated in the matches function; the character value on its own is insufficient information for determining if a position in the string was matched.  Presented below is a naive implementation of the algorithm that corrects that deficiency by having matches return the string consisting of all matched characters, in order, from the left string (s1) and all matched indices from the right string (s2).  Please also note that the transpositions function has been “folded into” the definition of jaro in its where clause, mainly for the purpose of brevity.

jaro :: String -> String -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where 
  n1 = fromIntegral $ length s1
  n2 = fromIntegral $ length s2
  (used, m1) = matches s1 s2
  m2 = map (s2 !!) $ sort used
  m = fromIntegral $ length m1
  t = (/ 2) $ fromIntegral $ length $ 
      filter id $ zipWith (/=) m1 m2

matches :: String -> String -> ([Int], String)
matches s1 s2 = foldl snoc initst (zip s1 [0..]) where
 window = max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> (idx2 : used, m1 ++ [c1])
   where
     p idx2 = idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = min (length s2 - 1) (idx1 + window)

From this point the path to an optimized implementation is simple: find what is slow and make it faster.  Flippancy aside, we made the decision to identify time-cost hotspots using GHC’s profiling functionality.  Many notable bloggers and Haskell hackers have also advocated for the direct inspection of the Core code that GHC produces to augment profiling information; however, we chose not to.
The honest truth of the matter is that Core is not a fun language to read, except perhaps to gluttons for syntactic punishment.  Furthermore, the practice quickly becomes infeasible for any significantly large code base.

Profiling

So we made up our mind.  Profiling and profiling only.

Unfortunately, running a single iteration of jaro takes such a trivially small amount of time that no worthwhile profiling data would be collected.  To sidestep this issue we devised the following test harness that calls jaro a large number of times.

words_file :: FilePath
words_file = "/usr/share/dict/propernames"

main :: IO ()
main = do
  ns <- lines `fmap` readFile words_file
  now <- length ns `seq` getCurrentTime
  let xs :: [Double]
      xs = sortBy (flip compare)
           [ jaro n m | n <- ns, m <- ns ]
  now2 <- rnf xs `seq` getCurrentTime
  print (diffUTCTime now2 now)

In short, the harness inputs a line delimited file, splits it into words, and then calculates the jaro similarity factor for the cross product of the word list.  Also note that there is some additional plumbing to measure execution time with the Data.Time library and to force evaluation of the list to head normal form using the Control.DeepSeq library.  This combination provides us with informal benchmarking capabilities that should confirm any results we find with profiling.

For all profiling experiments in this post we will be using the following compilation flags: -rtsopts -prof -auto-all -O2.  The first three flags represent the standard flags required for profiling all top-level definitions with the final flag representing the simplest way to activate most of the compiler optimizations that would make a difference for us.  There may be a future post discussing other optimization flags that would be of benefit, however for now we accept the O2 flag as “good enough.”

Likewise we will be using the following runtime system options: -p -i0.001.  There are no surprises here, we simply activate profiling and change the time scale for profiling ticks to 1 ms.  Note that with GHC 7 most runtime system options were disabled by default, hence the need to use the -rtsopts flag during compilation.

Profiling our naive implementation with the above settings gives us the following information:

total time  =        0.88 secs
total alloc = 3,255,736,936 bytes
wall clock = 14.746186 secs 
COST CENTRE                MODULE           %time %alloc

main                       Main              61.4   36.8
matches                    Main              29.8   47.7
jaro                       Main               8.8   15.5

Fantastic.  So now we know that matches is the dominant function, accounting for roughly 30% of our total run time and almost 48% of our total allocation.  Looking back at the matches function, though, the code is fairly dense and, at least to less experienced eyes, does not appear to reflect any obvious optimization opportunities.  To gain more information we will introduce our own cost centers to measure any internal expressions that we think may account for a large portion of our run time.

matches :: String -> String -> ([Int], String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = ([], "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 : used, m1 ++ [c1])
   where
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `notElem` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

Running the profiling with the new cost centers does in fact give us quite a bit more information to work with:

total time  =        1.06 secs
total alloc = 2,883,573,832 bytes
wall clock  = 15.011385 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              30.5   41.5
matchesPred                Main              25.0    0.0
matches                    Main              13.5   18.3
jaro                       Main              12.6   17.5
matchesFold                Main               9.6   14.6
matchesWindow              Main               4.2    1.5
matchesHi                  Main               3.3    2.5
matchesSnoc                Main               1.3    4.1

Hold on, though.  The introduction of our own cost centers affected our total time and allocation by fairly significant numbers.  That can’t be right, can it?  We’ll touch on this more in a later section, but for now, thankfully, we can rely on the independent benchmarking that we built into our test harness (recorded above in the wall clock field) to confirm any speed ups that we find.

Down the Rabbit Hole

Looking at the profiling information, we see that the dominant cost center we introduced is matchesPred which measures our find predicate:

p idx2 = idx2 `notElem` used && c1 == s2 !! idx2

There are a few things going on in this function, so lets work from left to right.  Recall that used is simply a list of integers, making notElem a linear traversal.
It should be possible to find a better container type for used that increases the speed of this check without penalizing other operations over used.

We decided to change used to be modeled as an IntSet (note: we imported qualified as IS) rather than a list of integers.  In the worst case, the complexity of the operations should be the same as a list; however, in practice we know that at the very least the allocation behavior should improve, likely leading to an improvement in time-cost as well.  We also gain the benefit of using a container that better matches the semantics we’re looking for;  after all, a set of matched indices needs no room for duplicates.

matches :: String -> String -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                foldl snoc initst (zip s1 [0..]) where
 window = {-# SCC "matchesWindow" #-}
          max 0 (max (length s1) (length s2) `div` 2 - 1)
 initst = (IS.empty, "")

 snoc st@(used, m1) (c1, idx1) =
   case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx2 `IS.insert` used, m1 ++ [c1])
   where
     p idx2 = {-# SCC "matchesPred" #-}
              idx2 `IS.notMember` used && c1 == s2 !! idx2
     lo = max 0 (idx1 - window)
     hi = {-# SCC "matchesHi" #-}
          min (length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.72 secs
total alloc = 2,490,396,576 bytes
wall clock  = 14.79711 sec

COST CENTRE                MODULE           %time %alloc

main                       Main              41.1   48.1
matchesPred                Main              20.7    0.0
matches                    Main              11.7   11.6
matchesFold                Main               9.4   17.0
jaro                       Main               9.4   12.7
matchesWindow              Main               4.0    1.7
matchesSnoc                Main               1.8    6.2
matchesHi                  Main               1.8    2.8

As expected, this lead to a significant decrease in total allocation during the execution of the program.  Unfortunately, while the profiling time did decrease by roughly 30%, the reduction in wall clock time was not as significant.  That leaves matchesPred as still the dominant cost center, so we revisit it looking for another optimization opportunity.  The only other function in this expression that is not a primitive comparison operator is the string indexing.  Obviously strings are not designed for efficient random access, so again we go on the quest for an alternative container type.

ByteStrings

The obvious and immediate choice here would be to use a ByteString as our new container type.  ByteStrings are so attractive because they provide constant time indexing and length checks in addition to extremely efficient stream-fused traversals, all of which should provide some big wins across the entire algorithm.

When considering which types to change, it is important to note that string s2 is the only one ever indexed into.  However, because s1 and s2 are built from the same word list it would seem to make sense to have them be of the same type to avoid unnecessary calls to pack or unpack which can sometimes be a major source of inefficiency when using ByteStrings.  This is something that we will discuss in more detail in the conclusion, mainly to lead in to the next post in our series, but for now we operate under the assumption that the preceding statement is correct.

The resultant types of our functions are shown below:

jaro :: ByteString -> ByteString -> Double
matches :: ByteString -> ByteString -> (IntSet, String)

For the most part the available ByteString operations mirror the String operations provided by the prelude, so code changes are predominantly just qualifications to tell GHC which version to use (note: we imported qualified as BS).  Given this, we will avoid showing the updated code for brevity’s sake.  The profiling results, though, are shown below in their entirety.

total time  =        0.51 secs
total alloc = 2,739,268,472 bytes
wall clock  = 13.066119 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              64.4   43.7
matchesPred                Main              10.2    0.0
matches                    Main               9.8   10.5
jaro                       Main               5.9   13.6
matchesFold                Main               4.7   22.5
matchesSnoc                Main               2.9    5.6
matchesHi                  Main               1.4    2.6
matchesWindow              Main               0.6    1.5

The first thing that we notice is that the switch to ByteStrings has pushed our allocation back up to a level near where it was at before we made the switch to an IntSet to carry our match indices.  The tradeoff, though, is significant performance increases across the board resulting in shaving an almost two full seconds off of the wall clock time.

Looking at our code again, we recognize that the expressions within the matchesWindow, matchesPred, and matchesHi cost centers have now all been reduced to either constant time or primitive operations.  To clear up our profiling information from this point on we will be removing those cost centers since there is not much else we can do to optimize them directly.  That does not mean that we are done, though.  Even though our test harness is now the dominant cost center by a very large margin, we still have a few tricks up our sleeves that we have yet to exhaust to see if we can make the Jaro algorithm even faster.

Down the Rabbit Hole in Space Jam

All of the changes we have made up to this point have been, for the most part, algorithm impartial.  We have identified a few hot spots and attempted to mitigate their cost by changing the data representations of the objects we are working with to make those operations more efficient; however, the structure of the code itself was left largely unchanged. What follows in this section is an exploration of more “invasive” optimization attempts.  If the previous section was an allusion to simply following the white rabbit (profiling output) to guide our path, this section would best be described as again following the rabbit, but this time asking more in-depth questions like, “Why is Bill Murray here?”

The Bill Murray in this case is the large amount of allocation still happening in the matches function, specifically in our matchesFold cost center.  Whenever you see allocation issues with a fold you should think to yourself, maybe now is the time to try a strict version of that fold.  In fact, the ByteString library provides a very nice and efficient version of foldl' to use.  The issue, though, is that we are not currently folding over the string directly, but rather over the zip of the string and an incrementing counter.  To use the ByteString traversals we must fuse that zip in by tracking the counter at each iteration of the fold.

While we are fusing traversals we should also pay attention to the jaro function.  Currently we traverse string s2 indexing into it, only to again traverse the result with a zip, and then a filter.  In parallel we also calculate the value of m by taking the length of our matches string. All of these passes can, and should, be fused into a single loop.  The new code for all of these changes is shown below:

jaro :: ByteString -> ByteString -> Double
jaro s1 s2
 | 0 == m = 0
 | otherwise = (m / n1 + m / n2 + (m - t) / m) / 3
 where
   n1 = fromIntegral $ BS.length s1
   n2 = fromIntegral $ BS.length s2

   (used, m1) = matches s1 s2
   m = fromIntegral $ length m1

   t = loop (IS.toAscList used) m1 0 where
     loop (i : is) (c : m1') = loop is m1' . f
         where f = if c /= s2 `BS.index` i then (+ 1)
                   else id
     loop _ _ = (/ 2)

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = {-# SCC "matchesFold" #-}
                snd $ BS.foldl' snoc initst s1 where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 initst = (0, (IS.empty, ""))

 snoc (idx1, st@(used, m1)) c1 =
   case find p [lo .. hi] of
     Nothing -> (idx1 + 1, st)
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  (idx1 + 1, (idx2 `IS.insert` used
                             , m1 ++ [c1]))
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.77 secs
total alloc = 2,698,599,372 bytes
wall clock  = 11.650707 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              62.3   44.3
matches                    Main              21.9   32.6
jaro                       Main               9.6   13.5
matchesFold                Main               3.5    1.3
matchesSnoc                Main               2.6    8.3

Looking at this profiling information it’s hard to instantly declare whether this was a success or not.  Recall that we removed some custom cost centers and, as discussed previously, that probably distorted total time and allocation results.  However, if we operate under the assumption that the relational cost percentage numbers are accurate then we can claim that we shrunk the % allocation for the matchesFold cost center significantly.  If that wasn’t indicative enough of a clear win, we also shaved almost another two full seconds off of the wall clock time. The trade off, though, was that % allocation rose in a few places, notably in the matches function itself.

Let’s see if we can fix this by pushing strictness a bit further.  Rather than have each iteration of our fold return nested tuples, lets instead have it work over a record type with strict fields.  Furthermore, let’s hit it with the sledgehammer that is the -funbox-strict-fields compilation flag to see if we can squeeze any more performance out of it.  Note that the code below requires the RecordWildCards and NamedFieldPuns language extensions: these let us write the more concise patterns St{used, m1} and St{..}.

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, m1) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, m1} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, m1 = "" }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , m1 = m1 ++ [c1] }
   ) { idx1 = idx1 + 1 }
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.26 secs
total alloc = 2,525,880,780 bytes
wall clock  = 11.152646 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              68.5   47.4
matches                    Main              16.3   31.8
jaro                       Main               8.9   14.5
matchesFold                Main               3.5    1.7
matchesSnoc                Main               2.7    4.7

While not as impressive of a change as some of our previous tweaks, moving to the strict record type still increased our performance by about half a second in wall clock time.  The last thing we can attempt to decrease our allocation is moving away from using append in the matchesSnoc cost center and instead try to build our matches string using cons.

data St = St { idx1 :: !Int, used :: !IntSet
             , mk :: !(String -> String) }

matches :: ByteString -> ByteString -> (IntSet, String)
matches s1 s2 = (used, mk []) where
 window = max 0
            (max (BS.length s1) (BS.length s2) `div` 2 - 1)
 St{used, mk} = {-# SCC "matchesFold" #-}
                BS.foldl' snoc initst s1 where
 initst = St { idx1 = 0, used = IS.empty, mk = id }

 snoc st@St{..} c1 =
   (case find p [lo .. hi] of
     Nothing -> st
     Just idx2 -> {-# SCC "matchesSnoc" #-}
                  st { used = idx2 `IS.insert` used
                     , mk = mk . (c1 : ) }
   ) { idx1 = idx1 + 1 }
   where
     p idx2 = idx2 `IS.notMember` used
              && c1 == s2 `BS.index` idx2
     lo = max 0 (idx1 - window)
     hi = min (BS.length s2 - 1) (idx1 + window)

And its profiling information:

total time  =        0.67 secs
total alloc = 2,523,483,604 bytes
wall clock  = 11.279014 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              82.2   47.4
matches                    Main              10.9   31.9
jaro                       Main               4.6   14.5
matchesFold                Main               1.2    1.7
matchesSnoc                Main               1.0    4.6

Looking at the total allocation and wall clock time one would be tempted to say that this change did not really do much.  Interestingly enough, though, if you look at the %time numbers, almost all cost centers pushed a significant amount of time to the main function.  Theoretically a more efficient test harness could leverage this change for some bigger gains down the road.

Results

We finish by stripping out our SCC annotations, adding an inline pragma for matches, manually inlining our find predicate, and doing common subexpression eliminations where the ByteString lengths were used.  This exhausts our optimization efforts, at least for this post.  The resultant profiling information is shown below.

total time  =        0.87 secs
total alloc = 2,593,431,612 bytes
wall clock  = 10.096731 secs

COST CENTRE                MODULE           %time %alloc

main                       Main              63.8   46.1
jaro                       Main              36.1   53.9

Compared to the profiling information from our starting point we have sped up our implementation by almost a full five seconds of wall clock time; a whopping one third of our total execution time.  Furthermore, our total allocation was shrunk by almost one billion bytes.  Not too shabby for a few hours worth of work.

The Pool of Tears or:  How I Learned to Stop Crying and Love Benchmarking Libraries

This blog post is not a complete summary of our optimization efforts; to be honest, we got quite “stupid” with it.  In fact, Nick will be following this post with a few more detailing our efforts to see if we could find a data type that could out perform ByteStrings and his own wild journeys off into the world of minute algorithmic variations.  Regardless of whether you have enjoyed this post or not (Space Jam jokes are not for everyone) I encourage you to check the future ones out.  The reason I mention this, in addition to providing a moderately decent introduction to the next posts in this series, is to set up the following anecdote.

Towards the end of our optimization fun-time, perhaps a week in at that point, we were playing with a variation of the Jaro algorithm that was barely recognizable.  The code was rife with language extensions, unboxed types and primitives, and more calls to unsafePerformIO than I honestly feel comfortable admitting.  Given how far we had travelled away from the original implementation, we built a few sanity checks into our test harness to confirm that each of our manual optimizations weren’t changing the input/output behavior of jaro.  Another thing we got into the habit of doing was clocking the time with both a profiled and non-profiled version, if only to have a small, meaningless victory with every test we ran.  We had just made another arbitrary change and ran the tests when something terrible happened: the sanity check had passed for the non-profiled version of the code and failed for the profiled version.

Profiling had just affected the evaluation of our code.  That’s not good.
Those of us fortunate enough to have the newest generation of unibody Macs (read: not me) already knew that there were some issues with profiling for that architecture, given that most attempts to profile came back with zeros across the board.  Looking for answers, Nick turned to the Haskell Cafe for help where he was pointed to this GHC bug report.  In short, profiling does not play nicely with the optimization flags, including the default -O0 flag (no optimizations).  Basically, profiling is currently broken and not to be trusted.

Obviously this was a problem given how I had just written an entire blog post that detailed an optimization methodology based almost entirely on profiling.  Oh that reminds me, all of the profiling numbers in this post are probably wrong so take them with a grain of salt.  This brings up a very good point that I neglected to mention earlier:  There is no magic bullet of optimization.  Profiling, while powerful when it works, will never be entirely sufficient on its own.  Thankfully we included benchmarking code in our test harness, so we were never proceeding truly blind.

That statement leads to my next point:  Do whatever you can to avoid writing your own benchmarking code.  For any well established programming language, invariably someone else will have already developed a robust benchmarking library.  If you can, use it.  For example, Bryan O’Sullivan has developed a fantastic benchmarking library for Haskell called criterion that I wished we would have used from the beginning.  Instead, we spent almost as much time changing and improving our test harness as we did optimizing the Jaro algorithm.  Even with all of that effort, when it came time to to execute a large batch of tests, roughly 100 tests with an estimated run time of 10+ hours, we rewrote everything to use criterion.  The reason was simple;  it was the better testing framework.

I think both of these points are probably things that I’ve heard before.  Undoubtably some professor tried to sagely pass on this knowledge during my undergraduate days, probably during the phase where I thought I would never use any of that material ever again (I’m still crossing my fingers that this holds true for C++).  The fact of the matter, though, is that it didn’t really ring true until I ran into these issues myself.

So if you get only one thing out of reading this post it should be the following: Participate in your own optimization experiment and feel free to get “stupid” with it.  Hopefully you will learn an enormous amount about the language you are working in.  Even more importantly, hopefully you will have a lot of fun.

Thank you for reading and see you next time.  Same functional time.  Same functional place.  Same functional channel.

-Evan

Advertisements

4 responses to “Exercises in Optimization (A O(n)-Part Series)

  1. Isn’t ByteString the wrong type for this? You’re dealing with text, and ByteStrings represent bytes, so you will go wrong when dealing with characters encoded as multiple bytes. Text would probably be a better choice.

    • Erik, thanks for the question.

      Nothing should go wrong in this example because the file we are dealing with only has characters from the Basic Latin subset of Unicode.

      In general, though, you’re right; Data.ByteString.Char8 truncates characters to an 8 bit representation, so it only covers a portion of Unicode.
      If you want something that matches the 16 bit character representation of Haskell’s Char type and is fully Unicode compliant, then yes, you would want to use Data.Text.

      I must admit a bit of naivety here, because I haven’t used the Data.Text library a great amount, but I believe the tradeoff for supporting the entirety of Unicode is a performance hit.
      At the very least you are now using a minimum of another 8 bits to store each character (it actually may be more with data type overhead) so allocation will be higher.
      Additionally, the internal representations are fairly different, with ByteString using a ForeignPtr to a Word8 and letting C do most of the heavy lifting and Text using its own version of a packed Array and keeping everything “in house.”
      The result are operations with very different complexities; for example, length in Data.ByteString is O(1) and length in Data.Text is O(n).
      Both libraries are obviously heavily optimized and fused behind the scenes, so I’m not sure what the actual performance difference is in real world time, but my intuition is that ByteString should be faster for most, if not all, operations.

      Nick’s next post details some more in depth investigation into alternative data type representations for our strings, but unfortunately Text wasn’t one of them. My guess would be, though, that they would perform similarly to the (UArray Int Char) representation we test, with Text maybe being a bit faster thanks to some fusions written with text manipulation use cases in mind.

      You’ve got me curious now, though, so I’ll probably revisit this and test Text directly when I get the time.

  2. Pingback: Trying some datatypes on for time (+ a few algorithmic variations) | Functional Kansans

  3. Pingback: Incremental optimizations | Functional Kansans

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s