Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Functionally Implementing Intersperse

This week while seeming to put off finishing many other blog posts on type classes, Collective Intelligence, the war on foreach and so on, I found myself intrigued by solving a simple problem in F# and look at the tradeoffs.  This post is meant to be a little dive into several ways to solve a problem and seeing where the pitfalls lie.

The Intersperse Goal

Let’s look at the Haskell function named intersperse.  This function takes a separator and a given list and then, you guessed it, intersperses the separator in the given list.  Let’s look at the Haskell implementation:

intersperse             :: a -> [ a ] -> [ a ]
intersperse _   []      = []
intersperse _   [ h ]   = [ h ]
intersperse sep (h:t)  = h : sep : intersperse sep t

As you will note from the above function, if the list is empty, we return empty.  Else, if it is a singleton, then we return it as such.  Otherwise, we cons the head of the list to the separator then to the recursive call to the intersperse function with the tail of the list.  Seems simple enough. 

Since Haskell treats strings as lists as well, we can take advantage of this uniform behavior between strings and lists.  For example, let’s take a few items and run them through to see what we get:

*Main> :m +Data.List
*Main Data.List> :t intersperse
intersperse :: a -> [ a ] -> [ a ]
*Main Data.List> intersperse ',' "foo"
"f,o,o"
*Main Data.List> intersperse (-1) [1..3]
[1,-1,2,-1,3]

Given this implementation, it should be rather easy for us to implement this capability in F#.  Unfortunately, as we noted above, strings are not treated as lists, but instead as IEnumerable<char>.  So, to get this uniform behavior, let’s deal with IEnumerable<T> so that we can use both strings and lists.

The Pattern Matching Way

When dealing with lists, we have easy ways to tease the data apart with both a head and a tail, much as we had above in Haskell.  In F#, this behavior is no different.  But, this behavior does not exist for sequences IEnumerable<T>.  To implement this is ultimately not hard and requires the use of an Active Pattern.  Let’s show how that might work:

let tl s = 
  match Seq.length s with
  | 0 -> failwith "Empty seq"
  | 1 -> Seq.empty 
  | _ -> Seq.skip 1 s
 
let (|SeqCons|SeqNil|) = function
  | s when Seq.is_empty s -> SeqNil
  | s -> SeqCons(Seq.hd s, tl s)

Because the tail function does not exist on the existing Seq module, I added my own version to it with the F# September 2008 CTP.  This is no more than checking the length and if there is more than one, return a new sequence that skips the current item, else an empty sequence.  As for our active pattern, we define two cases, the SeqNil, or empty case, and the SeqCons, or the cons case.  When we are an empty list, we return the SeqNil pattern, else we have the SeqCons which gives us both the head and the tail of the list.

Once we have this in place, writing such functions as inits and tails are rather trivial such as the following:

let rec inits<'a> : 'a seq -> 'a seq seq = function
  | SeqNil         -> Seq.singleton Seq.empty
  | SeqCons(x, xs) -> 
      Seq.append <| Seq.singleton Seq.empty <| Seq.map (Seq.append [x]) (inits xs)
      
let rec tails<'a> : 'a seq -> 'a seq seq = function
  | SeqNil               -> Seq.singleton Seq.empty
  | SeqCons(_,xs) as xxs ->
      Seq.append <| Seq.singleton xxs <| tails xs      

The two functions I defined above are ones that return all initial segments to a list, and return all final segments to the list.  Such examples of running this are below:

> inits "abc";;
val it : seq<seq<char>>
= seq [seq []; seq ['a']; seq ['a'; 'b']; seq ['a'; 'b'; 'c']]
> tails "abc";;
val it : seq<seq<char>> = seq ["abc"; seq ['b'; 'c']; seq ['c']; seq []]

But, let’s get to the problem we’re actually here to solve.  What about our intersperse?  Now that we have our function defined, it’s rather easy such as the following:

let rec intersperse sep = function
  | SeqNil             -> Seq.empty
  | SeqCons(x, SeqNil) -> Seq.singleton x
  | SeqCons(x, xs)     -> 
      Seq.append <| Seq.singleton x <| Seq.singleton sep 
        |> Seq.append <| intersperse sep xs  

As you can see, we cover three cases.  The first case is the SeqNil, in which we’ll return an empty sequence.  The second case, we check if we have a head and an empty tail, and if so, we return a singleton of the head.  Finally, we have a head and a tail in which we append the head to the separator, and then ultimately to the intersperse recursive call.  We can now verify the results of the following:

> intersperse ',' "foo";;
val it : seq<char> = seq ['f'; ','; 'o'; ','; ...]
> intersperse -1 [1..3];;
val it : seq<int> = seq [1; -1; 2; -1; ...]

And there we have it, but is this the right way?  The answer is no.  Although the code looks rather clean, we are incurring a huge penalty with the tail and skip implementations.  The reason why is that they restart to the head of the sequence each time which incurs a bit of a performance penalty.  So, our code looks clean, but ultimately, this isn’t the right solution here.

How bad is it?  Well, let’s calculate 100 items with an intersperse of –1.  Let’s first look at a basic List implementation:

> #time;;

--> Timing now on

> module List =
-   let rec intersperse sep = function
-     | []      -> []
-     | [x]     -> [x]
-     | (x::xs) -> x :: sep :: intersperse sep xs;;

module List = begin
  val intersperse : 'a -> 'a list -> 'a list
end

Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
> List.intersperse -1 [1..100];;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : int list
= [1; -1; 2; -1; 3; -1; 4; -1; 5; -1; 6; -1; 7; -1; 8; -1; 9; -1; 10; -1; 11;
   -1; 12; -1; 13; -1; 14; -1; 15; -1; 16; -1; 17; -1; 18; -1; 19; -1; 20; -1;
   21; -1; 22; -1; 23; -1; 24; -1; 25; -1; 26; -1; 27; -1; 28; -1; 29; -1; 30;
   -1; 31; -1; 32; -1; 33; -1; 34; -1; 35; -1; 36; -1; 37; -1; 38; -1; 39; -1;
   40; -1; 41; -1; 42; -1; 43; -1; 44; -1; 45; -1; 46; -1; 47; -1; 48; -1; 49;
   -1; 50; -1; ...]

Now we have a baseline of a really small time of nothing on the CPU clock and no real activity on the GC.  What about our function using a sequence pattern matching?  Let’s take a look at the numbers:

> intersperse -1 [1..100];;
Real: 00:00:01.086, CPU: 00:00:00.998, GC gen0: 65, gen1: 1, gen2: 0
val it : seq<int> = seq [1; -1; 2; -1; ...]

Above really points out the performance penalty we pay.  We have quite a bit of time tick off from our real time and CPU time for just 100 elements.  And of course the GC went into action even into the first generation.  Once again, this shows this code is not ideal.  And you don’t want to try with 1000 elements…  So, let’s move on.

The List Way

Instead, of dealing with the performance black hole that is our Seq implementation, it’s better to take that performance penalty up front by caching it to a list.  What we could do is cast the sequence to a list and then perform the operation as if we were pattern matching on a list, which is very efficient.  Something like that might look like the following:

let intersperse' sep : 'a seq -> 'a seq =
  let rec intersperseInner = function
    | []      -> []
    | [x]     -> [x]
    | (x::xs) -> x :: sep :: intersperseInner xs
  Seq.to_list >> intersperseInner >> Seq.of_list

What we’re doing here is creating an inner function that deals with the list.  Then we can implement the logic much as above.  Ultimately, we’d like to have a list version already declared somewhere else such as we did above for our performance check,  such as the following:

module List =
  let rec intersperse sep = function
    | []      -> []
    | [x]     -> [x]
    | (x::xs) -> x :: sep :: intersperse sep xs

module Seq =
  let intersperse sep : 'a seq -> 'a seq =
    Seq.to_list >> List.intersperse sep >> Seq.of_list

That’s a “good” solution in terms of the list implementation, but still we take that performance hit of going back and forth between lists and sequences for our Seq version.  How bad is it?  Well, let’s analyze our numbers.

> Seq.intersperse -1 [1..100];;
Real: 00:00:00.007, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : seq<int>
= [1; -1; 2; -1; 3; -1; 4; -1; 5; -1; 6; -1; 7; -1; 8; -1; 9; -1; 10; -1; 11;
   -1; 12; -1; 13; -1; 14; -1; 15; -1; 16; -1; 17; -1; 18; -1; 19; -1; 20; -1;
   21; -1; 22; -1; 23; -1; 24; -1; 25; -1; 26; -1; 27; -1; 28; -1; 29; -1; 30;
   -1; 31; -1; 32; -1; 33; -1; 34; -1; 35; -1; 36; -1; 37; -1; 38; -1; 39; -1;
   40; -1; 41; -1; 42; -1; 43; -1; 44; -1; 45; -1; 46; -1; 47; -1; 48; -1; 49;
   -1; 50; -1; ...]

The answer isn’t too bad actually and there is no GC hit either.  As I like to explore all my options, is there yet another way?

The Imperative Way

There is one more way we could look at the problem through the use of an imperative sequence expression.    By tracking whether we are at the first item, then we can decide whether to emit the separator or not.  Let’s take a look at that possible implementation:

let interperse'' sep s = 
  seq { let notFirst = ref false 
        for x in s do 
          if !notFirst then yield sep; 
          yield x; 
          notFirst := true
      } 

By using a reference cell, we’re able to track whether we are at our first record, and if we are, we don’t emit.  Then we yield our first result and then ensure our reference cell is set to true, so that every time thereafter, we emit the separator.  By far, this is the most efficient solution in terms of performance, but not necessarily readability, as we are setting a reference cell each and every time we’re in the loop.  Before we jump to any conclusions, let’s look at the numbers:

> intersperse'' -1 [1..100] |> Seq.iter (fun _ -> ());;
Real: 00:00:00.003, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()

As you can see, this is the best performing of all the sequence based intersperse functions. 

Conclusion

It’s always interesting to look at multiple solutions to even a simple problem.  By analyzing different ways of doing the intersperse function, we were able to find which implementation was the better performer of the lot.  When sitting in the F# world, we have many of these tools at our disposal.  But, with that, we also are hampered in ways especially in regards to pattern matching over sequences.  Readability and maintainability are big pieces in the applications we write, but performance can be right up there, depending on the domain.  But, what might read well, just might also be some of the worst performing code.

This entry was posted in F#, Functional Programming, Haskell. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • http://codebetter.com/members/Matthew.Podwysocki/default.aspx Matthew.Podwysocki

    @Paul,

    Actually my target audience is anyone who’s willing to learn something new and maybe step outside of that OO world you live in. I think I write from that perspective to give people just a glance on what functional programming is and the fact that it’s not a scary world after all.

    Ok, took the whole naming thing to heart. Yes, things like that can matter for a beginner who doesn’t yet have a grasp on the ins and outs of the language. I changed it a little and will keep it in mind for the future.

    Matt

  • PaulSmith

    I agree with Avner. Admittedly, I’m probably not your target audience, I’m here mainly for the OO blogs. However, I’ll take a peak at your posts from time-to-time trying to soak in a little bit at a time until it all just clicks for me (I know… wishful thinking).

    I took one look at the Haskell function and decided it would take too much time to try and vaguely understand it and skimmed over the rest. I didn’t even know it was an elementary topic. I would have stuck with it longer had the variables been spelled out (it does make a little more sense to me now). I’m not saying this because I think you “should” do it differently – just to give you a complete beginners perspective.

  • http://codebetter.com/members/Matthew.Podwysocki/default.aspx Matthew.Podwysocki

    @Avner

    Tim is correct that the x:xs convention is quite similar to i (or idx) or the for loops. The source code that I posted is directly from Haskell and those are the conventions used. Perhaps in the future when explaining elementary concepts, it might be helpful to change the code as to suit the the beginner. That’s why I took the time to explain the code instead of just pasting it. The code alone to an outsider might not be all that helpful.

    Matt

  • http://www.partario.com/blog/ Tim Robinson

    Avner, a lot of Haskell is like that: there’s a range of short variable names like this. The language is light on keywords, with plenty of symbols (although almost all of these are operators defined in the standard library, rather than in the language itself).

    Coming from a C# background I found Haskell’s syntax quite cryptic at first (“why do I have to stick $ signs in the middle of my function calls?”). It didn’t take me long to learn the syntax, though, and most of the underlying concepts have equivalents in modern C#.

    I’d say the x:xs convention for head:tail is equivalent to naming all of your ‘for’ loop variables ‘i': it’s not obvious to newcomers, but once you recognise it, you’ll see it used consistently throughout Haskell source code.

  • http://weblogs.asp.net/avnerk Avner

    Coming from an imperative/OO background, I can see a form of evolution, in the past decade or two, in the way sample code and tutorials are written in those languages. It’s rarer to see variables named “x” and “y”, and more common to see practices such descriptive variable names and code comments within samples.
    But for some reason, most blog posts and explanations of functional programming I’ve seen seem to use semi-obfuscated syntax. It’s hard enough for me, coming from an OO background, to wrap my head around the concepts. Why make it harder by making the examples less clear? For instance, I don’t know Haskell at all, so it took me a minute to figure out this syntax:
    intersperse sep (x:xs) = x : sep : intersperse sep xs

    it would have taken me half the time (and with more complicated code, it means a lot) to understand this line:

    intersperse sep (head:tail) = head : sep : intersperse sep tail

    I’m not even talking about the common C# practice of using full words (separator vs. sep), but just avoiding the oh-so-meaningless “x”.