Inlining and generic programming

Inlining and generic programming

Hello all,

First of all, I'm sorry that this email is so absurdly long. But it's not easy to explain the problem at hand, so I took a step-by-step approach. The executive summary is: GHC can do a great job with inlining, but often it doesn't, and I don't understand why. So I have some questions, which are highlighted in the text below. In general, any insights regarding inlining or improving the performance of generics are welcome. My final goal is to be able to state that generic functions (in particular using GHC.Generics) will have no runtime overhead whatsoever when compared to a handwritten type-specific version.

The setting

Generic programming is based on representing datatypes in a uniform way using a small set of representation types. Functions defined on those representation types can then be applied to all datatypes, because we can convert between datatypes and their representations.
However, generic functions tend to be slower than their specialised counterparts, because they have to deal with the conversions. But clever inlining (together with other compiler optimisations) can completely remove this overhead. The problem I'm tackling is how to tell GHC exactly what it should in the particular case of optimisation of generic code.

Simplified example

I'll focus on the problem of optimising a non-trivial function for generic enumeration of terms. My experience shows that GHC does quite good at optimising simple functions, especially consumers (like generic equality). But producers are trickier.

The particular implementation of these functions doesn't really matter. What's important is that we have a way to interleave lists (|||) and a way to diagonalise a matrix into a list (diag). We mark these functions as NOINLINE because inlining them will only make the core code more complicated (and may prevent rules from firing).

Suppose we have a type of Peano natural numbers:

data Nat = Ze | Su Nat deriving Eq

Implementing enumeration on this type is simple:

enumNat :: [Nat]enumNat = [Ze] ||| map Su enumNat

Now, a generic representation of Nat in terms of sums and products could look something like this:

type RepNat = Either () Nat

That is, either a singleton (for the Ze case) or a Nat (for the Su case). Note that I am building a shallow representation, since at the leaves we have Nat, and not RepNat. This mimics the situation with current generic programming libraries (in particular GHC.Generics).

We'll need a way to convert between RepNat and Nat:

toNat :: RepNat -> NattoNat (Left ()) = ZetoNat (Right n) = Su n

fromNat :: Nat -> RepNatfromNat Ze = Left ()fromNat (Su n) = Right n

(In fact, since we're only dealing with a generic producer we won't need the fromNat function.)

To get an enumeration for RepNat, we first need to know how to enumerate units and sums:

enumNatFromRep finally starts with (|||) directly. But its second argument, lvl6_ryw, is a map of lvl5_ryv, which is itself a map! At this stage I expected GHC to be aware of the fusion law for map, but it seems that it isn't.

Note how toNat is entirely gone from the second part of the enumeration (lvl5_ryF). Strangely enough, the enumerator for Ze (lvl4_ryE) is still very complicated: map toNat ([Left ()]). Why doesn't GHC simplify this to just [Ze]? Apparently because GHC doesn't simplify map over a single element list.

(Note, in particular, that we do not need INLINE pragmas on the from/to methods. This might just be because GHC thinks these are small and inlines them anyway, but in general we want to make sure they are inlined, so we typically use pragmas there.)

Now we need to implement enumeration generically. We do this by giving an instance for each representation type:

We explicitly tell GHC to inline each case, as before. Note that for products I'm not using the more natural list comprehension syntax because I don't quite understand how that gets translated into core.

GEnum' is the class used for instantiating the generic representation types, and GEnum is used for user types. We use a default signature to provide a default method that can be used when we have a Representable instance for the type in question. This makes instantiating Nat very easy:

instance GEnum Nat

Unfortunately, the core code generated in this situation (with the same RULES as before) is not nice at all:

Unfortunately this doesn't change the generated core code. With some more debugging looking at the generated code at each simplifier iteration, I believe that this is because a2_r79M got lifted out too soon, prevent the rule from applying. With some imagination I decided to try the -fno-full-laziness flag to prevent let-floating. I'm not sure this is a good idea in general, but in this particular case it gives much better results:

Note how our enum is now of the shape `[Leaf] ||| diag y`, which is good. The only catch is that there are some `Rec`s still laying around, with their associated newtype coercions, and a function a1_r72i that basically wraps the recursive enumeration in a Rec, only to be unwrapped in the body of `$fGEnumTree_$cgenum`. I don't know how to get GHC to simplify this code any further.

Question: why do I need -fno-full-laziness for the ft/diag rule to apply?

Question: why is GHC not getting rid of the Rec newtype in this case?

I have also played with -O2, in particular because of the SpecConstr optimisation, but found that it does not affect these particular examples (perhaps it only becomes important with larger datatypes). I have also experimented with phase control in the rewrite rules and the inline pragmas, but didn't find it necessary for this example. In general, anyway, my experience with the inliner is that it is extremely fragile, especially across different GHC versions, and it's hard to get any guarantees of optimisation. I have also played with the -funfolding-* options before, with mixed results. [1] It's also a pity that certain flags are not explained in detail in the user's manual [2,3], like -fliberate-case, and -fspec-constr-count and threshold, for instance.

Thank you for reading this. Any insights are welcome. In particular, I'm wondering if I might be missing some details regarding strictness.

RE: Inlining and generic programming

Your example was unusual in that it used a lot of top-level definitions. GHC treats them slightly specially. Given:

x = g 4

y = f x

GHC does not transform into this:

y = f (g 4)

which it would do in a nested let. Why not? Because the latter will generate code that dynamically allocates a thunk for (g 4), while the former will make
a static thunk.

(An alternative would be to treat them uniformly and only pull out those nested thunks at the very last minute; but GHC doesn’t do that right now.)

A disadvantage is that it’s not statically visible to the simplifier that x is used once. If we have a RULE for f (g n), it might not fire -- because of the
worry that someone else might be sharing x.

I think this is the root cause of much of your trouble.

Incidentally , it makes no difference giving x an INLINE pragma. GHC is very cautious about duplicating non-values and currently
not even INLINE will make it less cautious. That’s another thing we could consider changing.

I’ll respond to part 2 (about generic programming) separately

Simon

Hello all,

First of all, I'm sorry that this email is so absurdly long. But it's not easy to explain the problem at hand, so I took a step-by-step approach. The executive summary is: GHC can do a
great job with inlining, but often it doesn't, and I don't understand why. So I have some questions, which are highlighted in the text below. In general, any insights regarding inlining or improving the performance of generics are welcome. My final goal
is to be able to state that generic functions (in particular using GHC.Generics) will have no runtime overhead whatsoever when compared to a handwritten type-specific version.

The setting

Generic programming is based on representing datatypes in a uniform way using a small set of representation types. Functions defined on those representation types can then be applied to all datatypes, because we can convert between datatypes and their representations.
However, generic functions tend to be slower than their specialised counterparts, because they have to deal with the conversions. But clever inlining (together with other compiler optimisations) can completely remove this overhead. The problem I'm tackling
is how to tell GHC exactly what it should in the particular case of optimisation of generic code.

Simplified example

I'll focus on the problem of optimising a non-trivial function for generic enumeration of terms. My experience shows that GHC does quite good at optimising simple functions, especially consumers (like generic equality). But producers are trickier.

The particular implementation of these functions doesn't really matter. What's important is that we have a way to interleave lists (|||) and a way to diagonalise a matrix into a list (diag). We mark these functions as NOINLINE because inlining them will only
make the core code more complicated (and may prevent rules from firing).

Suppose we have a type of Peano natural numbers:

data Nat = Ze | Su Nat deriving Eq

Implementing enumeration on this type is simple:

enumNat :: [Nat]
enumNat = [Ze] ||| map Su enumNat

Now, a generic representation of Nat in terms of sums and products could look something like this:

type RepNat = Either () Nat

That is, either a singleton (for the Ze case) or a Nat (for the Su case). Note that I am building a shallow representation, since at the leaves we have Nat, and not RepNat. This mimics the situation with current generic programming libraries (in particular
GHC.Generics).

enumNatFromRep finally starts with (|||) directly. But its second argument, lvl6_ryw, is a map of lvl5_ryv, which is itself a map! At this stage I expected GHC to be aware of the fusion law for map, but it seems that it isn't.

Note how toNat is entirely gone from the second part of the enumeration (lvl5_ryF). Strangely enough, the enumerator for Ze (lvl4_ryE) is still very complicated: map toNat ([Left ()]). Why doesn't GHC simplify this to just [Ze]? Apparently because GHC doesn't
simplify map over a single element list.

(Note, in particular, that we do not need INLINE pragmas on the from/to methods. This might just be because GHC thinks these are small and inlines them anyway, but in general we want to make sure they are inlined, so we typically use pragmas there.)

Now we need to implement enumeration generically. We do this by giving an instance for each representation type:

We explicitly tell GHC to inline each case, as before. Note that for products I'm not using the more natural list comprehension syntax because I don't quite understand how that gets translated into core.

GEnum' is the class used for instantiating the generic representation types, and GEnum is used for user types. We use a default signature to provide a default method that can be used when we have a Representable instance for the type in question. This makes
instantiating Nat very easy:

instance GEnum Nat

Unfortunately, the core code generated in this situation (with the same RULES as before) is not nice at all:

Unfortunately this doesn't change the generated core code. With some more debugging looking at the generated code at each simplifier iteration, I believe that this is because a2_r79M got lifted out too soon, prevent the rule from applying. With some imagination
I decided to try the -fno-full-laziness flag to prevent let-floating. I'm not sure this is a good idea in general, but in this particular case it gives much better results:

Note how our enum is now of the shape `[Leaf] ||| diag y`, which is good. The only catch is that there are some `Rec`s still laying around, with their associated newtype coercions, and a function a1_r72i that basically wraps the recursive enumeration in a Rec,
only to be unwrapped in the body of `$fGEnumTree_$cgenum`. I don't know how to get GHC to simplify this code any further.

Question: why do I need -fno-full-laziness for the ft/diag rule to apply?

Question: why is GHC not getting rid of the Rec newtype in this case?

I have also played with -O2, in particular because of the SpecConstr optimisation, but found that it does not affect these particular examples (perhaps it only becomes important with larger datatypes). I have also experimented with phase control in the rewrite
rules and the inline pragmas, but didn't find it necessary for this example. In general, anyway, my experience with the inliner is that it is extremely fragile, especially across different GHC versions, and it's hard to get any guarantees of optimisation.
I have also played with the -funfolding-* options before, with mixed results. [1] It's also a pity that certain flags are not explained in detail in the user's manual [2,3], like -fliberate-case, and -fspec-constr-count and threshold, for instance.

Thank you for reading this. Any insights are welcome. In particular, I'm wondering if I might be missing some details regarding strictness.

If we used that to move the cast out of the way, the RULE would match too.

But GHC is nowhere near clever enough to do either of these things. And it's far from obvious what to do in general.

=================

Bottom line: the choices made by the constraint solver can affect exactly where casts are inserted into the code. GHC knows how to move casts around to stop them getting in the way of its own transformations, but is helpless if they get in the way of RULES.

I am really not sure how to deal with this. But it is very interesting!

Simon

Hello all,

First of all, I'm sorry that this email is so absurdly long. But it's not easy to explain the problem at hand, so I took a step-by-step approach. The executive summary is: GHC can do a
great job with inlining, but often it doesn't, and I don't understand why. So I have some questions, which are highlighted in the text below. In general, any insights regarding inlining or improving the performance of generics are welcome. My final goal
is to be able to state that generic functions (in particular using GHC.Generics) will have no runtime overhead whatsoever when compared to a handwritten type-specific version.

The setting

Generic programming is based on representing datatypes in a uniform way using a small set of representation types. Functions defined on those representation types can then be applied to all datatypes, because we can convert between datatypes and their representations.
However, generic functions tend to be slower than their specialised counterparts, because they have to deal with the conversions. But clever inlining (together with other compiler optimisations) can completely remove this overhead. The problem I'm tackling
is how to tell GHC exactly what it should in the particular case of optimisation of generic code.

Simplified example

I'll focus on the problem of optimising a non-trivial function for generic enumeration of terms. My experience shows that GHC does quite good at optimising simple functions, especially consumers (like generic equality). But producers are trickier.

The particular implementation of these functions doesn't really matter. What's important is that we have a way to interleave lists (|||) and a way to diagonalise a matrix into a list (diag). We mark these functions as NOINLINE because inlining them will only
make the core code more complicated (and may prevent rules from firing).

Suppose we have a type of Peano natural numbers:

data Nat = Ze | Su Nat deriving Eq

Implementing enumeration on this type is simple:

enumNat :: [Nat]
enumNat = [Ze] ||| map Su enumNat

Now, a generic representation of Nat in terms of sums and products could look something like this:

type RepNat = Either () Nat

That is, either a singleton (for the Ze case) or a Nat (for the Su case). Note that I am building a shallow representation, since at the leaves we have Nat, and not RepNat. This mimics the situation with current generic programming libraries (in particular
GHC.Generics).

enumNatFromRep finally starts with (|||) directly. But its second argument, lvl6_ryw, is a map of lvl5_ryv, which is itself a map! At this stage I expected GHC to be aware of the fusion law for map, but it seems that it isn't.

Note how toNat is entirely gone from the second part of the enumeration (lvl5_ryF). Strangely enough, the enumerator for Ze (lvl4_ryE) is still very complicated: map toNat ([Left ()]). Why doesn't GHC simplify this to just [Ze]? Apparently because GHC doesn't
simplify map over a single element list.

(Note, in particular, that we do not need INLINE pragmas on the from/to methods. This might just be because GHC thinks these are small and inlines them anyway, but in general we want to make sure they are inlined, so we typically use pragmas there.)

Now we need to implement enumeration generically. We do this by giving an instance for each representation type:

We explicitly tell GHC to inline each case, as before. Note that for products I'm not using the more natural list comprehension syntax because I don't quite understand how that gets translated into core.

GEnum' is the class used for instantiating the generic representation types, and GEnum is used for user types. We use a default signature to provide a default method that can be used when we have a Representable instance for the type in question. This makes
instantiating Nat very easy:

instance GEnum Nat

Unfortunately, the core code generated in this situation (with the same RULES as before) is not nice at all:

Unfortunately this doesn't change the generated core code. With some more debugging looking at the generated code at each simplifier iteration, I believe that this is because a2_r79M got lifted out too soon, prevent the rule from applying. With some imagination
I decided to try the -fno-full-laziness flag to prevent let-floating. I'm not sure this is a good idea in general, but in this particular case it gives much better results:

Note how our enum is now of the shape `[Leaf] ||| diag y`, which is good. The only catch is that there are some `Rec`s still laying around, with their associated newtype coercions, and a function a1_r72i that basically wraps the recursive enumeration in a Rec,
only to be unwrapped in the body of `$fGEnumTree_$cgenum`. I don't know how to get GHC to simplify this code any further.

Question: why do I need -fno-full-laziness for the ft/diag rule to apply?

Question: why is GHC not getting rid of the Rec newtype in this case?

I have also played with -O2, in particular because of the SpecConstr optimisation, but found that it does not affect these particular examples (perhaps it only becomes important with larger datatypes). I have also experimented with phase control in the rewrite
rules and the inline pragmas, but didn't find it necessary for this example. In general, anyway, my experience with the inliner is that it is extremely fragile, especially across different GHC versions, and it's hard to get any guarantees of optimisation.
I have also played with the -funfolding-* options before, with mixed results. [1] It's also a pity that certain flags are not explained in detail in the user's manual [2,3], like -fliberate-case, and -fspec-constr-count and threshold, for instance.

Thank you for reading this. Any insights are welcome. In particular, I'm wondering if I might be missing some details regarding strictness.

Re: Inlining and generic programming

Simon,

Thanks a lot for looking into this. One question regarding maps that I still don't understand: can you explain me if it is indeed to be expected that GHC won't fuse `map f . map g` into `map (f . g)` by default? Also, same for `map f [x]` ~> `[f x]`?

I understand your explanation for why this can result in different behaviour. But I think we should try to find a way to address this. We came up with DefaultSignatures to simplify instantiating generic functions, but now it turns out that using them can make the code slower! This is rather unexpected. Could we perhaps have a way to let users specify rewrite rules involving `cast`?