Shading is the practice of renaming a dependency and embed it in a project to be sure it won’t conflict with another version of itself (it’s a good time to go watch or rewatch Rich Hickey’s Spec-ulation).

For Unrepl we rely on shading extensively as we don’t want the code injected by the client to interfere with running code or even tools running a different strain of Unrepl.

That’s how we ended with the idea of content-defined shading: choose a granularity of shading (e.g. namespace or all eps, or whole project), compute a hash (in our case SHA1, thus we are SHA-ding!) on it and use the hash in the renaming process.

Doing so we end with stable names that don’t depend on date, version or commit.

Traditionally transient data structures use a mutable box to determine whether a node can be modified in place or not. Somehow it acts like a transaction: when the mutable box contains a non null then it means the transient is still “open” (editable nodes have not been shared yet os can be modified). Thus each node has a reference to a reference object (when the node was created by a persistent operation the reference to the reference itself is null).

When I was working on confluent map I found another way to track transients node ownership.

The main idea is to put the flag not in the node but in its parent. Since there are one flag per child, better to store them has a bitmap. Space-wise this solution is not greedier than using a reference type: a 32 bit bitmap vs a 32 bit reference (with compressed pointers) and an additional object.

The role of these flags is to tell whether a child is exclusively owned (not shared) by its parent. Now when one traverse a transient data structures from its root, one has just to check that the whole ownership chain is exclusive. When so the node is editable (mutable). No mutability required.

Tangentially related notes:

In confluent map, I doesn’t even need to have a separate bitmap, since I had a case left in the main bitmap.

Long time no post.
In the years since the last post, I’ve worked on several projects, let me introduce some!

First there’s xforms a collection of transducers-related stuff. Xforms is really great if you need to do any kind of aggregation, it also provides transducer versions of some core functions and several new transducing contexts (strings, io). Plus it has optimizations for dealing with key-value pairs without ever allocating a pair object (best case).

Xforms was initially a clojure/jvm lib but Mike Fikes started porting it to cljs, however I was not happy with having to either split the codebase or break the API to solve the “macros-and-code-in-one-file” problem. With his and António’s expert knowledges to guide me I figured out a couple of macros which allows to write cljc code which mixes macros and code, works on clj/jvm, cljs/clj and self-host cljs (yes there are macros to figure out the cljs flavor). This is really useful when porting clj/jvm code to cljs/*. These macros are packaged in a library named Macrovich.

With colleagues at HCA Datalab we worked on Powderkeg (“Keg” for friends) which basically turns Apache Spark in a giant transducing context. Plus it works without any AOT. Start the repl, connect to the cluster, run your transducers on RDDs. Benefits: you can experiment against real data with a tight feedback loop and you can test your computations with no dependencies on Spark.

The part of Keg that makes the REPL-no-AOT experience possible has been repurposed into Portkey which allows to deploy freshly REPLed functions as AWS Lambdas; a sub project is aws-clj-sdk an AWS api generated from the machine-readable services descriptions provided by Amazon (like official SDKs or Python’s Boto); Kimmo Koskinen and Baptiste Dupuch are hard at work on both projects.

Last, there’s Unrepl which aims to provide better REPLs and tooling in general without requiring a single dependency to your project. All that is needed is a plain socket repl and an Unrepl-powered tool (like Spiral for Emacs, Vimpire for VIM, or Unravel for the terminal).

Transducers are powerful and easy to grasp when they claim they transform reducing functions. However once you scratch their surface you quickly realize that’s not their true nature: they transform stateful processes.

In a previous post, I explained why seeded transduce forces transducers to return stateful reducing functions. However this can be fixed. The current implementation of transduce reads:

(defn transduce"reduce with a transformation of f (xf). If init is not supplied, (f) will be called to produce it. f should be a reducing step function that accepts both 1 and 2 arguments, if it accepts only 2 you can add the arity-1 with 'completing'. Returns the result of applying (the transformed) xf to init and the first item in coll, then applying xf to that result and the 2nd item, etc. If coll contains no items, returns init and f is not called. Note that certain transforms may inject or skip items."{:added"1.7"}([xformfcoll](transducexformf(f)coll))([xformfinitcoll](let [f(xformf)ret(clojure.core.protocols/coll-reducecollfinit)](fret))))

To fix it you have to make the seeded case the special case and not the normal case:

(defn fseed[finit](fn([]init)([x](fx))([xy](fxy))))(defn transduce"reduce with a transformation of f (xf). If init is not supplied, (f) will be called to produce it. f should be a reducing step function that accepts both 1 and 2 arguments, if it accepts only 2 you can add the arity-1 with 'completing'. Returns the result of applying (the transformed) xf to init and the first item in coll, then applying xf to that result and the 2nd item, etc. If coll contains no items, returns init and f is not called. Note that certain transforms may inject or skip items."{:added"1.7"}([xformfinitcoll](transducexform(fseedfinit)coll))([xformfcoll](let [f(xformf)ret(clojure.core.protocols/coll-reducecollf(f))](fret))))

By making the seeded reduce the special case, the init value can be wrapped in a composite init value by transducer-returned reducing functions. Now writing, for example, a pure take is possible.

This modification fixes the seeded reduce case but it requires allocating intermediate objects at each step which goes against the promise that transducers alleviate allocative pressure (promise inherited from reducers). So it’s fixable but for performance reasons transducer-returned reducing functions remain stateful.

Once you go stateful…

0-arity always delegate to the 0-arity of the downstream reducing function,

in any arity, the accumulator must be used in a linear manner with the only allowed operation being the 2-arity of the downstream function.

don’t forget to check for reduced return values!

So the accumulator values are passed around but never used except by the most downstream reducing function: the one that was passed to transduce by the user! Why not, then, encapsulates the accumulator state in a process?

A process in this model has only two operations: process one input and complete which could be respectively mapped to 1-arity and 0-arity of a function:

(fn([]...); completes the process, return value is unspecified([x]...)); processes x, returns true when no more input should be fed in.

Conclusion

To me the main advantage of transducers as process transformers is that their interface is a bit less complected; especially by not having to deal with reduced wrappers – because of that it may be easier to have them support primitive types.

The process model also seems to be less of a mismatch when reasoning about transducers in other contexts (e.g. sequences, channels).

A better name

There has been much discussion on how to best type them but not that much about how to name them in a more patterned way. What about BuilderDecoratorFactories?

First rule: you don’t call a transducer.Second rule: only composition is allowed.Third rule: better stateful than pure.

Here are the rationale behind each rule:

First rule: you don’t call a transducer.

Transducers may be so-called “stateful” that is they create stateful reducing functions. As a consequence you should not hold onto them for too long (otherwise state goes sour…). And there’s no betetr way to not hold them for too long that to never get a hold on them at all!

That’s why transduce, sequence, iteration and chan all take a transducer to avoid you the perils of having to call it.

Second rule: only composition is allowed.

It’s a direct consequence of the first rule: you should only compose transducers.

Third rule: better stateful than pure.

When one wants to pass state from one step to the next there are two options: be pure and pass it in the accumulator (and you’ll use the 1-arity to clean up) or be stateful.

It turns out that the pure option is a bad one, for two reasons. First reason, it’s going to increase object churn. Second reason, it’s going to change the type of the accumulator and this is a problem because:

in transduce an init value may be specified and this init value must be of the type of the accumulator,

we don’t have a way to map from the result domain to the accumulator domain.

So it means the user has to be aware of an implementation detail of your transducer (the way you smuggle state in the accumulator) to craft a proper init value. It’s an abstraction leak, it’s bad. Be stateful.

The goal is to tightly arrange a given sequence of n words within page margins, maximizing overall neatness. To be more precise, we wish to minimize the sum, over all lines except the last, of the cubes of the number of blank characters at the end of each line. See the comments in the code for more details.

The algorithm has O(n^2) time and space complexities.

Below is my take on it and I believe the complexity to be O(n*width). I achieve linearity in the words count by laying out the text back to front: the key insight is that the optimal layout of some words only depends on the amount of space left on the current line, it does not depend on the layout of the words before them.

(defn layout[wordswidth](let [cat(fn [word{u:ugliness[l&ls]:lines}]; adds the word to the start of the first line{:uglinessu:lines(cons (cons wordl)ls)})br(fn [rem {u:uglinessls:lines}]; adds a break (creates a new line){:ugliness(+ u(* rem rem rem)):lines(cons ()ls)})layout; layout words, knowing that the first line shouldn't have more than rem characters(fn [layoutwordsrem](if-let [[word&ws](seq words)](cond(= rem width); blank line ahead(catword(layoutws(- rem (count word))))(< (count word)rem); enough room for a space and the current word(catword(min-key :ugliness(layoutws(- rem (count word)1))(brrem (layoutwswidth)))):else(brrem (layoutwordswidth))){:ugliness0:lines(list ())}))mlayout(memoizelayout)layout(fn self[wordsrem](mlayoutselfwordsrem))](:lines(layoutwordswidth))))

This is an exercise I prepared for a Lambda Next workshop but that we never used.

The common advice about macros is that they should emit as little code as possible and delegate to ancillary functions as soon as possible. Here is an example from clojure.java.jdbc:

(defmacro transaction[&body]`(transaction*(fn []~@body)))

I still think this is good advice but it has unintended consequences. The problem with this piece of code is that all closed-over objects in body are going to be retained longer than expected, longer than they would have been retained if the macro had emitted all the logic implemented in transaction* instead of delegating to it. (See this discussion as an example of issues created by such code.)

The closure object has references to all closed-over objects and since a closure can be called many times, it can’t get rid of them. So the only time where they are going to be garbage collectible is once the closure itself becomes collectible… and a closure can’t be collected while it’s executing.

Helpfully there’s a low-level feature for that:

(defmacro transaction[&body]`(transaction*(^:oncefn*[]~@body)))

It instructs the compiler that the closure is one-shot and that closed-over references should be cleared, thus allowing referenced objects to be garbage collected before the closure returns.

This problem is not specific to macros but can easily be solved in most cases: the closure is an implementation detail and the macro writer knows enough about its life-cycle to fix it. However any regular closure (fn or reify) is going to prevent closed-overs to be garbage-collected while one of its (java) methods is running because the closure is referenced by the stack.

During the last LambdaNext workshop a delegate stumbled on such a case while experimenting with reducers (and incidentally it made me understand a memory issue I worked around last year):

(Depending on your memory settings, you may have to tweak the length of the range to exhibit the problem; more details here)

If one modifies reducers to not use (java) methods but external extensions:

(in-ns 'clojure.core.reducers)(defrecordFolder[collxf])(defn folder"Given a foldable collection, and a transformation function xf, returns a foldable collection, where any supplied reducing fn will be transformed by xf. xf is a function of reducing fn to reducing fn."{:added"1.5"}([collxf](Folder.collxf)))(extend-typeFolderclojure.core.protocols/CollReduce(coll-reduce[fldrf1](clojure.core.protocols/coll-reduce(:collfldr)((:xffldr)f1)(f1)))(coll-reduce[fldrf1init](clojure.core.protocols/coll-reduce(:collfldr)((:xffldr)f1)init))CollFold(coll-fold[fldrncombinefreducef](coll-fold(:collfldr)ncombinef((:xffldr)reducef))))

(defn tarjan"Returns the strongly connected components of a graph specified by its nodes and a successor function succs from node to nodes. The used algorithm is Tarjan's one."[nodessuccs](letfn[(sc[envnode]; env is a map from nodes to stack length or nil,; nil means the node is known to belong to another SCC; there are two special keys: ::stack for the current stack ; and ::sccs for the current set of SCCs(if (contains? envnode)env(let [stack(::stackenv)n(count stack)env(assoc envnoden::stack(conj stacknode))env(reduce (fn [envsucc](let [env(scenvsucc)](assoc envnode(min (or (envsucc)n)(envnode)))))env(succsnode))](if (= n(envnode)); no link below us in the stack, call it a SCC(let [nodes(::stackenv)scc(set (take (- (count nodes)n)nodes)); clear all stack lengths for these nodes since this SCC is doneenv(reduce #(assoc%1%2nil)envscc)](assoc env::stackstack::sccs(conj (::sccsenv)scc)))env))))](::sccs(reduce sc{::stack()::sccs#{}}nodes))))

As always, if you need some short-term help with Clojure (code review, consulting, training etc.), contact me!

Decaying lists can also maintain a state – that is updated after each conj or decay. Below the implementation, there is an example of state management.

You can also cap the length of the decaying list by using the :capacity option. A quick note on capacity: increasing the capacity by the half-life value (eg going from 1000 to 1050 when half-life is 50), doubles the range the decaying list remembers.

Sometimes in a seq pipeline, you know that some intermediate results are, well, intermediate and as such don’t need to be persistent but, on the whole, you still need the laziness.

You can’t always opt for reducers because while being non-persistent (and thus faster), they imply getting rid of laziness.

So you can’t mix and match transformations on sequences and on reducers when you care for laziness. Or, wait!, maybe you can.

At core, reducers are just a clever way to compose big functions while giving the user the illusion of manipulating collections. So we should be able to apply a reducer pipeline if we get hold of the composed function. This is exactly what the below seq-seq function does:

Of course seq-seq could be made to support chunked sequences and, if you try to play with it, beware of r/mapcat.

Update:

I added reverse-conses to account for the fact that when several items are added during a single call to f1 (eg during a r/mapcat “step”) they were added in the wrong order – if only everything was more set-like.