This is exactly how I work, and I went through a similar process to the article writer in coming to learn that premature abstraction fails for the same reasons premature optimisation does: before you have enough data to analyse (in the form of code that executes slowly or obvious refactoring candidates), you’re working blind, and it’s sheer luck if what you decided to optimise or abstract would’ve been a hot path or good model. Having only two similar things isn’t enough to predict what the “pivot points” of an abstraction over them and future things might be.

A Rust library includes the actual compiled functions like you’d expect, but it also contains a serialized copy of the compiler’s metadata about that library, giving function prototypes and data structure layouts and generics and so forth. That way, Rust can provide all the benefits of precompiled headers without the hassle of having to write things twice.

Of course, the downside is that Rust’s ABI effectively depends on accidental details of the compiler’s internal data structures and serialization system, which is why Rust is not getting a stable ABI any time soon.

I should’ve been more clear. Rust will not recompile third-party crates most of the time. It will if you run cargo clean, if you change compile options (e.g., activate or deactivate LTO), or if you upgrade the compiler, but during regular development, it won’t happen too much. However, there is a build for cargo check, and a build for cargo test, and yet another build for cargo build, so you might end up still compiling your project three times.

I mentioned keeping crates under control, because it takes our C.I. system at work ~20 minutes to build one of my projects. About 5 minutes is spent building the project a first time to run the unit tests, then another 10 minutes to compile the release build; the other 5 minutes is spent fetching, building, and uploading a Docker image for the application. The C.I. always starts from a clean slate, so I always pay the compilation price, and it slows me down if I test a container in a staging environment, realize there’s a bug, fix the bug, and repeat.

One way to make sure that your build doesn’t take longer than is needed to is be selective in your choice of third party crates (I have found that the quality of crates varies a lot) and making sure that a crate pays for itself. serde and rayon are two great libraries that I’m happy to include in my project; on the other hand, env_logger brings a few transitive libraries for coloring the log it generates. However, neither journalctl nor docker container logs show colors, so I am paying a cost without getting any benefit.

Definitely, this is why MLton is doing it, it’s a whole program optimizing compiler. The compilation speed tradeoff is so severe that its users usually resort to using another SML implementation for actual development and debugging and only use MLton for release builds.
If we can figure out how to make whole program optimization detect which already compiled bits can be reused between builds, that may make the idea more viable.

In last discussion, I argued for multi-staged process that improved developer productivity, esp keeping mind flowing. The final result is as optimized as possible. No wait times, though. You always have something to use.

Exactly. I think developing with something like smlnj, then compiling the final result with mlton is a relatively good workflow. Testing individual functions is faster with Common Lisp and SLIME, and testing entire programs is faster with Go, though.

Won’t that make optimizations extremely hard? I haven’t watched the video, so I don’t know the details (and the Jai language primer makes no mentions of contexts), but if you can’t tell statically what’s in scope, it seems to me that most analyses will have to conservatively assume that the universe is in scope, no?

Things may have changed from the last demo I saw of Jai contexts, but this seems to be something intended to be used sparingly, or at least the context should contain only a few root object pointers. Functions that use context simply desugar to context-passing-style. The really interesting problem is what to do about higher-order code.

On other thing that makes this easier: Jai is focused on fast full-compilation, so it doesn’t suffer from the usual restrictions imposed by separate compilation. It would be possible to do conservative global analysis (very cheaply!) to compute which functions need which partitions of the whole context.

Scope and optimization here are separate questions and I don’t see how they’re related. Regarding scope, I don’t know the full details but I would assume you have to declare the global variables beforehand, so it’s not like you can introduce arbitrary variables into the context. The compiler knows exactly which static addresses are accessible and which are not. Perhaps that answers your question?

Yes, every tool should have a custom format that needs a badly cobbled together parser (in awk or whatever) that will break once the format is changed slighly or the output accidentally contains a space. No, jq doesn’t exist, can’t be fitted into Unix pipelines and we will be stuck with sed and awk until the end of times, occasionally trying to solve the worst failures with find -print0 and xargs -0.

JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).

In a JSON shell tool world you will have to spend time parsing and re-arranging JSON data between tools; as well as constructing it manually as inputs. I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).

Sidestory: several months back I had a co-worker who wanted me to make some code that parsed his data stream and did something with it (I think it was plotting related IIRC).

Me: “Could I have these numbers in one-record-per-row plaintext format please?”

Co: “Can I send them to you in JSON instead?”

Me: “Sure. What will be the format inside the JSON?”

Co: “…. it’ll just be JSON.”

Me: “But it what form? Will there be a list? Name of the elements inside it?”

Co: “…”

Me: “Can you write me an example JSON message and send it to me, that might be easier.”

Co: “Why do you need that, it’ll be in JSON?”

Grrr :P

Anyway, JSON is a format, but you still need a format inside this format. Element names, overall structures. Using JSON does not make every tool use the same format, that’s strictly impossible. One tool’s stage1.input-file is different to another tool’s output-file.[5].filename; especially if those tools are for different tasks.

I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).

Except that standardized, popular formats like JSON get the side effect of tool ecosystems to solve most problems they can bring. Autogenerators, transformers, and so on come with this if it’s a data format. We usually don’t get this if it’s random people creating formats for their own use. We have to fully customize the part handling the format rather than adapt an existing one.

Still, even XML that had the best tooling I have used so far for a general purpose format (XSLT and XSD in primis), was unable to handle partial results.

The issue is probably due to their history, as a representation of a complete document / data structure.

Even s-expressions (the simplest format of the family) have the same issue.

Now we should also note that pipelines can be created on the fly, even from binary data manipulations. So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.

XML and its ecosystem were extremely complex. I used s-expressions with partial results in the past. You just have to structure the data to make it easy to get a piece at a time. I can’t recall the details right now. Another I used trying to balance efficiency, flexibility, and complexity was XDR. Too bad it didn’t get more attention.

“So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.”

The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated. Works well enough for them. Camkes is an example.

Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.

While I am definitely not a proponent of JavaScript, computations in XSLT are incredibly verbose and convoluted, mainly because XSLT for some reason needs to be XML and XML is just a poor syntax for actual programming.

That and the fact that while my transformations worked fine with xsltproc but did just nothing in browsers without any decent way to debug the problem made me put away XSLT as an esolang — lot of fun for an afternoon, not what I would use to actually get things done.

That said, I’d take XML output from Unix tools and some kind of jq-like processor any day over manually parsing text out of byte streams.

I loved it when I did HTML wanting something more flexible that machines could handle. XHTML was my use case as well. Once I was a better programmer, I realized it was probably an overkill standard that could’ve been something simpler with a series of tools each doing their little job. Maybe even different formats for different kinds of things. W3C ended up creating a bunch of those anyway.

“Pipelines are integrated on the fly.”

Maybe put it in the OS like a JIT. Far as bytestreams, that mostly what XDR did. They were just minimally-structured, byte streams. Just tie the data types, layouts, and so on to whatever language the OS or platform uses the most.

JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).

This is true, but but it does not mean heaving some kind of common interchange format does not improve things. So yes, it does not tell you what the data will contain (but “custom text format, possibly tab separated” is, again, not better). I know the problem, since I often work with JSON that contains or misses things. But the problem is not to not use JSON but rather have specifications. JSON has a number of possible schema formats which puts it at a big advantage of most custom formats.

The other alternative is of course something like ProtoBuf, because it forces the use of proto files, which is at least some kind of specification. That throws away the human readability, which I didn’t want to suggest to a Unix crowd.

Thinking about it, an established binary interchange format with schemas and a transport is in some ways reminiscent of COM & CORBA in the nineties.

Using a whitespace separated table such as suggested in the article is somewhat vulnerable to continuing to appear to work after the format has changed while actually misinterpreting the data (e.g. if you inserted a new column at the beginning, your pipeline could happily continue, since all it needs is at least two columns with numbers in). Json is more likely to either continue working correctly and ignore the new column or fail with an error. Arguably it is the key-value aspect that’s helpful here, not specifically json. As you point out, there are other issues with using json in a pipeline.

In my day-to-day work, there are times when I wish some tools would produce JSON and other times when I wish a JSON output was just textual (as recommended in the article). Ideally, tools should be able to produce different kinds of outputs, and I find libxo (mentioned by @apy) very interesting.

I spent very little time thinking about this after reading your comment and wonder how, for example, the core utils would look like if they accepted/returned JSON as well as plain text.

A priori we have this awful problem of making everyone understand every one else’s input and output schemas, but that might not be necessary. For any tool that expects a file as input, we make it accept any JSON object that contains the key-value pair "file": "something". For tools that expect multiple files, have them take an array of such objects. Tools that return files, like ls for example, can then return whatever they want in their JSON objects, as long as those objects contain "file": "something". Then we should get to keep chaining pipes of stuff together without having to write ungodly amounts jq between them.

I have no idea how much people have tried doing this or anything similar. Is there prior art?

In FreeBSD we have libxo which a lot of the CLI programs are getting support for. This lets the program print its output and it can be translated to JSON, HTML, or other output forms automatically. So that would allow people to experiment with various formats (although it doesn’t handle reading in the output).

But as @Shamar points out, one problem with JSON is that you need to parse the whole thing before you can do much with it. One can hack around it but then they are kind of abusing JSON.

powershell uses objects for its pipelines, i think it even runs on linux nowaday.

i like json, but for shell pipelining it’s not ideal:

the unstructured nature of the classic output is a core feature. you can easily mangle it in ways the programs author never assumed, and that makes it powerful.

with line based records you can parse incomplete (as in the process is not finished) data more easily. you just have to split after a newline. with json, technically you can’t begin using the data until a (sub)object is completely parsed. using half-parsed objects seems not so wise.

if you output json, you probably have to keep the structure of the object tree which you generated in memory, like “currently i’m in a list in an object in a list”. thats not ideal sometimes (one doesn’t have to use real serialization all the time, but it’s nicer than to just print the correct tokens at the right places).

json is “java script object notation”. not everything is ideally represented as an object. thats why relational databases are still in use.

Salut mec! I did my undergrad at UdeM and I don’t recall needing a book for this class. I did the class with Marc Feeley and it was sufficient to attend the lectures and to make sure to preview and review the slides. The projects were good, and if you apply yourself and do them, I think you’ll get a lot out of this class without spending $100. TAPL is a cool book, but that would be more useful for Stephan’s grad-level class or a class by Brigitte Pientka at McGill.

I’m mostly looking into this for my personal benefit, this is the only class I’m taking this semester (As I also work full time) and I’d like to get really involved into the subject. I will probably take other classes in the same vein over the next semesters.

Don’t reach for a profiler, don’t try to set a global variable, or do start/stop timing in code. Don’t even start figuring out how to configure a logger to print timestamps, and use a log output format.

Then when you’ve done all this, rerun the command and pass it through a command that attaches timestamps to every output line.

I quickly read the article before bed yesterday (server seems overloaded this morning), and I’ll be really interested in seeing how he parallelized prettying-printing. In my own project, ppbert, parsing takes about 1/4 of the execution time and pretty-printing takes 3/4; I’d love to reduce the time it takes to pretty-print.

Well, I threw 10 min at Googling for a CompSci or other work parallelizing pretty-printing. Gave me nothing. Most searches on compiler-related stuff gives me too much data. Tried Bing but it kept talking about parallel ports for printers. Nevermind on that.

Point being that parallel algorithm and write-up for pretty-printers is worthwhile so there’s at least one of them in Google. :) Anyone wanting to experiment might also just put regular algorithms into parallel languages like Cilk or Chapel twiddling with data- and thread-parallel variants until they go faster. Warning that, if not used on hierarchically structured programs, the expected vs actual speedups might be all over the place since it might not be clear how to best divide the work. If it doesn’t parallelize cleanly, then Amdahl’s Law ruins it.

Well, one prior work on parallel printing is GNU Parallel. Parallel printing is actually one of its most important feature and advertised as such at the front page. To quote, “GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially”.

Appreciate the tip as that might be useful. For the search, I was aiming for something more efficient with it being pretty-printing done multi-threaded or otherwise in the same program. Still surprised GNU Parallel wasn’t in the results given the page has some terms I used.

Last year, I sat down and decided to learn how to use ed. For most people, my former self included, ed is this incredibly unintuitive editor that you quit by SIGKILLing after eat flaming death! fails to work. I was pleasantly surprised to find a simple underlying model that guided the entire design of the editor: addresses and commands. After I learned different types of addresses (e.g., line numbers, line number ranges, regular expressions, etc.) and some commands, I was able to write a small Python program entirely with ed.

I’ve had a similar experience with awk and jq: they can look like gibberish, yet they become easy-to-use and powerful tools in your arsenal once you understand their underlying models (patterns and actions for awk, json-in/json-out filters for jq).

Are there other types that are always on the heap? It turns out there are!! Here they are:

Vec (an array on the heap)

String (a string on the heap)

That sounds wrong to me. Both Vec<T> and String­—as far as I know—are triples: a length (usize), a capacity (usize), and a pointer into the heap. Those three values can be stored in the stack, however the pointer will always point to a memory block in the heap.

I want something like the OSX iTerm which just seemed to do what I want without a lot of friction. I’d prefer tabs and panes over windows but as long as I get fast rendering, 24-bit color, and can remap the keystrokes I’m happy. Tilix seems to fit the bill. I’d like to take advantage of triggers but haven’t gotten around to rebuilding vte.

I’ve also been using tilix for a few weeks now. I was a user of urxvt before, but I wanted to have tabs (otherwise I end up with 15-20 terms opened) and I wanted better font rendering than urxvt. I tried alacritty, but its lack of features was problematic (no true underlining, no scrollback, can’t dim the terminal when it’s not the active window, etc.)

Tilix is a pretty good terminal emulator. I have one problem with it, when I launch it, sometimes it fails to raise to the top in Openbox, so I need to Alt-Tab it in.

I like this book, and I like Wirth’s approach to compilers in general (i.e., keeping them very simple), but the typesetting in his documents is so bad that it makes reading them harder than it needs to be.

ISTR the printed copy I had was much more readable than this PDF, although it was once I learned that of course all his works were prepared on the same OS, computer, et cetera he built, it made sense why they all had kind of lousy formatting. Which is really too bad, I agree.

I needed to implement a simple (though not small) database, and I was unsure how to persist data to disk as it was my first such project. I was thrilled to find this paper: it is short, to the point, and explains a simple and easy-to-implement technique; the only thing I’m sad about is that I did not think of it myself :)

What is it about inference rules, regular expressions, and BNF grammars that allow us to understand them perfectly well, even with the syntactical variance? If simple syntactic differences can cause confusion with overlines and substitution, shouldn’t their expression be re-thought entirely? For instance, would the example that mixes overlines and substitution not be clearer if two for-all quantifiers (∀) were used?

Presumably, if you used a quantifier, you’d still have the problem of delimiting the extent of the quantified variables. As discussed in the talk, you would need to signal grouping with an overline, parens, binding dots, or similar.

What is it about inference rules, regular expressions, and BNF grammars that allow us to understand them perfectly well, even with the syntactical variance?

Regular expressions and BNF grammars are easy to understand because they’re just powerful enough for their intended purpose - specifying the syntax of programming languages, which is an easy task provided you don’t purposefully make it hard.

On the other hand, I’m not sure we understand inference rules that well.

For instance, would the example that mixes overlines and substitution not be clearer if two for-all quantifiers (∀) were used?

Not really. The trouble with a substitution like t[xs -> ts] is that it doesn’t really mean what it literally says. You don’t actually have two separate lists xs of variables and ts of terms to substitute them with. You have a single list ss of pairs of the form (x,t), where x is a variable and t is a term to substitute it with. (Alternatively, you may insist on using two separate lists, but then you would need to maintain an invariant that the lists have the same length. IMO, this makes life unnecessarily hard.)