Given that, isn't this whole idea of source code generation a misunderstanding? That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

Am I missing something here?

I know that code is data as well. What I don't understand is, why generate source code? Why not make it into a function which can accept parameters and act on them?

We're looking for long answers that provide some explanation and context. Don't just give a one-line answer; explain why your answer is right, ideally with citations. Answers that don't include explanations may be removed.

@Utku, the better reasons to do code generation often relate to wanting to provide a higher-level description than your current language can express. Whether the compiler can or can't create efficient code doesn't really have anything to do with it. Consider parser generators -- a lexer generated by flex or a parser generated by bison will almost certainly be more predictable, more correct, and often faster to execute than equivalents hand-written in C; and built from far less code (thus also being less work to maintain).
– Charles DuffyDec 2 '17 at 1:04

1

Maybe you come from a language which doesn't have many functional elements, but in many languages functions are first class -- you can pass them around, so in those types of languages code is data, and you can treat it just like that.
– RestiosonDec 3 '17 at 9:04

1

@Restioson in a functional language code isn't data. First class functions mean exactly that: Functions are data. And not necessarily particularly good data: you can't necessarily mutate them just a bit (like mutate all additions within the functions into subtractions, say). Code is data in Homoiconic languages. (most homoiconic languages have first class functions. But the reverse is not true.).
– Lyndon WhiteDec 4 '17 at 7:34

27 Answers
27

Technically, if we generate code, it is not source even if it is text that is readable by humans. Source Code is original code, generated by a human or other true intelligence, not mechanically translated and not immediately reproducible from (true) source (directly or indirectly).

If something can be generated, than that thing is data, not code.

I would say everything is data anyway. Even source code. Especially source code! Source code is just data in a language designed to accomplish programming tasks. This data is to be translated, interpreted, compiled, generated as needed into other forms — of data — some of which happen to be executable.

The processor executes instructions out of memory. The same memory that is used for data. Before the processor executes instructions, the program is loaded into memory as data.

So, everything is data, even code.

Given that [generated code is data], isn't this whole idea of code generation a misunderstanding?

It is perfectly fine to have multiple steps in compilation, one of which can be intermediate code generation as text.

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

That's one way, but there are others.

The output of code generation is text, which is something designed to be used by a human.

Not all text forms are intended for human consumption. In particular, generated code (as text) is typically intended for compiler consumption not human consumption.

Source code is considered the original: the master — what we edit & develop; what we archive using source code control. Generated code, even when human-readable text, is typically regenerated from the original source code. Generated code, generally speaking, doesn't have to be under source control since it is regenerated during build.

Practical reasoning

OK, I know that code is data as well. What I don't understand is, why generate source code?

From this edit, I assume you are asking on a rather practical level, not theoretical Computer Science.

The classical reason for generating source code in static languages like Java was that languages like that simply did not really come with easy to use in-language tools to do very dynamic stuff. For example, back in the formative days of Java, it simply was not possible to easily create a class with a dynamic name (matching a table name from a DB) and dynamic methods (matching attributes from that table) with dynamic data types (matching the types of said attributes). Especially since Java puts a whole deal of importance, nay, guarantees, on being able to catch type errors at compile time.

So, in such a setting, a programmer can only create Java code and write a lot of lines of code manually. Often, the programmer will find that whenever a table changes, he has to go back and change the code to match; and if he forgets that, bad things happen. Hence, the programmer will get to the point where he writes some tools that do it for him. And hence the road starts to ever more intelligent code generation.

(Yes, you could generate the bytecode on the fly, but programming such a thing in Java would not be something a random programmer would do just inbetween writing a few lines of domain code.)

Compare this to languages that are very dynamic, for example Ruby, which I would consider the antithesis to Java in most respects (note that I am saying this without valuing either approach; they are simply different). Here it is 100% normal and standard to dynamically generate classes, methods etc. at runtime, and most importantly, the programmer can do it trivially right in the code, without going on a "meta" level. Yes, things like Ruby on Rails come with code generation, but we found in our work that we basically use that as a kind of advanced "tutorial mode" for new programmers, but after a while it gets superfluous (as there is so little code to write in that ecosystem that when you know what you are doing, writing it manually gets faster than cleaning up the generated code).

These are just two practical examples from the "real world". Then you have languages like LISP where the code is data, literally. On the other hand, in compiled languages (without a runtime engine like Java or Ruby), there is (or was, I have not kept up with modern C++ features...) simply no concept of defining class or method names at runtime, so code generation the build process is the tool of choice for most things (other more C/C++ specific examples would be things like flex, yacc etc.).

I think this is better than the more up-voted answers. In particular, the example mentioned with Java and database programming does a much better job of actually addressing why code generation is used and is a valid tool.
– PanzercrisisNov 29 '17 at 13:57

These days, is it possible in Java to create dynamic tables from a DB? Or only by using an ORM?
– NoumenonDec 1 '17 at 5:18

"(or was, I have not kept up with modern C++ features...)" surely this has been possible in C++ for over two decades thanks to function pointers? I haven't tested it but I'm sure it should be possibly to allocate a char array, fill it with machine code and then cast a pointer to the first element to a function pointer and then run it? (Assuming the target platform doesn't have some security measure to stop you doing that, which it might well do.)
– PharapDec 2 '17 at 19:43

1

"allocate a char array, fill it with machine code and then cast a pointer to the first element to a function pointer and then run it?" Apart from being undefined behaviour, it's the C++ equivalent of "generate the bytecode on the fly". It falls into the same category of "not considered by ordinary programmers"
– CalethAug 2 '18 at 11:22

1

@Pharap, "surely this has been possible in C++ for over two decades" ... I had to chuckle a little bit; it is about 2ish decades since I last coded C++. :) But my sentence about C++ was formulated badly anyways. I have changed it a bit, it should be clearer what I meant, now.
– AnoEAug 2 '18 at 11:28

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

True. I don't care about performance unless I'm forced to.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

Hmm, no idea what you're talking about.

Look it's like this: Generated and retained source code is always and forever a pain in the butt. It exists for one reason only. Someone wants to work in one language while someone else insists on working in another and neither one can be bothered to figure out how to interoperate between them so one of them figures out how to turn their favorite language into the imposed language so they can do what they want.

Which is fine until I have to maintain it. At which point you can all go die.

Is it an anti pattern? Sigh, no. Many languages wouldn't even exist if we weren't willing to say goodbye to the shortcomings of previous languages and generating the code of the older languages is how many new languages start.

It's a code base that is left in a half converted Frankenstein monster patchwork that I can't stand. Generated code is untouchable code. I hate looking at untouchable code. Yet people keep checking it in. WHY? You might as well be checking in the executable.

Well now I'm ranting. My point is we're all "generating code". It's when you treat generated code like source code that you're making me crazy. Just cause it looks like source code doesn't make it source code.

ARG!!! It doesn't matter what it looks like!!! Text, binary, DNA, if it's not the SOURCE it's not what you should touch when making changes. It's no ones business if my compilation process has 42 intermediate languages that it goes through. Stop touching them. Stop checking them in. Make your changes at the source.
– candied_orangeNov 29 '17 at 5:07

@utku: "If something is not meant to be consumed by a human, it shouldn't be text": I completely disagree. Some counter-examples off the top of my head: the HTTP protocol, MIME encodings, PEM files -- pretty much anything that uses base64 anywhere. There are lots of reasons to encode data into a 7-bit safe stream even if no human should ever see it. Not to mention the much larger space of things that normally a human should never interact with, but that they may want to occasionally: log files, /etc/ files on Unix, etc.
– Daniel PrydenNov 29 '17 at 14:03

12

I don't think "programming with punch cards" means what you think it means. I've been there, I've done that, and yeah, it was a pain; but it has no connection to "generated code." A deck of punched cards is just another kind of file--like a file on disk, or a file on tape, or a file on an SD card. Back in the day, we would write data to decks of cards, and read data from them. So, if the reason we generate code is because programming with punch cards is a pain, then that implies that programming with any kind of data storage is a pain.
– Solomon SlowNov 29 '17 at 17:46

The most frequent use case for code generators I had to work with in my career were generators which

took some high level meta-description for some kind of data model or database schema as input (maybe a relational schema, or some kind of XML schema)

and produced boiler-plate CRUD code for data access classes as output, and maybe additional things like corresponding SQLs or documentation.

The benefit here is that from one line of a short input specification you get 5 to 10 lines of debuggable, type-safe, bug-free (assumed the code generators output is mature) code you otherwise had to implement and maintain manually. You can imagine how much this reduces maintenance and evolvement effort.

Let me also respond to your initial question

Is source code generation an anti pattern

No, not source code generation per se, but there are indeed some pitfalls. As stated in The Pragmatic Programmer, one should avoid the usage of a code generator when it produces code which is hard to understand. Otherwise, the increased efforts to use or debug this code may easily outweigh the effort saved by not writing the code manually.

I would also like to add that it is most times a good idea to separate generated parts of code from manually written code physically in a way re-generation does not overwrite any manual changes. However, I also have dealt with the situation more than once where the task was to migrate some code written in old language X to another, more modern language Y, with the intention to to the maintenance afterwards in language Y. This is a valid use case for one-time code generation.

I agree with this answer. Using something like Torque for java, I can do automatic generation of java source files, with fields matching the sql database. This makes crud operations much more easy. The major benefit is type safety, including only being able to reference fields which exists in the database(Thank you autocomplete).
– MTilstedNov 29 '17 at 12:34

Yes, for statically typed languages this is the important part: you can make sure your hand-written code actually fits to the generated one.
– Paŭlo EbermannNov 29 '17 at 23:15

"migrate some code written in old language" - even then, the one-time code generation may be a big pain. For example, after some manual changes you detect a bug in the generator and need to redo the generation after the fix. Luckily, git or alike can usually ease the pain.
– maaartinusDec 1 '17 at 3:46

I've encountered two use cases for generated (at build time, and never checked in) code:

Automatically generate boilerplate code such as getters/setters, toString, equals, and hashCode from a language built to specify such things (e.g. project lombok for Java)

Automatically generate DTO type classes from some interface spec (REST, SOAP, whatever) to then be used in the main code. This is similar to your language bridge issue, but ends up being cleaner and simpler, with better type handling than trying to implement the same thing without generated classes.

Highly repetitive code in inexpressive languages. For instance I had to write code that essential did the same thing on many similar but not identical data structures. It probably could have done with something like a C++ template (hey isn't that code generation?). But I was using C. Code generation saved me writing lots of near identical code.
– Nick KeighleyNov 29 '17 at 9:29

1

@NickKeighley Perhaps your toolchain was not permitting you to use another more suitable language?
– WilsonNov 29 '17 at 10:23

7

You don't usually get to pick and choose your implementation language. The project was in C, that wasn't an option.
– Nick KeighleyNov 29 '17 at 10:49

1

@Wilson the more expressive languages often use code generation (e.g. lisp macros, ruby on rails), they just don't require in to be saved as text in the meantime.
– Pete KirkhamNov 29 '17 at 10:53

4

Yeah, code-generation is essentially meta-programming. Languages like Ruby allow you to do meta-programming in the language itself, but C does not so you have to use code-generation instead.
– Sean BurtonNov 29 '17 at 11:45

Sussmann had much interesting to say about such things in his classic "Structure and interpretation of computer programs", mainly about the code-data duality.

For me the major use of adhoc code generation is making use of an available compiler to convert some little domain specific language to something I can link into my programs. Think BNF, think ASN1 (Actually, don't, it is ugly), think data dictionary spreadsheets.

Trivial domain specific languages can be a huge time saver, and outputting something that can be compiled by standard language tools is the way to go when creating such things, which would you rather edit, a non trivial hand hacked parser in whatever native language you are writing, or the BNF for an auto generated one?

By outputting text that is then fed to some system compiler I get all of that compilers optimisation and system specific config without having to think about it.

I am effectively using the compiler input language as just another intermediate representation, what is the problem? Text files are not inherently source code, they can be an IR for a compiler, and if they happen to look like C or C++ or Java or whatever, who cares?

Now if you are hard of thinking you might edit the OUTPUT of the toy language parser, which will clearly disappoint the next time someone edits the input language files and rebuilds, the answer is to not commit the auto generated IR to the repo, have it generated by your toolchain (And avoid having such people in your dev group, they are usually happier working in marketing).

This is not so much a failure of expressiveness in our languages, as an expression of the fact that sometimes you can get (or massage) parts of the specification into a form that can be automatically converted into code, and that will usually beget far fewer bugs and be far easier to maintain.
If I can give our test and configuration guys a spreadsheet they can tweak and a tool that they then run that takes that data and spits out a complete hex file for the flash on my ECU then that is a huge time saving over having someone manually translate the latest setup into a set of constants in language of the day (Complete with typos).

Same thing with building models in Simulink and then generating C with RTW then compiling to target with whatever tool makes sense, the intermediate C is unreadable, so what? The high level Matlab RTW stuff only needs to know a subset of C, and the C compiler takes care of the platform details. The only time a human has to grovel thru the generated C is when the RTW scripts have a bug, and that sort of thing is far easier to debug with a nominally human readable IR then with just a binary parse tree.

You can of course write such things to output bytecode or even executable code, but why would you do that? We got tools for converting an IR to those things.

This is good, but I'd add that there is a tradeoff when determining which IR to use: using C as an IR makes some things easier and other things harder, when compared to, say, x86 assembly language. The choice is even more significant when choosing between, say, Java language code and Java bytecode, as there are many more operations that only exist in one or the other language.
– Daniel PrydenNov 29 '17 at 14:21

2

But X86 assembly language makes a poor IR when targeting an ARM or PPC core! All things are a tradeoff in engineering, thats why they call it Engineering. One would hope that the possibilities of the Java bytecode were a strict superset of the possibilities of the Java language, and that this is generally true as you get closer to the metal irrespective of toolchain and where you inject the IR.
– Dan MillsNov 29 '17 at 14:29

Oh, I totally agree: my comment was in response to your final paragraph questioning why you'd ever output bytecode or some lower-level thing -- sometimes you do need the lower level. (In Java specifically, there are a lot of useful things you can do with bytecode that you can't do in the Java language itself.)
– Daniel PrydenNov 29 '17 at 16:38

2

I don't disagree, but there is a cost to using an IR closer to the metal, not only in reduced generality, but in the fact that you usually end up responsible for more of the really annoying low level optimisation. The fact that we generally these days think in terms of optimising algorithm choice rather then implementation is a reflection on just how far compilers have come, sometimes you have to go really close to the metal in these things, but think twice before throwing away the compilers ability to optimise by using too low level an IR.
– Dan MillsNov 29 '17 at 16:58

1

"they are usually happier working in marketing" Catty, but funny.
– dmckeeDec 1 '17 at 2:40

Pragmatic answer: is the code generation necessary and useful? Does it provide something that is genuinely very useful and needed for the proprietary codebase, or does it seem to just create another way of doing things in a way that contributes more intellectual overhead for sub-optimal results?

OK, I know that code is data as well. What I don't understand is, why generate code? Why not make it into a function which can accept parameters and act on them?

If you have to ask this question and there's no clear answer, then probably the code generation is superfluous and merely contributing exoticism and a great deal of intellectual overhead to your codebase.

... then such questions need not be raised since they are immediately answered by the impressive results.

OSL uses the LLVM compiler framework to translate shader networks into
machine code on the fly (just in time, or "JIT"), and in the process
heavily optimizes shaders and networks with full knowledge of the
shader parameters and other runtime values that could not have been
known when the shaders were compiled from source code. As a result, we
are seeing our OSL shading networks execute 25% faster than the
equivalent shaders hand-crafted in C! (That's how our old shaders
worked in our renderer.)

In such a case you don't need to question the existence of the code generator. If you work in this type of VFX domain, then your immediate response is usually more on the lines of , "shut up and take my money!" or, "wow, we also need to make something like this."

translate shader networks into machine code. This sounds like a compiler rather than a code generator, no?
– UtkuNov 29 '17 at 4:32

2

It basically takes a nodal network the user connects and generates intermediary code which is compiled JIT by LLVM. The distinction between compiler and code generator is kind of fuzzy. Were you thinking more on the lines of code generation features in languages like templates in C++ or the C preprocessor?
– user204677Nov 29 '17 at 4:34

I was thinking of any generator that would output source code.
– UtkuNov 29 '17 at 4:37

I see, where the output is still for human consumption I assume. OpenSL also generates intermediary source code but it's low-level code that's close to assembly for LLVM consumption. It's typically not code that's meant to be maintained (instead the programmers maintain the nodes used to generate the code). Most of the time I think those types of code generators are more likely to be abused than useful enough to justify their worth, especially if you have to constantly regenerate the code as part of your build process. Sometimes they still have a genuine place though to address shortcomings...
– user204677Nov 29 '17 at 4:39

... of the language(s) available when used for a particular domain. QT has one of those controversial ones with its meta-object compiler (MOC). The MOC reduces the boilerplate you would normally need to provide properties and reflection and signals and slots and so forth in C++, but not to such an extent to clearly justify its existence. I often think QT could have been better without the cumbersome burden of the MOC's code generation.
– user204677Nov 29 '17 at 4:39

No, generating intermediate code is not an anti-pattern. The answer to the other part of your question, "Why do it?", is a very broad (and separate) question, though I will give some reasons anyway.

Historical ramifications of never having intermediate human-readable code

Let's take C and C++ as examples since they are among the most famous languages.

You should take notice that the logical procession of compiling C code outputs not machine code but rather human-readable assembly code. Likewise, old C++ compilers used to physically compile C++ code into C code. In that chain of events, you could compile from human readable code 1 to human readable code 2 to human readable code 3 to machine code. "Why?" Why not?

If intermediate, human-readable code was never generated, we might not even have C or C++ at all. That is certainly a possibility; people take the path of least resistance to their goals, and if some other language gained steam first because of C development stagnation, C might have died while it was still young. Of course, you could argue "But then maybe we would be using some other language, and maybe it would be better." Maybe, or maybe it would be worse. Or maybe we would all still be writing in assembly.

Why use intermediate human-readable code?

Sometimes intermediate code is desired so that you can modify it before the next step in building. I will admit this point is the weakest.

Sometimes it's because the original work was not done in any human-readable language at all but in a GUI modeling tool instead.

Sometimes you need to do something very repetitive, and the language should not cater to what you are doing because it is such a niche thing or such a complicated thing that it has no business increasing the complexity or the grammar of the programming language just to accommodate you.

Sometimes you need to do something very repetitive, and there is no possible way to get what you want into the language in a generic way; either it cannot be represented by or conflicts with the language's grammar.

One of the goals of computers is to reduce human effort, and sometimes code that is unlikely to ever be touched again (low likelihood of maintenance) can have meta-code written to generate your longer code in a tenth the time; if I can do it in 1 day instead of 2 weeks and it's not likely to be maintained ever, then I better generate it - and on the off chance that someone 5 years from now is annoyed because they actually do need to maintain it, then they can spend the 2 weeks writing it out fully if they want to, or be annoyed by 1 week of maintaining the awkward code (but we are still 1 week ahead at that point), and that's if that maintenance needs to be done at all.

I am sure there are more reasons I am overlooking.

Example

I have worked on projects before where code needs to be generated based on data or information in some other document. For example, one project had all of its network messages and constant data defined in a spreadsheet and a tool that would go through the spreadsheet and generate a lot of C++ and Java code that let us work with those messages.

I am not saying that was the best way to set up that project (I wasn't part of its startup), but that was what we had, and it was hundreds (maybe even thousands, not sure) of structures and objects and constants that were being generated; at that point it's probably too late to try to redo it in something like Rhapsody. But even if it were redone in something like Rhapsody, then we still have code generated from Rhapsody anyway.

Also, having all that data in a spreadsheet was good in one way: it allowed us to represent the data in ways we could not have if it were all just in source code files.

Example 2

When I did some work in compiler construction, I used the tool Antlr to do my lexing and parsing. I specified a language grammar, then I used the tool to spit out a ton of code in either C++ or Java, then I used that generated code along side my own code and included it in the build.

How else should that have been done? Perhaps you could come up with another way; there probably are other ways. But for that work, the other ways would have been no better than the generated lex/parse code I had.

Ive used intermediate code as a sort of a file format and debugging trace when the two systems were incompatible but had a stable api of some kind, in a very esoteric scripting language. Wasnt meant to be read manually but could have been in same way xml could have been. But this is more common than youd think after all webpages work thisway, as somebody pointed out.
– joojaaNov 30 '17 at 21:38

We have an amazing tool to turn source code text into binary, called a compiler. Its inputs are well-defined (usually!), and it has been through plenty of work to refine how it does optimisation. If you actually want to use the compiler to carry out some operations, you want to use an existing compiler and not write your own.

Plenty of people do invent new programming languages and write their own compilers. Pretty much without exception, they are all doing this because they enjoy the challenge, not because they need the features which that language provides. Everything which they do could be done in another language; they are simply creating a new language because they like those features. What that won't get them though is a well-tuned, fast, efficient, optimising compiler. It'll get them something which can turn text into binary, sure, but it will not be as good as all existing compilers.

Text is not just something which humans read and write. Computers are perfectly at home with text too. In fact formats like XML (and other related formats) are successful because they use plain text. Binary file formats are often obscure and poorly-documented, and a reader cannot easily find out how they work. XML is relatively self-documenting, making it easier for people to write code which uses XML-formatted files. And all programming languages are set up to read and write text files.

So, suppose you want to add some new facility to make your life easier. Perhaps it's a GUI layout tool. Perhaps it's the signals-and-slots interfaces which Qt provides. Perhaps it's the way that TI's Code Composer Studio lets you configure the device you're working with and pull the right libraries into the build. Perhaps it's taking a data dictionary and auto-generating typedefs and global variable definitions (yes, this is still very much a thing in embedded software). Whatever it is, the most efficient way to leverage your existing compiler is to create a tool which will take your configuration of whatever-it-is and automatically produce code in your language of choice.

It's easy to develop and easy to test, because you know what's going in and you can read the source code that it spits out. You don't need to spend man-years on building a compiler to rival GCC. You don't need to learn a complete new language, or require other people to. All you need to do is automate this one little area, and everything else stays the same. Job done.

Still the advantage of XML's text-basedness is just that if necessary, it can be read&written by humans (they don't normally bother once it works, but certainly do during development). In terms of performance and space-efficiency, binary formats are generally much better (which very often does not matter though, because the bottleneck is somewhere else).
– leftaroundaboutNov 30 '17 at 9:49

@leftaroundabout If you need that performance and space-efficiency, sure. The reason many applications have gone to XML-based formats these days is that performance and space-efficiency are not the top criteria that they once were, and history has shown how poorly binary file formats are maintained. (Old MS Word documents for a classic example!) The point remains though - text is just as suited for computers to read as humans.
– GrahamNov 30 '17 at 11:06

Sure, a badly-designed binary format may in effect actually perform worse than a properly thought through text format, and even a decent binary format is often not much more compact than XML packed with some general-purpose compression algorithm. IMO the best of both worlds is to use a human-readable specification through algebraic data types, and automatically generate an efficient binary representation from the AST of these types. See e.g. the flat library.
– leftaroundaboutNov 30 '17 at 11:23

A bit of a more pragmatic answer, focusing on why and not on what is and isn't source code. Note that generating source code is a part of the build process in all of this cases - so the generated files shouldn't find their way into source control.

Interoprability/simplicity

Take Google's Protocol Buffers, a prime example: you write a single high level protocol description which can be then used to generate the implementation in multiple languages - often different parts of the system are written in different languages.

Implementation/technical reasons

Take TypeScript - browsers can't interpret it so the the build process uses a transpiler (code to code translator) to generate JavaScript. In fact many new or esoteric compiled languages start with transpiling to C before they get a proper compiler.

Ease of use

For embedded projects (think IoT) written in C and using only a single binary (RTOS or no OS) it is quite easy to generate a C array with the data to be compiled as if normal source code, as oposed to linking them in directly as resources.

Edit

Expanding on protobuf: code generation allows the generated objects to be first-class classes in any language. In a compiled language a generic parser would by necessity return a key-value structure - which means you nead a lot boilerplate code, you miss out on some compile-time checks (on keys and types of values in particular), get worse performance and no code completion. Imagine all those void* in C or that huge std::variant in C++ (if you have C++17), some languages may have no such feature at all.

For the first reason, I think the OP's idea would be to have a generic implementation in each language (which takes the protocol buffers description and then parses/consumes the on-the-wire format). Why would this be worse than generating code?
– Paŭlo EbermannNov 29 '17 at 23:12

@PaŭloEbermann apart from the usual perfromance argument such a generic interpretation would make it impossible to use those messagess as first-class objects in compiled (and possibly interpreted) languages - in C++ for example such an interpreter would by necessity return a key-value structure. Of course you can then get that kv into your classes but it can turn into a lot of boilerplate code. And there is also code completion too. And compile time checking - your compiler won't check if your literals don't have typos.
– Jan DorniakNov 29 '17 at 23:21

I agree ... could you add this into the answer?
– Paŭlo EbermannNov 29 '17 at 23:25

It's also a workaround for having to write a full, down-to-native-object-code compiler for a more-expressive language. Generate C, let a compiler with a good optimizer take care of the rest.
– BlrflNov 29 '17 at 18:01

Not always. Sometimes you have one or more databases containing some definitions for e.g. signals on a bus. Then you want to pull this information together, maybe do some consistency checks and then write code that interfaces between the signals coming from the bus and the variables you expect to have in your code. If you can show me a language that has meta-programming that makes it easy to use some client provided Excel sheets, a database and other data-sources and creates the code I need, with some necessary checks on data validity and consistency, then by all means show me.
– CodeMonkeyNov 30 '17 at 6:25

@CodeMonkey: something like Ruby on Rails' ActiveRecord implementation comes to mind. There's no need to duplicate the database table schema in the code. Just map a class to a table and write business logic using the column names as properties. I can't imagine any sort of pattern that could be produced by a code generator that couldn't also be managed by Ruby meta-programming. C++ templates are also extremely powerful, albeit a bit arcane. Lisp macros are another powerful in-language meta-programming system.
– kevin clineDec 1 '17 at 8:11

@kevincline what I meant was code that was based on some data from the database (could be constructed from it), but not the database itself. I.e. I have information about which signals I receive in Excel Table A. I have a Database B with information on these signals, etc. Now I want to have a class that accesses these signals. There's no connection to the database or the Excel sheet on the machine that runs the code. Using really complicated C++ Templating to generate this code at compile time, instead of a simple code generator. I'll pick codegen.
– CodeMonkeyDec 4 '17 at 11:43

Source code generation is not always an anti-pattern. For example, I am currently writing a framework which by given specification generates code in two different languages (Javascript and Java). The framework uses the generated Javascript to record browser actions of the user, and uses the Java code in Selenium to actually execute the action when the framework is in replay mode. If I did not use code generation, I would have to manually make sure that both are always in sync, which is cumbersome and also is a logical duplication in some way.

If however one is using source code generation for replacing features like generics, then it is anti-pattern.

You could, of course, write your code once in ECMAScript and run it in Nashorn or Rhino on the JVM. Or, you could write a JVM in ECMAScript (or try to compile Avian to WebAssembly using Emscripten) and run your Java code in the browser. I'm not saying those are great ideas (well, they are probably terrible ideas :-D ), but at least they are possible if not feasible.
– Jörg W MittagNov 29 '17 at 9:15

In theory, it is possible, but it is not a general solution. What happens if I cannot run one of the languages inside another? For example, pne additional thing: I just created a simple Netlogo model using the code generation and have an interactive documentation of the system, which is always in sync with the recorder and the replayer. And in general, creating a requirement and then generating code keeps things which run semantically together in sync.
– Hristo VrigazovNov 29 '17 at 9:36

Maybe a good example where the intermediary code turned out to be the reason of success? I can offer you HTML.

I believe it was important for HTML to be simple and static - it made it easy to make browsers, it allowed to start mobile browsers early etc. As further experiments (Java applets, Flash) showed - more complex and powerful languages lead to more problems. It turns out that users actually are endangered by Java applets and visiting such websites was as safe as trying game cracks downloaded via DC++. Plain HTML, on the other hand, is harmless enough to allow us to check out any site with reasonable belief in security of our device.

However, HTML would be nowhere near where it is now if it wasn't computer generated. My answer wouldn't even show up on this page until someone manually rewrote it from the database into HTML file. Luckily you can make usable HTML in almost any programming language :)

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

Can you imagine better way to display the question and all of the answers and comments to user than by using HTML as a generated in-between code?

Yes, I can imagine a better way. HTML is a legacy of a decision by Tim Berners-Lee to allow the quick creation of a text-only web browser. That was perfectly fine at the time, but we wouldn't do the same with the benefit of hindsight. CSS has made all the various presentation element types (DIV,SPAN,TABLE,UL,etc.) unnecessary.
– kevin clineDec 1 '17 at 8:25

@kevincline I am not saying that HTML as such is without flaws, I was pointing out that introducing markup language (that can be generated by a program) worked out very well in this case.
– DžurisDec 1 '17 at 23:31

So HTML+CSS is better than just HTML. I've even written internal documentation for some projects I've worked on directly in HTML+CSS+MathJax. But most web pages I visit seem to have been produced by code generators.
– David KDec 2 '17 at 16:04

Because it's faster and easier (and less error-prone) than writing the code manually, especially for tedious, repetitive tasks. You can also use the high-level tool to verify and validate your design before writing a single line of code.

Common use cases:

Modeling tools like Rose or Visual Paradigm;

High-er level languages like Embedded SQL or an interface definition language that must be preprocessed into something compilable;

Lexer and parser generators like flex/bison;

As for your "why not just make it a function and pass parameters to it directly", note that none of the above are execution environments in and of themselves. There's no way to link your code against them.

Sometimes, your programming language just doesn't have the facilities you want, making it actually impossible to write functions or macros to do what you want. Or maybe you could do what you want, but the code to write it would be ugly. A simple Python script (or similar) can then generate the required code as part of your build process, which you then #include into the actual source file.

How do I know this? Because it's a solution I've reached for multiple times when working with various different systems, most recently SourcePawn. A simple Python script that parses a simple line of source code and produces two or three lines of generated code is far better than manually crafting the generated code, when you end up with two dozen such lines (creating all my cvars).

Text form is required for easy consumption by humans. Computers also process code in text form quite easily. Therefore generated code should be generated in the form that is easiest to generate and easiest to consume by computers, and that is very often readable text.

And when you generate code, the code generation process itself often needs to be debugged - by humans. It's very, very useful if the generated code is human readable so humans can detect problems in the code generation process. Someone has to write the code to generate code, after all. It doesn't happen out of thin air.

Generating Code, just once

Not all source code generation is a case of generating some code,
and then never touching it; then regenerating it from the original source when it needs updating.

Sometimes you generate code just once, and then discard the original source,
and moving forward maintain the new source.

This sometimes happens when porting code from one language to another.
Particularly if one doesn't expect to want to later port over new changes in the original (e.g. old language code is not going to be maintained, or it is actually complete (e.g. in the case of some math functionality)).

One common case is that writing a code generator to do this, might only actually translate 90% of the code correctly.
and then that last 10% needs to be fixed up by hand.
Which is a lot faster than translating 100% by hand.

Such code generators are often very different to the kind of code generators full language translators (like Cython or f2c) produce.
Since the goal is to make maintain code once.
They are often made as a 1 off, to do exactly what they have to.
In many ways it is the next level version of using a regex/find-replace to port code. "Tool assisted porting" you could say.

Generating Code, just once, from e.g. a website scrape.

Closely related is if you generate the code from some source you don't want to accesses again.
E.g. If the actions needed to generate the code are not repeatable, or consistent, or performing them is expensive.
I am working on a pair of projects right now:
DataDeps.jl and
DataDepsGenerators.jl.

DataDeps.jl helps users download data (like standard ML datasets).
To do this it needs what we call a RegistrationBlock.
That is some code specifying some metadata,
like where to download the files from, and a checksum, and a message explaining to the user any terms/coditions/what the licensing status on the data is.

Writing those blocks can be annoying.
And that information is often available in (structured or unstructured) froms on the websites where the data is hosted.
So DataDepsGenerators.jl,
uses a webscraper to generate the RegistrationBlockCode, for some sites that host a lot of data.

It might not generate them correctly. So the dev using the generated code can and should check and correct it.
Odds are they want to make sure it hasn't miss-scraped the licensing information for example.

Importantly, users/devs working with DataDeps.jl do not need to install or use the webscraper to use the RegistrationBlock code that was generated.
(And not needing to download and install a web-scraper saves a a fair bit of time. particularly for the CI runs)

Generating source code once is no an antipattern.
and it normally can not be replaced with metaprogramming.

"report" is an English word that means something other than "port again". Try "re-port" to make that sentence clearer. (Commenting because too small for a suggested edit.)
– Peter CordesNov 30 '17 at 4:38

Good catch @PeterCordes I have rephrased.
– Lyndon WhiteNov 30 '17 at 4:46

Faster but potentially much less maintainable, depending on how horrible the generated code is. Fortran to C was a thing back in the day (C compilers were more widely available, so people would use f2c + cc), but the resulting code was not really a good starting point for a C version of the program, AFAIK.
– Peter CordesNov 30 '17 at 4:49

1

Potentially, potentially not. It is not the fault in the concept of code generators that some code generators make non-maintainable code. In particular, a hand crafted tool, that doesn't have to catch every case can often make perfectly nice code. If 90% of the code is just list of array constants for example then generating those arrays constructors as an one off can trivially be done very nicely, and low effort. (On the other hand the C code output by Cython can't be maintained by humans. Because it is not intended to be. Just like you say for f2c back in the day)
– Lyndon WhiteNov 30 '17 at 4:56

1

The big table was just the simplest most reduced argument. Similar can be said for say converting for-loops or conditions. Indeed sed goes a long way, but sometimes one needs a bit more expressive power. The line between program logic and data is often a fine one. Sometimes the distinction isn't useful. JSON is (/was) just javascript object constructor code. In my example I am also generating object constructor code (is it data? maybe (maybe not since sometimes it has function calls). Is it better treated as code? yes.)
– Lyndon WhiteNov 30 '17 at 5:42

Generation of "source" code is an indication of a shortcoming of the language that are generated. Is using tools to overcome this an anti-pattern? Absolutely not - let me explain.

Typically code generation is used because there exist a higher-level definition that can describe the resulting code much less verbose than the lower level language. So code generation facilitates efficiency and terseness.

When I write c++, I do so because it allows me to write code more efficient than using assembler or machine code. Still machine code is generated by the compiler. In the beginning, c++ was simply a preprocessor that generated C code. General purpose languages is great for generating general purpose behavior.

In the same way, by using a DSL (domain specific language) it is possible to write terse, but perhaps code constricted to a specific task. This will make it less complicated to generate the correct behavior of the code. Remember that code is means to and end. What a developer is looking for is an efficient way to generate behavior.

Ideally the generator can create fast code from an input that is simpler to manipulate and understand. If this is fulfilled not using a generator is an anti-pattern. This anti-pattern typically comes from the notion that "pure" code is "cleaner", much in the same way a wood worker or other artisan might look at use of power tools, or use of CNC to "generate" workpieces (think golden hammer).

On the other hand, if the source of the generated code is harder to maintain or generate code that is not efficient enough the user is falling into the trap of using the wrong tools (sometime because of the same golden hammer).

Source code generation absolutely does mean the the generated code is data. But it is first class data, data that the rest of the program can manipulate.

The two most common types of data that I am aware of that are integrated into source code is graphical information about windows (number and placement of various controls), and ORMs. In both cases the integration via code generation makes manipulating the data easier, because you don't have to go through extra "special" steps to use them.

When working with the original (1984) Macs, dialog and window definitions were created using a resouce editor that kept the data in a binary format. Using these resources in your application was harder than it would have been if the "binary format" had been Pascal.

So, no, source code generation is not an anti-pattern, it allows making the data part of the application, which makes it easier use.

Code generation is an anti-pattern when it costs more than it accomplishes. This situation occurs when generation takes place from A to B where A is almost the same language as B, but with some minor extensions that could be done just by coding in A with less effort than all the custom tooling and build staging for A to B.

The trade off is more prohibitive against code generation in languages that don't have meta-programming facilities (structural macros) because of the complications and inadequacies of achieving metaprogramming through the staging of external text processing.

The poor trade off could also have to do with the quantity of use. Language A could be substantially different from B, but the whole project with its custom code generator only uses A in one or two small places, so that the total amount of complexity (small bits of A, plus the A -> B code generator, plus the surrounding build staging) exceeds the complexity of a solution just done in B.

Basically, if we commit to code generation, we should probably "go big or go home": make it have substantial semantics, and use it a lot, or don't bother.

Why did you remove the "When Bjarne Stroustrup first implemented C++ ..." paragraph? I think it was interesting.
– UtkuNov 30 '17 at 18:22

@Utku Other answers cover this from the point of view of compiling an entire, sophisticated language, in which the rest of a project is entirely written. I don't think it's representative of the majority of what is called "code generation".
– KazDec 1 '17 at 16:36

I didn't see this stated clearly (I did see it touched upon by one or two answers, but it didn't seem very clear)

Generating code (as you said, as though it was data) is not a problem--it's a way to reuse a compiler for a secondary purpose.

Editing generated code is one of the most insidious, evil, horrific anti-patterns you will ever come across. Do not do this.

At best, editing generated code pulls a bunch of poor code into your project (the ENTIRE set of code is now truly SOURCE CODE--no longer data). At worst the code pulled into your program is highly redundant, poorly named garbage that is nearly completely unmaintainable.

I suppose a third category is code you use once (gui generator?) then edit to help you get started/learn. This is a little of each--it CAN be a good way to start but your GUI generator will be targeted at using "Generatable" code that won't be a great start for you as a programmer--In addition, you might be tempted to use it again for a second GUI which means pulling redundant SOURCE code into your system.

If your tooling is smart enough to disallow any edits whatsoever of generated code, go for it. If not, I'd call it one of the worst anti-patterns out there.

Data is the information exactly in the form you need (and value). Code is also information, but in an indirect or intermediate form. In essence, code is also a form of data.

More specifically, code is information for machines to offload humans from processing information all by themselves.

Offloading humans from information processing is the most important motive. Intermediate steps are acceptable as long as they make life easy. That's why intermediate information mapping tools exist. Like code generators, compilers, transpilers, etc.

why generate source code? Why not make it into a function which can
accept parameters and act on them?

Let's say someone offers you such a mapping function, whose implementation is obscure to you. As long as the function works as promised, would you care if internally it's generating source code or not?

Inasmuch as you stipulate later on that code is data, your proposition reduces to "If something can be generated, then that thing is not code." Would you say, then, that assembly code generated by a C compiler is not code? What if it happens to coincide exactly with assembly code that I write by hand? You're welcome to go there if you wish, but I won't be coming with you.

Let's start instead with a definition of "code". Without getting too technical, a pretty good definition for the purposes of this discussion would be "machine-actionable instructions for performing a computation."

Given that, isn't this whole idea of source code generation a misunderstanding?

Well yes, your starting proposition is that code cannot be generated, but I reject that proposition. If you accept my definition of "code" then there should be no conceptual problem with code generation in general.

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

Well that's an entirely different question, about the reason for employing code generation, rather than about its nature. You are proposing the alternative that instead of writing or using a code generator, one writes a function that computes the result directly. But in what language? Gone are the days when anyone wrote directly in machine code, and if you write your code in any other language then you depend on a code generator in the form of a compiler and / or assembler to produce a program that actually runs.

Why, then, do you prefer to write in Java or C or Lisp or whatever? Even assembler? I assert that it's at least in part because those languages provide abstractions for data and operations that make it easier to express the details of the computation you want to perform.

The same is true of most higher-level code generators, too. The prototypical cases are probably scanner and parser generators such as lex and yacc. Yes, you could write a scanner and a parser directly in C or in some other programming language of your choice (even raw machine code), and sometimes one does. But for a problem of any significant complexity, using a higher-level, special-purpose language such as lex's or yacc's makes the hand-written code easier to write, read, and maintain. Usually much smaller, too.

You should also consider what exactly you mean by "code generator". I would consider C preprocessing and the instantiation of C++ templates to be exercises in code generation; do you object to these? If not, then I think you'll need to perform some mental gymnastics to rationalize accepting those but rejecting other flavors of code generation.

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

Why? You are basically positing that one should have a universal program to which the user feeds data, some classified as "instructions" and others as "input", and which proceeds to perform the computation and emit more data that we call "output". (From a certain point of view, one might call such a universal program an "operating system".) But why do you suppose that a compiler should be as effective at optimizing such a general-purpose program as it is at optimizing a more specialized program? The two programs have different characteristics and different capabilities.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

You say that as if having a universal-to-some-degree interface library would necessarily be a good thing. Perhaps it would, but in many cases such a library would be big and difficult to write and maintain, and maybe even slow. And if such a beast in fact does not exist to serve the particular problem at hand, then who are you to insist that one be created, when a code generation approach can solve the problem much more quickly and easily?

Am I missing something here?

Several things, I think.

I know that code is data as well. What I don't understand is, why generate source code? Why not make it into a function which can accept parameters and act on them?

Code generators transform code written in one language to code in a different, usually lower-level language. You're asking, then, why people would want to write programs using multiple languages, and especially why they might want to mix languages of subjectively different levels.

But I touched on that already. One chooses a language for a particular task based in part on its clarity and expressiveness for that task. Inasmuch as smaller code has fewer bugs on average and is easier to maintain, there is also a bias toward higher-level languages, at least for large-scale work. But a complex program involves many tasks, and often some of them can be more effectively addressed in one language, whereas others are more effectively or more concisely addressed in another. Using the right tool for the job sometimes means employing code generation.

The compiler's duty is to take a code written in human-readable form and convert it to machine-readable form. Hence, if the compiler cannot create a code that is efficient, then the compiler is not doing its job properly. Is that wrong?

A compiler will never be optimized for your task. The reason for that is simple: it's optimized to do many tasks. It's a general purpose tool used by many people for many different tasks. Once you know what your task is, you can approach the code in a domain-specific manner, making tradeoffs that the compilers could not.

As an example, I've worked on software where an analyst may need to write some code. They could write their algorithm in C++, and add in all the bounds checks and memoization tricks that they depend on, but that requires knowing a lot about the inner workings of the code. They would rather write something simple, and let me throw an algorithm at it to generate the final C++ code. Then I can do exotic tricks to maximize performance like static analysis that I would never expect my analysts to endure. Code generation allows them to write in a domain-specific manner which lets them get the product out the door easier than any general purpose tool could.

I have also done the exact opposite. I have another piece of work that I've done which had a mandate "no code generation." We still wanted to make life easy on those using the software, so we used massive amounts of template metaprogramming to make the compiler generate the code on the fly. Thus, I only needed the general purpose C++ language to do my job.

However, there's a catch. It was tremendously difficult to guarantee that the errors were readable. If you've ever used template metaprogrammed code before, you know that a single innocent mistake can generate an error that takes 100 lines of incomprehensible class names and template arguments to understand what went wrong. This effect was so pronounced that the recommended debugging process for syntax errors was "Scroll through the error log until you see the first time one of your own files has an error. Go to that line, and just squint at it until you realize what you did wrong."

Had we used code generation, we could have had much more powerful error handling capabilities, with human readable errors. C'est la vie.

There are a few different ways of using code generation. They could be divided in three major groups:

Generating code in a different language as output from a step in the compilation process. For the typical compiler this would be a lower-level language, but it could be to another high-level language as in the case of the languages which compile to JavaScript.

Generating or transforming code in the source code language as a step in the compilation process. This is what macros does.

Generating code with a tool separately from the regular compilation process. The output from this is code which lives as files together with the regular source code and is compiled along with it. For example entity classes for an ORM might be auto-generated from a database schema, or data transfer objects and service interfaces might be generated from an interface specification like a WSDL file for SOAP.

I would guess you are talking about the third kind of generated code, since this is the most controversial form. In the first two forms the generated code is an intermediate step which is very cleanly separated from the source code. But in the third form there is no formal separation between source code and generated code, except the generated code probably have a comment which say "don't edit this code". It stills opens the risk of developers editing the generated code which would be really ugly. From the viewpoint of the compiler, the generated code is source code.

Nevertheless, such forms of generated code can be really useful in a statically typed language. For example when integration with ORM entities, it is really useful to have strongly-typed wrappers for the database tables. Sure you could handle the integration dynamically at runtime, but you would lose type safety and tool support (code completion). A major benefit of statically type language is the support of the type system at the type of writing rather than just at runtime. (Conversely, this type of code generation is not very prevalent in dynamically typed languages, since in such a language it provides no benefit compared to runtime conversions.)

That is, if there is a code generator for something, then why not make
that something a proper function which can receive the required
parameters and do the right action that the "would generated" code
would have done?

Because type safety and code completion are features you want at compile time (and while writing code in an IDE), but regular functions are only executed at runtime.

There might be a middle ground though: F# supports the concept of type providers which is basically strongly typed interfaces generated programmatically at compile time. This concept could probably replace many uses of code generation, and provide a cleaner separation of concerns.

Processor instruction sets are fundamentally imperative, but programming languages can be declarative. Running a program written in a declarative language inevitably requires some type of code generation. As mentioned in this answer and others, a major reason for generating source code in a human-readable language is to take advantage of the sophisticated optimizations performed by compilers.

Wrong definition of source code. The source code is mostly for humans working on it (and that mere fact defines it, see also what is free software by the FSF). Assembler code generated with gcc -fverbose-asm -O -S is not source code (and is not only or mostly data), even if it is some textual form always fed to GNU as and sometimes read by humans.
– Basile StarynkevitchNov 30 '17 at 10:11

Also, many languages implementations compile to C code, but that generated C is not genuine source code (e.g. cannot be easily worked upon by humans).
– Basile StarynkevitchNov 30 '17 at 10:16

At last, your hardware (e.g. your AMD or Intel chip, or your computer motherboard) is interpreting machine code (which is obviously not source code). BTW The IBM1620 had keyboard typable (BCD) machine code, but that fact did not make it "source code". All code is not source.
– Basile StarynkevitchNov 30 '17 at 10:18

@BasileStarynkevitch Ah, you got me there. I shouldn't try to compress my witty statement too much, or they change their meaning. Right, source code should be the most original code that is fed into the first compilation stage.
– BergiNov 30 '17 at 11:20

No source code is code for humans. It is as difficult and as subjective to define as music (vs. sound). It is not a matter of trying to find the software consuming it.
– Basile StarynkevitchNov 30 '17 at 11:21

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).