After doing a bit of programming, I've become quite curious on language design itself. I'm still a novice (I've been doing it for about a year), so the majority of my code pertains to only two fields (GUI design in Python and basic algorithms in C/C++). I have become intrigued with how the actual languages themselves are written. I mean this in both senses. Such as how it was literally written (ie, what language the language was written in). As well as various features like white spacing (Python) or object orientation (C++ and Python).

Where would one start learning how to write a language? What are some of the fundamentals of language design, things that would make it a "complete" language?

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.
If this question can be reworded to fit the rules in the help center, please edit the question.

1

I saw a question on here a while ago similiar to this and the most common answer was: dont. Writing your own answer is a major ballache.
–
benhowdle89Mar 1 '11 at 10:24

Generally speaking, there are two major aspects: One is to define a language, its syntax, semantic etc; the other one is to write an actual compiler or interpreter for that language. With only one year of experience, you probably should not do either.
–
user281377Mar 1 '11 at 10:30

ammoQ I'm not aiming to write a language, right now at least. However, I'm intrigued by the process and what's involved.
–
RectangleTangleMar 1 '11 at 10:44

There are 3 most difficult tasks in programming: compilers, kernels and pessimistic transaction coordinators. One should be really really unhappy with what he/she finds available to start new compiler.
–
user7071Jun 4 '12 at 2:38

6 Answers
6

As with "normal" programming, LOGO is probably the language you should start with.

If you have lots of time, you can start writing a naive LOGO interpreter first (you will probably run into some obstacles and commit some common mistakes, but that's the whole point of this exercise).

The next step is to look at how context-free parsing works, LL(k)/LR(k)/Earley parsers, ASTs and so on.

That's obviously just the first step of processing the source code, but once you have it, you can progress to symbol tables and could probably write a LOGO-to-C compiler (you could of course compile to machine code directly, but it wouldn't add much to the "feel" while making debugging a nightmare at this point).

There are many ways in understand how compilers are built. In the simplest definition compilers are programs that take your source code and convert it to a form valid executable in a form or another(VM or machine language).

So in order to convert them, it has to first understand them. Its like the compiler program you write has to successfully understand millions of possible combination of valid programs that can be written in the language. There for in order to understand them it has to ...

a. Parse them : This step itself is composed of many steps. Since the program it self can have data and other stuff. It has to first recognize valid lexical tokens(What you call keywords). Doing this requires reading characters one by one, and then matching it against a template. Its like this, how do you recognize that a sentence is a valid English statement? You take the rules of English grammar and apply to the vocabulary in question.

A similar thing happens inside compilers. There is something called as a language grammar! Which fundamentally defines what is legal syntactically. Now writing a parser for each grammar that comes down the way is manually laborious and not practical. Hence there are parser generators. They work by, taking a grammar and then generating a parser for it. How does the parser look? There are many ways to do that, starting from using regular expressions to reading each character one by one and matching them until a valid lexical token is encountered.

b.Making sense of what has been read: You can make a grammatically valid sentence yet not make sense out of it. Same thing needs to be checked in a compiler too. What does the syntax notation mean... This is nothing but semantics.

c.You now what it means, now you want to mean the same in other languages: What is the equivalent of what I just parsed in Assembly. Its like you now perfectly parsed and recognized lets say a if statement. Now you take that convert it to equivalent of that in assembly.

d.Meanwhile optimize,what you can optimize: If there are any reasonable code optimizations can be done along the way, do it.

This is brief overview of compilers work in common language. Of course there is a lot to it. You can write volumes on it(And there are volumes on it, already written).

What you should do:

Read some good theory on Compilers. Dragon book recommended.

Download a opensource Compiler. Plenty available.

Try to map what you have read from the book to the code.

Break something and see how it works.

Add something and see how it works.

Then:

Write something on your own.

Search for interesting problems to solve and solve them.

Browse bug lists in those open source projects, Try sending patches to them.

Remember you can read a lot, but you will tested only when you write code. So write code. Fail often, learn from it. Use the feedback... Repeat the cycle again.

You can start with the implementation of Python if you want, its just C. If you want something simpler check out SIOD (Scheme in one Defun) its a very small footprint scheme implementation. Since Scheme has such a simple syntax it is very easy to understand it.

Very few of today's popular languages are the product of a comprehensive up-front design. Instead, there was an initial implementation by one or two people who were frustrated in some way with the available languages. Then some other people tried it and liked it, and over time, features were added. Eventually some desirable feature can't be added without breaking existing programs. Then someone creates a new language.

Completeness is a partially a matter of opinion, but not totally. Paul Graham wrote a fine article about the relative power of programming languages, introducing an idea now widely known as the Blub Paradox.

Like in conceptual art, you kind of need to learn everything that was done in the field since java inception (and I'd have said since C inception, a couple of years ago) to get a decent language out right.

And still you risk of emitting a useless behemoth that cannot be successfully applied to some interesting domain.

...or another lisp dialect, or another coffeescript.
(see, those have practical applications and usually are conceptually sound, but then again, put in perspective, usually happen to raise more problems then what they solve, and making problems worse is not a good reason to write a new language)