I am thorough with programming and have come across languages including BASIC, FORTRAN, COBOL, LISP, LOGO, Java, C++, C, MATLAB, Mathematica, Python, Ruby, Perl, JavaScript, Assembly and so on. I can't understand how people create programming languages and devise compilers for it. I also couldn't understand how people create OS like Windows, Mac, UNIX, DOS and so on. The other thing that is mysterious to me is how people create libraries like OpenGL, OpenCL, OpenCV, Cocoa, MFC and so on. The last thing I am unable to figure out is how scientists devise an assembly language and an assembler for a microprocessor. I would really like to learn all of these stuff and I am 15 years old. I always wanted to be a computer scientist someone like Babbage, Turing, Shannon, or Dennis Ritchie.

I have already read Aho's Compiler Design and Tanenbaum's OS concepts book and they all only discuss concepts and code in a high level. They don't go into the details and nuances and how to devise a compiler or operating system. I want a concrete understanding so that I can create one myself and not just an understanding of what a thread, semaphore, process, or parsing is. I asked my brother about all this. He is a SB student in EECS at MIT and hasn't got a clue of how to actually create all these stuff in the real world. All he knows is just an understanding of Compiler Design and OS concepts like the ones that you guys have mentioned (i.e. like Thread, Synchronization, Concurrency, memory management, Lexical Analysis, Intermediate code generation and so on)

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.
If this question can be reworded to fit the rules in the help center, please edit the question.

7

Commenters: comments are meant for seeking clarification, not for extended discussion. If you have a solution, leave an answer. If your solution is already posted, please upvote it. If you'd like to discuss this question with others, please use chat. See the FAQ for more information.
–
user8Jun 18 '11 at 22:42

Learn as many programming languages as you can. That way you will learn from their concepts as well as their mistakes. Why be content with dwarfs, when you can stand on the shoulder of giants?
–
sbiOct 24 '11 at 13:54

I can't understand how people create programming languages and devise compilers for it.

It is surprising to me, but lots of people do look at programming languages as magical. When I meet people at parties or whatever, if they ask me what I do I tell them that I design programming languages and implement the compilers and tools, and it is surprising the number of times people -- professional programmers, mind you -- say "wow, I never thought about it, but yeah, someone has to design those things". It's like they thought that languages just spring up wholly formed with tool infrastructures around them already.

They don't just appear. Languages are designed like any other product: by carefully making a series of tradeoffs amongst competing possibilities. The compilers and tools are built like any other professional software product: by breaking the problem down, writing one line of code at a time, and then testing the heck out of the resulting program.

Language design is a huge topic. If you're interested in designing a language, a good place to start is by thinking about what the deficiencies are in a language that you already know. Design decisions often arise from considering a design defect in another product.

Alternatively, consider a domain that you are interested in, and then design a domain-specific language (DSL) that specifies solutions to problems in that domain. You mentioned LOGO; that's a great example of a DSL for the "line drawing" domain. Regular expressions are a DSL for the "find a pattern in a string" domain. LINQ in C#/VB is a DSL for the "filter, join, sort and project data" domain. HTML is a DSL for the "describe the layout of text on a page" domain, and so on. There are lots of domains that are amenable to language-based solutions. One of my favourites is Inform7, which is a DSL for the "text-based adventure game" domain; it is probably the highest-level serious programming language I've ever seen. Pick a domain you know something about and think about how to use language to describe problems and solutions in that domain.

Once you have sketched out what you want your language to look like, try to write down precisely what the rules are for determining what is a legal and illegal program. Typically you'll want to do this at three levels:

lexical: what are the rules for words in the language, what characters are legal, what do numbers look like, and so on.

syntactic: how do words of the language combine into larger units? In C# larger units are things like expressions, statements, methods, classes, and so on.

semantic: given a syntactically legal program, how do you figure out what the program does?

Write down these rules as precisely as you possibly can. If you do a good job of that then you can use that as the basis for writing a compiler or interpreter. Take a look at the C# specification or the ECMAScript specification to see what I mean; they are chock-full of very precise rules that describe what makes a legal program and how to figure out what one does.

One of the best ways to get started writing a compiler is by writing a high-level-language-to-high-level-language compiler. Write a compiler that takes in strings in your language and spits out strings in C# or JavaScript or whatever language you happen to know; let the compiler for that language then take care of the heavy lifting of turning it into runnable code.

In particular you might find this post interesting; here I list most of the tasks that the C# compiler performs for you during its semantic analysis. As you can see, there are a lot of steps. We break the big analysis problem down into a series of problems that we can solve individually.

Finally, if you're looking for a job doing this stuff when you're older then consider coming to Microsoft as a college intern and trying to get into the developer division. That's how I ended up with my job today!

@Thorbjørn: Let's be clear about the terminology. A "compiler" is any device that translates from one programming language to another. One of the nice things about having a C# compiler that turns C# into IL, and an IL compiler (the "jitter") that turns IL into machine code, is that you get to write the C# compiler to IL (easy!), and put the processor-specific optimizations in the jitter. It's not that compiler optimizations are "not being done", it's that the jit compiler team does them for us. See blogs.msdn.com/b/ericlippert/archive/2009/06/11/…
–
Eric LippertJun 16 '11 at 13:19

1

I didn't realize that C# string concats are automatically converted to String.Concat calls at compile time. That is going to save me some typing time.
–
MufasaJun 23 '11 at 20:50

1

+1 for Inform7. Which, if I recall correctly, compiles to Inform6, which then compiles to.. something else, possibly something C-based. Which would make it high-level-language-to-high-level-language-to-high-level-language.
–
dlras2Jul 9 '11 at 2:59

5

@Cyclotis04: Inform6 compiles to Z-code, which is a famous extremely early example of a bytecode-based virtual machine. That's how all those Infocom games in the 1980s could be both larger than memory and portable to multiple architectures; the games were compiled to z-code and then z-code interpreters with code memory paging were implemented for multiple machines. Nowadays of course you can run a zcode interpreter on a wristwatch if you need to, but back in the day that was high tech. See en.wikipedia.org/wiki/Z-machine for details.
–
Eric LippertJul 9 '11 at 3:36

+1. Let's Build A Compiler was a real eye-opener for me, and unlike the Dragon Book, it was actually designed to be easily comprehensible by mere mortals.
–
Mason WheelerJun 15 '11 at 20:13

2

What's interesting about Crenshaw's intro is that it ends (spoiler: it's incomplete) just about the time you'd run smack into the issues that would make you realize, hey, I really should have designed my language fully before starting to implement it. And then you say, hey, if I have to write a full language specification, why not do it in a formal notation that I can then feed into a tool to generate a parser? And then you're doing it like everyone else.
–
kindallJun 16 '11 at 22:28

3

@kindall, you need to have done it by hand in order to realize that there is a reason to use the tools.
–
user1249Jun 17 '11 at 6:57

"I would really like to learn this stuff". If you are long-term serious:

Go to college, specialize in software engineering. Take every compiler class you can get. Those people providing the classes are better educated and more experienced than you; its good to have their expert perspectives used to present the information to you in ways you'll never get from reading code.

Stick with math classes through high school and continue in college for all 4 years. Focus on non-standard math: logics, group theory, meta-mathematics. This will force you to think abstractly. It will enable you to read the advanced theory papers on compiling and understand why those theories are interesting and useful. You can ignore those advanced theories, if you forever want to be behind the state of the art.

Collect/read the standard compiler texts: Aho/Ullman, etc. They contain what the community generally agrees is fundamental stuff. You might not use everything from those books, but you should know it exists, and you should know why you aren't using it. I thought Muchnick was great, but it is for pretty advanced topics.

Build a compiler. Start NOW by building a rotten one. This will teach you some issues. Build a second one. Repeat. This experience builds huge synergy with your book learning.

A really good place to start is to learn about BNF (Backus Naur Form), parsers, and parser-generators. BNF is effectively universally used in compiler land, and you can't realistically talk to your fellow compiler-types if you don't know it.

If you want a great first introduction to compiling, and the direct value of BNF not for just documentation but as a tool-processable metalanguage, see this tutorial (not mine) on building "meta" compilers (compilers that build compilers) based on a paper from 1964 (yes, you read that right) ["META II a syntax-oriented compiler writing language" by Val Schorre. (http://doi.acm.org/10.1145/800257.808896)]
This IMHO is one of the single best comp-sci papers ever written: it teaches you to build compiler-compilers in 10 pages. I learned initially from this paper.

What I wrote about above is a lot from personal experience, and I think it has served me pretty well. YMMV, but IMHO, not by much.

@nbt None of the above is necessary. But all of the above helps. Really a lot.
–
Konrad RudolphJun 16 '11 at 8:56

19

If you want to read the advanced technical papers on compiler theory, you better be mathematically competent. You can decide to ignore that literature, and your theory and therefore compilers will be poorer for it. The naysayers here all make the point that you can build a compiler without a lot of formal education, and I agree. They seem to imply you can build really good compilers without it. That's not a bet I'd care to take.
–
Ira BaxterJun 16 '11 at 21:20

6

CS is a discipline that's genuinely useful to language design and implementation. Not mandatory of course, but there's been decades of research that can and should be leveraged, and there is no reason at all to repeat others mistakes.
–
Donal FellowsJun 17 '11 at 21:36

Using simulators, you actually build a complete computer system from the ground up. While many commenters have stated that your question is too broad, this book actually answers it while staying very manageable. When you're done, you'll have written a game in a high-level language (that you designed), which uses your own OS's functionality, which gets compiled into a VM language (that you designed) by your compiler, which gets translated into an assembly language (that you designed) by your VM translator, which gets assembled into machine code (that you designed) by your assembler, which runs on your computer system which you put together from chips that you designed by using boolean logic and a simple hardware description language.

Take a step back. A compiler is simply a program that translates a document in one language into a document in another language. Both languages ought to be well-defined and specific.

The languages do not have to be programming languages. They can be any language whose rules can be written down. You've probably seen Google Translate; that's a compiler because it can translate one language (say, German) into another (Japanese, perhaps).

Another example of a compiler is an HTML rendering engine. Its input is an HTML file and the output is a series of instructions to draw the pixels on the screen.

When most people talk about a compiler, they are usually referring to a program that translates a high-level programming language (such as Java, C, Prolog) into a low-level one (assembly or machine code). That can be daunting. But it's not so bad when you take a generalist's view that a compiler is a program that translates one language into another.

Can you write a program that reverses every word in a string? For example:

When the cat's away, the mice will play.

becomes

nehW eht s'tac yawa, eht ecim lliw yalp.

That's not a difficult program to write, but you need to think about some things:

What is a "word"? Can you define which characters make up a word?

Where do words start and end?

Are words separated by only one space, or can there be more--or less?

Does punctuation need to be reversed, too?

What about punctuation inside a word?

What happens to capital letters?

The answers to these questions help the language be well-defined. Now go ahead and write the program. Congratulations, you've just written a compiler.

How about this: Can you write a program that takes a series of drawing instructions and outputs a PNG (or JPEG) file? Maybe something like this:

What comes after the word "line"? What comes after "color"? Likewise for "background", "box", etc.

What is a number?

Is an empty input file allowed?

Is it OK to capitalize the words?

Are negative numbers allowed?

What happens if you don't give the "image" directive?

Is it OK to not specify a color?

Of course, there are more questions to answer but if you can nail them down, you have defined a language. The program you write to do the translation is, you guess it, a compiler.

You see, writing a compiler isn't that difficult. The compilers you've used in Java or C are just bigger versions of these two examples. So go for it! Define a simple language and write a program to make that language do something. Sooner or later you're going to want to extend your language. For instance, you may want to add variables or arithmetic expressions. Your compiler will become more complex but you'll understand every bit of it because you wrote it yourself. That's how languages and compilers come about.

Don't believe that there's anything magic about a compiler or an OS: there is not. Remember the programs you wrote to count all the vowels in a string, or add up the numbers in an array? A compiler is no different in concept; it's just a whole lot bigger.

Every program has three phases:

read some stuff

process that stuff: translate the input data to the output data

write some other stuff – the output data

Think about it: what is input to the compiler? A string of characters from a source file.

What is output from the compiler? A string of bytes that represent machine instructions to the target computer.

So what is the compiler's "process" phase? What does that phase do?

If you consider that the compiler – like any other program – has to include these three phases, you'll have a good idea of how a compiler is constructed.

Several others have given excellent answers. I'll just add a few more suggestions. First, a good book for what you're trying to do is Appel's Modern Compiler Implementation texts (take your pick of C, Java, or Standard ML). This book takes you through a complete implementation of a compiler for a simple language, Tiger, to MIPS assembly that can be run in an emulator, along with a minimal runtime support library. For a single pass through everything necessary to make a compiled language work, it's a pretty good book1.

Finally, I mentioned that Appel has his text in C, Java, and Standard ML - if you're serious about compiler construction and programming languages, I recommend learning ML and using that version of Appel. The ML-family languages have strong type systems are predominantly functional - features that will be different from many other languages, so learning them if you don't already know a functional language will hone your language craft. Also, their pattern-matching and functional mindsets are extremely well-suited to the kinds of manipulations you need to do often in a compiler, so compilers written in ML-based languages are typically much shorter and easier to understand than compilers written in C, Java, or similar languages. Harper's book on Standard ML is a pretty good guide to get you started; working through that should prepare you to take on Appel's Standard ML compiler implementation book. If you learn Standard ML, it will also then be pretty easy to pick up OCaml for later work; IMO, it has better tooling for the working programmer (integrates more cleanly with the surrounding OS environment, produces executable programs easily, and has some spectacular compiler-building tools like ulex and Menhir).

1For long-term reference, I prefer the Dragon Book, as it has more details on the things I'm likely to refer to such as the inner workings of parser algorithms and has broader coverage of different approaches, but Appel's book is very good for a first pass. Basically, Appel teaches you one way to do things the whole way through the compiler and guides you through it. The Dragon Book covers different design alternatives in more detail, but provides far less guidance on how to get something working.

The basics of going about doing this is not as complicated as you'd think. The first step is to create your grammar. Think of the English language's grammar. In the same way you can parse a sentence if it has a subject and predicate. For more on that read about Context Free Grammars.

Once you have the grammar down (the rules of your language), writing a compiler is as simple as just following those rules. Compilers usually translate into the machine code, but unless you want to learn x86, I suggest you maybe look at MIPS or making your own Virtual Machine.

Compilers typically have two parts, a scanner and a parser. Basically, the scanner reads in the code and separates it out into tokens. The parser looks at the structure of those tokens. Then the compiler goes through and follows some rather simple rules to convert it to whatever code you need it to be in (assembly, intermediate code like bytecode, etc.). If you break it down into smaller and smaller pieces, this eventually isn't daunting at all.

Uhm. The compiler, after scanning/parsing needs to do type-checking/inference, optimization, register allocation, etc, etc. These steps be anything but simple. (When using interpreted code, you just defer these parts to the runtime stage.)
–
MackeJun 16 '11 at 7:31

Then you run it through an assembler, and turn into into something like this:

$A9 $00
$4C $10 $00

Only it's all squashed up, like this:

$A9 $00 $4C $10 $00

It's really not magic.

You can't write that in notepad, because notepad uses ASCII (not hex). You would use a hex editor, or simply write the bytes out programatically. You write that hex out to a file, name it "a.exe" or "a.out", then tell the OS to run it.

Of course, modern CPUs and operating systems are really quite complicated, but that's the basic idea.

If you want to write a new compiler, here is how it's done:

1) Write an interpreted language using something like the calculator example in pyparsing (or any other good parsing framework). That will get you up to speed on the basics of parsing.

2) Write a translator. Translate your language into, say, Javascript. Now your language will run in a browser.

3) Write a translator to something lower level, like LLVM, C, or Assembly.

You can stop here, this is a compiler. It's not an optimizing compiler, but that wasn't the question. You might also need to consider writing a linker and assembler, but do you really want to?

4) (Insane) Write an optimizer. Large teams work for decades on this.

4) (Sane) Get involved in an existing community. GCC, LLVM, PyPy, the core team working on any interpreter.

I can remember a point in my programming career when I was in a similar state of confusion to yours: I had read up on the theory quite a bit, the Dragon book, the Tiger book (red), but still hadn't much of a clue how to put it all together.

What did tie it together was finding a concrete project to do (and then finding out that I only needed a small subset of all the theory).

The Java VM provided me with a good starting point: it's conceptually a "processor" but it's highly abstracted from the messy details of actual CPUs. It also affords an important and often overlooked part of the learning process: taking things apart before putting them together again (like kids used to do with radio sets in the old days).

Play around with a decompiler and the Hello, World class in Java. Read the JVM spec and try to understand what's going on. This will give you grounded insight in just what the compiler is doing.

Then play around with code that creates the Hello, World class. (In effect you're creating an application-specific compiler, for a highly specialized language in which you can only say Hello, World.)

Try writing code that will be able to read in Hello, World written in some other language, and output the same class. Make it so you can change the string from "Hello, World" to something else.

Now try compiling (in Java) a class that computes some arithmetic expression, like "2*(3+4)". Take this class apart, write a "toy compiler" that can put it together again.

There are excellent answers in this thread, but I just wanted to add mine as I too once had the same question. (Also, I would like to point out that the book suggested by Joe-Internet is an excellent resource.)

First is the question of how does a computer work ?
This is how: Input -> Compute -> Output.

First consider the “Compute” part.
We’ll look at how Input and Output works later.

A computer essentially consists of a processor(or CPU) and some memory(or RAM).
The memory is a collection of locations each of which can store a finite number of bits, and each such memory location can itself be referenced by a number, this is called the address of the memory location.The processor is a gadget which can fetch data from the memory, perform some operations based on the data and write back some data back to the memory. How does the processor figure out what to read and what to do after reading the data from memory ?

To answer this, we need to understand the structure of a processor.
The following is a fairly simple view.
A processor essentially consists of two parts. One is a set of memory locations built inside the processor that serve as it’s working memory. These are called “registers”. The second is a bunch of electronic machinery built to perform certain operations using the data in the registers.There are two special registers called the “Program Counter” or the pc and the “Instruction Register” or the ir.
The processor considers the memory to be partitioned in three parts. The first part is the “program memory”, which stores the computer program being executed. The second is the “data memory”. The third is used for some special purposes, we’ll talk about it later.
The Program Counter contains the location of the next instruction to read from the Program Memory. The Instruction Counter Contains a number which refers to the current operation being performed. Each operation that a processor can perform is refered to by a number called the opcode of the operation. How a computer essentially works is it reads the memory location referenced by the Program Counter into the Instruction Register (and it increments the Program Counter so that it points to the memory location of the next instruction). Next, it reads the Instruction Register and performs the desired operation. For example the instruction could be to read a specific memory location into a register, or to write to some register or to perform some operation using the values of two registers and write the output to a third register.

Now how does the computer perform Input / Output ? I’ll provide a very simplified answer.
See http://en.wikipedia.org/wiki/Input/output and http://en.wikipedia.org/wiki/Interrupt. for more.
It uses two things, that third part of the memory and something called Interrupts. Every device attached to a computer must be able to exchange data with the processor. It does so using the third part of the memory mentioned earlier. The processor allocates a slice of memory to each device and the device and processor communicate via that slice of memory. But how does the processor know what location refers to what device and when does a device needs to exchange data ? This is where interrupts come in. An interrupt is essentially a signal to the processor to pause what it is currently and save all it’s registers to a known location and then start doing something else. There many interrupts, each is identified by a unique number. For each interrupt, there is a special program associated with it. When the interrupt occurs, the processor executes the program corresponding to the interrupt. Now depending on the bios and how the hardware devices are connected to the computer motherboard, every device gets a unique interrupt and a slice of memory. While booting up the operating system with the help of the bios determines interrupt and memory location of each device and sets up the special programs for the interrupt to properly handle the devices. So when a device needs some data or wants to send in some data, it signals an interrupt. The processor pauses what it is doing, handles the interrupt and then gets back to what it is doing. The are many kinds of interrupts, such as for the hdd, keyboard etc. An important one is the system timer, which invokes a interrupt at regular intervals. Also there are opcodes that can trigger interrupts, called software interrupts.

Now we can almost understand how an operating system works. When it boots up, the os sets up a the timer interrupt, so that it gives control to the os at regular intervals. It also sets up other interrupts to handle other devices etc. Now when the computer is running a bunch of programs, and the timer interrupt happens the os gains control and performs important tasks such as process management, memory management etc. Also an os usually provides an abstract way for the programs to access the hardware devices, rather than let them access devices directly. When a program wants to access a device, it calls some code provided by the os which then talks to the device. There is a lot of theory involved in these which deals with concurrency, threads, locks, memory management etc.

Now, one can in theory write a program directly using opcodes. This is what is
called machine code. This is obviously very painful. Now an assembly language for the processor is nothing but mnemonics for these opcodes, which makes it easier to write programs. A simple assembler is a program that takes a program written in assembly and replaces the mnemonics with the appropriate opcodes.

How does one go about designing a processor and assembly language. To know that you have to read some books on computer architecture. (see chapters 1-7 of the book refered by joe-internet). This involves learning about boolean algebra, how to build simple combinatorial circuits to add, multiply etc, how to build memory and sequential circuits, how to build a microprocessor and so in.

Now how does one write computer langauges. One could start off by writing a simple assembler in machine code. Then use that assembler to write a compiler for a simple subset of C. Then use that subset of C to write a more complete version of C. Finally use C to write a more complicated language such as python or C++. Of course to write a language you must first design it ( the same way you desigh a processor). Again look at some textbooks on that.

And how does one write an os. First you target a platform such as x86. Then you figure out how it boots and when will your os get invoked. A typical pc boots this way. It starts up and bios performs some tests. Then the bios reads the first sector of the hdd and load the contents to a specific location in the memory. Then it sets up the cpu to start executing this loaded data. This is the point you os gets invoked. A typical os at this point loads rest of itself memory. Then it initializes the devices and sets up other things and finally it greets you with the login screen.

So to write an os you must write the “boot-loader”. Then you must write code to handle the interrupts and devices. Then you must write all the code for process-management, device-management etc. Then you must write an api which lets the programs running in your os to access devices and other resources. And finally you must write code that reads a program from disk, sets it up as a process and starts executing it.

Of course my answer is overtly simplified and probably of little practical use. In my defence I’m now a graduate student in theory, so I have forgotten a lot of these things. But you can google a lot of these stuff and find out more.

ANTLR is a good starting point. It's a language generating framework, similar to Lex and Yacc. There's a gui called ANTLRWorks that simplifies the process.

In the .NET world there the Dynamic Language Runtime which can be used to generate code in the .NET world. I've written an expression language called Zentrum that generates code using the DLR. It will show you how to parse and execute statically and dynamically typed expressions.

For a simple introduction into how compilers work and how to create your own programming language I would recommend the new book http://createyourproglang.com which focuses more on language design theory without having to know about OS/CPU internals, i.e. lexers, parsers, interpreters, etc.

It uses the same tools that were used to create the recently popular Coffee Script and Fancy programming languages.

If all you say is true you have the profile of a promising researcher, and a concrete understanding can be obtained only one way: studying. And I'm not saying "Read all these high level computer science books (specially these) written by this genius!"; I mean: you must to be with high level people in order to be a computer scientist like Charles Babbage, Alan Turing, Claude Shannon or Dennis Ritchie. I'm not despising self-taught people (I'm one of them) but there are not many people like you out there. I seriously recommend Symbolic Systems Program (SSP) at Stanford University. As their website says:

The Symbolic Systems Program (SSP) at
Stanford University focuses on
computers and minds: artificial and
natural systems that use symbols to
represent information. SSP brings
together students and faculty
interested in different aspects of the
human-computer relationship,
including...

I'm going to suggest something a little out of left field: learn Python (or perhaps Ruby, but I have a lot more experience in Python so that's what I'll discuss). And not merely dabble in it, but really get to know it at a deep level.

There are several reasons I suggest this:

Python is an exceptionally well-designed language. While it has a few warts, it has fewer IMHO than many other languages. If you are a budding language designer, it's good to expose yourself to as many good languages as possible.

Python's standard implementation (CPython) is open-source and well-documented, making it easier to understand how the language works under the hood.

Python is compiled to a simple byte code which is easier to understand than assembly and which works the same on all platforms Python runs on. So you'll learn about compilation (since Python does compile your source code to byte code) and interpretation (as this byte code is interpreted in the Python virtual machine).

Python has lots of proposed new features, documented in numbered PEPs (Python Enhancement Proposals). PEPs interesting to read to see how the language designers considered implementing a feature before choosing the way they actually did it. (PEPs that are still under consideration are especially interesting in this regard.)

Python has a mix of features from various programming paradigms, so you will learn about various ways to approach solving problems and have a wider range of tools to consider including in your own language.

Python makes it pretty easy to extend the language in various ways with decorators, metaclasses, import hooks, etc. so you can play with new language features to an extent without actually leaving the language. (As an aside: blocks of code are a first-class objects in Ruby, so you can actually write new control structures such as loops! I get the impression that Ruby programmers don't necessarily consider that extending the language though, it's just how you program in Ruby. But it's pretty cool.)

In Python, you can actually disassemble the bytecode generated by the compiler, or even write your own from scratch and have the interpreter execute it (I have done this myself, and it was mind-bending but fun).

Python has good libraries for parsing. You can parse Python code into an abstract syntax tree and then manipulate it using the AST module. The PyParsing module is useful for parsing arbitrary languages, such as ones you design. You could in theory write your first language compiler in Python if you wanted (and it could generate C, assembly, or even Python output).

This investigative approach could go well with a more formal approach, as you will begin to recognize concepts you've studied in the language you're working with, and vice versa.

Compilers and programming languages (and everything including in building one - such as defining a finite grammar and conversion to assembly) is a very complex task which requires a great deal of understanding about systems as a whole. This type of course is typically offered as a 3rd/4th year Comp Sci class in University.

I would highly recommend you first get a better understanding of Operating Systems in general and how existing languages are compiled/executed (ie. natively (C/C++), in a VM (Java) or by an interpreter(Python/Javascript)).

I believe we used the book Operating System Concepts by Abraham Silberschatz, Peter B. Galvin, Greg Gagne in my Operating Systems course(in 2nd year). This was an excellent book which gave a thorough walkthrough of each component of an operating system - a bit pricey but well worth it and older/used copies should be floating around.

Well, I think your question could be rewritten to be, "What are the core practical concepts of a computer science degree", and the total answer, is, of course, to get your own Bachelor's in Computer Science.

Fundamentally, you create your own programming language compiler by reading a text file, extracting information from it, and performing transformations on the text based off of the information you've read from it, until you have transformed it into bytes that can be read by the loader (c.f., Linkers and Loaders by Levine). A trivial compiler is a fairly rigorous project when done for the first time.

An operating system's heart is the kernel, which manages resources (e.g., memory allocation/deallocation), and switches between tasks/processes/programs.

An assembler is a text->byte transformation.

If you are interested in this stuff, I would suggest writing an X86 assembler, in Linux, that supports some subset of standard X86 assembly. That will be a fairly straightforward entry point and introduce you to these issues. It is not a baby project, and will teach you many things.

I would recommend writing it in C; C is the lingua franca for that level of work.

On the other hand, this is a fine place for a very-high-level language. As long as you can dictate the individual bytes in a file, you can make a compiler/assembler (which is easier) in whatever language. Say, perl. Or VBA. Heavens, the possibilities!
–
IanDec 31 '12 at 3:48

People learn by doing. Only a small number can see symbols scrawled on the board and jump immediately from theory to practice. Unfortunately, those people are often dogmatic, fundamentalist and the loudest about it.

I was blessed to be exposed to the PDP-8 as my first assembly language. The PDP-8 had only six instructions, which were so simple it was easy to imagine them being implemented by a few discreet components, which in fact they were. It really removed the "magic" from computers.

Another gateway to the same revelation is the "mix" assembly language Knuth uses in his examples. "Mix" seems archaic today, but it still has that DE-mystifying effect.

Another good introductory book is N. Wirth's "Compilerbau" from 1986 (compiler construction) which is about 100 pages long and explains concise, well designed code for the toy language PL/0, including parser, code generator and virtual machine. It also shows how to write a parser that reads in the grammar to parse in EBNF notation. The book is in German but I wrote a summary and translated the code to Python as an exercise, see http://www.d12k.org/cmplr/w86/intro.html.

It's a big topic but rather than brush you off with a pompous "go read a book, kid" instead I'll gladly give you pointers to help you wrap your head around it.

Most compilers and/or interpreters work like this:

Tokenize: Scan the code text and break it into a list of tokens.

This step can be tricky because you can't just split the string on spaces, you have to recognize that if (bar) foo += "a string"; is a list of 8 tokens: WORD, OPEN_PAREN, WORD, CLOSE_PAREN, WORD, ASIGNMENT_ADD, STRING_LITERAL, TERMINATOR. As you can see, simply splitting the source code on spaces won't work, you have to read each character as a sequence, so if you encounter an alphanumeric character you keep reading characters until you hit a non-alphanum character and that string you just read is a WORD to be further classified later. You can decide for yourself how granular your tokenizer is: whether it swallows "a string" as one token called STRING_LITERAL to be further parsed later, or whether it sees "a string" as OPEN_QUOTE, UNPARSED_TEXT, CLOSE_QUOTE, or whatever, this is just one of the many choices you have to decide for yourself as you're coding it.

Lex: So now you have a list of tokens. You probably tagged some tokens with an ambiguous classification like WORD because during the first pass you don't spend too much effort trying to figure out the context of each string of characters. So now read yout list of source tokens again and reclassify each of the ambiguous tokens with a more specific token type based on the keywords in your language. So you have a WORD such as "if", and "if" is in your list of special keywords called symbol IF so you change the symbol type of that token from WORD to IF, and any WORD that is not in your special keywords list, such as WORD foo, is an IDENTIFIER.

Parse: So now you turned if (bar) foo += "a string"; a list of lexed tokens that looks like this: IF OPEN_PAREN IDENTIFER CLOSE_PAREN IDENTIFIER ASIGN_ADD STRING_LITERAL TERMINATOR. The step is recognizing sequences of tokens as statements. This is parsing. You do this using a grammar such as:

STATEMENT := ASIGN_EXPRESSION | IF_STATEMENT

IF_STATEMENT := IF, PAREN_EXPRESSION, STATEMENT

ASIGN_EXPRESSION := IDENTIFIER, ASIGN_OP, VALUE

PAREN_EXPRESSSION := OPEN_PAREN, VALUE, CLOSE_PAREN

VALUE := IDENTIFIER | STRING_LITERAL | PAREN_EXPRESSION

ASIGN_OP := EQUAL | ASIGN_ADD | ASIGN_SUBTRACT | ASIGN_MULT

The productions that use "|" between terms means "match any of these", if it there are commas between terms it means "match this sequence of terms"

How do you use this? Starting with the first token, try to match your sequence of tokens with these productions. So first you try to match your token list with STATEMENT, so you read the rule for STATEMENT and it says "a STATEMENT is either a ASIGN_EXPRESSION or an IF_STATEMENT" so you try to match ASIGN_EXPRESSION first, so you look up the grammar rule for ASIGN_EXPRESSION and it says "ASIGN_EXPRESSION is an IDENTIFIER followed by an ASIGN_OP followed by an VALUE, so you lookup the grammar rule for IDENTIFIER and you see there is no grammar ruke for IDENTIFIER so that means IDENTIFIER a "terminal" meaning it doesn't require further parsing to match it so you can try to match it directly with your token. But your first source token is an IF, and IF is not the same as a IDENTIFIER so match failed. What now? You go back to the STATEMENT rule and try to match the next term: IF_STATEMENT. You lookup IF_STATEMENT, it starts with IF, lookup IF, IF is a terminal, compare terminal with your first token, IF token matches, awesome keep going, next term is PAREN_EXPRESSION, lookup PAREN_EXPRESSION, it's not a terminal, what's it's first term, PAREN_EXPRESSION starts with OPEN_PAREN, lookup OPEN_PAREN, it's a terminal, match OPEN_PAREN to your next token, it matches, .... and so on.

The easiest way to approach this step is you have a function called parse() which you pass it the source code token you're trying to match and the grammar term you're trying to match it with. If the grammar term is not a terminal then you recurse: you call parse() again passing it the same source token and the first term of this grammar rule. This is why it's a called a "recursive descent parser" The parse() function returns (or modifies) your current position in reading the source tokens, it essentially passes back the last token in the matched sequence, and you continue the next call to parse() from there.

Each time parse() matches a production like ASIGN_EXPRESSION you create a structure representing that piece of code. This structure contains references to the original source tokens. You start building a list of these structures. We'll call this entire structure the Abstract Syntax Tree (AST)

Compile and/or Execute: For certain productions in your grammar you have created handler functions that if given an AST structure it would compile or execute that chunk of AST.

So let's look at the piece of your AST that has type ASIGN_ADD. So as an interpreter you have a ASIGN_ADD_execute() function. This function is passed as piece of the AST that corresponds to the parse tree for foo += "a string", so this function looks at that structure and it knows that first term in the structure must be an IDENTIFIER, and the second term is the VALUE, so ASIGN_ADD_execute() passes the VALUE term to a VALUE_eval() function which returns an object representing the evaluated value in memory, then ASIGN_ADD_execute() does a lookup of "foo" in your variables table, and stores a reference to whatever was returned by the eval_value() function.

That's an interpreter. A compiler would instead have handler functions translate the AST into byte code or machine code instead of executing it.

Steps 1 to 3, and some 4, can be made easier using tools like Flex and Bison. (aka. Lex and Yacc) but writing an interpreter yourself from scratch is probably the most empowering exercise any programmer could achieve. All other programming challenges seem trivial after summit-ting this one.

My advice is start small: a tiny language, with a tiny grammar, and try parsing and executing a few simple statements, then grow from there.

You make what I consider a classic mistake when people think about compiling: that is believing the problem is about parsing. PARSING IS TECHNICALLY EASY; there are great technologies for doing it. The hard part about compiling is semantic analysis, optimizing at high and low levels of program representation, and generation of code, with growing emphasis these days on PARALLEL code. You trivialize this completely in your answer: "a compiler would have handler functions to the translate the AST into byte code". There's 50 elapsed years of compiler theory and engineering hiding in there.
–
Ira BaxterJun 17 '11 at 15:31

The computer field is only complicated because it has had time to evolve in many directions.
At its heart it is just about machines that compute.

My favorite very basic computer is Harry Porter's Relay Computer.
It gives a flavor of how a computer works at the base level.
Then you can start to appreciate why things like languages and operating systems are needed.

The thing is, it's hard to understand anything without understanding what needs it.
Good luck, and don't just read stuff.
Do stuff.

If you are interested in understanding the essence of programming languages, I would suggest that you work through the PLAI(http://www.cs.brown.edu/~sk/Publications/Books/ProgLangs/) book to understand the concepts and their implementation. It will also help you with the design of your own language.

If you really have interests in compiler, and never did one before, you could start by designing a calculator for computing arithmetic formulas (a kind of DSL as Eric mentioned). There are many aspects you would need to consider for this kind of compiler:

Allowed numbers

Allowed operators

The operator priorities

Syntax validation

Variable look up mechanism

Cycle detection

Optimization

For example, you have the following formulas, your calculator should be able to calculate the value of x:

a = 1
b = 2
c = a + b
d = (3 + b) * c
x = a - d / b

It's not an extreme difficult compiler to start with, but could make you think more of some basic ideas of what a compiler is, and also help you improving your programming skills, and control the quality of your code (this actually is a perfect problem that Test Driven Development TDD could apply to improve the software quality).