User login

Navigation

Deliverable code?

I wonder what Ltuer's think about deliverability? What is deliverable code? When I think about dynamic vs. static languages a big difference is the nature of the end product. A scripting language might conclude with a script deliverable, but does anyone want to deliver a script? Dynamic languages are "programmer friendly" but the final compiled deliverable is almost certainly "compromised" (ie slower or less safe). Languages like Java and C# require an intermediate virtual machine. This is only acceptable due the pervasiveness of the environments and probably compromises other languages using the VM. Can we have it both ways? Good deliverable code and a highly productive development environment? For example I have occasionally developed an application in Lisp and then converted it to C, (but not recently). Surely there is a better way?

You may want to look into an old OSF project from the late '80s called "Architecture Neutral Distribution Format" (ANDF), that was pitched toward the question of delivering retargetable executables without delivering source. It didn't go much of anywhere, partially because that problem is somewhat alleviated for VM-based languages, partially because the web makes it easier to simply deliver multiple compiled versions of a given piece of source, and partially because it's a really hard problem.

When I think about dynamic vs. static languages a big difference is the nature of the end product.

I can already see the froth starting at the corners of Ehud's mouth. ;)

The difference you're seeing is mainly an artifact of history and community emphasis, and doesn't have much to do with anything technical.

Compilers are available for most of the popular languages: Python has Psyco; PHP has Roadsend, Quercus, Zend, and Phalanger; and Java, in addition to JIT VMs, has GCJ. Lisp and Scheme have long had good compilers, and converting to C by hand shouldn't be necessary these days unless you need to deliver non-Lisp human-readable source, which is a different issue.

For languages other than C that are used heavily in open source and free software, there's been less emphasis on compilers, since delivery via source is considered a benefit. Plenty of people are running Python, Perl, PHP and Ruby applications direct from source (LtU is an example), and there often aren't good reasons to bother compiling those applications, because they tend to be I/O bound.

In the Java case, the VM is considered a benefit because you can deliver the same binary representation across platforms. There are also other advantages to delivering programs using an intermediate representation, e.g. it's much easier to plug them into web frameworks, testing frameworks, IDEs, or other applications.

This means that the definition of "good deliverable code" isn't necessarily just a monolithic native executable. I'd say good deliverability means having a range of options, from interpreter-like to bytecode-compiled to native code compiled.

Can we have it both ways? Good deliverable code and a highly productive development environment?

Of course, and there are quite a few good examples. I'd point to Lisp and Scheme, and Paul would point to OCaml. While the latter isn't usually considered "dynamic" (depending on exact definition), it nevertheless has a REPL and can be interactively developed in an IDE. Some Schemes are not much less dynamic than OCaml from a technical perspective, other than the typing issue, but they manage to behave dynamically anyway.

And that's the crux of this: what is a dynamic language? The term generally refers to a whole family of features that are commonly found together, but few if any of those features are technically unique to the languages people think of as dynamic languages. For example, if you can compile and load code into a running system, as you can in an OCaml or SML REPL, it goes a long way towards blurring the distinction. Support for a bytecode compiler also tends to blur the distinction, as in the Java case. In general, the distinction has been getting blurrier as the state of the art has improved.

Surely there is a better way?

Yes. Dynamic languages can easily do a better job of generating deliverables if they want to, and static languages can, a little less easily, support dynamic features if they want to.

The asymmetry in easiness described in the previous sentence is really central to the distinction between language types. The main reason we associate dynamic features with dynamic languages is because those features come for free in language implementations which don't do much static analysis and rely heavily on runtime structures to achieve their semantics. We can get most of the dynamic features worth having in more statically sophisticated languages, but it requires more work.

Can we have it both ways? Good deliverable code and a highly productive development environment?

Of course, and there are quite a few good examples. I'd point to Lisp and Scheme, and Paul would point to OCaml.

Actually, I'd point to Common Lisp, many Schemes, some SML implementations (SML/NJ and MLton), as well as O'Caml. I can't seem to find the post now, but some time back I think I nailed down a good chunk of what frustrated me about the dynamic-language love that I often see, which boils down to the benefits of the development environments—benefits that I too had believed to accrue only to languages firmly in the Smalltalk/Lisp camp. So to find type-and-go development without tons of type declarations, time-travel debugging, a very expressive type system (at least if your name is Oleg), and compile-to-single-deliverable-native-binary in the form of O'Caml has been a real eye-opener, and has regrettably led me to bouts of extremism in its advocacy. But in my more rational, calmer moments, I write stuff like Whys and Wherefores, which I think is my best writing on LtU with respect to how I actually see the static and dynamic typing issue, and which I think supports Anton's point: modern research on optional static typing, gradual typing, etc. and efforts on the static language (if you will) side to escape from the edit/compile/link/test/crash/debug cycle are blurring many, if not most, of the traditional distinctions that have been made in the industry, which I think is all to the good.

Update: I also neglected to add that I think that this will ultimately be addressed within the context of addressing security concerns. That is, as more and more code gets exchanged among mutually distrustful parties in increasingly-indirect ways—separately-compiled modules downloaded from the Internet and deliberately linked by the user; DLLs automagically downloaded and loaded at runtime; user-scripted code attached to otherwise innocuous "documents" (think MS Word macro viri); distributed systems serializing code over network connections; the list goes on—I think the importance of Foundational Proof-Carrying Code will become increasingly apparent, and it so happens that the road to FPCC looks an awful lot like it involves TILs and TALs (Typed Intermediate Languages and Typed Assembly Languages). I can pretty easily envision a general-purpose code distribution mechanism that is essentially a TIL that has a certified verifier associated with it, as well as JIT compilation to an appropriate TAL and thence to native code. This would seem to me to be the logical intersection of Appel et al's work in A Very Modal Model of a Modern Major General Type System and Chang et al's (including Adam Chlipala, who has posted here before) work on certified verifiers.

When I think about dynamic vs. static languages a big difference is the nature of the end product.

I apologize for vagueness here. What I wanted to do is compare environments that favor compilation vs. favor interpretation. Choosing interpretation will have a speed penalty at least but this may be a necessary tradeoff. Reactive systems seem to be the best example. A reactive system must create and operate on data as it moves along in time, there don't seem to be opportunities to compile in this case. But given that this is a real problem and not just a choice we must deal with the efficiency and safety issues in this context.

I apologize for vagueness here. What I wanted to do is compare environments that favor compilation vs. favor interpretation.

Do you mean compilation to native code? Since of course, many language implementations, of dynamic languages or otherwise, use compilation to bytecode or some other intermediate representation. Are you categorizing those as environments which favor interpretation (of the IR)?

If I understand what it is you're interested in, a couple of properties that you should probably consider are:

How much metadata the compilation process retains about the program. Historically, compilation to native code has involved discarding most metadata, leaving a native code representation of the program about which little is known, so little runtime manipulation is possible. That's still a common approach. However, languages such as Java, which (canonically) compile to bytecode first and then may JIT-compile to native code, retain more metadata and are correspondingly better able to support runtime manipulation of a program. The various bytecode manipulation libraries for Java (e.g. CGLIB, ASM, BCEL) demonstrate this, allowing classes to be created and extended, and interfaces to be implemented, at runtime. In theory, a language could be statically compiled to native code but still retain the necessary metadata to support runtime manipulation of programs, and interaction with IR-based modules. So in this case it may not actually be whether the deliverable uses an IR that's important, but rather the availability of metadata, which IRs happen to be good at providing.

The cost of compilation. It's cheaper to compile to a typical IR than to native code, particularly optimized native code. If you're going to do runtime code generation and manipulation, then an IR helps for this reason. The JIT compilation approach can delay compilation to native code until the benefits can be predicted to outweigh the costs.

With both of the above points, having the deliverable involve an IR provides benefits. In particular, the second point may almost mandate an IR in some cases, even if you can avoid it on the first point.

Of course, there are other factors in the choice between native code and other kinds of deliverables, such as protection of source code, or resistance to modification (where native code has something of an obscurity advantage), or speed of initialization. Protection of source code and resistance to modification are both at odds with retention of metadata. That could perhaps be addressed, to some extent, by metadata encryption. Speed of initialization of IR-based systems can be addressed by things like heap snapshots. You can have most of your cake and eat most of it, but it can require a lot of work.

"Environments that favor interpretation" is just an expression to get around words like "dynamic". The word "reactive" seems to get to the point. A reactive application is just a simple or complicated interpreter. Also the interpreter is simply composing code that is already compiled and available in a library. Such an interpreter is probably more like a DSL than a complete language in the traditional sense, although it might be used that way. Most of the "AI" languages like Lisp, Jess, Prolog, and many more have JVM implementations.

While I have no serious problem with the JVM approach or other intermediate languages and VMs, I am not persuaded that this is the best approach for these environments (ie reactive, AI). A simple interpreter of the type needed is quite simple and is only composing compiled code on a one time basis. The composition will not be used again. It is hard to see how just in time compilation would be faster in this case. It would be nice to see some research or test results for this situation.

What I find odd is the assumption that the form of typing (dynamic vr.s static) has any effect what so ever on the form of delivered executable (script, virtual machine, native executable). Especially considering that all three forms have advantages and disadvantages, and are usefull in different contexts and for different purposes. The very strong trend in functional languages, at least, is to provide implementations of all three, and allow the developer to choose which one to use at any given point. I know that both Haskell and Ocaml provide all three, and developers use all three. And I beleive that Lisp and Scheme provide all three. It's weird, backwards, poorly specified languages like Java, Ruby, and C++ which force a single executable form.

Certainly none of the three you mention do. C/C++ interpreters are available. Java native compilers are available. I'm sure someone has done a native compiler to Ruby.

I'm not sure how meaningful the distinction between "interpreter" and "VM" is anyway--the usual distinction is that an interpreter is a machine whose instruction set is the source language; the latter is a machine whose instruction set is some IL. Many common examples of the former of course do a quick compile into an IL under the hood, and then employ a VM like Parrot (or even the JVM). Likewise the distinction between VM and hardware; the .Net platform usually will agressively compiles MSIL to native code whenever a .NET application is run--so do many JVM implementations.

The three languages find themselves often deployed to specific application types, where a given mode of translation and execution is preferable. C++ is routinely used for standalone apps where performance is a concern; compilation to native is therefore the dominant means of deploying and execution C++ code. Ruby and the numerous "P" languages are used for (dare I say it) scripts and other lightweight (and I/O-bound) apps; fast startup and portability are concerns; performance is not. Thus, they are usually deployed as source and "interpreted". Java and many of the .net languages occupy a middle ground where delivery of an IL makes the most sense.

What I find odd is the assumption that the form of typing (dynamic vr.s static)

"Dynamic language" is not usually exactly synonymous with "dynamic typing", although there tends to be a lot of overlap. The wikipedia definition looks as good as any:

a class of high level programming languages which share many common runtime behaviors that other languages might perform during compilation, if at all. These behaviors could include extension of the program, by adding new code, or by extending objects and definitions, or by modifying the type system, all during program execution.

These kind of features tend to be more difficult to implement in native code compilers. Relatively pure interpreters and (to varying degrees) compilers to intermediate representations tend to have have better dynamic properties. You sometimes see this in a single language: e.g. both OCaml and Gambit Scheme have better dynamic capabilities in their bytecode versions than their native code versions.

Native code compilation is only supported on Linux and FreeBSD
i386 platforms.
For installation, type:

./configure -prefix `pwd`
make world

make opt
make metaocamlopt
make install

The native code compiler can be run using the command metaocamlopt.

I do believe that Concoqtion, being based on MetaOCaml, inherits this limitation. But the limitation appears to have nothing to do with anything other than available resources to perform the other desired ports.

But the limitation appears to have nothing to do with anything other than available resources to perform the other desired ports.

Yes, and I didn't mean to imply that, I was just stating the simple fact that at the moment, most of these features/extensions are bytecode only. Glad to see they've added native code to MetaOCaml! I wonder if anyone's done any analysis on code generation performance; I'm curious about runtime code gen overheads, and OCaml's well designed core seems like it would be a good testbed.

For Gambit, the interpreter supports serializable closures and continuations, but the compiler doesn't. (That case may not be as straightforward as bytecode vs. native code, but the same general principle applies.)

That's correct, due to the manner in which the stack is copied. In fact, in a way, it's even worse: to build the library, you need to have the O'Caml source tree available, because the library uses non-public APIs, if you will, to do its dirty work.

Luckily, in that same entry, Oleg also writes:

For comparison we show another implementation, which requires the monadic style of writing code. That implementation however is in pure OCaml and works both for byte-code--interpreted as well as ocamlopt-compiled programs. That implementation is, too, strongly typed and needs no bizarre 'a cont types and no loopholes.

The monadic code is predictably ugly, so I asked Oleg if he would please integrate it with Jacques Carette's, Lydia E. van Dijk's, and his Syntax extension for Monads in OCaml, which he was kind enough to do in the 1.1 version. For those who aren't aware, the syntax extension essentially brings Haskell's "do-notation" to O'Caml, making O'Caml's support for monads every bit as complete as Haskell's, modulo the fact that Haskell runs everything inside an über-mondo IOSingingDancing monad because it's pure and lazy, while O'Caml doesn't, because it's impure and strict.

I just want to clarify, for anyone who may not be paying close attention, that the existence of the monadic version of delimited continuations, with or without the syntax extension, doesn't affect the original point, which is that you can do things with bytecode OCaml that you can't do with native code OCaml, for the kinds of reasons I described.

Part of the point of my response was to recapitulate that, from a feature point of view, you actually can use exactly the same delimited continuations in either bytecode or native O'Caml; it's only the implementation approach and syntax that varies slightly. Also, given that the native implementation requires access to the source tree, wouldn't it be more accurate to say that the native implementation relies, not on the fact that the implementation is a bytecoded one, but rather specifically upon the structure of stack frames—a structure that could just as well be provided in native code, but wasn't, because delimited continuations weren't provided by O'Caml out of the box? These seem to me to be accidents of history, not fundamental properties of bytecode vs. native implementations.

Part of the point of my response was to recapitulate that, from a feature point of view, you actually can use exactly the same delimited continuations in either bytecode or native O'Caml; it's only the implementation approach and syntax that varies slightly.

I'm not sure I'm understanding you: afaik, the monadic version requires that the code that uses the delimited continuation to be written in monadic style. Whether or not you use the syntax extension to do that, for the user of delimited continuations it's more than just a matter of the syntax varying slightly. The same end result is only achieved for certain values of "same", i.e. if you ignore that the programmer is required to use a different programming style.

Also, given that the native implementation requires access to the source tree, wouldn't it be more accurate to say that the native implementation relies, not on the fact that the implementation is a bytecoded one, but rather specifically upon the structure of stack framesâ€”a structure that could just as well be provided in native code, but wasn't, because delimited continuations weren't provided by O'Caml out of the box?

I'm saying that it's not a coincidence that the necessary structure was available for the bytecoded version: it's a function of the fact that bytecode and bytecode interpreters carry and use metadata that's often seen as unnecessary for natively-compiled programs.

These seem to me to be accidents of history, not fundamental properties of bytecode vs. native implementations.

I don't think they're just accidents, but in my first comment in this thread I used the phrase "artifact of history" about a related issue. It's not just an accident that bytecode implementations tend to have more metadata available to them. While it's possible for a native code compiler to keep all the same metadata, they have tended not to because they didn't think they needed to. Bytecode implementations, and even more so dynamic languages, didn't have the same degree of choice to throw away info.

There are also other factors which make dynamic behavior more difficult to implement at the native code level, particularly if you're talking about optimized native code. For example, dynamic behavior tends to be much simpler to implement if you rely on indirection, whereas many optimizations depend on removing indirection. Consider a direct function call to a machine address, vs. a call via a name that has to be looked up at runtime in a symbol table. In the former case, to achieve hot-loading of code at runtime, you need to keep track of all your linking information at runtime, so you can modify addresses throughout the runtime image as necessary. In the symbol table case, you just need to change the address in one place, and you don't need any additional information to locate that place. So being native and optimized has big consequences for the cost of implementing dynamic features, which also helps to explain why bytecode implementations and dynamic languages have had an edge here.

Indeed. That paper nicely underscores the point that to support dynamic behaviors, compilers need to keep more program metadata around. Notice how much the approach described by the paper depends on dynamic type info and dynamic descriptions of module signatures.

Precisely. In my recentreadingshere, I've belatedly arrived at the conclusion that "dynamic typing" is not the problem, but type erasure is—a point that I think you've been making all along, actually, by reinforcing the point that dynamic types reflect things that we know statically about our programs. This sounds quite a lot like the point of Proof-Carrying Code, which seems to have connections to TILs and TALs, etc. And if you have an interpreter for a TIL (or TAL, for that matter), don't you have something functionally identical to a "dynamically typed" language?

Oldtimers may remember how amazed I was when I discovered that generics were not always compiled by macro expansion and type erasure, as usually done by Ada compilers, and that the generic type information can (and should) be kept in the IL. I first noticed this in the context of cross language runtimes. I think the issue is analogous, and indeed the conclusion is the same: the fact that compilers can optimize away information that is needed for some types of dynamic (i.e., run time) behaviour, does not mean that this implemenation choice is mandatory, when dynamic behaviour is what you are after.

As usual, it is crucial to distiniguish between implemenation details and language (i.e., senmantic) issues. The distinction is often subtle, as the examples above illustrate, but it is a distinction we should keep on emphasizing, so as not to confuse those outside the field.

I don't think such a drastic measure is warranted. In fact, type erasure is arguably desirable, as you can build dynamism in type erasure semantics to support generics, dynamic loading, etc. and you only pay for what you use [1]. Using term-level representation types [2], and using the already provided mechanisms for dynamic loading as Oleg describes here [3], I think one can build a safe linking mechanism, and the benefit is, it's not hard-coded policy provided by the runtime, but a replaceable policy that can be overridden by the program.

...and point well taken. I should clarify my stance by recapitulating my opinion that there are two major domains to be tackled in programming language design today:

Concurrency

Security

I can envision an embedding of the Pi Calculus into a type-erased type system a làTyPiCal, but addressing security seems to me to involve Proof-Carrying Code, which involves type passing. I am, however, open to further information as to why this belief may not be warranted.

There's no good reason to hang on to the info once something's passed a security check, so doing type erasure at load-time would be valid - as would supplying type-erased code plus the means to reconstitute the typed version in order to trade checking time against running time.

Which reminds me, this doesn't have to be incompatible with a type system that's only semi-decidable if along with the type you pass the typing as well. Or perhaps some sensible residue thereof, like the number of steps taken to finish checking.

I'm not sure what type of security you're arguing for, but I agree that at its lowest level, proof carrying code is necessary to ensure some minimum safety properties if loading native code. I believe memory safety is all you need for capabilities, and then on top of that, you can build something like Marc Steigler's Emily, a capability secure variant of OCaml; from that, we can get Oleg's lightweight static capabilities, which yields security nirvana for us humble programmers. :-)

I don't think it is desirable to "dump type erasure" entirely. In particular, type erasure is just an optimization technique exploiting parametricity, and there are good reasons to want that, e.g. Theorems for free. And of course, type passing can be pretty costly, if you have it for every little polymorphic function.

The paper discusses this a bit, and we carefully designed the system in such a way that parametricity and type erasure still apply fully to the ML core language. Only functors are type-passing.

Well, while technically that erases types, it really just means that you represent the type information differently. It's basically an implementation detail, and does not reduce the costs nor recover parametricity, so it probably does not address any of the concerns people have about type passing.

Philippa Cowderoy:There's no good reason to hang on to the info once something's passed a security check, so doing type erasure at load-time would be valid

That depends on how much dynamism you want. Note that the paper describes a component system where you can not only import components dynamically, you can also create and export them dynamically via pickling - which is important for higher-order distributed programming.

Also, the signatures you check against at load time are generally not fully static, but often contain references to types coming from other dynamically loaded components. This is crucial to be able to express "dynamic type sharing" in a language with abstract types. Since you generally cannot know in advance which types may be referred to later in such a "dynamic signature" you have to keep a runtime representation for them (in the Alice system, that representation is computed lazily, however).

Well, while technically that erases types, it really just means that you represent the type information differently. It's basically an implementation detail, and does not reduce the costs nor recover parametricity, so it probably does not address any of the concerns people have about type passing.

It does not recover full parametricity no, but [2] that I reference provides an analysis of what sorts of theorems representation types provide for free. With representation types we trade parametricity for polytypism, which sometimes may be warranted. The point of these term-level representation types, is that you only pay for them when you use them, which is far better than the overhead present in .NET, for example, where types and metadata are a fixed overhead cost.

There's no good reason to hang on to the info once something's passed a security check

Sure there is. Most notably, the chance that it will have to pass another security check, as execution flow passes into a new security domain. Yes, this chews up cycles and bytes, but frankly those are getting cheap, and this is arguably the best possible place to spend 'em.

(This was meant to be a response to Philippa's most recent, but I evidently hit the wrong button.)

Sure there is. Most notably, the chance that it will have to pass another security check, as execution flow passes into a new security domain.

Agreed, this would be one situation where it would needed, but I don't find this approach to security palatable. I'm firmly of the capability security persuasion, and memory safety is all you really need for capabilities, so I agree: erase all types once all type computation is complete. :-)