Plasma Language Reference

JavaScript must be enabled in your browser to display the table of contents.

As the language is under development this is a working draft.
Many choices may be described only as bullet points.
As the language develops these will be filled out and terms will be
clarified.

Lexical analysis and parsing

The "front end" passes of Plasma compilation work as follows:

Tokenisation converts a character stream into a token stream.

Parsing converts the token stream into an AST.

AST→Core transformation converts the AST into the core representation.
This phase also performs symbol resolution, converting textual identifiers
in the AST into unique references.

Lexical analysis

Input files are UTF-8

Comments begin with a # and extend to the end of line.

Curly braces for blocks/scoping

Whitespace is only significant when it separates two tokens what would
otherwise form a single token

Statements and declarations are not delimited. The end of a statement can
be determined by the statement alone. Therefore: there are no statement
terminators or separators (such as semicolons in C) nor significant
whitespace (as in Python or Haskell).

String constants are surrounded by double quotes and may contain the
following escapes. \n \r \t \v \f \b \\. Escaping the double quote
character is not currently supported, using character codes is not
currently supported. Escaping any other character prints that character
as is; this allows \' to work as many programmers may expect, even
though it’s not necessary.

Parsing

Plasma’s EBNF is given in prices throughout this document as concepts are
introduced.
However the top level and some shared definitions are given here.
In this ENBF syntax I use ( and ) to denote groups and ? + and * to denote
optional, one or more, and zero or more.

A note on case and style.

Sometimes it is necessary to use case to distinguish symbols in
different namespaces that may appear in the same expression. For example
type names and type variables can both appear in type expressions. In other
situations there is no requirement but it can be useful to adopt a
convention that makes it easier to read code.

Disambiguation based on case is done as part of AST→Core transformation.

Plasma either requires or suggests the following cases in the following
situations.

Requirement

Suggestion

Notes

Variable

first letter lower

lower_case

Function Name

first letter lower

lower_case

Module Name

-

UpperCase

Case insensitive

Type Name

first letter upper

UpperCase

Type Variable

first letter lower

lower_case

Data constructor

first letter upper

UpperCase

to distinguish construction from function application or variable use.

Field selector

first letter lower

lower_case

Must be the same as function names.

Interface

?

UpperCase

Instance

?

lower_case

not first class,
but may appear in expressions.

Resources

-

lower_case

Note that there may be more symbol namespaces in the future.

The rationale for these decisions is:

Variables, functions and field selectors

The most common symbols should be in lower case and use _ to separate
words are preferred, but not enforced.

Module names

It is useful to visually distinguish module names from other symbols that
can appear within expressions. Currently module names can only be used to
module-qualify other symbols, but this may change in the future. This may
also become a requirement rather than a suggestion.

Types and type variables

Type variables must be distinguished from types. Note that variables
don’t need to be distinguished from functions as this is available from
context: free variables do not exist and a bound variable has the same
semantics as a defined function name.

Is a list of t where List is the list type and t is a type
variable, it may stand for any type. Note that this is the same as
Haskell but the opposite of Mercury.

Data constructors

Code that does different things should look different.
Therefore data construction should stand apart from function calls, and
hence it is useful if data constructors to begin with capital letters.
It could be argued that the same is true for field selection. Suggestions
welcome.

Interfaces and Interface Instances

Interfaces are to instances as types are to values,
This is reflected in our decision to suggest that interfaces should be
CamelCase and instances lower_case.
Also, instances and module qualifiers can both appear within
expressions as a prefix to another symbol.
Instances will also appear distinct from module qualifiers.

Environment

The environment is a concept we will consider for Plasma’s scoping rules.
The environment maps symbols to their underlying items (modules, types,
functions, variables etc). Even though no environment exists at runtime,
and the compile-time structure is an implementation detail of the compiler
(pre.env), it is useful to think of scoping in these terms, as it explains
most scoping behaviours.

Some languages allow overloading of symbols, usually based on a symbol’s
type and sometimes on it’s arity. Plasma does not support any overloading.

Scopes

When a new name is defined it is added to the current environment.

print!(x) # x does not exist.
x = "hello" # x (a variable) is added to the environment.
print!(x) # We may now refer to x.

When a nested block starts, it creates a new environment based upon the old
environment.

x = "hello"
if (...) {
print!(x) # Ok
}

When a nested block ends, the original environment is restored.

if (...) {
x = "Hello"
print!(x) # Ok
}
print!(x) # Error

Variables

When all paths (of a branching statement) assign the same variable, then
that variable is added to the original environment. No earlier
declaration is required. This only makes sense for branching statements,
and only happens with variables.

if (...) {
import SortedListSet as Set
# Some code using Set
} else {
import RBTreeSet as Set
# Some code using Set
}
# Cannot use Set here.

This is a desirable feature however it complicates the type system,
instead Interfaces provide a less direct way to accomplish
this type of thing.

Shadowing

Shadowing refers to a new binding with the same name as an old binding being
permitted and dominant in an inner or later scope.
Shadowing is not permitted for variables at all. It is permitted for other
symbols.

Note

TODO: Decide on rules for a symbol of one type overriding a symbol of
another type. For example it should probably be an error for a module
import to shadow an interface declaration. But it’s probably okay for a
variable to overload a function, unless that function is defined within
another function (a closure).

Variables

A variable cannot shadow another variable.

x = 3
x = 4 # Error
if (...) {
x = 5 # Error
}

Note

We are considering a special syntax to use with variables that allows
shadowing.

Other symbols

Symbols other than variables allow shadowing, for example module imports can
create shadowing of their contents (types, functions etc).
Including when import is used with a wildcard. Therefore we can use a
different Set implementation in the inner scope:

import SortedListSet as Set
...
# some code
...
{
import RBTreeSet as Set
...
# some code using RBTreeSets
...
}
...
# back to SortedListSet
...

(Yes, module imports may appear within function bodies and so-on.)

However, a binding that cannot be observed such as:

import SortedListSet as Set
import RBTreeSet as Set

Doesn’t make sense, and the compiler should generate a warning.

TODO: Figure out if context always tells us enough about the role of a
symbol that modules do not need to shadow types and constructors. I suspect
this is true but I’ll have to define the rest of the language first.

Namespaces

The environment maps names to items. Names might be qualified and if so the
qualifier is required to refer to that name. For example.

import imports one or more symbols from a module. Lines one and two add a
module name (RBTreeMap or Map, respectively) to the current environment.
Line three imports only the getpid function from IO and names it
getpid in the current environment. While line four imports everything in
IO, adding them all to the current environment.

Code using symbols imported by lines one and two will require module
qualification (either RBTreeMap or Map), while code using getopt (or
other symbols from IO) will not.

A module cannot be used without an import declaration.

Module exports

ExportDirective := export IdentList
| export *

Symbols can be exported with export directives.

export my_function

If a module has no export directives then nothing is exported. Which
probably makes the module useless.

TypeField will need lookahead, so for now all fields must be named, but
the anonymous name (_) is supported.

TODO: We use vertical bars to separate or types. Vertical bars mean "or"
and are used in Haskell, but in C commas (for enums) and semicolons (for
unions) are used. Which is best? Mercury uses semicolons as these mean
"or" in Mercury.

TODO: We use parens around the arguments of constructors, like Mercury, and
because fancy brackets aren’t required. However curly braces would be more
familiar to C programmers.

Builtin types

How "builtin" these are varies. Ints are completely builtin and handled by
the compiler where as a List has some compiler support (for special symbols
& no imports required to say "List(t)") but operations may be via library
calls.

Int

Uint

Int8, Int16, Int32, Int64

Uint8, UInt16, UInt32, UInt64

Char (a unicode codepoint)

Float (NIY)

Array(t)

List(t)

String (neither a CString or a list of chars).

Function types

These types are implemented in the standard library.

CString

Map(t)

Set(t)

etc…

User types

User defined types support discriminated unions (here a Map is
either a Node or Empty), and generics (k and v are type parameters).

t is not a type parameter but Ord itself may be a parameter to another
interface, which is what enables t to represent different types in different
situations; compare may also represent different functions in different
situations.

Note that in this case each member has a definition. This is what makes
this an interface instance (plus the different keyword), rather than an
(abstract) interface. The importance of this distinction is that interfaces
cannot be used by code directly, instances can.

Code can now use this instance.

r = ord_int.compare(3, 4)

Interfaces can also be used as parameter types for other interfaces.
Here we define a sorting algorithm interface using an instance (o) of the
Ord interface.

This is useful when an instance defines one or more operators, it makes
using the interface more convenient. Suitable instances for the basic types
such as Int are implicitly made available in this way.

Only one implicit instance for the given interface and types may be used at
a time.

Resources

ResourceDefinition := 'resource' UpperIdent 'from' QualifiedIdent

This defines a new resource. The resource has the given name and is a
child resource of the specified resource. SuperRes is the ultimate
resource and is already defined, along with it’s child resource such as
IO. See Handling effects below.

TODO: it should also work when the expression is for example a if or switch
expression that returns multiple items. This is a property of the if or
switch expression, not the assignment statement.

var1, var2 = if (...) then expr1, expr2 else expr3, expr4

Plasma is a single assignment language. Each variable can only be assigned
to once along any execution path, and must be assigned on each execution
path that falls-through (see Environment).

The special symbol _ can be used to ignore the return value of a function.
It can be used to selectively capture only some values.

div, _ = div_and_quot(7, 5)

Or to ignore the result of a function call that affects a resource.

_ = close!(file)

Function call

Call := ExprPart1 '!'? '(' Expr ( , Expr )* ')'

Function calls often return values, however functions that do not return
anything can be called as a statement. Such a function only makes sense if
effects a resource, and therefore will have a !. However the grammar and
semantics allow functions that don’t have an affect (the compiler will
almost certainly optimize these away).

function_name!(arg1, arg2)

Calls may also be expressions (see below), as an expression a call might
still use or observe some resource. However only one call per statement may
observe the same or a related resource, this ensures that effects happen in
a clear order.

Return

A function that returns a one or more values must always end in a return
statement, or a branching statement that (indrectly) ends in a return
statement on each branch.

TODO: This will need to be relaxed for code that aborts.

TODO: Named returns.

Functions that return nothing may optional use a return statement, this can
be used to implement early return.

Functions and blocks do not have values.
This is deliberate to keep functions and expressions semantically
separate.
This means that the last statement of a block does not have any special
significance as it does in some other languages.

Pattern matching

Pattern matching is also a statement (as well as an expression).
Cases are tried in the order they are written, the compiler should provide a
warning if a case will never be executed, or a value is not covered by any
cases. All variables in the pattern (the LHS of the →) must be new.

If a variable bound by one of the cases is used outside the match (like
beer) then it must be bound by every case. If it is not used outside the
match, then each cases' variable of the same name is "named apart"
(see Environment).

Currently either all cases must have a return statement or none of them.
TODO Matches where some return and others do not will be added in the future.

If-then-else

Plasma’s single-assignment rules imply that if the "then" part of an
if-then-else binds a non-local variable, then there must be an else part
that also binds the variable (or does not fall-through). Else branches
aren’t required if the then branch does not fall-through or does not bind
anything (it may have an effect).

Loops

Note

Not implemented yet.

Note

I’m seeking feedback on this section in particular.

# Loop over both structures in a pairwise way.
for [x <- xs, y <- ys] {
# foo0 and foo form an accumulator starting at 0. The value of foo
# becomes the value of foo0 in the next iteration.
accumulator foo0 foo initial 0
# The loop body.
z = f(x, y)
foo = foo0 + bar(x)
# This loop has three outputs. "list" and "sum" are names of
# reductions. Reductions are instances of the reduction
# interfaces. They "reduce" the values produced by each iteration
# into a single value.
output zs = list of z
output sum = sum of x
# foo is not visible outside the loop, an output is required to
# expose it. value is a keyword, it is handled specially and
# simply takes the last value encountered.
output foo_final = value of foo
}

Note

the accumulator syntax will probably change after the introduction of
some kind of state variable notation.

TODO: Introduce a more concise syntax for one-liners and expressions.
Similar in succinctness to using map and foldl calls.

The loop will iterate over corresponding items from multiple inputs. When
they’re not of equal length the loop will stop after the shortest one is
exhausted. This decision allows them to be used with a mix of finite and
infinite sequences.

Looping over the Cartesian combination of all items should also be supported
(syntax not yet defined, maybe use &). This is equivalent to using nested
loops in many other languages.

Valid input structures are: lists, arrays and sequences. Sequences are
coroutines and therefore can be used to iterate over the keys and values of
a dictionary, or generate a list of numbers.

TODO: Possibly allow this to work on keys and values in dictionaries. If
the keys are unmodified during the loop then the output dictionary can be
rebuilt more easily, its structure doesn’t need to change. Lua has the
ability to require keys to be sorted, or to drop this requirement.

The output declarations include a reduction. This is how the loop should
build the result.

TODO: Reduction isn’t a good word for it, since the output type can be
either a scalar or a vector.

The reduction can be completely different from the type of any of the
inputs. This builds an array from a list (or other ADT). This uses the
array reduction.

for [x <- xs] {
y = f(x)
output ys = array of y
}

Many reductions will be possible: array, list, sequence, min, max,
sum, product, concat_list. Developers will be able to create their
own as these are interfaces.

Loops are implemented in terms of coroutines. Coroutines return the values
for the inputs and the loop body and coroutines handle building the value of
the outputs (list and sum are coroutines above). Coroutines offer the most
flexibility as some of their state is kept on the stack.

Simpler implementations should be used as an optimisation when it is
possible. In these cases some loops may be optimised to calls to map or
foldl, or even simpler inline code.

Auto-parallelisation (a future goal) will work better with reductions that
are known to be either:

Order independent

Associative / commutative, but whose input type is the same as the output

Mergable, with a known identity value.

Accumulators are implemented more directly (not coroutines). However they
require the iterations to be processed in a specific order and may inhibit
parallelisation. A dependency analysis on the body and separating out the
code for each accumulator may mitigate this, especially if it can be
combined with the same analyses as reductions above.

Expressions

Expressions are broken into two parts. This allows us to parse call
expressions properly, with the correct precedence and without a left
recursive grammar. Binary operators are described as a left recursive
grammar, but are not implemented this way, their precedence rules are
documented below.

Arrays elements may be access by subscripting the array. Eg
a[3] will retrieve the 3rd element (1-based). A dash before the subscript
expression will count backwards from the end of the array, a[-2] is the
second last element. This syntax currently clashes with unary minus and so
is currently unimplemented. Array slices will use the .. token and are
also unimplemented.

TODO: Streams

Any control-flow statement is also an expression.

x = if (...) { statements } else { statements }

In this case the branches cannot bind anything visible outside of
themselves, and the value of a branch is the value of the last statement in
that branch.

TODO: Pattern matching expressions.

Ideas

These are just ideas at this stage, they are probably bad ideas.

If a multi-return expression is used as a sub-expression in another context
then that expression is in-turn duplicated.

x, y = multi_value_expr + 3

is

x0, y0 = multi_value_expr
x = x0 + 3
y = y0 + 3

Therefore calls involved in these expressions must not "use resources".

Another idea to consider is that a multiple return expression in the context
of function application applies as many arguments as values it returns. We
probably won’t do this.

... = bar(foo(), Z);

Is the same as

x, y = foo();
... = bar(x, y, z);

Handling effects (IO, destructive update)

Plasma is a pure language, we need a way to handle effects like IO and
destructive update. This is called resources. A function call that uses a
resource (such as print()), may only be called from functions that declare
that they use a resource. This means that a callee cannot use a resource
that a caller doesn’t expect (resource usage is transitive) and anyone
looking at a functions' signature can tell that it might use a resource.

A resource usage declaration looks like:

func main() -> Int uses IO

Here main() declares that it uses (technically may use) the IO
resource. Resources can be either used or observed; and a function may
use or observe any number of resources (decided statically). An observed
resource may be read but is never updated, a used resource may be read or
updated. This distinction allows two observations of a resource to commute
(code may be re-arranged during optimisation), but two uses of a resource
may not commute.

Developers may declare new resources, the standard library will provide some
resources including the IO resource. Examples of IO 's children might be
Filesystem and Time, Filesystem might have children for open files
(WIP), although none of these have been decided / implemented.

A call is valid if:

Callee is Pure

Callee may Observe

Callee may Use

Caller is Pure

Y

N

N

Caller may Observe

Y

Y

N

Caller may Use

Y

Y

Y

You’ll find that this is very intuitive.
It’s shown in a table for completeness.

Resource hierarchy

Resources form a hierarchy (not yet defined). For a call to be valid either
the resource, or its parent must be available in the caller. For example if
mkdir() uses the Filesystem resource, which is a child of IO then any
caller that uses IO can call mkdir().

Temporary resources (NIY)

Some resources can be creating and destroyed, and rather than being a part
of their parent always (Filesystem is always a part of IO) they are
subsumed by their parent instead. For example an array uses some memory as
its resource, that memory is allocated and freed when the array is
initialised and then goes out of scope (it is unique). But if that
the memory resource is created and destroyed within the same function, it’s
caller does not need the uses declaration, memory and possibly some other
resources are special cases.

Resources in statements

Every call that uses a resource must have the ! suffix. For example:

print!("Hello world\n")

This makes it clear to anyone reading the code to beware something
happens, changes or might be observed to have happened or have
changed. This is also the entire reason to have it in the language, it
serves no other function, but the compiler will make sure that it is
present on every call that either uses or observes something.

Multiple calls with ! may be used in the same statement, provided that
their resources do not overlap, or they are all observing the resource and
not modifying it. (Note that we are debating) this at the moment).

Commutativity of resources

Optimisation may cause code to be executed in a different order than
written. The following reorderings of two related (ancestor/descendant)
resources are legal.

None

Observe

Use

None

Y

Y

Y

Observe

Y

Y

N

Use

Y

N

N

Non-related resources may be reordered freely.

Higher order code (NIY)

This aspect of Plasma is currently under consideration.

Higher order functions need to handle resources, otherwise their
usefulness is reduced.

Resource usage from such code needs to be safe (WRT order of operations).

We want to encourage polymorphism here, otherwise people will write
higher-order abstractions that can’t be used with resources.

We’d prefer to make code concise that isn’t intended to be used with
resources, but ought to be resource-capable anyway.

Proposal 1

Higher order values may have uses/observes declarations (added to their
type) values without such declarations are pure. All higher order calls
have the usual ! sigil and the statement rules apply.

This has the disadvantage that it is not as concise, and that people who aren’t
planning to use resources, won’t write resource-capable code, if that code is
in a library it may be annoying to modify if it needs to be used with a
resource later.

Other proposals

There are several other ideas and their combinations that may help.

All higher order code implicitly uses resources, a function like map
therefore also uses that resource since it contains such calls. Unless
there’s a way to show this in the function’s signature I’m not fond of it.
Unless-unless we leave that to type inference.

Require all higher-order code to handle resources, users may feel that the
compiler is being overly-pedantic.

Higher order calls are exempt from the one-resource-per-statement rule.
Making the code more concise (it still includes a !).

Either expressions have a well-ordered declarative semantics or

resources must be declared as don’t-care ordering so they can be placed
in the same statements.

Linking to and storing as data (NIY)

Linking a resource with a real piece of data, such as a file descriptor,
is highly desirable. Likewise putting such data inside a structure to be
used later, such as a pool of warmed-up database connections, will be
necessary.

There are a couple of ideas. We could add information to the types to say
that they are resources and what their parent resource type is. So that the
variable can stand-in for the resource.