Safely Composable Type-Specific Languages

Transcription

1 Safely Composable Type-Specific Languages Cyrus Omar, Darya Kurilova, Ligia Nistor, Benjamin Chung, and Alex Potanin, 1 and Jonathan Aldrich Carnegie Mellon University and Victoria University of Wellington 1 {comar, darya, lnistor, bwchung, and 1 Abstract. Programming languages often include specialized syntax for common datatypes (e.g. lists) and some also build in support for specific specialized datatypes (e.g. regular expressions), but user-defined types must use generalpurpose syntax. Frustration with this causes developers to use strings, rather than structured data, with alarming frequency, leading to correctness, performance, security, and usability issues. Allowing library providers to modularly extend a language with new syntax could help address these issues. Unfortunately, prior mechanisms either limit expressiveness or are not safely composable: individually unambiguous extensions can still cause ambiguities when used together. We introduce type-specific languages (TSLs): logic associated with a type that determines how the bodies of generic literals, able to contain arbitrary syntax, are parsed and elaborated, hygienically. The TSL for a type is invoked only when a literal appears where a term of that type is expected, guaranteeing noninterference. We give evidence supporting the applicability of this approach and formally specify it with a bidirectionally typed elaboration semantics for the Wyvern programming language. Keywords: extensible languages; parsing; bidirectional typechecking; hygiene 1 Motivation Many data types can be seen, semantically, as modes of use of general purpose product and sum types. For example, lists can be seen as recursive sums by observing that a list can either be empty, or be broken down into a product of the head element and the tail, another list. In an ML-like functional language, sums are exposed as datatypes and products as tuples and records, so list types can be defined as follows: datatype a list = Nil Cons of a * a list In class-based object-oriented language, objects can be seen as products of their instance data and classes as the cases of a sum type [9]. In low-level languages, like C, structs and unions expose products and sums, respectively. By defining user-defined types in terms of these general purpose constructs, we immediately benefit from powerful reasoning principles (e.g. induction), language support (e.g. pattern matching) and compiler optimizations. But these semantic benefits often come at a syntactic cost. For example, few would claim that writing a list of numbers as a sequence of Cons cells is convenient: Cons(1, Cons(2, Cons(3, Cons(4, Nil))))

2 2 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich Lists are a common data structure, so many languages include literal syntax for introducing them, e.g. [1, 2, 3, 4]. This syntax is semantically equivalent to the generalpurpose syntax shown above, but brings cognitive benefits both when writing and reading code by focusing on the content of the list, rather than the nature of the encoding. Using terminology from Green s cognitive dimensions of notations [8], it is more terse, visible and maps more closely to the intuitive notion of a list. Stoy, in discussing the value of good notation, writes [31]: A good notation thus conceals much of the inner workings behind suitable abbreviations, while allowing us to consider it in more detail if we require: matrix and tensor notations provide further good examples of this. It may be summed up in the saying: A notation is important for what it leaves out. Although list, number and string literals are nearly ubiquitous features of modern languages, some languages provide specialized literal syntax for other common collections (like maps, sets, vectors and matrices), external data formats (like XML and JSON), query languages (like regular expressions and SQL), markup languages (like HTML and Markdown) and many other types of data. For example, a language with built-in notation for HTML and SQL, supporting type safe splicing via curly braces, might define: 1 let webpage : HTML = <html><body><h1>results for {keyword}</h1> 2 <ul id="results">{to_list_items(query(db, 3 SELECT title, snippet FROM products WHERE {keyword} in title))} 4 </ul></body></html> as shorthand for: 1 let webpage : HTML = HTMLElement(Dict.empty(), [BodyElement(Dict.empty(), 2 [H1Element(Dict.empty(), [TextNode("Results for " + keyword)]), 3 ULElement((Dict.add Dict.empty() ("id","results")), to_list_items(query(db, 4 SelectStmt(["title", "snippet"], "products", 5 [WhereClause(InPredicate(StringLit(keyword), "title"))]))))])]) When general-purpose notation like this is too cognitively demanding for comfort, but a specialized notation as above is not available, developers turn to run-time mechanisms to make constructing data structures more convenient. Among the most common strategies in these situations, no matter the language paradigm, is to simply use a string representation, parsing it at run-time: 1 let webpage : HTML = parse_html("<html><body><h1>results for "+keyword+"</h1> 2 <ul id=\"results\">" + to_string(to_list_items(query(db, parse_sql( 3 "SELECT title, snippet FROM products WHERE "+keyword+" in title")))) + 4 "</ul></body></html>") Though recovering some of the notational convenience of the literal version, it is still more awkward to write, requiring explicit conversions to and from structured representations (parse_html and to_string, respectively) and escaping when the syntax of the data language interferes with the syntax of string literals (line 2). Such code also causes a number of problems that go beyond cognitive load. Because parsing occurs at run-time, syntax errors will not be discovered statically, causing potential run-time errors in production scenarios. Run-time parsing also incurs performance overhead, particularly relevant when code like this is executed often (as on a heavily-trafficked website). But the most serious issue with this code is that it is highly insecure: it is

3 Safely Composable Type-Specific Languages 3 vulnerable to cross-site scripting attacks (line 1) and SQL injection attacks (line 3). For example, if a user entered the keyword ; DROP TABLE products --, the entire product database could be erased. These attack vectors are considered to be two of the most serious security threats on the web today [26]. Although developers are cautioned to sanitize their input, it can be difficult to verify that this was done correctly throughout a codebase. The best way to avoid these problems today is to avoid strings and other similar conveniences and insist on structured representations. Unfortunately, situations like this, where maintaining strong correctness, performance and security guarantees entails significant syntactic overhead, causing developers to turn to less structured solutions that are more convenient, are quite common (as we will discuss in Sec. 5). Adding new literal syntax into a language is generally considered to be the responsibility of the language s designers. This is largely for technical reasons: not all syntactic forms can unambiguously coexist in the same grammar, so a designer is needed to decide which syntactic forms are available, and what their semantics should be. For example, conventional notations for sets and maps are both delimited by curly braces. When Python introduced set literals, it chose to distinguish them based on whether the literal contained only values (e.g. {3}), or key-value pairs ({"x": 3}). But this causes an ambiguity with the syntactic form { } should it mean an empty set or an empty map (called a dictionary in Python)? The designers of Python avoided the ambiguity by choosing the latter interpretation (in this case, for backwards compatibility reasons). Were this power given to library providers in a decentralized, unconstrained manner, the burden of resolving ambiguities would instead fall on developers who happened to import conflicting extensions. Indeed, this is precisely the situation with SugarJ [6] and other extensible languages generated by Sugar* [7], which allow library providers to extend the base syntax of the host language with new forms in a relatively unconstrained manner. These new forms are imported transitively throughout a program. To resolve syntactic ambiguities that arise, clients must manually augment the composed grammar with new rules that allow them to choose the correct interpretation explicitly. This is both difficult to do, requiring a reasonably thorough understanding of the underlying parser technology (in Sugar*, generalized LR parsing) and increases the cognitive load of using the conflicting notations (e.g. both sets and maps) together because disambiguation tokens must be used. These kinds of conflicts occur in a variety of circumstances: HTML and XML, different variants of SQL, JSON literals and maps, or differing implementations ( desugarings ) of the same syntax (e.g. two regular expression engines). Code that uses these common abstractions together is very common in practice [13]. In this work, we will describe an alternative parsing strategy that sidesteps these problems by building into the language only a delimitation strategy, which ensures that ambiguities do not occur. The parsing and elaboration of literal bodies occurs during typechecking, rather than in the initial parsing phase. In particular, the typechecker defers responsibility to library providers, by treating the body of the literal as a term of the type-specific language (TSL) associated with the type it is being checked against. The TSL definition is responsible for elaborating this term using only general-purpose syntax. This strategy permits significant semantic flexibility the meaning of a form like { } can differ depending on its type, so it is safe to use it for empty sets, maps and

4 4 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich JSON literals. This frees these common forms from being tied to the variant of a data structure built into a language s standard library, which may not provide the precise semantics that a programmer needs (for example, Python dictionaries do not preserve key insertion order). We present our work as a variant of an emerging programming language called Wyvern [22]. To allow us to focus on the essence of our proposal and provide the community with a minimal foundation for future work, the variant of Wyvern we develop here is simpler than the variant we previously described: it is purely functional (there are no effects other than non-termination) and it does not enforce a uniform access principle for objects (fields can be accessed directly), so objects are essentially just recursive labeled products with simple methods. It also adds recursive sum types, which we call case types, similar to those found in ML. One can refer to our version of the language as TSL Wyvern when the variant being discussed is not clear. Our work substantially extends and makes concrete a mechanism we sketched in a short workshop paper [23]. The paper is organized as a language design for TSL Wyvern: In Sec. 2, we introduce TSL Wyvern with a practical example. We introduce both inline and forward referenced literal forms, splicing, case and object types and an example of a TSL definition. In Sec. 3, we specify the layout-sensitive concrete syntax of TSL Wyvern with an Adams grammar and introduce the abstract syntax of TSL Wyvern. In Sec. 4, we specify the static semantics of TSL Wyvern as a bidirectionally typed elaboration semantics, which combines two key technical mechanisms: 1. Bidirectional Typechecking: By distinguishing locations where an expression must synthesize a type from locations where an expression is being analyzed against a known type, we precisely specify where generic literals can appear and how dispatch to a TSL definition (an object with a parse method serving as metadata of a type) occurs. 2. Hygienic Elaboration: Elaboration of literals must not cause the inadvertent capture or shadowing of variables in the context where the literal appears. It must, however, remain possible for the client to do so in those portions of the literal body treated as spliced expressions. The language cannot know a priori where these spliced portions will be. We give a clean type-theoretic formulation that achieves of this notion of hygiene. In Sec. 5, we gather initial data on how broadly applicable our technique may be by conducting a corpus analysis, finding that existing code often uses strings where specialized syntax might be more appropriate. In Sec. 6, we briefly report on the current implementation status of our work. We discuss related work in Sec. 7 and conclude in Sec. 8 with a discussion of present limitations and future research directions. 2 Type-Specific Languages in Wyvern We begin with an example in Fig. 1 showing several different TSLs being used in a fragment of a web application showing search results from a database. We will review this example below to develop intuitions about TSLs in Wyvern; a formal and more detailed description will follow. For clarity of presentation, we color each character by the TSL it is governed by. Black is the base language and comments are in italics.

5 Safely Composable Type-Specific Languages 5 1 let imagebase : URL = <images.example.com> 2 let bgimage : URL = <%imagebase%/background.png> 3 new : SearchServer 4 def resultsfor(searchquery, page) 5 serve(~) (* serve : HTML -> Unit *) 6 >html 7 >head 8 >title Search Results 9 >style ~ 10 body { background-image: url(%bgimage%) } 11 #search { background-color: %darken( #aabbcc, 10pct)% } 12 >body 13 >h1 Results for <{HTML.Text(searchQuery)}: 14 >div[id="search"] 15 Search again: < SearchBox("Go!") 16 < (* fmt_results : DB * SQLQuery * Nat * Nat -> HTML *) 17 fmt_results(db, ~, 10, page) 18 SELECT * FROM products WHERE {searchquery} in title Fig. 1: Wyvern Example with Multiple TSLs <literal body here, <inner angle brackets> must be balanced> {literal body here, {inner braces} must be balanced} [literal body here, [inner brackets] must be balanced] literal body here, inner backticks must be doubled literal body here, inner single quotes must be doubled "literal body here, ""inner double quotes"" must be doubled" 12xyz (* no delimiters necessary for number literals; suffix optional *) Fig. 2: Inline Generic Literal Forms 2.1 Inline Literals Our first TSL appears on the right-hand side of the variable binding on line 1. The variable imagebase is annotated with its type, URL. This is a named object type declaring several fields representing the components of a URL: its protocol, domain name, port, path and so on (below). We could have created a value of type URL using the generalpurpose introductory form new, which forward references an indented block of field and method definitions beginning on the line after it appears: 1 objtype URL 2 val protocol : String 3 val subdomain : String 4 (*... *) 1 let imagebase : URL = new 2 val protocol = "http" 3 val subdomain = "images" 4 (*... *) This is tedious. By associating a TSL with the URL type (we will show how later), we can instead introduce precisely this value using conventional notation for URLs by placing it in the body of a generic literal, <images.example.com>. Any other delimited form in Fig. 2 can equivalently be used when the constraints indicated can be obeyed. The type annotation on imagebase (or equivalently, ascribed directly to the literal) implies that this literal s expected type is URL, so the body of the literal (the characters between the angle brackets, in blue) will be governed by the URL TSL during the typechecking phase. This TSL will parse the body (at compile-time) and produce an elaboration: a Wyvern abstract syntax tree (AST) that explicitly instantiates a new object of type URL using general-purpose forms only, as if the above had been written directly.

6 6 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich 2.2 Splicing In addition to supporting conventional notation for URLs, this TSL supports splicing another Wyvern expression of type URL to form a larger URL. The spliced term is here delimited by percent signs, as seen on line 2 of Fig. 1. The TSL chooses to parse code between percent signs as a Wyvern expression, using its abstract syntax tree (AST) to construct the overall elaboration. A string-based representation of the URL is never constructed at run-time. Note that the delimiters used to go from Wyvern to a TSL are controlled by Wyvern while the TSL controls how to return to Wyvern. 2.3 Layout-Delimited Literals On line 5 of Fig. 1, we see a call to a function serve (not shown) which has type HTML -> Unit. Here, HTML is a user-defined case type, having cases for each HTML tag as well as some other structures, such as text nodes and sequencing. Declarations of some of these cases can be seen on lines 2-6 of Fig. 4 (note that TSL Wyvern also includes simple product types for convenience, written T1 * T2). We could again use Wyvern s general-purpose introductory form for case types, e.g. BodyElement((attrs, child)). But, as discussed in the introduction, this can be cognitively demanding. Thus, we have associated a TSL with HTML that provides a simplified notation for writing HTML, shown being used on lines 6-18 of Fig. 1. This literal body is layout-delimited, rather than delimited by explicit tokens as in Fig. 2, and introduced by a form of forward reference, written ~ ( tilde ), on the previous line. Because the forward reference occurs in a position where the expected type is HTML, the literal body is governed by that type s TSL. The forward reference will be replaced by the general-purpose term, of type HTML, generated by the TSL during typechecking. Because layout was used as a delimiter, there are no syntactic constraints on the body, unlike with inline forms (Fig. 2). For HTML, this is quite useful, as all of the inline forms impose constraints that would cause conflict with some valid HTML, requiring awkward and error-prone escaping. It also avoids issues with leading indentation in multi-line literals, as the parser strips these automatically for layout-delimited literal bodies. 2.4 Implementing a TSL Portions of the implementation of the TSL for HTML are shown on lines 8-15 of Fig. 4. A TSL is associated with a named type using a general mechanism for associating a statically-known value with a named type, called its metadata. Type metadata, in this context, is comparable to class annotations in Java or class/type attributes in C#/F# and internalizes the practice of writing metadata using comments, so that it can be checked by the language and accessed programmatically more easily. This can be used for a variety of purposes to associate documentation with a type, to mark types as being deprecated, and so on. Note that we allow programs to extract the metadata value of a named type T programmatically using the form metadata[t]. For the purposes of this work, metadata values will always be of type HasTSL, an object type that declares a single field, parser, of type Parser. The Parser type is an object type declaring a single method, parse, that transforms a ParseStream extracted from a literal body to a Wyvern AST. An AST is a value of type Exp, a case type that encodes the abstract syntax of Wyvern expressions. Fig. 5 shows portions of the decla-

7 Safely Composable Type-Specific Languages 7 1 casetype HTML 2 Empty 3 Seq of HTML * HTML 4 Text of String 5 BodyElement of Attributes * HTML 6 StyleElement of Attributes * CSS 7 (*... *) 8 metadata = new : HasTSL 9 val parser = ~ 10 start <- >body = attributes start> 11 fn (attrs, child) => Inj( BodyElement, Pair(attrs, child)) 12 start <- >style = attributes EXP> 13 fn (attrs, e) => StyleElement((%attrs%, %e%)) 14 start <- < = EXP> 15 fn (e) => %e% : HTML Fig. 4: A Wyvern case type with an associated TSL. 1 objtype HasTSL 2 val parser : Parser 3 objtype Parser 4 def parse(ps : ParseStream) : Result 5 metadata : HasTSL = new 6 val parser = (*parser generator*) 7 casetype Result 8 OK of Exp * ParseStream 9 Error of String * Location 10 casetype Exp 11 Var of ID 12 Lam of ID * Type * Exp 13 Ap of Exp * Exp 14 Inj of Id * Exp Spliced of ParseStream 17 metadata : HasTSL = new 18 val parser = (*quasiquotes*) Fig. 5: Some of the types included in the Wyvern prelude. rations of these types, which live in the Wyvern prelude (a collection of types that are automatically loaded before any other). Notice, however, that the TSL for HTML is not provided as an explicit parse method but instead as a declarative grammar. A grammar is specialized notation for defining a parser, so we can implement a grammar-based parser generator as a TSL atop the lower-level interface exposed by Parser. We do so using a layout-sensitive grammar formalism developed by Adams [1]. Wyvern is itself layout-sensitive and has a grammar that can be written down using this formalism, as we will discuss, so it is sensible to expose it to TSL providers as well. Most aspects of this formalism are conventional. Each non-terminal (e.g. the designated start non-terminal) is defined by a number of disjunctive rules, each introduced using <-. Each rule defines a sequence of terminals (e.g. >body ) and non-terminals (e.g. start, or one of the built-in non-terminals ID, EXP or TYPE, representing Wyvern identifiers, expressions and types, respectively). Unique to Adams grammars is that each terminal and non-terminal in a rule can also have an optional layout constraint associated with it. The layout constraints available are = (meaning that the leftmost column of the annotated term must be aligned with that of the parent term), > (the leftmost column must be indented further) and >= (the leftmost column may be indented further). Note that the leftmost column is not simply the first character, in the case of terms that span multiple lines. For example, the production rule of the form A B = C D > approximately reads as: Term B must be at the same indentation level as term A, term C may be at the same or a greater indentation level as term A, and term D must be at an indentation level greater than term A s. In particular, if D contains a NEWLINE character, the next line must be indented past the position of the

8 8 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich left-most character of A (typically, though not always, constructed so that it must appear at the beginning of a line). There are no constraints relating D to B or C other than the standard sequencing constraint: the first character of D must be further along in the file than the others. Using Adams grammars, the syntax of real-world languages like Python and Haskell can be written declaratively. Each rule is followed, in an indented block, by a spliced function that generates an elaboration given the elaborations recursively generated by each of the n non-terminals in the rule, ordered left-to-right. Elaborations are of type Exp, which is a case type containing each form in the abstract syntax of Wyvern (as well as an additional case, Spliced, that is used internally), which we will describe later. Here, we show how to generate an elaboration using the general-purpose introductory form for case types (line 11, Inj corresponds to the introductory form for case types) as well as using quasiquotes (line 13). Quasiquotes are expressions written in concrete syntax that are not evaluated for their value, but rather evaluate to their corresponding syntax trees. We observe that quasiquotes too fall into the pattern of specialized notation associated with a type : quasiquotes for expressions, types and identifiers are simply TSLs associated with Exp, Type and ID (Fig. 5). They support the Wyvern concrete syntax as well as an additional delimited form, written with %s, that supports unquoting : splicing another AST into the one being generated. Again, splicing is safe and structural, not string-based. We can see how HTML splicing works on lines 12-15: we simply include the Wyvern expression non-terminal EXP in our rule and insert it into our quoted result where appropriate. The type that the spliced Wyvern expression will be expected to have is determined by where it is placed. On line 13 it is known to be CSS by the declaration of HTML, and on line 15, it is known to be HTML by the use of an explicit ascription. 3 Syntax 3.1 Concrete Syntax We will begin our formal treatment by specifying the concrete syntax of Wyvern declaratively, using the same layout-sensitive formalism that we have introduced for TSL grammars, developed recently by Adams [1]. Adams grammars are useful because they allow us to implement layout-sensitive syntax, like that we ve been describing, without relying on context-sensitive lexers or parsers. Most existing layout-sensitive languages (e.g. Python and Haskell) use hand-rolled context-sensitive lexers or parsers (keeping track of, for example, the indentation level using special INDENT and DEDENT tokens), but these are more problematic because they could not be used to generate editor modes, syntax highlighters and other tools automatically. In particular, we will show how the forward references we have described can be correctly encoded without requiring a context-sensitive parser or lexer using this formalism. It is also useful that the TSL for Parser, above, uses the same parser technology as the host language, so that it can be used to generate the quasiquote TSL for Exp more easily. 3.2 Program Structure The concrete syntax of TSL Wyvern is shown in Fig 6. An example Wyvern program showing several unique syntactic features of TSL Wyvern is shown in Fig. 7 (left).

10 10 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich 1 objtype T 2 val y : HTML 3 let page : HTML->HTML = (fn(x) => ~) 4 >html 5 >body 6 <{x} 7 page(case(5 : Nat)) 8 Z(_) => (new : T).y 9 val y = ~ 10 >h1 Zero! 11 S(x) => ~ 12 >h1 Successor! objtype[t, (y[named[ht ML]], ), ()]; ; elet(easc[arrow[named[ht ML], named[ht ML]]](elam(x.lit[>html >body <{x}])), page. eap(page; ecase(easc[named[nat]](lit[5])) { erule[z](_.eprj[y](easc[named[t ](enew { eval[y](lit[>h1 Zero!]); }))); erule[s](x.lit[>h1 Sucessor!]); })) Fig. 7: An example Wyvern program demonstrating all three forward referenced forms. The corresponding abstract syntax is on the right. The top level of a program (the p non-terminal) consists of a series of named type declarations object types using objtype or case types using casetype followed by an expression, e. Each named type declaration can also include a metadata declaration. Metadata is simply an expression associated with the type, used to store TSL logic (and in future work, other metadata). In the grammar, sequences of top-level declarations use the form p = to signify that all the succeeding p terms must begin at the same indentation. We do not specify separate compilation here, as this is an orthogonal issue. 3.3 Forward Referenced Blocks Wyvern makes extensive use of forward referenced blocks to make its syntax clean. In particular, layout-delimited TSLs, new expressions for introducing objects, and case expressions for eliminating case types and tuples all make use of forward referenced blocks. Fig. 7 shows these in use (assuming suitable definitions of Nat and HTML). Each line in the concrete syntax can contain either zero or one forward references. We distinguish these in the grammar by defining separate non-terminals e and ẽ[fwd], where the parameter fwd is the particular forward reference form that occurs. Note particularly the rule for let (which permits an expression to span multiple lines and so can be used to support multiple forward references in a single expression). 3.4 Abstract Syntax The concrete syntax of a Wyvern program, p, is parsed to a program in the abstract syntax, ρ, shown in Fig. 8. Forward references are internalized. Note that all literal forms are unified into the abstract literal form lit[body], including the layout-delimited form and number literals. The body remains completely unparsed at this stage. The abstract syntax for the example in Fig. 7 is shown to its right and demonstrates the key rewriting done at this stage. Simple product types can be rewritten as object types in this phase. We assume that this occurs so that we can avoid specifying them separately in the remainder of the paper, though we continue to use tuple notation for concision. 4 Bidirectional Typechecking and Elaboration We will now specify a type system for the abstract syntax in Fig. 8. Conventional type systems are specified using a typing judgement written like Γ Θ e : τ, where the typing context, Γ, maps bound variables to types, and the named type context, Θ, maps

11 Safely Composable Type-Specific Languages 11 ρ ::= θ; e θ ::= objtype[t, ω, e]; θ casetype[t, χ, e]; θ e ::= x easc[τ](e) elet(e; x.e) elam(x.e) eap(e; e) enew {m} eprj[l](e) einj[c](e) ecase(e) {r} etoast(e) emetadata[t ] lit[body] m ::= eval[l](e); m edef[l](x.e); m r ::= erule[c](x.e); r τ ::= named[t ] arrow[τ, τ] ω ::= l[τ]; ω χ ::= C[τ]; χ ê ::= x hasc[τ](ê) hlet(ê; x.ê) hlam(x.ê) hap(ê; ê) hnew { ˆm} hprj[l](ê) hinj[c](ê) hcase(ê) {ˆr} htoast(ê) hmetadata[t ] spliced[e] ˆm ::= hval[l](ê); ˆm hdef[l](x.ê); ˆm ˆr ::= hrule[c](x.ê); ˆr i ::= x iasc[τ](i) ilet(i; x.i) ilam(x.i) iap(i; i) inew {ṁ} iprj[l](i) iinj[c](i) icase(i) {ṙ} itoast(i) ṁ ::= ival[l](i); ṁ idef[l](x.i); ṁ ṙ ::= irule[c](x.i); ṙ Fig. 8: Abstract Syntax of TSL Wyvern programs (ρ), type declarations (θ), types (τ), external terms (e), translational terms (ê) and internal terms (i) and auxiliary forms. Metavariable T ranges over type names, l over object member (field and method) labels, C over case labels, x over variables and body over literal bodies. Tuple types are a mode of use of object types, so they are not included in the abstract syntax. For concision, we continue to write unit as () and pairs as (i 1, i 2 ) in abstract syntax as needed. type names to their declarations. Such typing judgements do not fully specify whether, when writing a typechecker, the type should be considered an input or an output. In some situations, a type propagates in from the surrounding syntactic context (e.g. when the term appears as a function argument, or an explicit ascription has been provided), so that we simply need to analyze e against it. In others, we need to synthesize a type for e (e.g. when the term appears at the top-level). Here, this distinction is crucial: a literal can only appear in an analytic context. Bidirectional type systems [28] make this distinction explicit by specifying the type system instead using two simultaneously defined typechecking judgements corresponding to these two situations. To support TSLs, we need to also, simultaneously with this process, perform an elaboration from external terms, which contain literals, to internal terms, i, the syntax for which is shown on the right side of Fig. 8. Internal terms contain neither literals nor the form for accessing the metadata of a named type explicitly (the elaboration process inserts the statically known metadata value, tracked by the named type context, directly). This manner of specifying a type-directed mapping from external terms to a smaller collection of internal terms, which are the only terms that are given a dynamic

14 14 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich 4.2 External Terms The bidirectional typechecking and elaboration rules for external terms are specified beginning in Fig. 10. Most of the rules are standard for a simply typed lambda calculus with labeled sums and labeled products, and the elaborations are direct to a corresponding internal form. We refer the reader to standard texts on type systems (e.g. [9]) to understand the basic constructs, and to course material 1 on bidirectional typechecking for background. In our presentation, as in many simple formulations, all introductory forms are analytic and all elimination forms are synthetic, though this can be relaxed in practice to support some additional idioms. The introductory form for object types, enew {m}, prevents the manual introduction of parse streams (only the semantics can introduce parse streams, to permit us to enforce hygiene, as we will discuss below). The auxiliary judgement Γ T Θ m ṁ ω analyzes the member definitions m against the member declarations ω while rewriting them to the internal member definitions, ṁ. Method definitions involve a self-reference, so the judgement keeps track of the type name, T. We implicitly assume that member definitions and declarations are congruent up to reordering. The introduction form for case types is written einj[c](e), where C is the case name and e is the associated data. The type of the data associated with each case is stored in the case type s declaration, χ. Because the introductory form is analytic, multiple case types can use the same case names (unlike in, for example, ML). The elimination form, ecase(e) {r}, performs simple exhaustive case analysis (we leave support for nested pattern matching as future work) using the auxiliary judgement Γ Θ r ṙ χ τ, which checks that each case in χ appears in a rule in the rule sequence r, elaborating it to the internal rule sequence ṙ. Every rule must synthesize the same type, τ. The rule T-metadata shows how the appropriate metadata is extracted from the named type context and inserted directly in the elaboration. We will return to the rule T-toast when discussing hygiene. 4.3 Literals In the example in Fig. 4, we showed a TSL being defined using a parser generator based an Adams grammars. As we noted, a parser generator can itself be seen as a TSL for a parser, and a parser is the fundamental construct that becomes associated with a type to form a TSL. The declaration for the prelude type Parser, shown in Fig. 5, shows that it is an object type with a parse function taking in a ParseStream and producing a Result, which is a case type that indicates either that parsing succeeded, in which case an elaboration of type Exp is paired with the remaining parse stream (to allow one parser to call another), or that parsing failed, in which case an error message and location is provided. This function is called by the typechecker when analyzing the literal form, as specified by the key rule of our system, T-lit, shown in Fig. 11. Note that we do not explicitly handle failure in the specification, but in practice we would use the data provided in the failure case to report the error to the user. The rule T-lit operates as follows: 1. This rule requires that the prelude is available. For technical reasons, we include a check that the prelude was actually included in the named type context. 1

15 Safely Composable Type-Specific Languages 15 Θ 0 Θ T [δ, i m : HasTSL] Θ parsestream(body) = i ps iap(iprj[parse](iprj[parser](i m)); i ps) iinj[ok]((i ast, i ps)) i ast ê Γ ; Θ ê i named[t ] Γ Θ lit[body] i named[t ] T-lit Fig. 11: Statics for external terms, e, continued. This is the key rule (described below). 2. The metadata of the type the literal is being checked against, which must be of type HasTSL, is extracted from the named type context. Note that in a language with subtyping or richer forms of type equality, which would be necessary for situations where the metadata might serve other roles, the check that i m defines a TSL would perform this check explicitly (as an additional premise). 3. A parse stream, i ps, which is an internal term of type named[p arsestream], is generated from the body of the literal. This is an object that allows the TSL to read the body and supports some additional conveniences, discussed further below. 4. The parse method is called with this parse stream. If it produces the appropriate case containing a reified elaboration, i ast (of type Exp) and the remaining parse stream, i ps, then parsing was successful. Note that we use shorthand for pairs in the rule for concision, and the relation i i defines evaluation to a value (the maximal transitive closure, if it exists, of the small-step evaluation relation in Fig. 15). 5. The reified elaboration is dereified into a corresponding translational term, ê, as specified in Fig. 12. The syntax for translational terms mirrors that of external terms, but does not include literal forms. It adds the form spliced[e], representing an external term spliced into a literal body. The key rule is U-Spl. The only way to generate a translational term of this form is by asking for (a portion of) a parse stream to be parsed as a Wyvern expression. The reified form, unlike the translational form it corresponds to, does not contain the expression itself, but rather just the portion of the parse stream that should be treated as spliced. Because parse streams (and thus portions thereof) can originate only metatheoretically (i.e. from the compiler), we know that e must be an external term written concretely by the TSL client in the body of the literal being analyzed. This is key to guaranteeing hygiene in the final step, below. The convenience methods parse_exp and parse_id return a value having this reified form corresponding to the first external term found in the parse stream (but, as just described, not necessarily the term itself) paired with the remainder of the parse stream. These methods themselves are not treated specially by the compiler but, for convenience, are associated with ParseStream. 6. The final step is to typecheck and elaborate this translational term. This involves the bidirectional typing judgements shown in Fig. 14. This judgement has a form similar to that for external terms, but with the addition of an outer typing context, written Γ out in the rules. This holds the context that the literal appeared in, so that the main typing context can be emptied to ensure that elaborations is hygienic, as we will describe next. Each rule in Fig. 10 should be thought of as having a corresponding rule in Fig. 14. Two examples are shown for concision.

16 16 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich i ê i id x iinj[v ar](i id ) x U-Var i i x i id x iinj[v ar](i id ) R-Var i 1 τ i 2 ê iinj[asc]((i 1, i 2)) hasc[τ](ê) U-Asc i id x i ê iinj[lam]((i id, i)) hlam(x.ê) U-Lam τ i 1 i i 2 iasc[τ](i) iinj[asc]((i 1, i 2)) R-Asc x i id i i ilam(x.i) iinj[lam]((i id, i )) R-Lam i 1 ê 1 i 2 ê 2 iinj[ap]((i 1, i 2)) hap(ê 1, ê 2) U-Ap body(i ps)=body eparse(body)=e iinj[spliced](i ps) spliced[e] U-Spl i τ i id T iinj[named](i id ) named[t ] U-N i 1 τ 1 i 2 τ 2 iinj[arrow]((i 1, i 2)) arrow[τ 1, τ 2] U-A Fig. 12: Dereification rules, used by rule T-lit (above) to determine the translational term encoded by the internal term of type named[exp]. We assume a bijection between internal terms of type named[id] (written i id ) and variables, type names and case and member labels. i 1 i 1 i 2 i 2 iap(i 1; i 2) iinj[ap]((i 1, i 2)) R-Ap τ i T i id named[t ] iinj[named](i id ) R-N τ 1 i 1 τ 2 i 2 arrow[τ 1, τ 2] iinj[arrow]((i 1, i 2)) R-A Fig. 13: Reification rules, used by the itoast ( to AST ) operator (Fig. 15) to permit generating an internal term of type named[exp] corresponding to the value of the argument (a form of serialization). 4.4 Hygiene A concern with any term rewriting system is hygiene how should variables in the elaboration be bound? In particular, if the rewriting system generates an open term, then it is making assumptions about the names of variables in scope at the site where the TSL is being used, which is incorrect. Those variables should only be identifiable up to alpha renaming. Only the user of a TSL knows which variables are in scope. The strictest rule would simply reject all open terms, but this would then, given our setting, prevent even spliced terms from referring to local variables. These are written by the TSL client, who is aware of variable bindings at the use site, so this should be permitted. Furthermore, the variables in spliced terms should be bound as the client expects. The elaboration should not be able to surreptitiously or accidentally shadow variables in spliced terms that may be otherwise bound at the use site (e.g. by introducing a variable tmp outside a spliced term that leaks into the spliced term). The solution to both of these issues, given what we have outlined above, is now quite simple: we have constructed the system so that we know which sub-terms originate from the TSL client, marking them as spliced[e]. These terms are permitted to refer only to

17 Safely Composable Type-Specific Languages 17 Γ ; Γ Θ ê i τ Γ ; Γ Θ ê i τ x : τ Γ Γ out; Γ Θ x x τ H-var Γ out; Γ, x : τ 1 Θ ê i τ 2 Γ out; Γ Θ hlam(x.ê) ilam(x.i) arrow[τ 1, τ 2] H-abs Γ out Θ e i τ Γ out; Γ Θ spliced[e] i τ H-spl-A Γ out Θ e i τ Γ out; Γ Θ spliced[e] i τ H-spl-S Fig. 14: Statics for translational terms, ê. Each rule in Fig. 10 corresponds to an analagous rule here by threading the outer context through opaquely (e.g. the rules for variables and functions, shown here). The outer context is only used by the rules for spliced[e], representing external terms that were spliced into TSL bodies. Note that elaboration is implicitly capture-avoiding here (see Sec. 6). i i i i itoast(i) itoast(i ) D-Toast-1 i val i i itoast(i) i D-Toast-2 Fig. 15: Dynamics for internal terms, i. Only internal terms have a dynamic semantics. Most constructs in TSL Wyvern are standard and omitted, as our focus in this paper is on the statics. The only novel internal form, itoast(i), extracts an AST (of type named[exp]) from the value of i, shown. variables in the client s context, Γ out, as seen in the premises of the two rules pertaining to this form (one for analysis, one for synthesis). The portions of the elaboration that aren t marked in this way were generated by the TSL provider, so they can refer only to variables introduced earlier in the elaboration, tracked by the context Γ, initially empty. The two are kept separate. If the TSL wishes to introduce values into spliced terms, it must do so by via a function application (as in the TSL for Parser discussed earlier), ensuring that the client has full control over variable binding. 4.5 From Values to ASTs By this formulation, elaborations containing free variables are always erroneous. In some rewriting systems, a free variable is not an error, but are instead replaced with the AST corresponding to the value of the variable at the generation site. We permit this explicitly by including the form toast(e). This simply takes the value of e and reifies it, producing a term of type Exp, as specified in Figs. 15 and Fig. 13. The rules for reification, used here, and dereification, used in the literal rule above, are dual. The TSL associated with Exp, implementing quasiquotes, can perform free variable analysis and insert this form automatically, so they need not be inserted manually in most cases. That is, Var( x ) : Exp elaborates to x which is ill-typed in an empty context, x : Exp produces the translational term htoast(spliced[x]), which will elaborate to itoast(x) in the context where the quotation appears (i.e. in the TSL definition), thus behaving as described without requiring that quotations are entirely implemented by the language. This can be seen as a form of serialization and could be implemented as a library using reflection or compile-time metaprogramming techniques (e.g. [20]).

18 18 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich Γ Θ i τ Γ Θ i τ T [ot[ω], µ] Θ Γ T Θ ṁ ω Γ Θ inew {ṁ} named[t ] IT-new Fig. 16: Statics for internal terms, i. Each rule in Fig. 10 except T-metadata corresponds to an analogous rule here by removing the elaboration portion. Only the rule for object introduction differs, in that we no longer restrict the introduction of parse streams (internal terms are never written directly by users of the language). 4.6 Metatheory The semantics we have defined constitute a type safe language. We will outline the key theorems and lemmas here, referring the reader to an accompanying technical report for fuller details [24]. The two key theorems are: internal type safety, and type preservation of the elaboration process. To prove internal type safety, we must define a bidirectional typing judgement for the internal language, shown and described in Fig. 16 (by the external type preservation theorem, we should never need to explicitly implement this, however). We must also define a well-formedness judgement for named type contexts (not shown). Theorem 1 (Internal Type Safety). If Θ and Θ i τ or Θ i τ, then either i val or i i such that Θ i τ. Proof. The dynamics, which we omit for concision, are standard, so the proof is by a standard preservation and progress argument. The only interesting case of the proof involves etoast(e), for which we need the following lemma. Lemma 1 (Reification). If Θ 0 Θ and Θ i τ then i i and Θ i named[exp]. Proof. The proof is by a straightforward induction. Analagous lemmas about reification of identifiers and types are similarly straightforward. If the elaboration of a closed, well-typed external term generates an internal term of the same type, then internal type safety implies that evaluation will not go wrong, achieving type safety. We generalize this argument to open terms by defining a wellformedness judgement for contexts (not shown). The relevant theorem is below: Theorem 2 (External Type Preservation). If Θ and Θ Γ and Γ Θ e i τ or Γ Θ e i τ then Γ Θ i τ. Proof. We proceed by inducting over the the typing derivation. Nearly all the elaborations are direct, so the proof is by straightforward applications of induction hypotheses and lemmas about well-formed contexts. The only cases of note are: e = enew {m}. Here the corresponding rule for the elaboration is identical but more permissive, so the induction hypothesis applies. e = emetadata[t ]. Here, the elaboration generates the metadata value directly. Well-formedness of Θ implies that the metadata term is of the type assigned.

19 Safely Composable Type-Specific Languages 19 e = lit[body]. Here, we need to apply internal type safety as well as a mutually defined type preservation lemma about translational terms, below. Lemma 2 (Translational Type Preservation). If Θ and Θ Γ out and Θ Γ and dom(γ out ) dom(γ ) = (which we can assume implicitly due to alpha renaming) and Γ out ; Γ Θ ê i τ or Γ out ; Γ Θ ê i τ then Γ out Γ Θ i τ. Proof. The proof by induction over the typing derivation follows the same outline as above for all the shared cases. The outer context is threaded through opaquely when applying the inductive hypothesis. The only rules of note are the two for the spliced external terms, which require applying the external type preservation theorem recursively. This is well-founded by a metric measuring the size of the spliced external term, written in concrete syntax, since we know it was derived from a portion of the literal body. Moving up to the level of programs, we can prove the correctness of compilation theorem below. Together, this implies that derivation of the compilation judgement produces an internal term that does not go wrong. Theorem 3 (Compilation). If ρ Θ i : τ then Θ and Θ i τ. Proof. We simply need a lemma about checking type declarations and the result follows straightforwardly. Lemma 3 (Type Declaration). If Θ0 θ Θ then Θ 0 Θ. Proof. The proof is a simple induction using the definition of Θ (not shown). 4.7 Decidability Because we are executing user-defined parsers during typechecking, we do not have a straightforward statement of decidability (i.e. termination) of typechecking: the parser might not terminate, because TSL Wyvern is not a total language (due to self-reference in methods). Indecidability of typechecking is strictly for this reason. Typechecking of terms not containing literals is guaranteed to terminate. Termination of parsers and parser generators has previously been studied (e.g. [15]) and the techniques can be applied to user-defined parsing code to increase confidence in termination. Few compilers, even those with high demands for correctness (e.g. CompCert [17]), have made it a priority to fully verify and prove termination of the parser, because it is perceived that most bugs in compilers arise due to incorrect optimization passes, not initial parsing. 5 Corpus Analysis We performed a corpus analysis on existing Java code to assess how frequently there are opportunities to use TSLs. As a lower bound for this metric, we examined String arguments passed into Java constructors, for two reasons: 1. The String type may be used to represent a large variety of notations, many of which may be expressed using TSLs. 2. We hypothesized that opportunities to use TSLs would often come when instantiating an object.

20 20 Omar, Kurilova, Nistor, Chung, Potanin, Aldrich Methodology. We ran our analysis on a recent version ( r) of the Qualitas Corpus [33], consisting of 107 Java projects, and searched for constructors that used Strings that could be substituted with TSLs. To perform the search, we used command line tools, such as grep and sed, and a text editor features such as search and substitution. After we found the constructors, we chose those that took at least one String as an argument. Via a visual scan of the names of the constructors and their String arguments, we inferred how the constructors and the arguments were intended to be used. Some additional details are provided in the technical report [24]. Results. We found 124,873 constructors and that 19,288 (15%) of them could use TSLs. Table 1 gives more details on types of String arguments we found that could be substituted with TSLs. The Identifier category comprises process IDs, user IDs, column or row IDs, etc. that usually must be unique; the Pattern category includes regular expressions, prefixes and suffixes, delimiters, format templates, etc.; the Other category contains Strings used for ZIP codes, passwords, queries, IP addresses, versions, HTML and XML code, etc.; and the Directory path and URL/URI categories are self-explanatory. Table 1: Types of String arguments in Java constructors that could use TSLs Type of String Number Percentage Identifier 15,642 81% Directory path 823 4% Pattern 495 3% URL/URI 396 2% Other (ZIP code, password, query, 1,932 10% HTML/XML, IP address, version, etc.) Total: 19, % Limitations. There are three limitations to our corpus analysis. First, the proxy that we chose for finding how often TSLs could be used in existing Java code is imprecise. Our corpus analysis focused exclusively on Java constructors and thus did not consider other programming constructs, such as method calls, assignments, etc., that could possibly use TSLs. We did not count types that themselves could have a TSL associated with them (e.g. URL), only uses of Strings that we hypothesized might not have been Strings had better syntax been available. Our search for constructors with the use of command line tools and text editor features may not have identified every Java constructors present in the corpus. Finally, the inference of the intended functionality of the constructor and the passed in String argument was based on the authors programming experience and was thus subjective. Despite the limitations of our corpus analysis, it shows that there are many potential use cases where type-specific languages could be considered, given that numerous String arguments appeared to specify a parseable format.

SQL INJECTION ATTACKS By Zelinski Radu, Technical University of Moldova Where someone is building a Web application, often he need to use databases to store information, or to manage user accounts. And

Overview Elements of Programming Languages Lecture 12: Object-oriented functional programming James Cheney University of Edinburgh November 6, 2015 We ve now covered: basics of functional and imperative

EMC White Paper Introduction to XML Applications Umair Nauman Abstract: This document provides an overview of XML Applications. This is not a comprehensive guide to XML Applications and is intended for

Web Development Owen Sacco ICS2205/ICS2230 Web Intelligence Introduction Client-Side scripting involves using programming technologies to build web pages and applications that are run on the client (i.e.

CHAPTER 7 GENERAL PROOF SYSTEMS 1 Introduction Proof systems are built to prove statements. They can be thought as an inference machine with special statements, called provable statements, or sometimes

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR Andrey V.Lyamin, State University of IT, Mechanics and Optics St. Petersburg, Russia Oleg E.Vashenkov, State University of IT, Mechanics and Optics, St.Petersburg,

HTML Web Page That Shows Its Own Source Code Tom Verhoeff November 2009 1 Introduction A well-known programming challenge is to write a program that prints its own source code. For interpreted languages,

Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

Introducing Formal Methods Formal Methods for Software Specification and Analysis: An Overview 1 Software Engineering and Formal Methods Every Software engineering methodology is based on a recommended

Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

Source Code Translation Everyone who writes computer software eventually faces the requirement of converting a large code base from one programming language to another. That requirement is sometimes driven

The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

66 CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS 5.1 INTRODUCTION In this research work, two new techniques have been proposed for addressing the problem of SQL injection attacks, one

CS 1133, LAB 2: FUNCTIONS AND TESTING http://www.cs.cornell.edu/courses/cs1133/2015fa/labs/lab02.pdf First Name: Last Name: NetID: The purpose of this lab is to help you to better understand functions:

Termination Checking: Comparing Structural Recursion and Sized Types by Examples David Thibodeau Decemer 3, 2011 Abstract Termination is an important property for programs and is necessary for formal proofs

10CS73:Web Programming Question Bank Fundamentals of Web: 1.What is WWW? 2. What are domain names? Explain domain name conversion with diagram 3.What are the difference between web browser and web server

Everyday Lessons from Rakudo Architecture Jonathan Worthington What do I do? I teach our advanced C#, Git and software architecture courses Sometimes a mentor at various companies in Sweden Core developer

About the tutorial Tutorial Simply Easy Learning 2 About the tutorial DTD Tutorial XML Document Type Declaration commonly known as DTD is a way to describe precisely the XML language. DTDs check the validity

Chapter 6 User-defined types The user is allowed to define his/her own data types. With this facility, there is no need to encode the data structures that must be manipulated by a program into lists (as

Syntaktická analýza Ján Šturc Zima 208 Position of a Parser in the Compiler Model 2 The parser The task of the parser is to check syntax The syntax-directed translation stage in the compiler s front-end

Developers Guide Designs and Layouts HOW TO IMPLEMENT WEBSITE DESIGNS IN DYNAMICWEB Version: 1.3 2013.10.04 English Designs and Layouts, How to implement website designs in Dynamicweb LEGAL INFORMATION

CHAPTER 4. STATEMENT LOGIC 59 The rightmost column of this truth table contains instances of T and instances of F. Notice that there are no degrees of contingency. If both values are possible, the formula

Monads for functional programming Philip Wadler, University of Glasgow Department of Computing Science, University of Glasgow, G12 8QQ, Scotland (wadler@dcs.glasgow.ac.uk) Abstract. The use of monads to

Anatomy of Programming Languages William R. Cook Copyright (C) 2013 2 Chapter 1 Preliminaries Preface What? This document is a series of notes about programming languages, originally written for students

Java Application Developer Certificate Program Competencies After completing the following units, you will be able to: Basic Programming Logic Explain the steps involved in the program development cycle

Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction

Chapter II. Controlling Cars on a Bridge 1 Introduction The intent of this chapter is to introduce a complete example of a small system development. During this development, you will be made aware of the

HTML5/CSS3/JavaScript Programming Description: Prerequisites: Audience: Length: This class is designed for students that have experience with basic HTML concepts that wish to learn about HTML Version 5,