Cloud Security, For Real This Time: Homomorphic Encryption and the Future of Data Privacy. That’s the title of my presentation at the next Central Ohio OWASP Quarterly Seminar, on 27 February at 1:00 p.m. Dan King, from Microsoft, will be talking about single sign-on for federated Dynamics CRM, very practical stuff which is in real world use today. I, on the other hand, will be talking about technologies which don’t quite exist in fully practical forms today, but which I predict will change the Internet in the decade to come and which I find mind-expanding to even read about.

In the meantime, I’ll leave you with some decidedly weird Python code:

I had an interesting bug in my compiler: The parser would fail on blank lines. To a certain degree, this makes sense; the formal grammar of the language does not include blank lines. This is invalid input! On the other hand, every programming language ever invented, as far as I know, simply ignores them. That sounds simple, but, as the author of the compiler, it is necessary to ask: Where is the correct place to ignore a blank line?

If a compiler exists to transform source code into an executable program, what is the rationale for having lexing as a separate "phase?" Partially, it’s an implementation detail: Treating lexing separately can improve performance and makes certain features easier to implement. There is a theoretical justification as well; if you specify the PL as a context-free grammar using terminals which are lexemes specified with regular expressions, then it makes sense to implement these concerns separately.

Compiler textbooks do not always do a great job of explaining the border between lexing and parsing in a general sense. For example, "Compilers: Principles, Techniques, & Tools," ("the dragon book"), Second Edition, has this to say on p. 6: "Blanks separating the lexemes would be discarded by the lexical analyzer." That is true enough for the toy example in question on that page, but clearly wrong for a language with significant whitespace, like Python or Haskell. Significant whitespace is a reasonably common programming language feature, but the dragon book mostly ignores it.

Whitespace is not significant in C# or VB.NET, but Roslyn does not discard it. Instead, Roslyn syntax nodes store whitespace, comments, etc., in SyntaxTrivia fields. The intention is that the full, original source text can be rebuilt from the syntax tree.

Lexers tend to treat keywords (reserved words) distinctly from identifiers such as variable or type names. The dragon book does suggest why this is the case, but, as Frans Bouma pointed out, does so in terms "requiring knowledge found hundreds of pages further on [in the book]." Other books have a similar problem: "Modern Compiler Implementation in ML" states, "A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language." But it does this well before the notion of a "unit in the grammar of a programming language," is really explained.

The Coursera compilers course finally made this click for me: The point of a lexer is to provide input to a parser. Put differently, a lexer should supply a valid sequence of terminals to the parser, given valid (source code) input. So you need to understand parsers before you can really understand a lexer. Indeed, it may make more sense to teach compilers "backwards."

In the same way that lexers operate on a sequence of characters or code points, parsers operate on a sequence of terminals. Hence, the lexer should aim to produce a valid enumeration of terminals in the parser’s context-free grammar, given valid input.

Therefore: If the parser’s terminals are characters, then you do not need a lexer and you are doing scannerless parsing. If the parser’s terminals are "more than individual characters," keywords, formatted numbers, identifiers, and the like, then the job of the lexer is to produce terminals, and only terminals, on correctly-formatted source code.

This tells me that the correct place for me to remove blank lines in my compiler is in the lexer.

I’m building a small compiler for a toy language which emits .NET executables, implemented in F#. Demo compilers are a dime a dozen, but there are a few things which make this project distinct:

No lexer or parser generators are used. The entire compiler is written from the ground up, and is intended to be fairly simple code.

The source code is idiomatic F#, and is almost totally purely functional.

The project started as a fairly simple port of Jack Crenshaw’s code from his classic series, "Lets Build A Compiler," but deviated pretty quickly as I found I wanted to take the project in a slightly different direction.

In the name of simplicity, Crenshaw avoids using distinct lexing, parsing, etc. phases in his code. I completely understand his reasoning, but found that this was confusing to me personally, because the code I was writing did not resemble the code I read in compiler textbooks. Indeed, many "educational" compiler implementations swing the other way, using a "nanopass" (many distinct phases) design.

F# code directly ported from Pascal is ugly, and I wanted to use purely functional code whenever possible.

I had been attempting to port one "chapter" of Crenshaw’s code at a time, but I fairly quickly discovered that this is exactly backwards; it makes far more sense to implement a complete compiler and then break it down into chapter-sized steps. It is easier to design a road when you know what the destination will be! I do recognize the value of Crenshaw’s "learn by doing / just write the code" approach. I want to go back and "chapterize" the compiler when I’m done.

Crenshaw’s code emits 68000 ASM for SK*DOS as a string; mine emits a binary .NET assembly as a file.

As I develop the compiler, I have found that I use a back-and-forth cycle of adding features and beautifying code. New features tend to be pretty ugly until I understand how the code needs to work in the end, and then I go back and rewrite the uglier bits using the knowledge I have gained in the implementation.

For example, I have just added support for decoration and dereferencing of local variables, and, with that, multiple-line programs. This resulted in a considerable expansion of both the parser and the code generator. So I’m going to go back and simplify a number of things, especially error handling. I’m going to use a combination "railway oriented" design and error nodes in the parser.

The code does have moments of beauty, though. Here’s the top level of the compiler:

Coursera, Udacity, MIT Open Courseware, and other such sites are useful to me because they decouple the desire to learn college-level material from the expense and regulations of earning (another) diploma. The latter isn’t compelling to me today, but the former certainly is.

First, I found all three courses to be an effective way for me to learn the material presented. This is mostly because all three professors did an excellent job of selecting the exercises to complete for the course. Most of what I learned in all three courses came from completing the exercises. The quality of the lectures and the availability of written materials for offline study varied from course to course. But it’s hard to be too critical considering that the courses are free. For someone like me who cares much more about increasing my knowledge than receiving credit or some sort of official certificate, it almost seems foolish not to take at least one course at any given time.

Completing the courses gave me a chance to practice with some programming languages I don’t use in my professional work right now, like Scala, Python 3, and R.

Perhaps surprisingly, given that all three courses use Coursera’s services, the means of submitting assignments was wildly different from class to class.

For the Scala course, assignments were submitted via sbt, and graded by unit tests on the server. These tests were occasionally fallible; when the server was especially loaded, it could decide that a submission was experiencing infinite recursion (a real hazard in functional programming) when, in fact, it was simply executing the code slowly. At most other times, however, the tests were accurate and the feedback was pretty reasonable.

For the Social Network Analysis course, the programming assignments were an optional add-on. You had to attach the homework via Coursera’s site, and it would be graded somehow. It’s not clear to me what the mechanism was, but it always accepted my submissions. The final programming assignment was graded using a peer review system where three other students would rate your submission, and you would evaluate the submissions of three more students. I really appreciated the feedback I got on this; it seems that the other students put real effort into their feedback.

In the Linear Algebra course, the programming assignments were all in Python. You submit them using a custom Python script which analyzes your code using regular expressions and unit tests on the client and server. This system was the buggiest of the three; the client and server code were occasionally out of sync, and frequently rejected correct code. These issues were fixed as the course went on, but it made for some very frustrating hours.

The use of instructor-supplied unit testing varied from class to class, as well. Unit tests were always included in the Scala class assignments, although the grader would run additional tests on your code. The Linear Algebra class assignments occasionally included unit tests, but mostly did not. The grader always ran unit tests, but they were not usable by the students. This deficit was made up for by other students who would post reasonable unit tests to the class forum. The Social Network Analysis class never included any unit tests in the assignments at all.

My biggest disappointment with all of these courses, however, is non-technical. Coursera, like nearly everyone in academia, has an honor code which forbids sharing work. This stands in stark contrast to most other collaborative systems for learning programming, like Exercism, Project Euler, 4Clojure (and arguably GitHub) where you are expected to first solve a problem yourself, and then work with other developers to refine your solution into the best possible implementation. You learn how to do the work well, not just how to make tests pass.

With Coursera (with the sole exception of the Social Network Analysis final problem), you just can’t do that at all. There are forums, where you are allowed to opaquely discuss the assignment and your approach to it, but you cannot show any actual code for the assignment itself. In a real college course, you could review your work privately with a TA, so you’d at least get some feedback beyond pass/fail. There is no such option at Coursera. For me, this is the biggest barrier to real learning in Coursera, and it seems unnecessary, given that "cheating" on a zero-credit course you take for fun robs only yourself.

There are other areas in which I think Coursera could profit by deviating from the way things are done in college. It’s not clear to me why a fixed schedule is required for these classes. Why not give students the option of "auditing" a course on their own time? Since the lectures are pre-recorded and the assignments are, for the most part, graded mechanically, it seems like this should be possible. Having a more flexible schedule would allow students who cannot commit the roughly 6 through 16 hours per week that most of the classes seem to require.

These criticisms aside, however, the big picture is that all of the classes I have taken so far have been good because the professors did an excellent job of designing the courses. As long as Coursera continues to recruit such qualified teachers, I think their classes will be a good investment of my time.

I wrote a 15*15, NYT-style crossword puzzle for Strange Loop. On the NYT difficulty scale, it’s roughly a Wednesday-level puzzle. However, it was written for Strange Loop and thus does presume familiarity with functional programming and math, and has a few "inside jokes."

Behavioral interviewing also works — where you’re not giving someone a hypothetical, but you’re starting with a question like, “Give me an example of a time when you solved an analytically difficult problem.” The interesting thing about the behavioral interview is that when you ask somebody to speak to their own experience, and you drill into that, you get two kinds of information. One is you get to see how they actually interacted in a real-world situation, and the valuable “meta” information you get about the candidate is a sense of what they consider to be difficult.

I have interviewed candidates this way for years, and I can’t recall ever being wrong in my assessment of a developer’s general intelligence and technical skills. However, I have made mistakes! In particular, this method doesn’t necessarily give you a good indication of how well the candidate will get along with other members of the team, and whether or not they will behave professionally. I’m not as good at assessing that as I am at assessing technical skill.

Unsurprisingly, Google also found

[...] that brainteasers are a complete waste of time. How many golf balls can you fit into an airplane? How many gas stations in Manhattan? A complete waste of time. They don’t predict anything. They serve primarily to make the interviewer feel smart.

The article doesn’t mention anything about whiteboard coding, but I also find that useful in technical interviews.

I have to disagree with the headline, however. While big data might not be able to tell you who to hire, it clearly told Google that they were doing it incorrectly!

YAML’s security risks are in no way limited to Rails or Ruby. YAML documents should be treated as executable code and firewalled accordingly. Deserializing arbitrary types is user-controlled, arbitrary code execution.

It’s Not Just Ruby

A few weeks ago, I had a need to parse Jasmine’s jasmine.yml in some C# code. I spent some time looking at existing YAML parsers for .NET and ended up deciding that spending a couple of hours writing a lightweight, purpose-specific parser for jasmine.yml made more sense for my use case than including an off-the-shelf YAML parser which invariably turned out to be quite a heavyweight project.

Because all of these attacks landed within the Ruby community, observers have occasionally characterized this as a crisis for Rails or Ruby. I think that’s (at least a little) misguided. The real focus of our attention should be YAML.

It is very easy to demonstrate that the same vulnerabilities exist in other platforms. Here’s a two-line "attack" on Clojure’s YAML parser from Justin Leitgeb. I have to use the term "attack" here loosely, because all he is doing is deserializing an arbitrary class, which the YAML spec allows. But, in most environments, deserializing an arbitrary class is tantamount to code execution. The PyYAML library for Python has nearly the same vulnerability (though there is a workaround for it). I think that YAML-based attacks have tended to target Ruby projects simply because use of YAML is quite a bit more common amongst Rubyists, and certain prominent Ruby libraries in an internal way — users of these libraries may have no idea that the JSON input they supply, e.g., might be vulnerable to an internal YAML parser.

YAML leverages these primitives, and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.

That is, in practice, remote code execution. If an external actor, such as the author of a YAML document, can cause your application to instantiate an arbitrary type, then they can probably execute code on your server.

On the other hand, even in environments which make a very careful distinction between data type construction and code execution, one can imagine vulnerable code. Consider the following Haskell data type:

Contrived, sure. But you may have more innocuously-named types which you don’t plan for random users to spin up. Haskell, indeed, makes such attacks harder, but it’s not a free pass.

It’s difficult to overstate the danger of remote code authentication. If someone can execute code on your server, they can probably own your data center.

The YAML spec is largely mute on the issue of security. The word "security" does not appear in the document at all, and malicious user input isn’t discussed, as far as I can see.

Types in the YAML Spec

The encoding of arbitrary types is discussed in the last section of the YAML spec, "Recommended Schemas." It specifies "tag resolution," which is, in practice, the mapping of YAML content to instantiated types during deserialization. This section defines four schemas which a compliant parser should understand. The first three, "Failsafe," "JSON," and "Core," define , define tag resolutions for common types like strings, numbers, lists, maps, etc., but don’t appear dangerous to me.

However, the last section of "Recommended Schemas" is a catchall called "Other Schemas." It notes,

None of the above recommended schemas preclude the use of arbitrary explicit tags. Hence YAML processors for a particular programming language typically provide some form of local tags that map directly to the language’s native data structures (e.g., !ruby/object:Set).

While such local tags are useful for ad-hoc applications, they do not suffice for stable, interoperable cross-application or cross-platform data exchange.

In practice, most YAML deserialization code will, by default, attempt to instantiate any type specified in the YAML file.

Some YAML parsers have a "safe" mode where they will only deserialize types specified in the tag resolution. For example, PyYAML has safe_load. Its documentation notes,

Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function. Check the yaml.safe_load function though.

On the other hand, it’s not hard to imagine cases where allowing arbitrary object instantiation makes sense, such as in an application configuration file. An example in the .NET space would be XAML files. If you are defining a form or a workflow, then you want to be able to allow the instantiation of custom controls. There is no standard way to do this with, say, JSON files, so using a format like YAML makes sense here. (So does using a non-domain-specific language like Ruby, but that presumes that Ruby is available and might not be suitable for cross-language work.)

HTML input elements have a placeholder attribute which you can use to show a bit of text to prompt the end user. Although you can make an editable div by using the contenteditable attribute, it will not support the placeholder attribute. I needed to do both, so I ended up reinventing the placeholder attribute for editable divs. Here’s how I did it.

I wanted the "placeholder" in the editable div to behave as much as possible like a "real" placeholder in an input element. So I started by making a list of requirements:

The placeholder text should cosmetically resemble an input’s placeholder.

All markup used should be valid.

The placeholder text should not appear in the DOM.

The placeholder text should disappear when the div contains any text or when it has input focus.

I decided to use a data-placeholder attribute in lieu of a placeholder attribute since this is valid for a div. In order to prevent the placeholder text from appearing in the DOM, I used the CSS :before pseudo-element to hold the text. Making the placeholder text disappear when the div has focus is easy, but making it disappear when the div contains text is harder, because the :contains pseudo-class doesn’t exist. So I had to use a bit of JavaScript and another data- attribute for that.

div[data-placeholder] The placeholder should only be shown on divs which have a data-placeholder attribute.

:not(:focus) Hide the placeholder when the div is focused.

:not([data-div-placeholder-content]) Hide the placeholder when the div has a data-div-placeholder-content attribute. See JavaScript below. This is my workaround for the problem that :contains doesn’t exist.

:before This is a CSS pseudo element. The user will see the text, but it won’t be in the (standard) DOM.

content: attr(data-placeholder); The content of the pseudo element should be the text contained in the data-placeholder attribute.

The rest is just cosmetics.

In order to hide the "placeholder" text when the div contains user-supplied text, I had to resort to a bit of JavaScript/jQuery:

This code runs whenever any div containing a data-placeholder attribute on the document receives a keydown or keypress event. It then sets the data-div-placeholder-content attribute of that div if it contains any text and clears it if not. Hopefully, that much is obvious. But what’s up with the change and input events? Do those even exist?

Well, sort of. divs don’t normally fire the change event when the user types. However, I needed a way to set the attribute state correctly if I ever change the div text in JavaScript code. Since there is no obvious event to use, I just send change. So code like this:

$("#myDiv").text("foo");

becomes:

$("#myDiv").text("foo").trigger("change");

Without explicitly triggering change, both the "foo" and the placeholder text would be shown, which is obviously undesirable.

input is a relatively new event and is not supported by all popular browsers at this time. It’s a better choice than keydown and keypress, but I include those two, as well as input, in an attempt to support as many browsers as possible.

I laughed at this, not because it implies some kind of reckless disregard for process and community, but because, in 1999, at a former employer, when our VCS was Microsoft SourceSafe, this was just the way that we went about our business. Times have changed!

I recently completed the Coursera Social Network Analysis class. This was my first time taking a Coursera class. In this post, I will describe my experience with Coursera generally, and review the Social Network Analysis class in particular.

The Social Network Analysis class is an introductory level class. No previous experience with social network analysis is presumed. The professor is Lada Adamic, of the University of Michigan. If you take the class, you will learn what social network analysis is, the basic mathematical techniques used, a few interesting applications, and have the chance to use software tools such as Gephi, R, NetLogo, and others.

One point which is perhaps not obvious to outsiders is that “social network analysis” is actually relevant to analyzing almost any network which can be modeled as a directed or undirected graph. The class includes lectures on using the techniques presented here on networks which are not actually social networks.

There is quite a bit more math underlying real-world social network analysis then an outsider might expect. Most so-called "social media experts" would probably find it incomprehensible. However, this class really only skims the surface of the math. Students who are not “mathematically inclined” probably won’t have any problem getting through the class. Students who really like math will probably want to follow up with something more in-depth.

The class has an optional programming component. You can opt to not complete any of the programming assignments, and still pass the class. However, I think the most valuable things I learned from the class came from completing the programming assignments. It’s one thing to hear about, say, a mathematical formula for determining some property of the network, but just reading this formula on the slide sometimes doesn’t really stick in my memory. If I actually implement this analysis in R, on the other hand, I have to think about what all of the variables mean. Also, in completing the programming assignments I had the opportunity to use some tools which I had never heard of before, like NetLogo, and other tools, like R, which I had heard about, but never actually used.

If you do complete the programming assignments, then you will receive a “with distinction” on the certificate (a PDF, really) you receive on successful completion of the class. Coursera, however, is not at this time an accredited institution, so you should probably take this class to fulfill your love of learning, rather than because you think it will enhance your resume.

I think the programming assignments would be somewhat difficult for students without prior programming experience. None of them were difficult for me to complete, but somebody without a general sense of common programming syntax, and without real-world experience of setting up slightly-novice-unfriendly tools, might be frustrated by the fairly low level of support included in the course. It is assumed that you are able to install and configure, for example, the R system, pick a text editor and a REPL, install libraries from various locations, figure out differing syntax in various systems, and deal with at least one required library which doesn’t always seem to return accurate results. No professional programmer will have a problem with any of this, but I can see several roadblocks to non-programmers. This is unfortunate, because the programming itself is not difficult, and really does help with understanding the material in the class.

Like just about any class I’ve ever taken anywhere, online or offline, there are “bugs” in the assignments. Judging by forum posts, many people seem to find this surprising, but, in my experience, it happens in traditional classrooms as well. In a few cases, a multiple-choice question had more than one correct answer. In other cases, selecting the correct answer would not give you credit. The Coursera staff would eventually fix these problems.

In contrast to the Functional Programming class, where your grade was completely determined by unit tests and static analysis against your Scala code, the Social Network Analysis class was more “traditional” in that there were multiple-choice-style homeworks and a timed final exam.

Because the class is free, the measure of its value is, “Is taking this course worth my time?” For me, the answer turned out to be a definite yes. I probably spent about 5 hours a week listening to the lectures and completing the assignments. Prior to taking the class, I spent some time reading about the subject matter. I found taking the class to be a more efficient way to really understand what the field was about. For that reason, I recommend it to anyone who needs to interact with networks, social or otherwise, and is not familiar with the field.