CMSC 330, Spring 2013

Organization of Programming Languages

Project 2

Due 11:59pm Thu, Feb 28, 2013

Introduction

In this project, you will implement classes and methods for working
with regular expressions. When you are done with this project, you
will be able to compile regular expressions to NFAs and DFAs, use them
to recognize strings. You will also conduct an experiment comparing
the performance of different regular expression implementations.

Code Overview

To get you started, we've supplied you several files containing code
that you'll use for this project. The file example.rb gives
some Ruby code showing how to use the other files. We discuss the
other three files in turn.

Regular expressions (re.rb)

The file re.rb contains five classes that we will use to
represent regular expressions as abstract syntax trees (ASTs):
REEpsilon (the empty string), REConst (one
character), REStar (Kleene closure), REUnion
(disjunction), and REConcat (concatenation). Each of these
classes has a constructor that takes the appropriate number of
arguments depending on how many subexpressions each syntactic form
has. For example, here are some regular expressions along with code
to create them using these classes.

The regular expression corresponding to r4 above is depicted
below; this figure illustrates why we call it an abstract syntax tree:

Each of the five classes also has accessor methods to get the child
expressions. These are created with the Ruby method attr_accessor
:name, which creates a getter method name to
retrieve field @name and a setter method name= to write
to it. For example, REUnion has two fields, expr1
and expr2, so we could write r3.expr1 (r3
is defined in the table above) to get the left disjunct
(REConst.new("a") in this case) and r3.expr2 to get the right
disjunct (REConst.new("b")).

We also included some utility methods in these classes: An ==
method to compare two regular expressions structurally, a
hash method to construct hash keys for regexps,
and a
to_s method to present a regular expression in textual
form. This last method adds lots of extra parentheses so that
precedence is clear.

Parsing regular expressions (parser.rb)

As you have no doubt guessed, constructing regular expression
ASTs by hand is tedious and error-prone. To help make it easier for
you to test your project, we've included a class Parser in
parser.rb to translate the more usual string representation
of a regular expression into the appropriate AST. To use
Parser, all you need to do is instantiate it and invoke its
parse(s) method, e.g., r = Parser.new.parse("abc")
will give you the AST r4 above.

This particular parser is not very sophisticated, and we do not
guarantee that it is bug free. Regular expressions in this language
are over upper- and lowercase alphabetic characters. As usual, Kleene
closure is *, and binds more tightly than conjunction
(represented by juxtaposition, as usual), which itself binds more
tightly than disjunction, which is |.

Finite automata (fa.rb)

Finally, fa.rb contains a class NFA that we will use
to represent non-deterministic and deterministic finite automata
(recall that a DFA is just a special case of an NFA). An
NFA instance contains three fields to represent the key parts of the
NFA: @start, the initial state; @final, an array
containing the final states; and @delta, a hash containing
the transitions.

In these NFAs, a "state" is an arbitrary Ruby object. For example,
the file example.rb uses integers to represent states. In
our formal translation of regexps to NFAs, we labeled states with
regexps, and you can do the same here as well. (Note that states
cannot be nil.)

The set of transitions @delta has, roughly speaking, type
Hash<State, Hash<String, Array<State>>>: it takes a
state and and produces a hash that takes a
symbol (a string of length 1) and produces an array of states. Thus,
@delta[s1][x] is an array containing all the states that we
transition to from state s1 on input symbol x. We
may examine this data structure directly during grading, so please do
not change the representation of @delta.

The class NFA includes several methods to help construct and
work with NFAs. The method set_start(s) sets the NFA's start
state to s. The method make_final(s) adds
s to the list of final states. The method add_trans(s1,
sym, s2) adds a transition from s1 to s2 on
symbol sym. Note that although sym is a string, it
must always have length 1. Next, the method states returns
an array containing all the states in the NFA. This includes the
start state, all the final states, and any state appearing in a
transition. The method to_s gives a readable representation
of an NFA. Finally, accept?(s) returns an accepting state if
the NFA accepts s, and nil otherwise.

Part 1: Implement check() and trans()

For the first part of the project, you will implement the
check(e) and trans(e) functions from slides 61 and 68,
respectively. Both of these functions work by recursively visiting
the AST of the regular expression. There are several ways to visit
ASTs in an object-oriented language, but we'll do something simple for
this project: For each of the five regular expression classes, you
will implement methods check and trans. (Thus,
instead of check(e) as we wrote in the slides, in this
project you'll call e.check, and similarly for
trans.) For the compound regular expressions (REStar,
REUnion, REConcat), these will likely recursively call check and
trans on the subexpressions. For example, REUnion#check will
call @expr1.check and @expr2.check.

Your method trans should return an Array where each element
is itself an Array whose first component is a String (a symbol) and
whose second component is a regexp. See the code for REConst#trans
for an example (we've already filled this in for you).

Part 2: Translate regular expressions to NFAs

Next, implement the method NFA.of_re(r), which, given a
regexp r returns an NFA accepting the same language. Use the
reduce algorithm given on slide 72.

Part 3: NFAs to DFAs

Next, write a method NFA#is_dfa? that returns true iff self
is a deterministic finite automaton, and write a method
NFA#to_dfa that returns a new DFA that accepts the same
language as self (which should not be modified). Use the algorithm on
slide 81 for NFA#to_dfa. You do not need to (and should not)
minimize the resulting DFA. Both is_dfa? and to_dfa
should work correctly if self has an implicit dead state (i.e., if
there are states that do not include transitions for all possible
inputs).

Part 4: Regexp bake off

In parts 1-3, you developed code to generate DFAs from regexps. For
the last part of the project, your job is to develop and conduct an
experiment that answers the following question: Which implementation
performs regular expression matching the fastest: (1) regexps
translated to NFAs, as implemented in parts 1-2; (2) regexps
translated to DFAs, as implemented in parts 1-3; or (3) native regexps
in Ruby?

To answer this question, at a minimum you will need to pick at least
six regular expressions, match the regexps against a large number of
different strings under each of the three approaches (enough so that
it takes at least a second or so), and time the results. Be sure to
add ^ and $ to the Ruby regexp to make it do exact matching, so that
you're comparing times for the same thing. For example, if you run
your implementation on the regexp "330", you should use the Ruby
regexp /^330$/.

On a Unix machine, you can use the "time" command to run a program and
measure how long it took (use "real," the wall-clock time, as your
measure). In Ruby, you can use Time.now.usec to get a high-resolution
(microsecond) time. To help produce more accurate results, run your
time trials for each variant at least 3 times and average the results.

Some things you will need to think about for this experiment are: If
you match a single regexp r against many different strings, you can
amortize the cost of translating r to an NFA/DFA, and only do that
once. How does that affect the results? Does the particular regular
expression matter? For example, does it matter if the regexp is long?
Includes many disjunctions? Includes Kleene closure? Does it matter
if the string being matched is long?

Write up your experiment and results in a short (1-2 page) plain text
or pdf file called experiment.txt or experiment.pdf,
and submit it along with the rest of your project file. Your writeup
should contain the following:

Your experimental design. Include enough information here that we
could reproduce the experiment if we wanted to. List the regular
expressions you use, how many times you run them, describe the kinds
of strings you match against, etc. (Tip: you can use the wget
url command on Linuxlab to download web pages; these might
be a good source of large amounts of text to search through.)

A description of the equipment on which you ran the experiment (if you
used a linuxlab machine, just give us the name of the machine; if you
used your own machine, describe its processor, speed, available
memory, and operating system).

The results of the experiment. Summarize the results of the
measurements you took. For example, if you ran several different
regexps against large amounts of text, you could report the average
search time for each regexp and each implementation (NFAs, DFAs, Ruby
internal) per input line.

A very brief analysis speculating why the experimental result is
as reported.

This part of the project will worth approximately 20% of the project
grade, and will be graded on a scale of 0-3 points:

3 points - solid experiment well above and beyond the minimum

2 points - solid, but minimal experiment

1 point - problems with the experiment

0 points - serious issues with experiment

We will not grade you on grammar or spelling, unless there are
problems to the point that we cannot understand your writeup.

Hints and Tips

Several of the project parts depend on each other. When we grade
your project, we will plug in correct versions of any necessary
methods; for example, when grading your regexp to nfa conversion, we
will "monkey patch" your code with our versions of check() and
trans(). While this should help make grading equitable, it also means
you need to be extra careful not to change any of the interfaces we
supply you with, since we will be relying on those in grading.

Beware that a hash cannot itself be
used as a key in another hash. For example, if you had h = {
} and then g = { }; g[h] = 1 then Ruby will
complain. This issue may arise if you use hashes to represent sets;
note that the NFA-to-DFA reduction takes a set of NFA states and
combines them into a single DFA state. If the latter is represented
as a hash-based set, then you will not be able to store it as the
key for a hash.

Public Tests

The zip file containing the public tests has several Ruby programs in
it; the public tests are in the files public_check_ol.rb,
public_is_dfa_ol.rb, etc. Once you have copied your
fa.rb and re.rb files into the
p2_public_tests directory, you should be able to run the
files there. For example,

If you look inside the test files, you will see that they invoke the
script tester.rb, which can be called in several ways. Here
are the possible sets of arguments:

checkre-file

The file re-file is expected to contain
regular expressions, one per line. For each, line it will call our
parser (in parser.rb) which in turn will invoke your code
to read in the regular expression and convert it to an NFA. Then it
will call the check method on each NFA and print the
result.

transre-file

The file re-file file is read in and each
regular expression is converted to an NFA, as above, but this time
it will invoke the trans method on each NFA and then will
print the resulting set of transitions.

of_rere-filestr-file

After reading in each regular expression in re-file and
converting it to an NFA, it will check whether the strings in
str-file (one per line) are matched by each NFA. It will
print the results of the match, and will print whether the matching
results differ from what Ruby's matcher would do, if you had typed
in the regular expression directly.

is_dfanfa-file

Reads in each NFA from the input file and invokes (and prints the
output of) your is_dfa? method on the NFA.

to_dfanfa-filestr-file

Converts each NFA in the input file to a DFA (using your
to_dfa method) and then attempts to accept each string in
str-file using the DFA. Makes sure that the behavior for the
DFA is the same as the original NFA it was converted from.

The public tests themselves invoke tester.rb and then compare
its output with an output file in the outputs/ directory. As
an example, the test public_check_ol.rb will invoke ruby
tester.rb check inputs/re-ol-2, which produces

false
true
false
false
false
false
true
false
true
false

(This output is compared against that in
outputs/check-re-ol-2.out.)
You can see that the input file is a regular expression file
inputs/re-ol-2 in which each line has a regular expression on
it. For each regexp, it will print whether that regexp accepts the
empty string (by calling the check method on the NFA your
code constructed from the regexp).

As another example, the test file public_of_re_ol.rb calls
ruby tester.rb of_re inputs/re-ol-2 inputs/str-ol-2 and thus
tries to match each string in inputs/str-ol-2 to each regular
expression in inputs/re-ol-2 (and the output is compared
against the file outputs/of_re-ol-2.out).

You are advised to construct your own tests by modifying these public
tests to have different strings to match, different regular
expressions to match against, etc.

Submission

You may either submit over the web or under command line interface.

Submitting over the web

Put your files (re.rb, fa.rb, and write-up) in a zip archive. Don't put things under a directory but on the top level.

Submit your archive directly to the
submit server
by clicking on the submit link in the column "web submission".

Next, use the submit dialog to submit your schedule.rb file directly.

Select your file using the "Browse" button,
then press the "Submit project!" button.

Submitting in command line interface

Make a directory and put your files in it. Download .submit and submit.jar in Project Files to the same directory. Then run the following command under that directory:

java -jar submit.jar

The first time you submit this way you will be asked to enter your
directory ID and password. All files in the directory (and its subdirectories) will then be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:

Successful submission # received for project 2

Academic Integrity

The Campus Senate has adopted a policy asking students to include the
following statement on each assignment in every course: "I pledge on
my honor that I have not given or received any unauthorized assistance
on this assignment." Consequently your program is requested to
contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the
course syllabus. Any evidence of impermissible cooperation on
projects, use of disallowed materials or resources, or unauthorized
use of computer accounts, will be submitted to the Student
Honor Council, which could result in an XF for the course, or
suspension or expulsion from the University. Be sure you understand
what you are and what you are not permitted to do in regards to
academic integrity when it comes to project assignments. These
policies apply to all students, and the Student Honor Council does not
consider lack of knowledge of the policies to be a defense for
violating them. Full information is found in the course
syllabus---please review it at this time.