Everything Scheme

Wednesday, May 31, 2006

How to Write an Unhygienic Macro - Introducing an Identifier Into the Lexical Context of a Macro Call

The standard syntax-rules macro system of R5RS Scheme is hygienic

If a macro transformer inserts a binding for an identifier (variable or keyword), the identifier will in effect be renamed throughout its scope to avoid conflicts with other identifiers.

and referentially transparent

If a macro transformer inserts a free reference to an identifier, the reference refers to the binding that was visible where the transformer was specified, regardless of any local bindings that may surround the use of the macro.

These properties of the macro system make the majority of macros easy to write and understand. In some situations it is nevertheless convenient to break the rules. A classic example is the if-it macro.

Syntax: (if-it test consequent alternative)

Semantics: An if-it expression is evaluated as follows: first, test is evaluated and its result is bound to it. If the result is a true value, then consequent is evaluated and its value(s) is(are) returned. Otherwise is evaluated and its value(s) is(are) returned. In consequent and alternative references to it will be bound to the it inserted by if-it.

It isn't too difficult to write such a macro with the syntax-case macro system. It is however a little tricky to ensure the resulting if-it macro can be used by other macros without knowledge of how if-it is implemented - at least until one discovers the following easy-to-use technique.

The solution uses two seldomly used functions namely syntax-local-introduce and syntax-local-get-shadower. The last of these are relatively unknown - a search reveals it is only used four times in total in the PLT code base - therefore a quote from the documentation is in order.

(syntax-local-get-shadower identifier) returns identifier if no binding in the current expansion context shadows identifier, if identifier has no module context, and if the current expansion context is not a module. If a binding of inner-identifier shadows identifier, the result is the same as (syntax-local-get-shadower inner-identifier), except that it has the location and properties of identifier. Otherwise, the result is the same as identifier with its module context (if any) removed and the current module context (if any) added. Thus, the result is an identifier corresponding to the innermost shadowing of identifier in the current context if its shadowed, and a module-contextless version of identifier otherwise.

In short syntax-local-get-shadower allows the macro writer to break referential transparency. The call (syntax-local-get-shadower#'it) will return the identifier to which it is bound at the site of the macro call (and not at the site of definition).

Inserting the result of (syntax-local-get-shadower#'it)directly into the template of if-it won't work as expected though. The macro system is hygienic by default, so the identifier will subjected to renaming and will therefore not bind uses at the call site. Preventing renaming is easy though, just call syntax-local-introduce.

To test that if-it behaves properly when used in the definition of other macros and also works with the module system I used the following tests. The four tests all return #t.

Saturday, May 20, 2006

Selection of the top search results

A google for "Scheme" gives "Results 1 - 100 of about 356,000,000" in 0.40 seconds. Selecting these top 100 results could be done by sorting the 356,000,000 results after rank and returning the 100 greatest. It is however very seldom that a user will ever want to examine more than the top 500 results, thus it is unnecessary to sort more than the top results.

The trick to find the n greatest elements of a list xs is to use a heap.

Begin by inserting the first n elements of the list into a heap. The elements of the heap should be thought of as the top n candidates.

For each remaining element x of the list we compare it with the minimum element of the heap. If x is smaller than the minimum element, then the n heap elements are all greater than x, and x is therefore not in the top n. If on the other hand x is larger than the minimum element, then x is inserted into the heap instead of the previous minimum element.

Now the heap contains the top-n elements, and by continually deleting the minimum element, we will get a sorted list of the top n elements.

To find the top n elements out of m elements in this way will use time O(m log(n)). This is to be compared with O(m log(m)) from the naïve solution.

Sunday, May 07, 2006

The Search Engine Materializes

Having an index over Scheme source is no fun without a way to perform searches and present results in an human-friendly way. This weekend I therefore wrote a little web-servlet, which takes a query, looks up the terms in the index, and returns links to documents containing all terms.

It didn't take many searches to realize that I need to spend some time on 1) ranking the results and 2) supporting both case sensitive and case insensitive queries.

In general ranking is difficult to get right, hopefully the narrow scope of indexing Scheme source only will help. Managing Gigabytes explains ranking in details, and since I keep track of the term frequencies in the index, the ground work has been done.

Supporting both case sensitive and insensitive searches with the same index can be done with a little trick: after tokenizing all terms are converted to lower case before they are put in the index. When a search is made the query is likewise converted to lower case before a search is made. The returned document numbers can be used directly for a case insensitive search. To avoid false matches for a case sensitive search, the actual documents are retrieved from a repository and a simple minded search for the query terms are made.

This approach makes sense when indexes are large and the repository is available. My plan is put the index on a web-server without the repository, so instead I plan to make to indexes one for each type of search mode. Fortunately the code written so far is prepared for different indexes.