I've done it! Ye mighty auto tokenizer allows you to define a grammar and tokenizer strings. I've never done it, looked at other code, or read anything on how to do this so there is a chance I've used a pattern I never knew I was using, etc.

You feed it a token grammar and a string, and this son-of-a-gun will give you the next token. Warning: like any declarative language this will do what you tell it to do, therefore do not give it faulty grammars (like ones that will accept nothing)!

Legal disclaimer:By downloading the said file, knowingly or not, you agree to have no rights to its code or your knowledge of the knowledge gained upon mentally processing it. You have no copying rights, understanding rights, or right to process any thoughts derived from the knowledge of the said file. You are however given the right to live and breathe under the condition you do so without violation of any stated rule in this disclaimer.

As for the approach, I've never done anything like this before, I just looked at the rules for grammars and created factories for them. The parser does most of the work, but you can see from the [poorly (un)commented] code how it works.

One more thing; I will eventually add some form of faulty grammar detection. Basically if you can reach an end node without a decision or entering a node from a node then that node is faulty. A node is also faulty if it contains a grammar inside itself that is faulty - i.e. node.getInside() is faulty, so I will conjure up the code for it at some time.

It does a lot more than string.indexOf; it allows you to define a grammar for parsing. That was just an example to show you how to define a grammar for a simple sentence, which could end with a period. You can define any grammar you want, html, css, a command system in French, you can even define a programming language and have it parse that.

Next I will upgrade the code to return the parse tree too, which will make it "ye mighty auto lexer", or something.

I've made a further update to have it select from a list of grammars, returning the grammar that gave the best match. Note that ambiguous grammars will give ambiguous results, such as one grammar accepting white space (only) and another beginning with being able to accept white space. On another note I've added in detection of faulty grammar!

... and then passed the grammar the provided sentences to see how much of it is accepted by the grammar.

It is achievable using regular expressions (or JavaCC), however figuring this out by yourself from scratch without resorting to any text book or reference as to how it is (or could be) achieved is something that not a lot of people can attest to having experience with. Also it allows me to implement it anywhere, such as on an embedded device restricted to using C/assembler, or just about anywhere I can think of.

There is more to be added, for example it currently tells you how much has been accepted, but does not quite tokenize to the last reasonable end for a grammar. I've modified the code to determine of a string is acceptable in its entirety, but I will make one final modification to make it only return a reasonable string that is entirely compatible (if you get what I mean).

Okay, I've made the next update - the parser will now return the largest valid token from a string, which it was not doing before. Before the code could either determine whether a string completely satisfies a grammar or extract the largest string that is accepted. Now it will extract the largest valid grammar, hence is a complete tokenizer & lexer.

Now if I want I could have some factories that load a grammar from file or something, or even a regular expression.

... if you were to call Parser.parse ("This is a lovely valid sentence. This is the next sentence!", sentence), it would return you the number of characters that form a valid sentence, which would be "This is a lovely sentence."

If you then were to call Parser.parse (" This is not a valid sentence because sentences do not begin with whitespace.", sentence), it would return -1 to tell you that it could not return a token. If you were to pass whitespace instead of sentence then it would return 7, i.e. " "!

Edit: I did a little speed test with a valid sentence formed of 12 031 characters; it searched 2 140 577 nodes and took 1.469 seconds to parse. Bearing in mind that the grammar depicting sentences involves branching for each character, increasing the search space. If you have a specific word then it generates a list that does not require so many nodes, so if your grammar defines the text, "some_reserved_keyword" then it will not need to branch at each letter.

... as the one you wrote will accept "123............123e123e123e123". And never infinitely accept an option, take out my grammar check and turn on debugging in Parser.java to see the damage that could do

Ah lol; I write {} as [optional] repeats and [] as options, whereas you write {} as options ! And although I don't do it on purpose, I [often] tend to write an optional words in square brackets when writing [normal] sentences in chat

Ok, I've written what appears to be the final [necessary] update. There was a possibility of terminating early and not getting the largest token size by accepting every character in the string and reaching an end node without fully satisfying the grammar completely (i.e. an incomplete branch). Although it never occurred (in my presence) that bug has been fixed. The only expected updates (as of now) will be commenting.

Note that most of the work takes place in parser, which is just your every day depth first search using iteration as opposed to recursion! The rest are simple factories, which I may update to be instances, therefore removing the need for implementing code to have to know about lists and collections.

That's all nice, but a product without proper 'marketing' is worthless.

Very true! Well my purpose of writing it was down to two motivations: we don't have anything like this in our code repository at work, especially for embedded [memory starved] systems, and it was really nagging me that I knew it could be done with little difficulty. Apart from that it offers nothing above the others, although I am considering left recursion removal, or at least detection of it.

If I can achieve removal of left recursion I'll think about pushing it as a useful tool, second to that I'm working on a C++ version. As for some 'yaw' value, I'll think of something!

It's not really about it being more powerful or better, it's all personal; I literally work up at 6am Wednesday morning after going to sleep at 1 thinking that this is possible. I had no idea I'd have a solution for Friday morning (or Thursday evening when I completely figured it out). My first idea was terribly dumb (on Wednesday morning), I'm quite surprised at how easy it was. It's basically a really simple search for the longest valid placement of letters that lead to the end node!

Technically speaking iterative deepening should not only find the solution, but will do so without getting caught in an infinite loop on left recursion! Source code is available online for anyone who wants to have a try! I'll even race ya!! On second thoughts it wouldn't! It's still infinite, so it will tell you if the entire string is valid, but not return a token.

I've made a further advancement in the code, which I need to update some time. Basically the parser can have the saving of the stack eliminated as it is redundant in the presence of the "choice stack". This [should] give it a significant speed improvement.

Have you tried http://www.antlr.org/ ?I've used it a lot and it is a very convenient and efficient solution for this type of problem.Especially given that it comes with a grammar design/interpret/debug tool.

That does seem interesting; my code currently works as a lex/tokenizer compiler, but I'm also looking to make it generate parse trees. In fact it is practically capable of doing that! What you have here though is the code to do so yourself; you can examine it and see how simple these programs are. Note that there is an error I've updated on my workstation but not here. Since I'm at work I have no access to it.

The nodes should [also] correspond to Von Neumann architecture, where the NODE structure can compile to code/commands. Of course I have no current ambitions for that, though it would be fun. Second to fun, it would be nice to have the code interpret the grammar in text, i.e. you could give it this text stream ...

It compiles and walks the tree; just import the Jar into Eclipse. In the Parser class there is a boolean to debug_output the tree walking. I've made some additions in the office as I tend to do tool researching out of hours, so when I get back into the office I'll give you the update that's got it screwing around with a particular type of config file I use. It lexes it, and I'll even make it parse the file (on Monday at around 18:20 GMT after I finish work).

If there is trouble compiling, try creating a java project (in Eclipse) and importing into the src(source) folder. If you don't use Eclipse then maybe someone could cook up an ant makefile (I have no idea how easy/hard they are).

So if that code isn't enough (at the moment) it should be at 18:20GMT on Monday - maybe over the weekend if I go in! So I guess I'll have to do something to make this more 'visible'; and outputting to the console is so pre 2k, I'll output to a JFrame or something. I will [eventually] make it read the grammars from text in the not too distant future!

Type-3, or Regular languages, described by RE, are only a subset of CFG (Type-2). Very simple example of language you can't describe with RE is a^nba^n (that is, some number of a, then b, and then same number of a again).

Every time I try to use a damned parser generator I fail. I've put serious effort into it multiple times. Tried JavaCC and antlr. I just don't seem to be able to wrap my head around how to approach solving ambiguity problems.

keldon85's API looks nice and straightforward. I would worry that it silently ignores ambiguity though.

Although this topic has not been posted in for 180 days, I believe responding here is justified as there are unanswered questions.

I've uploaded the code to github now, but you should be aware that another library (parboiled) does the same job, and goes further to return a parse tree and provides a nice way to express your grammars.

Quote

wow finaly ! I will have wait for this answer nearly for 3 years now (ayway interresting)

Oh, I thought I pretty much responded to your question (though I didn't quote you). Sorry about that.

Quote

keldon85's API looks nice and straightforward. I would worry that it silently ignores ambiguity though.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org