Re: [Ocaml-lib-devel] regexp->DFA

On Sun, 2004-05-30 at 08:40, Sylvain LE GALL wrote:
> To my mind extlib is an extension of the basic code of the stdlib. But
> it should be enough "general purpose". Even if regexp is commonly used,
> i think it should fit well in a self contained library !
BTW: I originally thought to make Felix use Extlib,
since Extlib contains some data structures I want,
variable length array and PSet/PMap for example.
Unfortunately, these components are inseparable
because they're unified by the Enum concept,
and I have some issues with enums.
I also can't just reuse the parts I want,
since Extlib is LGPL and Felix is FFAU.
So I either use the package as a whole,
or don't use any of it (or make a subset
of it, package it separately, and use that,
getting the problems of *both* dependence
and loss of separate maintenance)
The point being: it is a fairly typical
problem that there is a tension between
separability and integration. In particular
it is precisely Enums which simultaneously
make Extlib components easy to use separately,
whilst at the same time making them all
dependent on Enums :)
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Thread view

Attached regexp module. The main limitation
of this code is to be seen in this line:
let posarray = Array.make !counter 0 in
where there is a position array indexed by the
alphabet. Obviously no good for 32 bit input.
What this code does: regexps are built out of integers.
This code *classifies* its input, it doesn't just
return true or false.
The type parameter 'a is the type of symbol used
to designate the class, it is captured in the
term
`REGEXP_code of 'a
This term does not match any inputs and is
equivalent to `REGEXP_epsilon except that it
marks a terminal state.
It's called 'code' because in ocamllex it
is the bit inside the { } used to create
a token from the lexbuf.
Equipped with an appropriate driver,
the output can be used to construct a string
classifier or lexer.
This routine was derived from the Felix
routine which works well enough to lex
itself. It was derived from the standard
algorithm given in the Dragon Book.
There is no state minimisation, however
the matrix shares transition vectors
between equivalent characters.
Regexp would not normally be constructed by the user.
Instead they'd use regular definitions, and there
would also be sugar for arrays of int and/or strings,
and also charsets.
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Hello,
Well, i read part of your code -- as i am currently reading the dragon
book...
Well, to speak freely i feel like regexp is a so large and complexe
field, that the code you provide should not fit in the general layout of
extlib ( i am not an author of extlib and i am just giving my opinion ).
To my mind extlib is an extension of the basic code of the stdlib. But
it should be enough "general purpose". Even if regexp is commonly used,
i think it should fit well in a self contained library !
To my mind, it should be a standalone library, which could be a pre
requisite for extlib, if extlib want to use the toplevel functionnality
of this...
Just to give you an example :
- People writing lex like program should be interested -- very
interested -- in being able to use the internal representation of DFA.
to merge to regexp for example ( when you do lex like program you need
to merge some regexp in order to create a DFA, with some intermediate
point of break ). For example :
aaa { B }
aab { C }
If you use two higlevel regexp you get
aaa -> B
{ put back all caracter if you have "aa" and not "a" }
aab -> C
If you merge your DFA
|-> a -> B
aa -|
|-> b -> C
( ie something more efficient at caracter recognition )
- People using your code as simple regexp, should only be interested in
a toplevel recognition ( for example go through a file looking for a
certain pattern )
Maybe you should put your code in a library coming along felix, make
felix use the code ( prerequisite ) and allow people to extend
functionnality based on your code ?
What i propose to you is to remove this code ( if possible ) from felix
and make a library ( let call it felix-regexp ) and let people use it as
a standard ocaml library.
People reading the very few mail i send to this list, should know that i
don't like very big library containing the full world... I prefer small
part which are to my mind more flexible for developper ( and a nightmare
for distributing source code ).
As usual i don't say that what i propose must be done -- i just say it
will be the way i will work on it -- which is not the best way.
To give you some more hint on my personnal position : i need to be able
to merge DFA ( and maybe introduce other concept ). The code is read is
a very good start for this kind of things, but i don't have the level to
maintain this kind of library ! ( regexp is not a hacker field, there is
a lot of theory in it ).
Kind regard
Sylvain Le Gall
ps : what is the licence which cover your code ( i suppose it is MIT/BSD
)
On Sun, May 30, 2004 at 06:55:08AM +1000, skaller wrote:
> Attached regexp module. The main limitation
> of this code is to be seen in this line:
>
> let posarray = Array.make !counter 0 in
>
> where there is a position array indexed by the
> alphabet. Obviously no good for 32 bit input.
>
> What this code does: regexps are built out of integers.
> This code *classifies* its input, it doesn't just
> return true or false.
>
> The type parameter 'a is the type of symbol used
> to designate the class, it is captured in the
> term
>
> `REGEXP_code of 'a
>
> This term does not match any inputs and is
> equivalent to `REGEXP_epsilon except that it
> marks a terminal state.
>
> It's called 'code' because in ocamllex it
> is the bit inside the { } used to create
> a token from the lexbuf.
>
> Equipped with an appropriate driver,
> the output can be used to construct a string
> classifier or lexer.
>
> This routine was derived from the Felix
> routine which works well enough to lex
> itself. It was derived from the standard
> algorithm given in the Dragon Book.
>
> There is no state minimisation, however
> the matrix shares transition vectors
> between equivalent characters.
>
> Regexp would not normally be constructed by the user.
> Instead they'd use regular definitions, and there
> would also be sugar for arrays of int and/or strings,
> and also charsets.
>
> --
> John Skaller, mailto:skaller@...
> voice: 061-2-9660-0850,
> snail: PO BOX 401 Glebe NSW 2037 Australia
> Checkout the Felix programming language http://felix.sf.net
>
>

On Sun, 2004-05-30 at 08:40, Sylvain LE GALL wrote:
> Hello,
>
> Well, i read part of your code -- as i am currently reading the dragon
> book...
>
> Well, to speak freely i feel like regexp is a so large and complexe
> field, that the code you provide should not fit in the general layout of
> extlib ( i am not an author of extlib and i am just giving my opinion ).
I would have thought it could be the basis of a Str replacement.
I also have used it to write an ocamllex replacement.
Both these functions are part of the standard distro.
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

On Sun, 2004-05-30 at 08:40, Sylvain LE GALL wrote:
> Maybe you should put your code in a library coming along felix, make
> felix use the code ( prerequisite ) and allow people to extend
> functionnality based on your code ?
I'm desperately trying to automate the Felix build.
Adding a requirement for an external tool makes that
more difficult, especially since Ocaml doesn't have
any package management to speak of.
For example all my code is literate programmed.
To remove the requirement to install the literate
progamming tool it is shipped inside the Felix tarball.
That LP tool (interscript) is a very powerful package
integrator.
It has taken a month to actually get Felix to build on
6 platforms. [Now Felix hooks your fileutils we have
to go through that process again ..;(
> People reading the very few mail i send to this list, should know that i
> don't like very big library containing the full world... I prefer small
> part which are to my mind more flexible for developper ( and a nightmare
> for distributing source code ).
Felix is a programming language compiler.
It's packaging concept is like Python or Ocaml:
monolithic. One tarball. As you say, to avoid
a 'nightmare' distributing it. In particular
clients are expected to be C/C++ programmers
with no knowledge of Ocaml.. I'd like them
all to learn Ocaml .. but they're downloading
the package to learn *Felix* :)
There are a LOT of parts of the code that could
be split out. For example a unification algorithm.
And quite a bit of Unicode handling stuff
(Felix specifies 8 and 32 bit strings and
UTF-8 encoded Unicode as input)
> To give you some more hint on my personnal position : i need to be able
> to merge DFA ( and maybe introduce other concept ).
To do that, you convert the DFA back to regexp, join the regexp,
and build a new DFA. There's no way to merge DFAs directly,
they're monolithic. This is a real pain for an extensible
compiler.
Similarly Yacc like parsers aren't extensible ..
because they're driven by a DFA.
> regexp is not a hacker field, there is
> a lot of theory in it ).
But the theory for this is old enough and simple enough to
be second year computer science level (or I'd not have been
able to make it work) and is in the Dragon Book.
> ps : what is the licence which cover your code ( i suppose it is MIT/BSD
> )
Its FFAU -- free for any use ~ public domain.
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

On Sun, 2004-05-30 at 08:40, Sylvain LE GALL wrote:
> To my mind extlib is an extension of the basic code of the stdlib. But
> it should be enough "general purpose". Even if regexp is commonly used,
> i think it should fit well in a self contained library !
BTW: I originally thought to make Felix use Extlib,
since Extlib contains some data structures I want,
variable length array and PSet/PMap for example.
Unfortunately, these components are inseparable
because they're unified by the Enum concept,
and I have some issues with enums.
I also can't just reuse the parts I want,
since Extlib is LGPL and Felix is FFAU.
So I either use the package as a whole,
or don't use any of it (or make a subset
of it, package it separately, and use that,
getting the problems of *both* dependence
and loss of separate maintenance)
The point being: it is a fairly typical
problem that there is a tension between
separability and integration. In particular
it is precisely Enums which simultaneously
make Extlib components easy to use separately,
whilst at the same time making them all
dependent on Enums :)
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net

Hello,
On Sun, May 30, 2004 at 01:47:40PM +1000, skaller wrote:
> On Sun, 2004-05-30 at 08:40, Sylvain LE GALL wrote:
>
> > Maybe you should put your code in a library coming along felix, make
> > felix use the code ( prerequisite ) and allow people to extend
> > functionnality based on your code ?
>
> I'm desperately trying to automate the Felix build.
> Adding a requirement for an external tool makes that
> more difficult, especially since Ocaml doesn't have
> any package management to speak of.
>
> For example all my code is literate programmed.
> To remove the requirement to install the literate
> progamming tool it is shipped inside the Felix tarball.
> That LP tool (interscript) is a very powerful package
> integrator.
>
> It has taken a month to actually get Felix to build on
> 6 platforms. [Now Felix hooks your fileutils we have
> to go through that process again ..;(
Yep, it is generally a pain to get it to work on more than one
platforms.
>
> > People reading the very few mail i send to this list, should know that i
> > don't like very big library containing the full world... I prefer small
> > part which are to my mind more flexible for developper ( and a nightmare
> > for distributing source code ).
>
> Felix is a programming language compiler.
> It's packaging concept is like Python or Ocaml:
> monolithic. One tarball. As you say, to avoid
> a 'nightmare' distributing it. In particular
> clients are expected to be C/C++ programmers
> with no knowledge of Ocaml.. I'd like them
> all to learn Ocaml .. but they're downloading
> the package to learn *Felix* :)
>
> There are a LOT of parts of the code that could
> be split out. For example a unification algorithm.
> And quite a bit of Unicode handling stuff
> (Felix specifies 8 and 32 bit strings and
> UTF-8 encoded Unicode as input)
>
No problem, as i said, i know the problem with code split -- but i also
know advantage of this !
I prefer to see advantage, but i cannot ignore the disadvantage are big
enough to prevent anyone to consider it.
> > To give you some more hint on my personnal position : i need to be able
> > to merge DFA ( and maybe introduce other concept ).
>
> To do that, you convert the DFA back to regexp, join the regexp,
> and build a new DFA. There's no way to merge DFAs directly,
> they're monolithic. This is a real pain for an extensible
> compiler.
>
> Similarly Yacc like parsers aren't extensible ..
> because they're driven by a DFA.
You mean "(aaa|aab)". I don't remember precisely the algo but i was
thinking you can introduce some final state in your DFA, in order to
recognize some intermediary form ( ie if you merge regexp, you will
recognize aaa and aab but you cannot distinguish between them ).
>
> > regexp is not a hacker field, there is
> > a lot of theory in it ).
>
> But the theory for this is old enough and simple enough to
> be second year computer science level (or I'd not have been
> able to make it work) and is in the Dragon Book.
>
I think you are a bit unreallistic about this ! I understand the theory
but i don't feel like i am at the top of this field. Moreover, during my
study, i don't think at the end of it, all people know regexp !
But maybe i am not good enough.
> > ps : what is the licence which cover your code ( i suppose it is MIT/BSD
> > )
>
> Its FFAU -- free for any use ~ public domain.
>
Ok,
Kind regard
Sylvain Le Gall
ps : i understand your code, and i am able to use it, understand the
theory behind this, i am not so bad at understanding theory but i don't
think i have the level of people who, for example, has written "pcre"

On Sun, 2004-05-30 at 19:37, sylvain.le-gall@... wrote:
> > To do that, you convert the DFA back to regexp, join the regexp,
> > and build a new DFA. There's no way to merge DFAs directly,
> > they're monolithic. This is a real pain for an extensible
> > compiler.
> >
> > Similarly Yacc like parsers aren't extensible ..
> > because they're driven by a DFA.
>
> You mean "(aaa|aab)".
Basically I need to be able to add grammar productions
to my parser, and new literal kinds to my lexer.
I can't do either at the moment (except by
recompiling the compiler)
It's really easy to add new data types, you just can't
add special syntax to use it or literals of it.
Adding new grammar productions is fairly easy for
a recursive descent parser :)
> I think you are a bit unreallistic about this ! I understand the theory
> but i don't feel like i am at the top of this field. Moreover, during my
> study, i don't think at the end of it, all people know regexp !
But one does not need to know it fully. Just the basic idea,
and the name "Dragon Book" .. or "CiteSeer" or "Google" is enough :)
--
John Skaller, mailto:skaller@...
voice: 061-2-9660-0850,
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net