Hello,
(this is a long post, you may skip this introduction)
I was sticking around D's NG for about a year and following it's
advancements very closely. I was busy ruthlessly wasting my spare time
for testing this or that of cool features, converting some personal C++
projects. Maybe it's time to introduce myself: I'm student in the middle
of getting my master's degree in Moscow Institute of Physics and
Technology. My main interests for the last years were C++,
metaprogramming, compiler theory and, lately, robotics and
microcontrollers. I got attracted to D2 as a natural way to reinvest my
C++ knowledge (honestly, C++0x is such a mess) and, in fact, looked
forward to do some hackery on some real-world compiler (that would be
DMD). As sort of warm-up to get better grasp on the language and
real-world interpreters alike I did some work on DMDScript (porting it
to D2 among others) it's available on
http://dsource.org/projects/dmdscript-2. Note that I haven't touched it
in couple of dmd releases, it may need cosmetic fixes.
To summarize, I'm no stranger to regular expressions and parsing and
want to utilize this experience to help community, what follows is my
first attempt at proposal, any comments and ideas are warmly welcome.
--- somewhat informal draft ---
Having regular expression (regex) engine in standard library is totally
expected for any modern language.
The solid and performant implementation of regex is one of the greatest
points of pride (if not of sale).
While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic or nondeterministic) and backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually more
limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from pattern.
Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here: http://lh3lh3.users.sourceforge.net/reb.shtml. I though have now
idea why he claims RE2 is feature-rich.
Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.
The proposal consists of three parts:
1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to do
in spare time even if it's not considered a compelling GSOC proposal ;)
2) Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with cached
DFA states as optimization tuning aka memory vs time. Someone brought
http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point,
I see some optimization opportunities I'd like to try. Supporting
unicode of all flavors is of no question. The API for this engine may
stay the same or it even may be integrated into existing one, making
the type of engine a compile-time parameter with default being
Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");
3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
The best speed gain most likely would be obtained by generating the
whole DFA at compile time, I believe such possibility was previously
discussed on the NG.
This diversity in kinds of regexes may warrant making regex engine an
interface ( for homogeneous storage and parameter passing) with concrete
implementations being classes.
Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?
--
Dmitry Olshansky

Before fully embracing the contents of that page, look at its critics too.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations.

But it inflates the binary too, if overused, because each RE pattern becomes a
compiled function (or more than one).

Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

One RE syntax in a language is more than enough :-) But there are some features
of the Python REs that I'd like to see bolted-on to the ECMA REs used by D. So
I suggest a superset of ECMA.
Bye,
bearophile

Before fully embracing the contents of that page, look at its critics too.

I've seen them, and they are expected, also I thought I clearly
presented some of logical considerations.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations.

But it inflates the binary too, if overused, because each RE pattern becomes a
compiled function (or more than one).

Well, in case that result would be a data structure or bytecode created
at the compile-time using CTFE. The function for matching or replacing
using the resulting structure/program would be the same as run-time one.
It's "inflating" binary like any static data. Then
StaticRegex!"whatever" should be convenience wrapper hiding all this
complexity. Anyway I partly agree, there are limitations and
considerations when using such technique. And it would rely on CTFE that
does not leak memory like hell.

Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

One RE syntax in a language is more than enough :-) But there are some
features of the Python REs that I'd like to see bolted-on to the ECMA REs used
by D.

Verbose regular expressions, that allow to put space and comments, to format
REs more like code and less like a cryptic puzzle language.
(?:...) A non-grouping version of regular parentheses. Matches whatever regular
expression is inside the parentheses, but the substring matched by the group
cannot be retrieved after performing a match or referenced later in the pattern.
(?P<name>...) Similar to regular parentheses, but the substring matched by the
group is accessible via the symbolic group name name. A symbolic group is also
a numbered group.
(?P=name) Matches whatever text was matched by the earlier group named name.
(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume
any of the string.
(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The
contained pattern must only match strings of some fixed length.
(?<=...) Positive lookbehind assertion. Matches if the current position in the
string is preceded by a match for ... that ends at the current position. The
contained pattern must only match strings of some fixed length.
(?<!...) Negative lookbehind assertion. Matches if the current position in the
string is not preceded by a match for .... The contained pattern must only
match strings of some fixed length. Patterns which start with negative
lookbehind assertions may match at the beginning of the string being searched.
Bye,
bearophile

Verbose regular expressions, that allow to put space and comments, to format
REs more like code and less like a cryptic puzzle language.
(?:...) A non-grouping version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring matched by the
group cannot be retrieved after performing a match or referenced later in the
pattern.
(?P<name>...) Similar to regular parentheses, but the substring matched by the
group is accessible via the symbolic group name name. A symbolic group is also
a numbered group.
(?P=name) Matches whatever text was matched by the earlier group named name.
(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume
any of the string.
(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The
contained pattern must only match strings of some fixed length.
(?<=...) Positive lookbehind assertion. Matches if the current position in the
string is preceded by a match for ... that ends at the current position. The
contained pattern must only match strings of some fixed length.
(?<!...) Negative lookbehind assertion. Matches if the current position in the
string is not preceded by a match for .... The contained pattern must only
match strings of some fixed length. Patterns which start with negative
lookbehind assertions may match at the beginning of the string being searched.

FWIW: (probably due to domain?) I have never used positive lookahead, and even
less lookbehinds.
Also, what I like in py regexes is that they don't match anywhere as default
(there is a find method).
What I find annoying is single-line-only as default (like D regexes,
unfortunately).
Denis
--
_________________
vita es estrany
spir.wikidot.com

Verbose regular expressions, that allow to put space and comments, to format
REs more like code and less like a cryptic puzzle language.
(?:...) A non-grouping version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring matched by the
group cannot be retrieved after performing a match or referenced later in the
pattern.
(?P<name>...) Similar to regular parentheses, but the substring matched by the
group is accessible via the symbolic group name name. A symbolic group is also
a numbered group.
(?P=name) Matches whatever text was matched by the earlier group named name.
(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume
any of the string.
(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The
contained pattern must only match strings of some fixed length.
(?<=...) Positive lookbehind assertion. Matches if the current position in the
string is preceded by a match for ... that ends at the current position. The
contained pattern must only match strings of some fixed length.
(?<!...) Negative lookbehind assertion. Matches if the current position in the
string is not preceded by a match for .... The contained pattern must only
match strings of some fixed length. Patterns which start with negative
lookbehind assertions may match at the beginning of the string being searched.

Yeah, that's something I vaguely called "... not implemented features it
may very well have had".
More then that (?:...), (?!...), (?<=...) area part of ECMA v3 spec. And I got
a preliminary support for them in my patch
herehttp://d.puremagic.com/issues/show_bug.cgi?id=5673.
Others (except (?P<name>) and (?P=name) ) also considered common extensions and
I planed to add them plus regex comment (#...) where all of ... simply have no
effect on matching.

Others (except (?P<name>) and (?P=name) ) also considered common extensions
and I planed to add them plus regex comment (#...) where all of ... simply
have no effect on matching.

Beside the (#...) comments in Python you have also the verbose regex, that
allow to put whispace and free #... comments with no parentheses. I find this
one of the nicest features, because it allows you to format your regex.
Bye,
bearophile

Others (except (?P<name>) and (?P=name) ) also considered common extensions
and I planed to add them plus regex comment (#...) where all of ... simply
have no effect on matching.

Beside the (#...) comments in Python you have also the verbose regex, that
allow to put whispace and free #... comments with no parentheses. I find this
one of the nicest features, because it allows you to format your regex.
Bye,
bearophile

You can also format regex with
r"\d+" // matches the integer part
~ r"(?:\.\d+)?"; // matches the fractional part
:)

Others (except (?P<name>) and (?P=name) ) also considered common extensions
and I planed to add them plus regex comment (#...) where all of ... simply
have no effect on matching.

Beside the (#...) comments in Python you have also the verbose regex, that
allow to put whispace and free #... comments with no parentheses. I find this
one of the nicest features, because it allows you to format your regex.
Bye,
bearophile

You can also format regex with
r"\d+" // matches the integer part
~ r"(?:\.\d+)?"; // matches the fractional part
:)

Others (except (?P<name>) and (?P=name) ) also considered common
extensions
and I planed to add them plus regex comment (#...) where all of ...
simply
have no effect on matching.

Beside the (#...) comments in Python you have also the verbose
regex, that
allow to put whispace and free #... comments with no parentheses. I
find this
one of the nicest features, because it allows you to format your regex.
Bye,
bearophile

You can also format regex with
r"\d+" // matches the integer part
~ r"(?:\.\d+)?"; // matches the fractional part
:)

Nice trick, thank you!

Or even without '~' using implicit adjacent strings concatenation
(though I vaguely remember that the conclusion was that ~ operator
string will also happen at compile time soon)
string pat = `\d+` // matches the integer part
`(?:\.\d+)?`; // matches the fractional part
assert(pat == `\d+(?:\.\d+)?`);
This I think somewhat diminishes the need for verbose regex, isn't it?
It may be even documented somewhere in examples...
Overall for now it seems that having different syntax style (like perl
and etc.) is not something people find useful/interesting.
So I'll be sticking with ECMA + extensions.
--
Dmitry Olshansky

Others (except (?P<name>) and (?P=name) ) also considered common extensions
and I planed to add them plus regex comment (#...) where all of ... simply
have no effect on matching.

Beside the (#...) comments in Python you have also the verbose regex, that
allow to put whispace and free #... comments with no parentheses. I find this
one of the nicest features, because it allows you to format your regex.

Before fully embracing the contents of that page, look at its critics too.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations.

But it inflates the binary too, if overused, because each RE pattern becomes a
compiled function (or more than one).

Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

One RE syntax in a language is more than enough :-) But there are some
features of the Python REs that I'd like to see bolted-on to the ECMA REs used
by D. So I suggest a superset of ECMA.

Having regular expression (regex) engine in standard library is
totally expected for any modern language.
The solid and performant implementation of regex is one of the
greatest points of pride (if not of sale).

Couldn't agree more.

While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic or nondeterministic) and backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually
more limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from
pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here: http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
now idea why he claims RE2 is feature-rich.
Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.
The proposal consists of three parts:
1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to
do in spare time even if it's not considered a compelling GSOC
proposal ;)

I would really like to see this.

2) Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with
cached DFA states as optimization tuning aka memory vs time. Someone
brought http://swtch.com/~rsc/regexp/regexp1.html, it could be a
starting point, I see some optimization opportunities I'd like to
try. Supporting unicode of all flavors is of no question. The API for
this engine may stay the same or it even may be integrated into
existing one, making the type of engine a compile-time parameter with
default being Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
The best speed gain most likely would be obtained by generating the
whole DFA at compile time, I believe such possibility was previously
discussed on the NG.

This would be a real nice feature.

This diversity in kinds of regexes may warrant making regex engine an
interface ( for homogeneous storage and parameter passing) with
concrete implementations being classes.
Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

Hi Dmitry,
It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable and
optimized, I think it could be fixed.

Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems that
parsing a pattern is somewhat convulted, but I'll defer any critics or
decisions until I looked in closer detail.
I see you also using the test from std.regex, sadly it's somewhat broken
(the one with test vectors).
All it's checking proves that *there is a match*, not *what* was matched.
Anyway, I'm gathering what I have in a way of fixes as my first pull
request, wish me luck ;)

Having regular expression (regex) engine in standard library is
totally expected for any modern language.
The solid and performant implementation of regex is one of the
greatest points of pride (if not of sale).

Couldn't agree more.

While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic or nondeterministic) and backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually
more limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from
pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

Right, for NFA the execution time in worst case O(n*m) n - input, m -
pattern length.
As for DFA, it seems the process of construction full DFA eagerly from
NFA would grab a lot of ram & time O(2^^m) in worst case, which makes it
useful only on certain amounts of input, hence the idea to compute it at
compile time.

Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here: http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
now idea why he claims RE2 is feature-rich.
Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.
The proposal consists of three parts:
1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to
do in spare time even if it's not considered a compelling GSOC
proposal ;)

I would really like to see this.

2) Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with
cached DFA states as optimization tuning aka memory vs time. Someone
brought http://swtch.com/~rsc/regexp/regexp1.html, it could be a
starting point, I see some optimization opportunities I'd like to
try. Supporting unicode of all flavors is of no question. The API for
this engine may stay the same or it even may be integrated into
existing one, making the type of engine a compile-time parameter with
default being Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
The best speed gain most likely would be obtained by generating the
whole DFA at compile time, I believe such possibility was previously
discussed on the NG.

This would be a real nice feature.

This diversity in kinds of regexes may warrant making regex engine an
interface ( for homogeneous storage and parameter passing) with
concrete implementations being classes.
Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

Hi Dmitry,
It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable
and optimized, I think it could be fixed.

Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems
that parsing a pattern is somewhat convulted, but I'll defer any
critics or decisions until I looked in closer detail.

I'm open to any criticisms or suggestions you may have.

I see you also using the test from std.regex, sadly it's somewhat
broken (the one with test vectors).
All it's checking proves that *there is a match*, not *what* was matched.

You're right. I'm now updating the the code to make use of the other
info in the test vectors for more reliable tests.

Anyway, I'm gathering what I have in a way of fixes as my first pull
request, wish me luck ;)

Hi Dmitry,
It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable
and optimized, I think it could be fixed.

Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems
that parsing a pattern is somewhat convulted, but I'll defer any
critics or decisions until I looked in closer detail.

I'm open to any criticisms or suggestions you may have.

Ok, I think I got my criticisms sorted out.
Sorry for taking so much time, I've got really involved week, attending
to a couple events and, in fact, helping organize one.
First things first - I like your structured way of expressing regex VM
instructions by introducing distinct types for each, but let's examine
what good does it get you in the end.
Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)&program[pos];
final switch( *instType )
{
...
case REInst.IChar:
InstIChar* inst = cast(InstIChar*)&program[pos]; //disadvantage
writefln( "IChar %s", inst.c ); //benefit
pos += InstIChar.sizeof;//not used //well to actually
benefit of introduced types, should it be (*inst).sizeof ? ...
Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
return 0;
state_pc += InstChar.sizeof;//not used
Which means that the whole typing isn't exploited much, but gets in your
way far to often.
The idea to avoid all the mess with double casting is to introduce tag
union:
struct IR{
uint opcode;
union{
struct {
dchar c;
}
struct {
ubyte[8] bitmap;
}
//etc..
}
}
Then the double cast is removed:
auto inst = cast(IR*)&program[pos];
switch(inst.InstType){
...
case REInst.IChar:
writefln( "IChar %s", inst.c ); pos += InstIChar.sizeof;
}
The only problem that remains is that we've lost the actual size of
struct which varies based on an opcode value, and constructors...
Solving both at once :
struct WrapedIR(REInst InstType,T)
{
enum length = instLength(InstType);
property auto bytes(){// see later
return (cast(byte*)data)[0..length];
}
T data;
alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
switch(type){
case REInst.Char: case REInst.IChar:
return sizeof(uint)+sizeof(dchar);
...
}
}
and having helper function (which may use property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }
will get us pretty much the same thing (or better) everywhere you create
instructions:
auto charInst = makeInst!(REInst.Char)();
Which gets us to how you are creating instructions actually:
static string MakeREInst( string instType, string instName )
{
string result =
"byte["~instType~".sizeof] "~instName~"Buf;"~
instType~"* "~instName~" =
cast("~instType~"*)&"~instName~"Buf[0];"~
"*"~instName~" = "~instType~"();";
return result;
}
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)&instNameBuf[0];
*instName = instType();
Even disregarding the suggested approach to IR representation, the
better alternative I think would be the opposite - stack allocate
struct, and treat it as array of bytes when need be.
Isn't this one of prime cases for introducing types in the first place?
Also, why mixins?
instType instName = instType(); //can't be any worse then
mixin(MakeREInst( "instType", "instName" );
Then for getting bytes of any struct use this one-liner (analogous to
'bytes' of the above IR tag union):
ubyte[] bytesOf(T)(T* t){
return (cast(byte*)t)[0..T.sizeof];
}
Essentially you need to treat instructions as bytes exactly once: when
appending or inserting them into VM program.
program ~= splitInstBuf //append magically mixed-in buffer
...
becomes
program ~= bytesOf(&splitInst)
...
Doesn't look half bad, and compiler wouldn't allow missing '&' if anything.
Look & feel clearly could be enhanced by changing pointer argument to
ref argument in bytesOf, but I don't think that current DMD inliner
works well with ref. (and we _do_ need inliner for such functions)
One more unrelated thing is the way you do the insertion of some IR
instructions (this one from parseRegex, there are more):
program =
program[0..progConcatStart] // gets a slice of program, but since it's
followed by other data it has capacity of 0
~ splitInstBuf //and so this always allocates new array!
~ program[progConcatStart..$]; //and this might or
might not allocate
So no matter how innocent it looks, it's 1-2 allocation _every_ time.
To solve this troublesome consequence you need something like this (not
actually tested ):
program.length += splitInstBuf.length;
memmove(
&program[progConcatStart+splitInstBuf.length],
&program[progConcatStart],
program[0].sizeof*(program.length-progConcatStart)
);
program[progConcatStart..progConcatStart+splitBuf.length] = splitBuf[0..$];
...or simply cheat ;) and use std.array.insert:
insert(program,progConcatStart,splitBuf);
To further improve things also check on how to exploit Appender in
Phobos, it should be faster for continuous appending.
--
Dmitry Olshansky