The usages are as far as I know incorrect anyway. We resize
the hash bucket array based on the number of keys it holds,
not based on the number of buckets that are used, so this
usage was wrong anyway.

Another bug that this revealed is that the old code would allow
HvMAX(hv) to fall to 0, even though every other part of the
core expects it to have a minimum of 7 (meaning 8 buckets).

As part of this we change the hard coded 7 to a defined constant
PERL_HASH_DEFAULT_HvMAX.

After this patch there remains one use of HvFILL in core, that used
for scalar(%hash) which I plan to remove in a later patch.

The split patch introduced a buffer read overrun error in sv_dump() when
stringifying empty strings. This bug was always existant but was probably
never triggered because we almost always have at least one extflags set,
so it never got an empty buffer to show. Not so with the new compflags. :-(

was only available if Perl could resolve the first argument to
split at compile time, meaning under various arcane situations.

This manifested as oddities like

my $delim = $cond ? " " : qr/\s+/;
split $delim, $string;

and

split $cond ? " ", qr/\s+/, $string

not behaving the same as:

($cond ? split(" ", $string) : split(/\s+/, $string))

which isn't very convenient.

This patch changes this by adding a new flag to the op_pmflags,
PMf_SPLIT which enables pp_regcomp() to know whether it was called
as part of split, which allows the RXf_SPLIT to be passed into run
time regex compilation. We also preserve the original flags so
pattern caching works properly, by adding a new property to the
regexp structure, "compflags", and related macros for accessing it.
We preserve the original flags passed into the compilation process,
so we can compare when we are trying to decide if we need to
recompile.

This was reverted because it broke C<split("\x{20}", $thing)>, and
because one might argue that is not that #1 does the wrong thing,
but rather that the behavior of #2 that is wrong. In other words
we might expect that all three should behave the same as #3, and
that instead of "fixing" the behavior of #1 to be like #2, we should
really fix the behavior of #2 to behave like #3. (Which is what we did.)

Also, it doesn't make sense to move the special case detection logic
further from the regex engine. We really want the regex engine to decide
this stuff itself, otherwise split " ", ... wouldn't work properly with
an alternate engine. (Imagine we add a special regexp meta pattern that behaves
the same as " " does in a split /.../. For instance we might make
split /(*SPLITWHITE)/ trigger the same behavior as split " ".

The other major change as result of this patch is it effectively
reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which
was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which
and free up bits in the regex flags structure.

But we dont want to get rid of these vars, and it turns out that
RXf_SEEN_LOOKBEHIND is used only in the same situation as the new
RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to
RXf_NO_INPLACE_SUBST, and then instead of using two vars we use
only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to
have their bits back.

xs_init() must pass a static char* when creating &DynaLoader::boot_DynaLoader.

newXS() assumes that the passed pointer to the filename is in static storage,
or otherwise will outlive the PVCV that it is about to create, and hence that
it's safe to copy the pointer, not the value, to CvFILE. Hence xs_init()
must not use an auto array to "store" the filename, as that will be on the
stack, and becomes invalid as soon as xs_init() returns. The analogous bug
fix was made in universal.c by commit 157e3fc8c802010d in Feb 2006.

Spotted by compiling for ithreads with gcc 4.8.0's ASAN and running
dist/B-Deparse/t/deparse.t

If the heredoc terminator we are searching for is longer than the bytes
remaining in s, then the memNE() would read beyond initialised memory.
Hence change the loop bounds to avoid this case, and change the failure case
below to reflect the revised end-of-loop condition.

It doesn't matter that the loop no longer increments shared->herelines,
because the failure case calls S_missingterm(), which croaks.

PerlIO_find_layer should not be using memEQ() off the end of the layer name.

PerlIO_find_layer was using memEQ() to compare the name of the desired layer
with each layer in the array of known layers. However, it was always using
the length of the desired layer for the comparison, whatever the length of
the name it was comparing it with, resulting in out-of-bounds reads.

The CRTL's fstat() sets errno to EVMSERR and vaxc$errno to RMS$_IOP
when called on a proccess-permanent file (i.e., stdin, stdout, or
stderr). That error generally means a rewind operation on a file
that cannot be rewound. It's odd that fstat is doing such a thing,
but we can at least protect ourselves from the effects of it by
saving errno and restoring it on a successful call.

* If the hash is not OOK omit any iterator status information
instead of showing -1/NULL
* If the hash is OOK then add the RAND value from the iterator
and if the LASTRAND is not the same show it too
* Tweak tests to test the above.

In 596a6cbd6bcaa8e6a4 I added a somewhat desperate hack to detect
if a file is record-oriented so that we preserve record semantics
when PL_rs has beeen set. I did it by calling fstat(), which is
already a pretty icky thing to be doing on every record read, but
it turns out things are even worse becaseu fstat() sets errno in
some conditions where it's successful, specifically when the file
is a Process-Permanent File (PPF), i.e., standard input or output.

So save errno before the fstat and restore it before doing the
read so if the read fails we get a proper errno. This gets
t/io/errno.t passing.

Side note: instead of fstat() here, we probably need to store a
pointer to the FAB (File Access Block) in the PerlIO struct so all
the metadata about the file is always accessible. This would
require setting up completion routines in PerlIOUnix_open and
PerlIOStdio_open.

Using threads + fork() on Linux, and IO operations in the threads, the
PL_perlio_mutex may be left in a locked state at the call of fork(),
potentially leading to deadlock in the child process at subsequent IO
operations. (Threads are pre-empted and not continued in the child
process after the fork.)

Therefore, ensure that the PL_perlio_mutex is unlocked in the child
process, right after fork(), by using atfork_lock/unlock.

(The RT text gives ways to reproduce the problem, but are not easily
added to Perl's test suite)

The code now uses regular returns instead of setjmp() and longjmp() for
signalling the need for pattern compilation to restart. By avoiding this,
and the corresponding need to mark many variables as volatile, we make
the code less fragile, and Address Sanitizer is now happy.

The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.

[With a lot of hand editing in regcomp.c, to propagate the changes through
subsequent commits.]

Remove volatile qualifiers. Remove the variable jump_ret. Move the
initialisation of restudied back to the declaration. This reverts several of
the changes made by commits 5d51ce98fae3de07 and bbd61b5ffb7621c2.

However, I can't see a cleaner way to avoid code duplication when restarting
the parse than to approach I've taken here - the label redo_first_pass is
now inside an if (0) block, which is clear but ugly.

The regex parse needs to be restarted if it turns out that it should be done
as UTF-8, not bytes. Using setjmp()/longjmp() complicates compilation
considerably, causing warnings about missing use of volatile, and hitting
code generation errors from clang's ASAN. Using goto is much clearer.

In S_regclass(), create listsv as a mortal, claiming a reference if needed.

The SV listsv is sometimes stored in an array generated near the end of
S_regclass(). In other cases it is not used, and it needs to be freed if
any of the warnings that S_regclass() can trigger turn out to be fatal.

The simplest solution to this problem is to declare it from the start as a
mortal, and claim a (new) reference to it if it is *not* to be freed. This
permits the removal of all other code related to ensuring that it is freed
at the right time, but not freed prematurely if a call to a warning returns.

As documented in pod/perlreguts.pod, the call graph for regex parsing
involves several levels of functions in regcomp.c, sometimes recursing more
than once.

The top level compiling function, S_reg(), calls S_regbranch() to parse each
single branch of an alternation. In turn, that calls S_regpiece() to parse
a simple pattern followed by quantifier, which calls S_regatom() to parse
that simple pattern. S_regatom() can call S_regclass() to handle classes,
but can also recurse into S_reg() to handle subpatterns and some other
constructions. Some other routines call call S_reg(), sometimes using an
alternative pattern that they generate dynamically to represent their input.

These routines all return a pointer to a regnode structure, and take a
pointer to an integer that holds flags, which is also used to return
information.

Historically, it has not been clear when and why they return NULL, and
whether the return value can be ignored. In particular, "Jumbo regexp patch"
(commit c277df42229d99fe, from Nov 1997), added code with two calls from
S_reg() to S_regbranch(), one of which checks the return value and generates
a LONGJMP node if it returns NULL, the other of which is called in void
context, and so both ignores any return value, or the possibility that it is
NULL.

After some analysis I have untangled the possible return values from these
5 functions (and related functions which call S_reg()).

Starting from the top:
S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma-
like constructions that it handles. Otherwise, historically it would return
NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return
NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece()
returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.

S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with
TRYAGAIN, but (historically) if S_regatom() returns NULL without setting
the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls
S_reg() it has similar behaviour when passing back return values, although
often it is able to loop instead on getting a TRYAGAIN.

Which gets us back to S_reg(), which can only *generate* NULL in conjunction
with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it
called generated it. All other functions that these call that return regnode
structures cannot return NULL. Hence

1) in the loop of functions called, there is no source for a return value of
NULL without the TRYAGAIN flag being set
2) a return value of NULL with TRYAGAIN set from an inner function does not
propagate out past S_regbranch()

Hence the only return values that most functions can generate are non-NULL,
or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot
return NULL. The longest sequence of functions that can return NULL (with
TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch().
Rapidly returning right round the loop back to S_reg() is not possible.

Hence code added by commit c277df42229d99fe to handle a NULL return from
S_regbranch(), along with some other code is dead.

The return value isn't used (yet). Previously the code was returning END,
which is a macro for the regnode number for "End of program" which happens to
be 0. It happens that 0 is valid C for a NULL pointer, but using it in this
way makes the intent unclear. Not only is orig_emit a valid type, it's
semantically the correct thing to return, as it's most recently added node.

Test that UTF-8 in the look-ahead of (?(?=...)...) restarts the sizing parse.

S_reg() recurses to itself to parse various constructions used as the
conditionals in conditional matching. Look-aheads and look-behinds can turn
out to need to be sized as UTF-8, which can cause the inner S_reg() to use
the macro REQUIRE_UTF8 is used to restart the parse. Test that this is
handled correctly.

Test that S_grok_bslash_N() copes if S_reg() restarts the sizing parse.

S_reg() can discover midway through parsing the pattern to determine its
size, that the pattern will actually need to be encoded as UTF-8. If
calculations so far have been done in terms of bytes, then the macro
REQUIRE_UTF8 is used to restart the parse, so that sizes can be calculated
correctly for UTF-8.

It is possible to trigger this restart when processing multi-character
charnames interpolated into the pattern using \N{}. Test that this is
handled correctly.

I believe that this code was rendered unreachable when perl 5.001 added
code to S_nextchar() to skip over embedded comments. Adrian Enache noted
this in March 2003, and proposed a patch which removed it. See
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html

The patch wasn't applied at that time, and when he sent it again August,
he omitted that hunk. See
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html

Commit 8d919b0a35f2b57a changed the storage location of the string in
SVt_REGEXP. It updated most code to deal with this, but missed the use of
SvPVX_const() in Perl_sv_uni_display(). This breaks dumping regular
expressions which have the UTF-8 flag set.

Inserting into a hash that is being traversed with each()
has always produced undefined behavior. With hash traversal
randomization this is more pronounced, and at the same
time relatively easy to spot. At the cost of an extra U32
in the xpvhv_aux structure we can detect that the xhv_rand
has changed and then produce a warning if it has.

It was suggested on IRC that this should produce a fatal
error, but I couldn't see a clean way to manage that with
"strict", it was much easier to create a "severe" (internal)
warning, which is enabled by default but suppressible with
C<no warnings "internal";> if people /really/ wanted.

ensure that inserting into a hash causes its hash iteration order to change

This serves two functions, it makes it harder for an attacker
to learn useful information by viewing the output of keys(),
and it makes "insert during traversal" errors much easier to
spot, as they will almost always produce degenerate behavior.

perturb insertion order and update xhv_rand during insertion and S_hsplit()

When inserting into a hash results in a collision the order of the items
in the bucket chain is predictable (FILO), and can be used to determine
that a collision has occured.

When a hash is too small for the number of items it holds we double
its size and remap the items as required. During this process the
keys in a bucket will reverse order, and exposes information to an
attacker that a collision has occured.

We therefore use the PL_hash_rand_bits() and the S_ptr_hash()
infrastructure to randomly "perturb" the order that colliding
items are inserted into the bucket chain. During insertion and
mapping instead of doing a simple "insert to top" we check the low
bit of PL_hash_rand_bits() and depending if it is set or not we
insert at the top of the chain, otherwise second from the top.
The end result being that the order in a bucket is less predictable,
which should make it harder for an attacker to spot a collision.

Every insert (via hv_common), and bucket doubling (via hsplit())
results in us updating PL_hash_rand_bits() using "randomish" data
like the hashed bucket address, the hash of the inserted item, and
the address of the inserted item.

This also updates the xhv_rand() of the hash, if there is one, during
S_hsplit() so that the iteration order changes when S_hsplit() is
called. This also is intended to make it harder for an attacker to
aquire information about collisions.

S_ptr_hash() - A new static function in hv.c which can be used to
hash a pointer or integer.

PL_hash_rand_bits - A new interpreter variable used as a cheap
provider of "semi-random" state for use by the hash infrastructure.

xpvhv_aux.xhv_rand - Used as a mask which is xored against the
xpvhv_aux.riter during iteration to randomize the order the actual
buckets are visited.

PL_hash_rand_bits is initialized as interpreter start from the random
hash seed, and then modified by "mixing in" the result of ptr_hash()
on the bucket array pointer in the hv (HvARRAY(hv)) every time
hv_auxinit() allocates a new iterator structure.

The net result is that every hash has its own iteration order, which
should make it much more difficult to determine what the current hash
seed is.

This required some test to be restructured, as they tested for something
that was not necessarily true, we never guaranteed that two hashes with
the same keys would produce the same key order, we merely promised that
using keys(), values(), or each() on the same hash, without any
insertions in between, would produce the same order of visiting the
key/values.