Description:
------------
Using preg_match_all with the PREG_SET_ORDER flag and an optional capture group might return either an empty string or a missing element.
This depends on whether there is a matched capture group after the non-matched capture group.
From an interface perspective, this means that I have to check for two representations to test whether an optional capture group was matched.
The test script demonstrates such a case.
When considering a fix, the first priority should be that whatever the representation is (NULL, an empty string or a missing element.), it should be consistent, no matter if there is a matched capture group after or not.
If this cannot be fixed due to backwards compatibility Issues. Could we add a PREG_KEEP_NONMATCHES or PREG_SET_ORDER_2 flag?
Test script:
---------------
<?php
preg_match_all("#(a)?(b)(c)?#","b",$matches,PREG_SET_ORDER);
var_dump($matches);
Expected result:
----------------
array(1) {
[0] => array(3) {
[0] => string(1) "b" [1] => string(0) "" [2] => string(1) "b" [3] => string(0) ""
}
}
Actual result:
--------------
array(1) {
[0] => array(3) {
[0] => string(1) "b" [1] => string(0) "" [2] => string(1) "b"
}
}

Hmm, yes, I misunderstood this report: 61780 is about turning empty strings from unmatched subpatterns into NULLs while this is about *adding* trailing unmatched subpatterns.
I don't see anything in the 61780 changes that clearly indicate this is also fixed, but some parts look like they might do it accidentally. Can anyone check? (It'd be nice if 3v4l could do master too...)

For PREG_PATTERN_ORDER unmatched subpatterns are already added[1];
something like that would also have to be done for PREG_SET_ORDER.
However, that would cause BC issues, and adding another flag isn't
appealing to me. I also think that NULL and unset are close enough
to stick with the current behavior for now; checking `isset()`
would suffice.
[1] <https://github.com/php/php-src/blob/PHP-7.1.0/ext/pcre/php_pcre.c#L862-L871>

[2017-01-23 15:48 UTC] tomasyorke at hotmail dot com

As noted by cmb, this bug does not appear with the PREG_PATTERN_ORDER flag.
Thus I switched the flag and my code to use PREG_PATTERN_ORDER and implemented my own preg_set_order() function.
But I found a nearly identical bug with the PREG_OFFSET_CAPTURE flag.
Consider the code
preg_match_all("#(a)?(b)(c)?#","b",$matches,PREG_OFFSET_CAPTURE);
var_dump($matches);
Returns
array(4) {
[0] => array(1) {
[0] => array(2) {
[0] => string(1) "b" [1] => int(0)
}
} [1] => array(1) {
[0] => array(2) {
[0] => string(0) "" [1] => int(-1)
}
} [2] => array(1) {
[0] => array(2) {
[0] => string(1) "b" [1] => int(0)
}
} [3] => array(1) {
[0] => string(0) ""
}
}
As you can see, the third optional capture group returns something different than the first optional capture group.
This is too a bug. Again, I don't care what representation is chosen. But it must be the same for any optional capture no matter if trailing or not.

[2017-01-23 15:59 UTC] tomasyorke at hotmail dot com

Regarding BC
The proposed change would be to change the representation of missed capture groups from 2 representations to 1 representation.
The only code that would break is code that supported only the more obscure representation that only appears with trailing groups. And if that code exists, then it will still break on non trailing missing groups even without the change.

Regarding the BC break: when fixing <https://bugs.php.net/61780> I
would have preferred to unset unmatched captures, but that
appeared too much of a BC break (consider somebody is count()ing
the $matches). Therefore I changed the empty strings to NULL.
Adding unmatched trailing captures as NULL would still have the
same BC concerns – I don't think we can do that before PHP 7.2,
and even that might require the RFC process.

[2017-01-23 17:43 UTC] tomasyorke at hotmail dot com

I'm glad we didn't go the unset route. Having fixed sized arrays is useful for looping by groups/matches independently of the Flag used.
You also can very easily know how many groups/matches the regex returned/contained.
By making the output array dynamicly sized. You only allow for an easy way to determine one of these things. While making the other more complex.

[2017-01-23 17:53 UTC] tomasyorke at hotmail dot com

If you wish to delay this change until 7.2, that sounds fine.
But why would you need an RFC for a bugfix?

Since we now have PREG_UNMATCHED_AS_NULL, the solution appears to
be straight forward: fill up NULL values at the end, only if
PREG_UNMATCHED_AS_NULL is used. Since this option is only
available as of PHP 7.2.0, the resulting BC break seems to be
acceptable.

@requinix: That's impossible for BC reasons. I agree with @cmb that this should only be done for PREG_UNMATCHED_AS_NULL.

[2018-06-09 22:51 UTC] php at tecnopolis dot ca

For those stuck on older PHP's, I'm wondering if the following is sufficient to work around this bug?
$pattern.='(.??)';
preg_match($pattern,$haystack,$matches);
array_pop($matches);
I can't yet find any instance where the fairly innocuous (.??) could possibly cause a problem, it should match in all cases, when there's chars left or not, and being non-greedy will not eat earlier matches' chars.