So I went with this regex: "\\(\\b\\sw\\)\\|\\(?:[^A-Z]\\([A-Z]\\)\\)".

Note that I want to skip [^A-Z] and only get the bounds for [A-Z], otherwise it would be simple.

What I expected: since there are two capturing groups, tied with an \\|, (match-string 1) should give me what I want.

What I got: if a word start matches it's in (match-string 1), and if a subword start matches it's in (match-string 2). That seems pretty useless with no chance to collect the matches in a generic way. How could I rebuild the regex to have both cases in (match-string 1)?

1 Answer
1

I don't know where your expectation comes from, but I for my part don't know any programming language where it'd hold. In every regular expression engine I'm aware of anonymous groups are numbered in order of appearance, Emacs Lisp being no exception.

Your pattern has two distinct anonymous groups, which are numbered from left to right. Hence the first branch gets the number 1, the second gets number 2.

I don't see, though, why this should be “useless”. Since the groups are in distinct branches of an alternation, you can just use (or (match-string 1) (match-string 2)) to extract the matching text.

Alternatively, you can give explicit numbers to both groups: "\\(?1:\\b\\sw\\)\\|\\(?:[^A-Z]\\(?1:[A-Z]\\)\\)". With this pattern, both groups get the same number, and the matching text always ends up in (match-string 1).