To avoid false matching in multibyte encoding, this module uses anchoring technique to ensure each matching position places at the character boundaries. cf. perlfaq6, "How can I match strings with multibyte characters?"

PATTERN is specified as a string. MODIFIER is specified as a string. Modifiers in the following list are allowed.

i case-insensitive pattern (only for ascii alphabets)
I case-insensitive pattern (greek, cyrillic, fullwidth latin)
j hiragana-katakana-insensitive pattern (but halfwidth katakana
are not considered.)
s treat string as single line
m treat string as multiple lines
x ignore whitespace (i.e. [\x20\n\r\t\f]) unless backslashed
or inside a character class; but comments are not recognized!
o once parsed (not compiled!) and the result is cached internally.

o modifier

while (<DATA>) {
print replace($_, '(perl)', '<strong>$1</strong>', 'igo');
}
is more efficient than
while (<DATA>) {
print replace($_, '(perl)', '<strong>$1</strong>', 'ig');
}
because in the latter case the pattern is parsed every time
whenever the function is called.

If a reference to a scalar is specified as the first argument, substitutes the referent scalar and returns the number of substitutions made. If a string (not a reference) is specified as the first argument, returns the substituted string and the specified string is unaffected.

This function emulates CORE::split(' ', STRING, LIMIT). It returns a list given by split STRING on whitespace including "\x81\x40" (IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field.

regexp meaning
^ match the start of the string
match the start of any line with 'm' modifier
$ match the end of the string, or before newline at the end
match the end of any line with 'm' modifier
. match any character except \n
match any character with 's' modifier
\A only at beginning of string
\Z at the end of the string, or before newline at the end
\z only at the end of the string (eq. '(?!\n)\Z')
\C match a single C char (octet), i.e. [\0-\xFF] in perl.
\j match any character, i.e. [\0-\x{FCFC}] in this module.
\J match any character except \n, i.e. [^\n] in this module.
* \j and \J are extensions by this module. e.g.
match($_, '(\j{5})\z') returns last five chars including \n at the end
match($_, '(\J{5})\Z') returns last five chars excluding \n at the end

A character class can include literal characters, metacharacters, and predefined character classes. Ranges in character class are supported. The endpoints of a range are specified by literal characters or metacharacters.

It is no need for users to be conscious of legal ranges of leading and trailing bytes in Shift-JIS, as this module properly skips illegal byte sequences when a character range is to be expanded. For example [\x{8340}-\x{8396}] is equivalent to [\x{8340}-\x{837E}\x{8380}-\x{8396}], since 0x7F is illegal as the trailing byte in Shift-JIS. So [\0-\x{fcfc}] matches any one Shift-JIS character. In character classes, any character or byte sequence that does not match any one Shift-JIS character (say, re('[\xA0-\xFF]')) is croaked.

Character classes that match non-Shift-JIS substring are not supported (use \C or alternation).

Since the version 0.13, the POSIX character equivalence classes [=x=] are supported, where x can be any character literal or meta chatacter (\xhh, \x{hhhh}) that belongs to the character equivalents can be used. have identical meanings. Character equivalence classes are used in a character class.

A kana collation symbol which may be voiced/semi-voiced includes a sequence(s) of two characters of voiced/semi-voiced in halfwidth katakana.

\p{Halfwidth} matches an ASCII graphic character excluding QUOTATION MARK, APOSTROPHE, and HYPHEN-MINUS. \p{Fullwidth} matches a double-byte character corresponding to \p{Halfwidth}. Note: the \p{Fullwidth} character for 0x5C (\) is FULLWIDTH YEN SIGN and that for 0x7E (~) is FULLWIDTH MACRON.

\p{MSWin} matches a character of Microsoft CP932. \p{NEC} matches an NEC special character or an NEC-selected IBM extended character. \p{IBM} matches an IBM extended character. \p{Vendor} matches a character of vendor-defined characters in Microsoft CP932, i.e. equivalent to [\p{NEC}\p{IBM}].

\p{Kanji0} matches a kanji of the minimum kanji class of JIS X 4061; \p{Kanji1} matches a kanji of the level 1 kanji of JIS X 0208; \p{Kanji2} matches a kanji of the level 2 kanji of JIS X 0208; \p{Kanji} matches a kanji of the basic kanji class of JIS X 4061.

\p{Prop}, \P{^Prop}, [\p{Prop}], etc. are equivalent to each other; and their complements are \P{Prop}, \p{^Prop}, [\P{Prop}], [^\p{Prop}], etc. \pP, \P^P, [\pP], etc. are equivalent to each other; and their complements are \PP, \p^P, [\PP], [^\pP], etc. [[:class:]] is equivalent to [^[:^class:]]; and their complements are [[:^class:]] or [^[:class:]].

In \p{Prop}, \P{Prop}, [:class:] expressions, Prop and class are case-insensitive. E.g. \p{digit}, [:BoxDrawing:], etc. are also accepted. Prefixes Is and In for \p{Prop} and \P{Prop} (e.g. \p{IsProp}, \P{InProp}, etc.) are optional. But \p{isProp}, \p{ISProp}, etc. are not ok, since the prefixes Is and In are not case-insensitive.

An embedded modifier, (?iIjsmxo), that appears at the beginning of the 'regexp' or that follows one of regular expressions ^, \A, or \G at the beginning of the 'regexp' is allowed to contain I, j, o modifiers.

Using 'e' modifier in replacement or looping in a while-clause are not supported by this module. They can be used only via a usual syntax (i.e. in m// or s/// operators).

Use a regular expression '\A(\j*?)' or '\G(\j*?)', to avoid mismatching a single-byte character on a trailing byte of a double-byte character, or a double-byte character on two bytes before and after a character boundary.

Don't forget $1 corresponds to '(\j*?)' and backreferences intended to use begin from $2.

Note: If matching on a very long string, a special regular expression \R{padG} may be safer than \G(\j*?) as the former has a lower probability of that the repeating count of * would overflow a limit.

A legal Shift-JIS character in this module must match the following regular expression:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Any string from external resource should be checked by the function ShiftJIS::String::issjis(), excepting you know it is surely encoded in Shift-JIS.

Use of an illegal Shift-JIS string may lead to odd results.

Some Shift-JIS double-byte characters have a trailing byte in the range of [\x40-\x7E], viz.,

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

The Perl's lexical analyzer doesn't take any care to these characters, so they sometimes make trouble. For example, the quoted literal ending with a double-byte character whose trailing byte is 0x5C causes a fatal error, since the trailing byte 0x5C backslashes the closing quote.

Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift-JIS double-byte characters needs the greatest care.

The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift-JIS string literal.