Hello, I'm working on files where words have been tagged for part-of-speech categories in French. This is done automatically and the tagging software often makes predictable mistakes which can be corrected together. The word "en" can be a pronoun or a preposition. I'm trying to put together a script to search for and replace this (tab separated):

Code

en PR0:PER en

with this

Code

en PRP en

When the first word on the next line is a present participle (in French these end in -ant). For example:

Code

en PR0:PER en vieillissant VER:ppre vieillir

needs to become

Code

en PRP en vieillissant VER:ppre vieillir

Here's the bit of script I've been trying to use:

Code

# get arguments off the command line ($pattern1, $pattern2, $input_file, $output_file) = @ARGV; # open input file open $IN, "<:encoding(utf-8)", $input_file or die "unable to open $input_file for reading!\n"; # open output file open $OUT, ">:encoding(utf-8)", $output_file or die "unable to open $output_file for writing!\n"; # loop over lines while ($line = <$IN>) { # test to see whether the line matches the pattern, # also setting up a backreference # replace pattern1 with pattern2 if ($line =~ /$pattern1/x) { $line =~ s/($pattern1)/$pattern2/gx; # write the current line number and the line with the match highlighted to the output file print $OUT $.,": ",$line; } } # close all filehandles close $IN; close $OUT;

This works fine with simple terms, but when I enter my search text "pattern1" as a regex, it does not recognise it. Here's the syntax I've been using:

Code

pattern1 en\tPRO:PER\ten\n(?!.*ant\b) pattern2 en\tPRP\ten\n

I'm fairly new to this, so I'm probably doing something very obviously wrong. Any help would be much appreciated. G. R.

Some possible clues. If your file is not too large, you may slurp it into a variable and then use your regex in multiline mode.

Another possibility is to read line by line and always keep a sliding window or buffer of two lines in memory and to have two separate regexes, one for the first line and one for the second.

In either case, I would recommend to separate the problems, i.e. that you start testing with actual regexes hard-coded in your program and, only once this works properly, that you start to pass regexes as command line arguments, which will probably create other problems that you can solve as a second step.

You have improved the style of your code, but you still have not addressed the main issue which Laurent pointed out in his early replies. Your look-ahead assertion will never match. The diamond (<>) operator, by default, reads up to the next newline. (That newline is the last character of the string in $_.) Your substitution will work everywhere that you intend, but it will also be made in places that you expect the assertion to prevent it. Laurent suggested that you use slurp mode (Refer to $INPUT_RECORD_SEPARATOR in perldoc perlvar). Using a regex on a multi-line string introduces two new issues. (Do you want a dot to match a newline character? and Do you want the anchors (^ and $) to refer to the start/end of the line or the string?) Use the /s and /m switches on the substitute command to specify your requirements. (Refer to s/PATTERN/REPLACEMENT/ in perldoc perlop) Good Luck, Bill

Thanks for your answer, and for pointing out the problems in my code. With help from the web I've managed to cobble together two options which appear to work as intended. (I tried the /s and /m switches, but this seems to do the job.) The first uses Path::Tiny and writes directly into the file, the second uses a subroutine to read in the whole file, then writes output into another file. I'm not sure which is advisable... and I still ultimately need to set up the different terms as command-line arguments.

Both slurping options are correct. Each has a place. I use option 2 when I need a quick answer. Option 1 is probably better for production work.

You still do not seem to understand how your regex works on a multi-line string. In your case, the /m option does not make any difference because you do not use the anchors that it affects. The /s does make a difference, especially with the greedy match (.*) in your assertion. My best guess is that you need a non-greedy (.*?) match and the /s, but you know your data better than I.

It is probably worth the effort to prepare a sample of fake test data which contains only special cases. Remember that a successful test is the one that finds an error. Good Luck, Bill