straight from the chef

Menu

Batch Search and Replace and Selective Pretranslation in OmegaT

In this post I want to share three scripts that can do an extended search and replace in OmegaT project. Search and replace templates for each script are specified in external plain text files located in project’s root folder, so these scripts without any modifications can be used for different projects with different sets of search and replace patterns — the user needs to modify only those plain text files as needed. On top of text modification there is a possibility to do a simple math on what is being found by the script thus enabling the user to have a per project unit converter.
Each script should be accompanied by its own external file located in a subfolder named .ini in the project’s root (details under each script further on). The format of these files is the same for all three:

Only one empty line in the file — the very last one

Each line consists of tree blocks:

Search pattern (regex aware)

Tab

Replace pattern

So, if you need to replace “Владимир Владимирович” (taking into consideration different cases of Russian nouns) with “the President of Russian Federation“, here’s what you need to specify in the substitution file:Владимир\p{L}?+\sВладимирович\p{L}?+ the President of Russian Federation
If you need to convert miles into kilometers, here’s what you specify:(\d+)(\s?)mile(s?) ${(it[1] as int) * 1.6}$2km
Likewise, if you want to convert Fahrenheit into Centigrade, here’s the line:(\d+)°F ${((it[1] as int) - 32 ) * 5 / 9 }°C
To make sure abbreviation “EU” is always spelled in uppercase, you can either put:\b([Ee][Uu])\b ${(it[1] as String).toUpperCase()}
or\b([Ee][Uu])\b EU
There’s a TAB character in each of the above examples. It separates search pattern from replace pattern.

So, now to the scripts themselves. As usual, each heading is a link to pastebin.com where you can download the scripts, and under each heading there is a listing of the script.

This script performs a global search and replace in the whole project. The file where search and replace patterns are specified should be named search_replace.ini. The sequence in which these patterns are listed matters. You should back up your project before using the script, as there’s no way to revert the changes.

The external file for this script should be named pretranslate.ini. This script works only on those segments that don’t have translation. If the source segment contains what is specified as a search pattern, then target gets populated with the source text with all the possible substitutions and prefixed with {PRETRAN}. Other segments where nothing was found are left intact.

UPDATE: This script has been updated on October 15, 2013 (the listing here and on pastebin.com are up to date, you may need to update your local copy)

The update includes new location of the segment_substitution.ini, and new behavior when the script is invoked while some text is selected — the selection gets inserted into the target at the cursor position after all the possible transformations. If no transformations are possible, the text inserts as is, which might be a nifty way to insert things without changing your clipboard content.

Big thanks goes to Yu Tang who helped me figure out how to make math possible in these scripts.

Post navigation

24 thoughts on “Batch Search and Replace and Selective Pretranslation in OmegaT”

Here’s a short script to make a tab-separated glossary into a pretranslate.ini file, which will only search for whole words/phrases. It’s sorted from longest to shortest entries in the source language.

Thanks for good suggestions, Steve. I think I’ll manage to make it possible to limit the script’s functionality to current/selected file, but not being a programmer and trying to avoid scripting as much as I can, I can tell you right away there won’t be a “cancel” button any time soon. Somehow GUI programming with Swing isn’t the easiest thing for my feeble mind. I’ll email you when I have updated the script.

A question: when you are running replace_with_template and more than one replacement strings apply to the same selected texts… how does the script manages that? Does it apply only the first replacement it finds in segment_substitution? Does it let you select which replacement to use? I am sorry to ask, but my ability to construe groovy statements is limited 🙂

The script reads change and replace templates (or expressions) defined in segment_substitution.ini one by one as they’re listed, and applies them as soon as the match is found. It applies the respective template to all the found matches within the segment simultaneously. It means that if you have the same search expression used more than once, it is going, most likely, to be applied only once — the very first time it’s found, unless the replacement(s) that was done before the next iterations produce the match again. It also applies to substrings, so you have to be rather careful about tailoring segment_substitution.ini, and usually it’s a good idea to list longer search expressions first.

I hope it answers your question, but if not, I’d be happy to try to clarify more.

Thank you, it does answer my question. I am glad the script is smart enough to apply changes recursively. Kudos for you and Yu for that! I will take note of your suggestion of listing longer search expressions first. Cheers!

([a-zA-z]+)\s(\d+),\s(\d+) $2 de $1 de $3
January enero
February febrero
March marzo
April abril
May mayo
June junio
July julio
August agosto
September septiembre
October octubre
November noviembre
December diciembre

Hi Hector, usually this error happens when there’s an extra empty line in segment_substitution. There shouldn’t be any empty lines except the last one. If you’re sure you don’t have any empty lines, and still get this error, you can send me or share somewhere your segment_substitution.ini and source+target text, I’ll take a look.

Maybe different OS’es treat the end of the file differently. Or maybe it’s something that happened with the updated Groovy scripting engine in OmegaT. Not being a programmer, I don’t even know how to investigate it, nor I feel appealed to do so.

Hi Kos, Thank you for sharing the script, it’s a real life saver!
Now I can remove leading and trailing spaces, double spaces, and capitalize all words at the start of a sentence – all at once. This will save me a lot of time.

Hi Kos, thank you for sharing the scripts.
There is only one prob here:
The third script works well except that the replacement always goes to the beginning of the current segment no matter where the cursor is.

There’s very little I can do about it. Of course, the script can try to figure out where the cursor is, and then put it to the retrieved position when the target is replaced with the texts submitted to all the substitutions, but probably after all the changes the script does, the retrieved location won’t be the same as when it was invoked.

I use this script all the time to clean up translated segments—all the capitalization and everything—it saves hours. There is one “bug” I can’t figure out how to fix: it doesn’t seem to work with dollar signs($).
For example, if I want to batch replace anything like: $ 100 (which the google translate API keeps putting in) to $100 I tried this:
\$\s ——> $
This seems to make it terminate. Do you think there is an easy way to fix this? I looked at the code but couldn’t figure it out. Thx.

I have a related question. I use this script to do all the sentence capitalization. And, please don’t laugh at my noobness, but this is my statement:
\.\sa → . A
\.\sb → . B
\.\sc → . C
So there is 26 lines, one for each alphabet character. I have a sneaking suspicion that they can all be combined into one statement. I tried some different ways, but having such limited knowledge I could’t figure it out. Am I doing this wrong?

It’s not wrong per se, as it works and gets the thing done. But you are right, it can be combined into a one neat line:
\.\s(\p{Ll}) → . ${((it[1] as String).toUpperCase())}
Works with any alphabet where there’s a concept of lower and upper cases.

Hi Kos,
No wonder I couldn’t figure it out – look at that thing! I will give that a shot, and look at it more closely so it can also be a learning opportunity for me. Thanks so much for your advice and help! I think your blog is great and always look forward to seeing new recipes – please keep it up. Best regards.

Yeah, it looks a bit complicated because the script is written in such a way that it can execute subscripts defined on each line, as in the one above.
By using () in the search we can capture and group things, and then refer to them in the replacement. The first group is referred to by $1, the second — by $2 and so on.
So we can search for[uU](ncle) [sS](am)
and replace withU$1 S$2
to make sure “Uncle Sam” is always capitalized.
To run subscripts, we have to use ${} (that’s where the subscript is going to be run) and “it” (in this case it’s the result of the search). So, it[1] is basically $1 for the subscript, but in the subscript it can be declared as any type, not only String, which allows you to do unit conversions or almost anything else groovy is capable of.

My favorite recipe stopped working today when I upgraded to OmegaT 4.0.0. / I get an ever-spinning cursor, no response, and have to use the task manager to kill the program. If you update to 4.0 and solve this mystery, please post the solution. Thanks, as always, for the great recipes.