I would like to do a ‘search and replace’ on about 100 data files with unstructured text. Each of the files are about 50 MB in size and 50k lines. There are about 20k terms to do search and replace, which are stored in another file; 'terms_list.csv', a CSV format file with three columns COL1, COL2, and COL3. I need to search for words in COL1 and COL2 of 'terms_list.csv' in each of the 100 data files and replace with the corresponding word in COL3, if either of the words are found.

With my basic knowledge of shell scripting, I wrote the following shell script using AWK/SED to loop. It reads lines one by one from 20k lines ‘terms_list.csv’ and search for COL1 and COL2 in each of the 100 files, and if found, will replace with COL3.

I am positive that there is a better/efficient code than above to accomplish the task as this require a lot of disk read/write. Any suggestions to improve? If there are better tools (perl/python) for this task, could you please give me some suggestions/directions to look at?

Below are a sample data for both of the files:

‘text_data_file_0001.csv’: One of the 100 data files, ‘text_data_file_0001.csv’ contain unstructured data as below which contain ‘TermFull’ and ‘TermAbbreviated’ among the text. [size of each file is about 50 MB and 50k lines]

ID000001,Mangifera indica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. M. indica is a popular fruit in India.
ID000002,Oryza sativa, commonly known as Asian rice, is the plant species most commonly referred to in English as rice. O. sativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

'terms_list.csv' file: The search terms 'TermFull' and 'TermAbbreviated', and replace term 'TermJoined' are stored in ‘terms_list.csv’ contains 20k lines as below

Required output file ‘text_data_file0001.csv’ is as below with ‘TermFull’ and ‘TermAbbreviated’ replaced with ‘TermJoined’

ID000001,Mangiferaindica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. Mangiferaindica is a popular fruit in India.
ID000002,Oryzasativa, commonly known as Asian rice, is the plant species #most commonly referred to in English as rice. Oryzasativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

Whenever you're running a while read loop (or, really, any other loop) in shell, if you care about efficiency and are iterating over lots of items or lines, you never want to either fork or use any external tool inside that loop. That means no $(...), no awk, no pipelines, no sed. (Running a single copy of awk that loops over all your content, as in Ed's answer, is great; running a separate copy of awkper line of content is horrible; a separate copy per column in that line is even worse).
– Charles DuffyMay 15 at 19:46

BTW, while IFS=, read -r x y z _; do would avoid the need for those three awk invocations, though it would still leave you with a sed -i per line, which is still a serious inefficiency.
– Charles DuffyMay 15 at 19:49

Thanks @CharlesDuffy. I am learning a lot reading and trying to digest your recommendations and Ed's solution. I need to work on my rudimentary scripting skills.
– InsilicoMay 15 at 20:00

As another aside, echo "File: $DATA_FILE x: $x | y: $y | z: $z" -- passing everything to echo as a single string -- is actually less buggy than the current code; see BashPitfalls #14 or the shellcheck.net wiki page SC2086 (btw, running your code through shellcheck before asking SO questions is always a good idea).
– Charles DuffyMay 15 at 20:04

There is a case that still isn't covered by this: if the input CSV file contains commas within one of the fields, you have to resort to a CSV parser for pre-processing. Commas are ignored when escaping regex metacharacters in the third solution.

I actually just want the OP to be comparing apples when she looks at the scripts and have something that works somewhat robustly in both cases. She definitely has RE metachars in here terms_list.csv as we can see .s in there and I'm sure she'd want them treated literally and I'm also fairly sure she'd have no idea they weren't being treated literally by a script if it wasn't pointed out :-)!
– Ed MortonMay 16 at 13:34

@EdMorton your point was on spot. However, I am curious, how did you arrive at the conclusion that I am a female or male? ;)
– Insilicoyesterday

Massage to suit. If looks like your terms_list file might contain RE metachars, for example, so you should think about whether or not you want to use them in a regexp, as you were doing in sed and so we are doing above with gsub(), or use a string operation like with index() and substr() instead of gsub() and also how you want to handle partial matches and/or cases where the first replacement creates a term that didn't previously exist which is then matched by the second replacement and so on.

I suspect something like this (untested) would be both flexible and robust enough for your needs (it'll certainly be far more robust than the sed script you were running and literally orders of magnitude faster than your shell loop + sed approach):

The gsubs are escaping any metachars so every char in the original and replacement text gets treated literally but then we're adding word boundaries around the whole of every original term to avoid partial matches.

Thank you Ed Morton for your expert guidance. I am trying to understand the code and run it on my data. As I said to Charles above, I need to work on my rudimentary scripting skills.
– InsilicoMay 15 at 20:04

You're welcome, feel free to ask questions (after a glance at the awk man page of course :-) ). The gsub()s at the start are explained where I set escOld and escNew in my answer at stackoverflow.com/a/29626460/1745001 (there it's sed but it's the same concept of turning RE metachars into literal chars).
– Ed MortonMay 15 at 20:04

After a bit of trial and error, I liked the approach by BenjaminW. However, I also adapted the approach for perl and found that perl (v5.22.1) outperformed sed (GNU sed 4.2.2) in this case. (With a very similar code--; needed at the end of pattern--, perl was roughly three times faster than sed. Please see the time command output. The fie outputs had similar word count results) I have posted the sed and perl codes below which worked for my current need.

@EdMorton, @CharlesDuffy, @BenjaminW, thanks again for your comments and solutions. The info you have provided was immensely useful for me and I have learned a lot in the past week. I have taken your suggestions and summarize/document below for naive coders like me.

Thanks @EdMorton. Be careful about the meta charterers in substitution pattern! I have . in my data and it means 'everything' in RegExp. That needs to be escaped with backslash.

Thanks to @CharlesDuffy; When iterating over lots of items or lines with any loop in shell, use of any external tool inside that loop will considerably decrease efficiency. The new code below is highly efficient compared to the code above.

@EdMorton I accepted Benjamins answer first but I think I misunderstood his comment to add as new answer. Anyway, I have accepted his post as accepted answer. BTW, this is my first question and I am not familiar with the etiquette. I should probably pay more attention to read some SO guidelines (if available).
– Insilico22 hours ago

No no, I meant that if you have your own, separate solution (like the Perl one), it's better to add it as an answer instead of making it part of the question. The question shouldn't contain solutions.
– Benjamin W.22 hours ago