Search and Replace

Search and Replace

Is the most common operation that computers automate

Searching and Replacing

Across hundreds, thousands or even millions of
documents just cannot be done by a human. The time, and the errors introduced
and replacements missed make it essential to find a set of tools to automate
this whole process.

Commonly an 'exact search' is required

whereby the text is found exactly as
entered. However, to search for special characters such as carriage returns,
line feeds, tabs etc, special 'escape codes' must be allowed such as \r, \n, \t,
and these are interpreted specially. A backslash is entered as \\.

Very often the found text has to be replaced in a different arrangement

for example, replacing, re-arranging or substituting other text. A very common
example is reformatting of dates from US to EU formats - mm/dd/yyyy to
dd-mm-yyyy. To do this, part of the found text has to be 'captured' and then
substituted into the replacement string. Usually the fragments of captured text
are stored in 'macros' or 'variables' named $1, $2, $3 etc, with $0 representing
the entire match.

Search and Replace

Tools

Binary Data Formats

The replace tools used differ depending on the format of the files being
processed. Often, as with Microsoft products, the actual data is stored in a
compressed binary format that is incomprehensible to humans, but is fast and
small to store. Modifying these files naively can easily result in corrupted
Word documents, although if the search and replace text have identical lengths
it is possible for it to work with blind luck. Some popular tools work around
this approach by interfacing directly to the native Microsoft application,
ensuring no data loss or potential corruption.

Windows, Mac and Unix Line Feeds

Mainframe format

Files that come from a Mainframe are usually encoded using EBCDIC, which is
an alternative to ASCII. If Mainframe files are in ASCII already, then any
numbers stored in packed formats get corrupted and lost by the conversion, so it
is essential to use a tool that operates on the original EBCDIC data.

Data Size

Many tools rely on loading an entire file into main memory before processing
it. Even as memory sizes grow, the text files we process also seem to grow to
match! Loading huge files into memory is a very naive approach these days, and
can slow a computer to a crawl or even crash it. Such tools as
TextPipe Pro
work around this issue by processing files in large chunks.