Popular White Paper On This Topic

You can't grep XML and HTML files directly.
?
The issue is that the standards for these do not say anything about newlines. It is perfectly valid for megabytes of HTML to be all on one line, or even on zero lines (no newline at end of file). We had one of these a few months ago, and couldn't get the poster to run wc or to believe the output when he saw it.
?
Even if the file has some newlines somewhere, any line with a token on can have lots of other junk in, or be separated from its text, or anything.
?
Best start I can recommend is to run it through awk like:
?
AWK='''
BEGIN { RS = ">"; }
{ printf ("%s>\n", $0); }
'''
?
awk "${AWK}"
?
Using RS (record separator) then breaks the input after every HTML/XML token. And you replace the > to restore the way it looks.
?
After that, it becomes a whole lot easier to parse with another awk.
?

-r, remove input file after conversion
-v, verbose
-h, usage and options (help)
-m, manual
-l, see this script"
manual:
DESCRIPTION
html2txt converts ascii files with html content to plain text. It replaces the
previous suffix, if any, with a "txt" suffix. It skips the following files:
- binary files
- directories
- files that already have the same name as <input_file>.txt
Option -r, removes the input file after conversion.
EXAMPLES
Use find with xargs to run the script recursively on multiple files. For
example, to convert all html files to text recursively:
find . -name "*.html" | xargs html2txt

Copyright 1998-2015 Ziff Davis, LLC (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is not
affiliated with or endorsed by any company listed at this site.