If you often write HTML in an editor and then paste into WordPress, you’ll notice that sometimes annoying formatting tags (like <span> tags) are added. Using simple shell scripts, you can automatically clean up that garbage HTML formatting with a few simple commands.

Why use shell scripting? If you’re new to programming, it’s much, much better to start small. Not only are you less likely to give up, but you’ll have opportunities to stop and learn along the way. That said, your first programs can be really useful even if they’re also really simple.

Why Shell Scripting?

Firstly, let’s start off by defining “shell scripting” as writing scripts to be run in the Bash shell. Technically speaking, other scripting languages such as Powershell could also be termed “shell scripting.” But why focus on shell scripting in general, and Bash scripting in particular, in the first place?

You can develop in small steps, in an interactive way. To continue the above example, let’s say you’ve decided you’ll use tar to do your compression, but you’re not yet sure which of its options you want. Just play around with it at the prompt until you get the result you want, then copy/paste the command you used into your script.

1. Collecting Long Lists of Parameters

The easiest and most straightforward way to use a shell script is as a knd of shortcut for an existing command. Some command line programs have a ton of flags, and their syntax isn’t always clear. But you can take one of these commands, with all its complicated options, and throw them into a shell script with a name that’s easier to enter. Consider the following command, which runs the Pandoc on a Markdown file and creates an ODT file, using a template file:

The first line of the script directs the system to use the Bash shell to run it. The next one takes the first argument at the command line ($1), and runs Pandoc with a set of flags on it. It’s worth noting there are other ways to do this, such as using the alias command on Unix-ish systems. But making small shell scripts means you can keep them handy (such as in your ~/bin folder), quickly copy (or sync) them elsewhere, and change them with any text editor. Save your script with a file name that’s easy to remember and type (e.g. “markdown2odt.sh”). Don’t forget to give it executable permissionsOne Of The Most Important Tools In Linux - Understanding ChmodOne Of The Most Important Tools In Linux - Understanding ChmodThere are plenty of features that make Linux special, but one of them that makes it so secure is its permissions system. You can have fine-grain control over all the files in your system and...Read More.

The Docbook XML format has no convention for inline styles, so if we convert HTML to DocBook all this formatting gets tossed out. Then we can use Pandoc to convert the DocBook back to HTML, and we get a nice bit of markup that you can (for example) paste into WordPress. Rather than do this with individual calls to Pandoc, the following script chains them together to:

Convert the exported HTML file to DocBook, which has no inline styles (before the pipe)

Convert the DocBook back into what is now nice, clean HTML formatting (after the pipe)

#! /bin/bash
pandoc -w docbook $1 | pandoc -r docbook -w html -o $1 -

Explaining Standard Input/Output

The above takes advantage of the terminal concepts of “standard input” and “standard output.” If you were to run the first part of the command, you’d get a whole bunch of XML shown in the terminal. The reason why is we haven’t given Pandoc any other output (such as a file) to use. So it’s using the only fall back it’s got: standard output, in this case the terminal.

On the other hand, the dash character at the end of the second Pandoc command means it should use “standard input.” Run by itself, you’d be greeted with a prompt, where the shell would wait for you to provide some text via it’s default input, by typing on the keyboard. When we combine them, you can almost imagine the first command spitting out a bunch of XML to the terminal where it is immediately piped into the second command as input.

The result is, if you rename this to “clean-html.sh,” you can run it on any HTML file to get rid of those bothersome styles. The best part is Pandoc will read from the file, then overwrite it at the end, meaning there’s no temp files littered about.

3. Running Programs on Multiple HTML Files

Some programs allow you to specify wildcards such as the asterisk at the command line. This allows you to, for example, move all JPG images to your “Pictures” folder:

mv *.jpg ~/Pictures

But other programs take only one file at a time as input, and Pandoc is one of them. So what happens when we have a whole directory full of exported HTML files and we want to clean up the HTML formatting? Do we need to run our “clean-html.sh” script on each one of them manually?

No, because we’re not newbies. We can wrap our piped command in a “for-each” loop. This will go to each HTML file in the current directory in turn, and perform the clean operation on it. Let’s also add a little message via the echo statement to let us know all the files have been taken care of:

Provide the user with additional export options like PDF (adds choices based on input, via if-then or case statements).

As you can see, with shell scripts you can build things a little at a time, testing them out at the prompt and tacking them onto your scripts as you go.

What do you say, does shell scripting seem a little less intimidating now? Are you ready to try your hand at automating your dullest tasks? If you decide to jump in, let us know how it goes below in the comments!

I had looked briefly into these, albeit a while ago. What I'd found at the time was that they're focused on specifically *validating* HTML, and inline styles are technically valid. So they didn't contain a feature to remove in-line styling, at least that I could find.

But I'd argue that to accomplish the above you don't need to be "fluent" in scripting. The point of this post is that you can get started by using terminal commands you already know and wrapping just a bit of simple scripting around them. After all, everyone who is fluent with scripting *now* had to create something simple like this at some point in the past...

Aaron has been elbow-deep in technology as a business analyst and project manager for going on fifteen years, and has been a loyal Ubuntu user for almost as long (since the Breezy Badger). His interests include open source, small business applications, integration of Linux and Android, and computing in plain text mode.