This blog is about the Linux Command Line Interface (CLI), with an occasional foray into GUI territory.
Instead of just giving you information like some man page, I hope to illustrate each command in real-life scenarios.

The above output messages provided a clue on why the input pdf file was problematic. The pdf file does not "conform to Adobe's published PDF specification." To its credit, gs "repaired or ignored" the problem. It continued on to successfully extract the pages. In this particular example, gs is more error tolerant than its counterpart, pdftk.

Note that with no explicit instruction, the default layout is 1x2 in the landscape orientation.

a2ps provides shortcuts to specify the number of rows and columns for common configurations. For example, -2 is equivalent to 1 row and 2 columns. Valid shortcuts are -1 to -9.

Paper size

North American a2ps users need to modify the output paper size. a2ps was written in Europe, and uses A4
as the default paper size. North America uses a different standard, and a common paper size is called Letter - 8.5 x 11 inches. Printing A4 on Letter-sized paper results in text being cropped at the end of each line in the right column.

To modify the paper size to Letter:

$ a2ps -M Letter .emacs

Instead of overriding the paper size in every single run, you can change the default locally (per user) or globally (system-wide). To specify the paper size for a user, add the following line to $HOME/.a2ps/a2psrc.

Options: --medium=Letter

To change the default system-wide, edit the file /etc/a2ps-site.cfg. Look for the Options: --medium line and change the value to Letter.

Preview of output

By default, a2ps sends the output to the printer. Sometimes you want to override that behavior in order to preview the output. You may redirect the output to a PostScript file or directly to ghostview.

The -P parameter normally specifies the printer name. However, display is a special name to redirect output to ghostview.

Multiple files

a2ps also supports multiple input files. By default, each file begins printing on a new sheet ("file alignment"). Empty cells in the layout are not filled. For example, given 2 input files - 1.txt and 2.txt - that are each 1 page long, each file will be printed on a separate sheet, leaving the sheets half empty.

You can control the file alignment using -A parameter. If file alignment is fill, a2ps prints a file beginning in the next available cell, leaving no empty cell in between it and the previous file.

Tuesday, January 14, 2014

Want a tool for editing image files, but shun GIMP because of the steep learning curve? For me, I needed an app to edit screenshots for my blog posts. Pinta turns out to be the perfect tool for that purpose.
I use pinta to edit the following image file formats: JPG, PNG, TIFF, BMP, ICO, TGA, ORA.

The main draw canvas is sandwiched between 2 columns of tool sub-windows. The left column is organized into Tools and Palette; the right, Layers, Images, and History. The sub-windows are by default docked, but you can make them hidden, or floating. My preference is to keep the default configuration - the most common operations that I need are conveniently located.

The Tools window on the left column has the typical selection tools- rectangle select, ellipse select, lasso select - and geometric shape drawing tools - rectangles, rounded rectangles, ellipses, and freeform shapes. Mousing over a tool icon displays some brief instruction on using the tool. The tools are self-explanatory. But I could not get the lasso select, and the free-form shape drawing tools to work on pinta version 1.3.

The right column houses the very useful History window. Every image edit operation you completed in the current session - Text, Ellipse, etc - is recorded there. Clicking an operation reverts the image to that exact state in its history. If you like the more traditional Redo and Undo features, they are available in the Edit menu.

Occasionally, you may foray into the menus and sub-menus to get at editing functions that are not exposed in the windows. For editing screenshot, I frequent the Image menu that comprises the cropping, rotating, and resizing functions. I find Crop to Selection particularly useful. You first use a selection tool to specify a subset of the original image. Crop to Selection reduces the image to the selected, eliminating everything else.

Pinta is easy to use. So easy that I just shrugged when I realized that this software does not come with a user manual. If you don't know GIMP, I suggest that you start with pinta because you will be productive within minutes. If you have modest image editing requirements, you may never need to graduate to the more powerful GIMP.

Thursday, January 9, 2014

This post describes how to scan pages from a printed book and convert the image to text using Optical Character Recognition (OCR) technology.

The tools that I use are:

SimpleScan

tesseract

Preparation

SimpleScan is a GUI scan application that comes pre-installed in many Linux distributions (including Debian Wheezy).

To manually install it on Debian:

$ sudo apt-get install simple-scan

tesseract is a command-line OCR program.

To install:

$ sudo apt-get install tesseract-ocr

If English is the language used, that is all you need to install. If you require another language, you must install additional tesseract language packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, and tesseract-ocr-fra for French.

OCR Procedure

The first parameter is the input image filename. The second parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g., out.txt.

If the language is not English, you need to specify the language on the command line using a 3-character language code (refer to the tesseract man page). The following command specifies the use of 3 languages: Russian, German and French.

$ tesseract OnWritingWell.jpg myout -l rus+deu+fra

Accuracy

In the above example, there were a total of 734 words. Within the output text file, 119 words (16% of total) require some form of manual correction. This roughly translates to 84% OCR accuracy. The sample size is too small to be scientific, or statistically valid. What is the performance that you are getting from OCR?