BUSINESS TECHNOLOGY

BUSINESS TECHNOLOGY; Now, PC's That Read A Page and Store It

By JOHN MARKOFF

Published: August 17, 1988

The ability of computer programs to accurately read a printed page, including both text and graphics, and store the page's contents in a computer is improving rapidly, and the cost of the software is dropping. This will soon make the new technology widely accessible to many personal computer users and could significantly speed the work flow in many offices.

Known as optical character recognition systems, the technology has the potential to push word processing and the handling of documents to new levels of sophistication. It could make it possible to quickly and inexpensively convert large amounts of printed documents to computer storage. Documents received by facsimile machines from distant points by telephone will also be easily converted for computer processing and storage.

Besides changing office routines, such developments are expected to raise thorny copyright issues, since printed material can more easily be converted to a computerized form and then altered through use of word processing programs.

In the past, the optical character recognition systems, also known as OCR readers, either had insufficient power and accuracy or were priced too high for the average personal computer user. But inexpensive and powerful 32-bit microprocessors and the development of new software are bringing potent new OCR technologies within the $800 to $2,500 range.

The new software is known as a ''page recognition'' system and is distinguished from its predecessors in that it can recognize virtually unlimited numbers of fonts and font sizes, distinguish text from line drawings and half-tones and correctly read multiple columns of text.

Earlier inexpensive OCR systems could read only a single or several fonts produced by a typewriter.

Last week, the Caere Corporation, a Los Gatos, Calif., company that has manufactured bar code scanning products, announced a new page recognition system called Omnipage. It is available for Apple Macintosh computers as a software program and for I.B.M.-compatible machines as a system that works with a plug-in co-processing circuit board. The cost of the Macintosh software is $800. The co-processing board for the I.B.M. will cost $1,995.

Next month the Palantir Corporation, which has sold a $30,000 OCR system intended for large corporate users, will announce a low-cost version of its system for I.B.M. and compatible personal computers. The cost of the Santa Clara, Calif., company's system is expected to be $2,500.

Both products offer increased accuracy. Earlier OCR systems made extensive corrections necessary. For example, even a system that claims 95 percent accuracy - the general limit of such products - could require as many as 100 corrections on a simple double-spaced typed page. In contrast, the new systems claim accuracies that approach 100 percent on many documents.

The makers of the newer systems suggest that the technology will spawn a host of new uses. For example, a company that would otherwise throw away most of the hundreds of resumes it receives could use an OCR system to scan each resume in about 30 seconds and store it on a computer disk for later reference.

The new systems will also make it possible to send a typed or type-set document by facsimile machine and then convert the data automatically to text for editing.

''It is the kind of software that makes it worthwhile to go out and buy a desktop scanner,'' said Jonathan Seybold, publisher of the Seybold Report on Publishing Systems, of Media, Pa. ''Until now the mass-market OCR software has been aimed at reading typewritten materials, and that hasn't been terribly useful.''

Richard Shaffer, editor of Technologic, said: ''These things don't have to be perfect. They just have to be better than the average secretary.''

Despite the new-found enthusiasm, some researchers are still cautious about the impact of the new page recognition systems. ''The history of OCR is that people have been saying that it would take off for some time, but it hasn't yet,'' said Richard Casey, a computer scientist who specializes in document recognition systems at I.B.M.'s Almaden Research Center in San Jose, Calif.

Many others, however, expect the new systems to dramatically ease the task of moving information from paper to computers. At a demonstration last week, the Caere system accurately recognized passages of text in business magazines.

The evolution of document scanners has been relatively long, technologically speaking. The first OCR research was done by I.B.M. at the company's Endicott Laboratory during the mid-1930's. It took three decades before the technology was available commercially. In 1964, an OCR system was developed for the Fireman's Fund Insurance Company by Recognition Equipment Inc.

These systems relied on a technique referred to as matrix-matching. Each character of a document is compared with a template stored in a computer's memory. The approach works best when limited to a single font of one size. Several special ''machine readable'' fonts were developed for recognition systems.

In the late 1970's increased computing power made it possible to apply pattern recognition technologies to the problem of recognizing text. This approach looks for characteristic features of a particular letter or number. For example, the software can be trained to recognize the pointed tip of the letter A. This approach extensively broadened the number of fonts that could be recognized, but was susceptible to defects in characters - a break in an o, for instance, might make that letter read as a c.

In contrast to these techniques, the software designed by engineers at Palantir and Caere is based on a series of methods that are used to examine an entire page, making assumptions about the content of a document before attempting to recognize individual characters.

The Caere program will first look for dense areas on a page and then apply tests to determine if these areas are graphics instead of text. It then tries to recognize individual columns, paragraphs and line spacing. Only after determining where each character lies on the page does it identify the individual characters.

In addition to a series of recognition tests similar to the Caere software, the Palantir system relies on a series of special dictionaries to aid in identifying individual words.