New York -- It appears that Googlebot’s obsession to text has reached a new level? Search engine giant Google Inc. last week announced that it has begun using a new technology called “Optical Character Recognition” (OCR) that allows the search engine to read text from scanned documents saved in Adobe PDF format and include the words from those documents in its search results.
Scanned documents have been popping up in Google's search index for a while now. That is because Google has begun indexing Adobe Portable Document Format (PDF) documents metadata for a long while now, and even offering the option of converting the documents to HTML.
OCR converts the picture of words into actual words. However, so far it has only been able to do so with those PDF documents which contained actual text data.
To elaborate more precisely, scanned documents are photographs of the entire page, pixel by pixel, along with the text itself, images, paper defects and holes, stains, etc. They are analog true copy. A digital document on the other hand is a binary code of the text itself.
“In the past, scanned documents were rarely included in search results as we could not be sure of their content,” Evin Levey, a Google product manager, said in a Google blog post. “We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query.”
Indexing an scanned documents are much harder than the documents that are saved as PDFs, according to Google, because scans might contain the ring of a coffee cup, ink smudges or fold creases in the paper. “Today, that changes,” Levey added. “We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format.”
By converting images of text into text, Google expands its already massive index. “This (OCR) technology enables us to effortlessly convert a picture (of a thousand words) into a thousand words -- words that can be searched and indexed so that these valuable documents are more easily found,” Levey wrote.
Google is also making this converted text of scanned PDFs available on its search results pages via the “View as HTML” link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 can also be viewed as HTML.
A similar search, “Repairing Aluminum Wiring,” on Yahoo Search also showed the CPSC PDF as the top result, but the Yahoo’s “View as HTML” link displayed only blank pages. While Microsoft’s Live Search and Ask.com both returned the CPSC PDF as the top result, but, neither offered a “View as HTML” link.
The concept of OCR technology, which has been around for some time, needs to be pretty smart to get anywhere near to the reading ability of a human. Google’s Levey explains: It may not make any difference to us whether a document is text or picture of a text, but it may “To people reading these documents, the dissimilarity between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible.
Now you can search on specific words found inside a document, so long as it is a PDF -- undoubtedly Google will expand the range of index-worthy formats over time.
For example -- A computer can only “read” the binary code making up a character, a word, a sentence. Take a circle. Should it be read it as a zero, the letter ‘O’, just a circle, or the ring from my coffee cup? The process is painstaking and error-prone though, although its efficacy has been improving over the years.
Some of the few examples of the new technology in action: repairing aluminum wiring, spin lock performance, steady success in a volatile world.
“This is a small but an important step forward in our mission of making the entire world’s information accessible and useful,” said Levey.
Although the importance of the change cannot be understated as now scores of print-only books and documents (mostly relating to history, academics, government and archives) that had been uploaded but unable to be indexed, can now be readily searched through.