Google Now Indexes Scanned Documents
Posted on October 30, 2008 at 19:51 PM EDT
Google has announced that it will now begin including scanned documents in its search results - a feat that requires an immense amount of processing power and advanced image recognition technology. Unlike standard text documents, scanned files don't contain any text data that Google's spiders can index. Instead, Google has employed Optical Character Recognition (OCR) technology, converting photos of words into digital text files. In the past Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata - not the contents of the documents. From now on Google searches will include the text within these scanned images in normal search results. When you encounter a scanned document you'll be able to view it in its original form as a PDF, or as a converted text file (click "View As HTML").