On 11/2/07, Tom Knight <tk@csail.mit.edu> wrote:
I'm amazed that Google is not using the "text behind image" feature of PDF files to handle this. Modern OCR programs produce text behind the page image, which is then searchable and selectable for cut and paste. I wonder why this obviously useful idea is not used in their scanning and OCR. I guess they want to be the only people who can search and index the pages.
I applaud Google's efforts to digitize old books and have been downloading some occasionally for quite a while now. Unless my memory is seriously flawed (quite possible :-( ), my early downloaded pdf files were searchable, but in the last 6 months or so none have been. This makes me think Google has changed policy on this, perhaps for copyright liability reasons. In any case, this greatly diminishes the usefulness for me of these books. Since the online versions are searchable it is clear that OCR has been done on them, but is not made available in the downloads. I have been somewhat disappointed in quality as well, with missing pages and illegible pages quite common. This seems to be true across the board both in old math books and others. If Google employees or anyone can shed more light on these matters I would be interested to know what is going on. Jim