September 15th, 2010
The completely manual, last ditch, option to recover a PDF
I was recently given a PDF file that consisted of scanned pages requiring repair. Commercial repair tools only recovered the first page. I managed to find one option to go a little further. This option for advanced users is presented here in case it helps others.
First of all, I have dealt with these a few times and if you need the text content of a PDF this will not work - grab one of the many commercial PDF repair tools and try that. I have a few I like to use and in general they will recover a lot, but no automated approach is going to get you 100% of the way there...
Tools Needed
Photoshop (perhaps some other image editing program)
Textpad (or another good text/hex editor)
Knowing that when you scan an image into a PDF, most of the time it is stored as a JPEG within the PDF I figured I would start by changing the file extension from .pdf to .jpg and see if it opened in Photoshop...
It did! but only the first page. So now the tricky bit...
I opened the pdf in Textpad and started scanning down the file searching for "endobj" basically you will see constructs that look like:
endobj
13 0 obj
<< /Type /XObject /Subtype /Image /Width 1275 /Height 1649
/BitsPerComponent 8 /ColorSpace /DeviceRGB
/Filter /DCTDecode /Length 239648 >>
stream
followed by a bunch of binary data. If you look closely at the first line of the binary you´ll see "JFIF" (a JPEG/JFIF compression header)
The endobj is the end of the object before and the 13 0 obj starts a new object.
so... the method goes like this:
Remove everything in the file down to and including the next endobj, save the file (make a copy obviously) open in Photoshop (hopefully getting the next page) and save as to a JPG then repeat for each subsequent image.
In the case of my file I only got 1 and a half more pages before the file abruptly ended, but that was better then nothing. I truly hope this helps others, granted it´s an edge case for repairing a PDF file and will only work to get out the images from a PDF, but as it was for me, something was better then nothing