How to convert a PDF from archive.org for tablet
Archive.org is a great website full of book rarities available for free download. The only problem is that most of their PDF files are quiet unreadable, not just on a tablet, but even on a stronger desktop PC, because they are optimized for minimal size, and it can take around 5-10 seconds for a tablet to render a page. This post shows a method for converting those PDF files to larger files (around 4x), which can be comfortably browsed offline even on a tablet.
The conversion process can be outlined in 3 steps:
- Convert the PDF into a series of JPEG images using Ghost Script.
- Downscale and crop (optionally) the files with the batch tool of IrfanView.
- Then convert these images into a PDF file again using the img2pdf Python tool.
An attentive reader might ask: Why do I need to downscale the images? Why don’t we just generate smaller images with Ghost Script right away?
The answer: The anti-aliasing of Ghost Script is quite poor, therefore in order to get optimal quality it is better to export larger-than-required JPEG images with it, and then downscale them with a software which has decent anti-aliasing.
The steps in detail:
1. Convert the PDF into a series of JPEG images using Ghost Script.
This example uses the Ghost Script under a Cygwin environment (with Console2). Make sure that Ghost Script is properly installed under Cygwin. Download any PDF, for example: Cicero: De Natura Deorum (as “cicero.pdf”).
Then execute the following command:
gs -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=p%03d.jpg cicero.pdf |
This will generate a JPEG image for each page on 300 DPI:
(Click for full size)
After the conversion of all 702 pages has been completed comes the next step:
2. Downscale and crop the images
Many batch tools exist (like of Photoshop’s or GIMP’s), but we’ll use now this old, handy program, called: IrfanView. In the “File” menu select “Batch Conversion/Rename”:
(Click for full size)
Select the input files by navigating to the directory and clicking the “Add all” button. Then select the output directory. Click on “Options” for the JPEG settings:
Then click “OK”, and on the batch window click “Advanced” for the resize options:
(Click for full size)
There are 3 important settings on this page:
- Select the “resize” option, and set its parameters.
- Check the “Preserve aspect ratio” option.
- Also check the “Use Resample function” option, because this enables the anti-aliasing.
Then click “OK”, and on the batch window click “Start Batch”. It will finish quickly.
3. Convert the images into a PDF file again
Download the img2pdf tool and either install it the regular way, or just copy the two “.py” files from the “src” folder. Then execute the following command:
python img2pdf.py -o cicero_optimized.pdf *.jpg |
The good thing about img2pdf is that it doesn’t decode/re-encode the JPEG files, but it just inserts them into the PDF unmodified.