. .

How to convert a PDF from archive.org for tablet

Archive.org is a great website full of book rarities available for free download. The only problem is that most of their PDF files are quiet unreadable, not just on a tablet, but even on a stronger desktop PC, because they are optimized for minimal size, and it can take around 5-10 seconds for a tablet to render a page. This post shows a method for converting those PDF files to larger files (around 4x), which can be comfortably browsed offline even on a tablet.

The conversion process can be outlined in 3 steps:

  1. Convert the PDF into a series of JPEG images using Ghost Script.
  2. Downscale and crop (optionally) the files with the batch tool of IrfanView.
  3. Then convert these images into a PDF file again using the img2pdf Python tool.

An attentive reader might ask: Why do I need to downscale the images? Why don’t we just generate smaller images with Ghost Script right away?

The answer: The anti-aliasing of Ghost Script is quite poor, therefore in order to get optimal quality it is better to export larger-than-required JPEG images with it, and then downscale them with a software which has decent anti-aliasing.

The steps in detail:

 

 1. Convert the PDF into a series of JPEG images using Ghost Script.

This example uses the Ghost Script under a Cygwin environment (with Console2). Make sure that Ghost Script is properly installed under Cygwin. Download any PDF, for example: Cicero: De Natura Deorum (as “cicero.pdf”).

Then execute the following command:

gs -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=p%03d.jpg cicero.pdf

This will generate a JPEG image for each page on 300 DPI:
(Click for full size)
gs_convert1

After the conversion of all 702 pages has been completed comes the next step:

2. Downscale and crop the images

Many batch tools exist (like of Photoshop’s or GIMP’s), but we’ll use now this old, handy program, called: IrfanView. In the “File” menu select “Batch Conversion/Rename”:

(Click for full size)

irfan_view_1

Select the input files by navigating to the directory and clicking the “Add all” button. Then select the output directory. Click on “Options” for the JPEG settings:

irfan_view_2

Then click “OK”, and on the batch window click “Advanced” for the resize options:

(Click for full size)

irfan_view_3

There are 3 important settings on this page:

  • Select the “resize” option, and set its parameters.
  • Check the “Preserve aspect ratio” option.
  • Also check the “Use Resample function” option, because this enables the anti-aliasing.

Then click “OK”, and on the batch window click “Start Batch”. It will finish quickly.

3. Convert the images into a PDF file again

Download the img2pdf tool and either install it the regular way, or just copy the two “.py” files from the “src” folder. Then execute the following command:

python img2pdf.py -o cicero_optimized.pdf *.jpg

The good thing about img2pdf is that it doesn’t decode/re-encode the JPEG files, but it just inserts them into the PDF unmodified.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>