Jul 13, 2013

Python converting PDF to Image

I have a task to generate thumbnails of uploaded PDF's. And seems like there no really solid decisions yet. Just garbage on the surface in google results. after googling for a while I found out about many ways to do so. E.g. use python stdin/out to run external command line tool. It might work for you to. But it seemed not so pythonic for me. So I have searched fo better decision. My current is for now to install ImageMagick and MagicWand binding.

Install ImageMagick.

I have used PIL a while ago to work with images. But it made me cry, before I have met sorl-thumbnails. It helped me a lot. But now I have to deal with PDF's. And ImageMagick seems like a complete decision to master it all.  It has to convert pdf to my direct desirables - jpeg. So to install ImageMagick I have used brew. Like this:
brew install imagemagick
However there are many other ways to do so, depending on a platform. But I strongly recommend to look at brew.

Anyway installing ImageMagick is tricky. And in order to have it installed to work with pdf's we need to have freetype and ghostscript packages. In case of absence of ghostscript you could have error like so:
wand.exceptions.DelegateError: Postscript delegate failed 'file.pdf': No such file or directory @ error/pdf.c/ReadPDFImage/682
In case of freetype package absence you will have you PDF rendered without fonts. So be sure those 2 are certainly installed.

Installing Wand

There are several high level bindings for ImageMagick for python, But I have chosen wand as my favorable here.
This is strongly depends on a platform. But nowdays fortunately I can do:
pip install Wand
And I'm happy with it.
Wand is simple enough for my task so I can do convert PDF to image and do simple transformations of my choice with it.

Working with it

Now that we have those things installed we may convert a pdf into image and resize it afterwards.
from wand.image import Image
# Converting first page into JPG
with Image(filename="/thumbnail.pdf[0]") as img:
     img.save(filename="/temp.jpg")
# Resizing this image
with Image(filename="/temp.jpg") as img:
     img.resize(200, 150)
     img.save(filename="/thumbnail_resize.jpg")
I'm sure there are better solutions here.  Note this is a simplified example to show the whole point of this method.

Feel free to suggest better solution in comments.

11 comments:

  1. Hey Lurii! Thanks for sharing this incredible piece of information. Let me tell you about my experience. I have been using this JPG to PDF converter. It is quite nice and moreover, its free and cool.

    ReplyDelete
  2. Thanks for the post! I noticed that when I did this, the image quality was rather poor. I tried resizing the image before saving it, but that ruined the image. It basically looked like a blank white page, with just a few specks here and there. Do yo have any advice? Thanks!

    ReplyDelete
  3. on some systems you will also need to install ghostscript in order to read pdf in imagemagic

    ReplyDelete
  4. I had a problem with quality as well converting a multi-page pdf to a bunch of PNGs. Try this:

    Image(filename="/thumbnail.pdf[0]", resolution=300) as img:

    300 worked fine for me since I'm just using OCR on the images, you may need to bump resolution up.

    ReplyDelete
  5. PDFDelegateFailed `The system cannot find the file specified.
    ' @ error/pdf.c/ReadPDFImage/800

    this error is coming !! what should i do man !?

    ReplyDelete
  6. PDFDelegateFailed `The system cannot find the file specified.
    ' @ error/pdf.c/ReadPDFImage/800

    i am getting some path error , how should i solve this ???

    ReplyDelete
  7. okay i solved the problem, sometimes the pdf is in binary form so we need to give commands like this.

    from wand.image import Image
    # Converting first page into JPG
    with Image(filename="/thumbnail.pdf[0]") as img:
    img.save(blob=response.rendered_content)
    # Resizing this image
    with Image(filename="/temp.jpg") as img:
    img.resize(200, 150)
    img.save(filename="/thumbnail_resize.jpg")

    response = PDFTemplateResponse(
    request,
    template='reports/report_email.html',
    filename='weekly_report.pdf',
    context=render_data,
    cmd_options={'load-error-handling': 'ignore'})

    ReplyDelete