Converting djvu to pdf AND preserving table of contents , how is it possible?

9

5

I tried several online and offline tools but table of content ( TOC ) information was not preserved during the conversion.

I would like to convert a 5000 page Finnish dictionary which is in djvu format and has about 5000 TOC entries structured hierarchically for finding words quickly.

Any idea how is it possible to preserve the TOC information during DJVU to PDF conversion?

user1198559

Posted 2014-08-23T06:36:47.883

Reputation: 370

Answers

5

update: user3124688 has coded up this process in the script dpsprep.


I don't know of any tools that will do the conversion for you. You certainly should be able to do it, but it might take a little work. I'll outline the basic process. You'll need the open source command line utilities pdftk and djvused (part of DjVuLibre). These are available from your package manager (GNU/Linux) or their websites (Windows, OS X).

  • step 1: convert the file text

    First, use any tool to convert the DJVU file to a PDF (without bookmarks).

    Suppose the files are called filename.djvu and filename.pdf.

  • step 2: extract DJVU outline

    Next, output the DJVU outline data to a file, like this:

    djvused "filename.djvu" -e 'print-outline' > bmarks.out
    

    This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:

    file ::= (bookmarks
               <bookmark>*)
    bookmark ::= (name
                   page
                   <bookmark>*)
    name ::= "<character>*"
    page ::= "#<digit>+"
    

    For example:

    (bookmarks
      ("bmark1"
        "#1")
      ("bmark2"
        "#5"
        ("bmark2subbmark1"
          "#6")
        ("bmark2subbmark2"
          "#7"))
      ("bmark3"
        "#9"
        ...))
    
  • step 3: convert DJVU outline to PDF metadata format

    Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:

    file ::= <entry>*
    entry ::= BookmarkBegin
              BookmarkTitle: <title>
              BookmarkLevel: <number>
              BookmarkPageNumber: <number>
    title ::= <character>*
    

    So our example would become:

     BookmarkBegin
     BookmarkTitle: bmark1
     BookmarkLevel: 1
     BookmarkPageNumber: 1
     BookmarkBegin
     BookmarkTitle: bmark2
     BookmarkLevel: 1
     BookmarkPageNumber: 5
     BookmarkBegin
     BookmarkTitle: bmark2subbmark1
     BookmarkLevel: 2
     BookmarkPageNumber: 6
     BookmarkBegin
     BookmarkTitle: bmark2subbmark2
     BookmarkLevel: 2
     BookmarkPageNumber: 7
     BookmarkBegin
     BookmarkTitle: bmark3
     BookmarkLevel: 1
     BookmarkPageNumber: 9
    

    Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.

  • step 4: extract PDF metadata and splice in converted bookmarks

    Once you've got the converted list, output the PDF metadata from your converted PDF file:

    pdftk "filename.pdf" dump_data > pdfmetadata.out
    

    Now, open the file and find the line that begins: NumberOfPages:

    insert the converted bookmarks after this line. Save the new file as pdfmetadata.in

  • step 5: create PDF with bookmarks

    Now we can create a new PDF file incorporating this metadata:

    pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf
    

    The file out.pdf should be a copy of your PDF with the bookmarks imported from the DJVU file.

pyrocrasty

Posted 2014-08-23T06:36:47.883

Reputation: 1 332

3

Based on the very clear outline above given by user @pyrocrasty (thank you!), I have implemented a DJVU to PDF converter which preserves both OCR'd text and the bookmark structure. You may find it here:

https://github.com/kcroker/dpsprep

Acknowledgements for the OCR data go to @zetah on the Ubuntu forums!

user3124688

Posted 2014-08-23T06:36:47.883

Reputation: 31

I had a DJVU file with non-numeric text in the bookmark page number fields, so the parser didn't read them. I replaced j.split('#')[1] with (int(re.findall(r'\d+', j.split('#')[1])[0])+1) and it worked great. Debian Jessie needed: sudo apt-get install pdftk djvulibre-bin python-pip ruby ruby-dev libmagickwand-dev; sudo pip install sexpdata; sudo gem install iconv pdfbeads – None – 2016-12-17T23:17:49.977