Combine multiple PDF files into single PDF file in Clojure

Packing scattered pieces into a succinct entity is a common task I encounter constantly. Such as saving a slide which consists of a series of orderd images, or a newspaper provided as PDF files of each page. It's not only clutter the file system but also difficult to navigate because the file time or file name may not ordered as it's natural order. To smooth the offline reading experience, it's better to pack them together into a single PDF file.

I'd used the clj-pdf to pack a set of images into a PDF, it's the first option comes into my mind, looks the API collate may be the thing we want. Here is how to combine pdf files into one file

 
(defn read-pdf-bytes [file-path]
  (java.nio.file.Files/readAllBytes (java.nio.file.Paths/get file-path (into-array String [])))
)
 
(clj-pdf.core/collate
  {:size :a2
  }
  (java.io.FileOutputStream. (clojure.java.io/file "d:\\merged.pdf"))
  (read-pdf-bytes "C:\\Download\\01.pdf")
  (read-pdf-bytes "C:\\Download\\02.pdf")
  (read-pdf-bytes "C:\\Download\\03.pdf")
  (read-pdf-bytes "C:\\Download\\04.pdf")
)
 

The problem of this solution is all the PDF files are to be merged must have the same size and we must specify it in our code, otherwise, pdf file will be cut to fit the page size if it's too big or the result page is filled with empty white space if it's too small. In another words, the collate is not the same as merge that I thought.

To achieve what we want, we need another clojure wrapper of pdfbox, the merge function is exactly the thing we need.

 
(require '[pdfboxing.merge :as pdf])
(pdf/merge-pdfs
  :input
  (mapv #(str "e:\\" % )
  [
    "01.pdf"
    "02.pdf"
    "03.pdf"
    "04.pdf"
    "05.pdf"
    "07.pdf"
    "06.pdf"
    "08.pdf"
  ])
  :output "e:\\merged.pdf")
 

After Merging PDF files unable to delete PDF files

The code above will be able to merge pdf files as desired, but there is one problem, after those separate pdf files were merged I don't need them anymore so I try to delete them but it tells me the files were opened in JVM and unable to delete. Obviously the files were opened and merged but not closed to release the resources.

So I looked at the source code of merge-pdf function

 
(defn merge-pdfs
  "merge multiple PDFs into output file"
  [& {:keys [output input]}]
  {:pre [(arg-check output input)]}
  (let [merger (PDFMergerUtility.)]
      (doseq [f input]
        (.addSource merger (FileInputStream. (File. f))))
      (.setDestinationFileName merger output)
      (.mergeDocuments merger)))
 

The class that perform the merging is PDFMergerUtility. In the latest trunk codebase, the problem was already solved, but not in older releases even the newest release 2.0.9.

The problem was solved in the following commit of PDFBOX

 
https://github.com/apache/pdfbox/commit/341083ad02f99c2c00a746108f19c6e597bff4a1#diff-c445f796d4350b59834d5e812187aac9
 

A new merge mode called optimize mode was added. The commit time is Apr 15, 2018, 1:20 AM GMT+8. The version I was using is 2.0.6. As of the time of this writing , the newest release is 2.0.9 which was released before the commit thus not contains this commit. To use it we need the newest trunk snapshot version. I tried the following to download the dependency

 
[org.apache.pdfbox/pdfbox "trunk-SNAPSHOT"]
[org.apache.pdfbox/pdfbox "LATEST-TRUNK-SNAPSHOT"]
 

None of them works. I can't find the method to retrieve the lasted trunk snapshot of PDFBOX in clojure build tool.

But I found the Latest development snapshot build on pdfbox.apache.org, to use the fix we should download from the 3.0.0-SNAPSHOT/, including preflight, fontbox and pdfbox. Put those jars in the classpath of your Clojure REPL JVM.

Now restart Clojure REPL you can change the merge mode as below

 
user> (def foo (org.apache.pdfbox.multipdf.PDFMergerUtility.))
#'user/foo
user> (.getDocumentMergeMode foo)
#<DocumentMergeMode PDFBOX_LEGACY_MODE>
user> (.setAcroFormMergeMode foo  org.apache.pdfbox.multipdf.PDFMergerUtility$DocumentMergeMode/OPTIMIZE_RESOURCES_MODE)
nil
user> (.getDocumentMergeMode foo)
#<DocumentMergeMode OPTIMIZE_RESOURCES_MODE>
user> 
 

Now we can redefine the merge-pdfs or add a new version that will allow specify the merge mode.

Change the code of merge-pdfs as below

 
(defn merge-pdfs
  "merge multiple PDFs into output file"
  [& {:keys [output input]}]
  {:pre [(arg-check output input)]}
  (let [merger (PDFMergerUtility.)]
      (.setAcroFormMergeMode merger org.apache.pdfbox.multipdf.PDFMergerUtility$DocumentMergeMode/OPTIMIZE_RESOURCES_MODE)
      (doseq [f input]
        (.addSource merger (FileInputStream. (File. f))))
      (.setDestinationFileName merger output)
      (.mergeDocuments merger)))
 

Then reevaluate the pdfboxing.merge namespace in REPL. Now you can merge the files and delete those PDF files without restart the JVM.