HTML to PDF
CMS, Enterprise Solutions, Java, UI Development

Generating PDF from HTML in Adobe Experience Manager

Overview

One of the most  commonly found and most informative document formats on the internet are – PDFs.  In this article, you will learn how to generate PDFs from the HTML and CSS of a webpage(s) programmatically and easily. We will make use of two open source Java libraries: Flying Saucer and iText.

To modify HTML dynamically JSOUP – another open-source Java library has been used.

PDF generator enables the author of a website to use our component to generate a PDF. The generated PDF will have content from that website laid out in a similar fashion and order as the page structure.  All this functionality is obtained by using the Java FlyingSaucer API and JSOUP.

 

Brief Introduction to Java Flying Saucer API

Flying Saucer (also called XHTML renderer) is a pure Java library for rendering XML, XHTML, and CSS  content. Due to its ability to save rendered XHTML to PDF (using iText), Flying Saucer is often used as a server-side library for the purpose of generating PDF documents. It works with an XML/XHTML document and uses CSS to determine how to lay this document it out visually on the screen. This CSS might be embedded in the document, or linked from it.

Flying Saucer has support for adding print-related things to our PDF like pagination and page headers and footers. The API understands the document layout using this CSS, lays it out, and renders it as a PDF. Using Flying Saucer we can generate PDF available for immediate download, on-the-fly.

PDF Generator working

  • The interface will have a button, on click of which an AJAX call (using GET method) will be triggered.
  • Path of the page to be converted to PDF will be selected by the author in a dialog which will be sent as parameter to the servlet.
  • In response we are generating a pdf which opens in a new tab, the user can save it with any name he/she wishes.
  • Requests have been implemented to get all the HTML from the page and the CSS that is being applied on the page
  • iTextRendererobject is used to set the layout and generate the PDF.
  • To render images MediaReplacedElementfactoryis implemented which replaces the image element with iText image element which can be rendered by Flying Saucer API.
  • Utility class using Jsoup has been used to handle all the HTML manipulations i.e removal of certain tags like button, header and footer of the page.
  • Special CSS has been applied for certain cases to make sure the PDF layout is looking good.

 

Steps :

  1. We will start with creating a Maven Archetype 10 Project in Eclipse. For PDF Generation , we will create an AEM component under our project where we can author the page path of the root page which we want to convert into PDF.
    Component Structure in CRXDE
    Component Structure in CRXDE
    PDF Generator Component in Edit Mode
    PDF Generator Component in Edit Mode

    We need to add the following dependencies in our core pom.xml of the maven project for flying saucer, itext and jsoup:

    For these dependencies to be applied we need to add them in the Embed-Dependency tag in the core pom.xml file:

     

     

  2. For the pdf generation, we need to provide the HTML of the page/ pages as a string. Flying saucer framework will take the HTML and convert that into a PDF with the styles from the CSS files. We will pass the page path authored in the above component as a parameter to a Sling Servlet via an AJAX call( using Get method). After getting the page path, we can now extract that HTML as a string from the page path in AEM by using the following logic:

    Here the “filePath” parameter will be the entire path till the html page under /content along with the “.html” extension.

    Similarly we need to extract the CSS from all our external style sheets and included them internally with our HTML, since external CSS wont be applied to the PDF everytime. The best practice is to have the styles inline or internally.  We can use the following piece of code to extract CSS as a string from all the external links using the same logic as used above .

    Here, cssArray[ ]  is an array of the paths to all the css files, the styles of which we need in our PDF. The CSS extracted is kept at the beginning of our HTML string in a <style> tag.
    Now we can pass the returned string which has both HTML and CSS to the following outputPDF() method. This method will first change the string into a W3C Dom Document . Using the iTextRenderer  object we will pass the document to the setDocument() method.
    Once the document is installed you must call layout() to perform the actual layout of the document and then createPDF() to draw the document into a PDF file.
    Upon successful creation of PDF, an output stream of application/pdf content type will be sent as a response, which will open up the PDF in a new tab in the browser.

    To get the images from this document, MediaReplacedElementFactory will be made use of.
    Within Flying Saucer you will have to implement a ReplacedElementFactory so that you can replace any markup before rendering with the image data.
    The following code snippet will get the original rendition of the images and display them in the PDF :

    And finally you just need to indicate your ReplacedElementFactory to Flying-Saucer when rendering, using this piece of code in the generatePDF() method:

     

  3. Custom CSS : The way content shows up on a website and the way it should show up on a PDF varies a lot, so to get the perfect layout on the PDF, we will write a separate CSS file with styles specific to the PDF. As paged media,the CSS which applies is that marked with the “media” attribute or “print” or “all”. The path to this CSS can be added at the end of the array with all the css file paths.
  4. Most PDFs need to have some kind of footer or header along with page number. Flying Saucer supports CSS standards for paged medium and recognizes @page attribute in the CSS.
    Hence to add any page properties we can simply add the code in HTML and it will be reflected in the PDF.
    For our use case, we added the properties dynamically by appending the following code to the CSS string like so, you can add it in the HTML itself or through JS or jQuery as well.The following css style will mean that the generated pdf will have page size as 8.5 inch by 11 inch, margin of 26 mm at top and 16 mm at left right and bottom and “Page x of y” printed at the top right corner of the pdf document.

  5. To remove unwanted tags like Header, Footer, links, etc we shall make use of JSOUP. Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. With methods like remove(), we can remove unwanted tags from our DOM document.
    In the below snippet of code we are removing unwanted sections from the HTML like header, footer, buttons , which we don’t want to show up on the PDF.

    This way you can get a PDF from the content of your webpage. To get a pdf with the content of a page, along with its child pages, we can simply get the HTML of all the pages and append them together in series, and feed that HTML to our generatePDF() method.

    References :

  1.  https://stackoverflow.com/questions/10316607/render-image-from-servlet-in-flyingsaucer-generated-pdf
  2. https://flyingsaucerproject.github.io/flyingsaucer/r8/guide/users-guide-R8.html#xil_43

 

 

About The Author

Leave a Reply

*