HTML to PDF
CMS, Enterprise Solutions, Java, UI Development

Generating PDF from HTML in Adobe Experience Manager

Overview

One of the most  commonly found and most informative document formats on the internet are – PDFs.  In this article, you will learn how to generate PDFs from the HTML and CSS of a webpage(s) programmatically and easily. We will make use of two open source Java libraries: Flying Saucer and iText.

To modify HTML dynamically JSOUP – another open-source Java library has been used.

PDF generator enables the author of a website to use our component to generate a PDF. The generated PDF will have content from that website laid out in a similar fashion and order as the page structure.  All this functionality is obtained by using the Java FlyingSaucer API and JSOUP.

 

Brief Introduction to Java Flying Saucer API

Flying Saucer (also called XHTML renderer) is a pure Java library for rendering XML, XHTML, and CSS  content. Due to its ability to save rendered XHTML to PDF (using iText), Flying Saucer is often used as a server-side library for the purpose of generating PDF documents. It works with an XML/XHTML document and uses CSS to determine how to lay this document it out visually on the screen. This CSS might be embedded in the document, or linked from it.

Flying Saucer has support for adding print-related things to our PDF like pagination and page headers and footers. The API understands the document layout using this CSS, lays it out, and renders it as a PDF. Using Flying Saucer we can generate PDF available for immediate download, on-the-fly.

PDF Generator working

  • The interface will have a button, on click of which an AJAX call (using GET method) will be triggered.
  • Path of the page to be converted to PDF will be selected by the author in a dialog which will be sent as parameter to the servlet.
  • In response we are generating a pdf which opens in a new tab, the user can save it with any name he/she wishes.
  • Requests have been implemented to get all the HTML from the page and the CSS that is being applied on the page
  • iTextRendererobject is used to set the layout and generate the PDF.
  • To render images MediaReplacedElementfactoryis implemented which replaces the image element with iText image element which can be rendered by Flying Saucer API.
  • Utility class using Jsoup has been used to handle all the HTML manipulations i.e removal of certain tags like button, header and footer of the page.
  • Special CSS has been applied for certain cases to make sure the PDF layout is looking good.

 

Steps :

  1. We will start with creating a Maven Archetype 10 Project in Eclipse. For PDF Generation , we will create an AEM component under our project where we can author the page path of the root page which we want to convert into PDF.
    Component Structure in CRXDE
    Component Structure in CRXDE
    PDF Generator Component in Edit Mode
    PDF Generator Component in Edit Mode

    We need to add the following dependencies in our core pom.xml of the maven project for flying saucer, itext and jsoup:

    <dependency>
    
    <groupId>com.lowagie</groupId>
    
    <artifactId>itext</artifactId>
    
    <version>2.1.7</version>
    
    </dependency>
    
    <dependency>
    
    <groupId>com.itextpdf</groupId>
    
    <artifactId>itextpdf</artifactId>
    
    <version>5.5.6</version>
    
    </dependency>
    
    <dependency>
    
    <groupId>org.xhtmlrenderer</groupId>
    
    <artifactId>flying-saucer-core</artifactId>
    
    <version>9.1.11</version>
    
    </dependency>
    
    <dependency>
    
    <groupId>org.xhtmlrenderer</groupId>
    
    <artifactId>flying-saucer-pdf</artifactId>
    
    <version>9.1.11</version>
    
    </dependency>
    
    <dependency>
    
    <groupId>org.jsoup</groupId>
    
    <artifactId>jsoup</artifactId>
    
    <version>1.9.2</version>
    
    </dependency>

    For these dependencies to be applied we need to add them in the Embed-Dependency tag in the core pom.xml file:

     

    <Embed-Dependency>
    
    jsoup,
    
    itext,
    
    itextpdf,
    
    flying-saucer-core,
    
    flying-saucer-pdf
    
    </Embed-Dependency>

     

  2. For the pdf generation, we need to provide the HTML of the page/ pages as a string. Flying saucer framework will take the HTML and convert that into a PDF with the styles from the CSS files. We will pass the page path authored in the above component as a parameter to a Sling Servlet via an AJAX call( using Get method). After getting the page path, we can now extract that HTML as a string from the page path in AEM by using the following logic:
    Your code
    .............
    .................
    		HttpServletRequest req = requestResponseFactory.createRequest("GET", filePath);
    		WCMMode.DISABLED.toRequest(req);
    		ByteArrayOutputStream os= new ByteArrayOutputStream();
    		HttpServletResponse resp = requestResponseFactory.createResponse(baos);
    		requestProcessor.processRequest(req, resp, request.getResourceResolver());
    		
    //this is the html from the page as a string
    		String fileContent= os.toString(CharEncoding.UTF_8);
    		
    .................	
    ..............

    Here the “filePath” parameter will be the entire path till the html page under /content along with the “.html” extension.

    Similarly we need to extract the CSS from all our external style sheets and included them internally with our HTML, since external CSS wont be applied to the PDF everytime. The best practice is to have the styles inline or internally.  We can use the following piece of code to extract CSS as a string from all the external links using the same logic as used above .

    ...........
    .............
    
    StringBuilder cssString = new StringBuilder();
    
    String cssArray[] = { ..all the stylesheets you want to add like .., "/etc/designs/external/tether.css","/etc/designs/pdf-generator/clientlib-all.css", "/etc/designs/pdf-generator/pdf-styles.css"};
    
    // extract the CSS file content
    
    for (String cssFile : cssArray) {
    
    Element style = new Element(Tag.valueOf("style"), "");
    
    style.attr("type", "text/css");
    
    //use logic as above in a method getStringFromPath
    cssString = cssString.append(getStringFromPath(cssFile, request, requestResponseFactory, requestProcessor));
    
    }
    
    return cssString.toString();
    
    }
    ..............
    .........
    

    Here, cssArray[ ]  is an array of the paths to all the css files, the styles of which we need in our PDF. The CSS extracted is kept at the beginning of our HTML string in a <style> tag.
    Now we can pass the returned string which has both HTML and CSS to the following outputPDF() method. This method will first change the string into a W3C Dom Document . Using the iTextRenderer  object we will pass the document to the setDocument() method.
    Once the document is installed you must call layout() to perform the actual layout of the document and then createPDF() to draw the document into a PDF file.
    Upon successful creation of PDF, an output stream of application/pdf content type will be sent as a response, which will open up the PDF in a new tab in the browser.

    public static void outputPDF(String htmlString, SlingHttpServletResponse response, ResourceResolver resourceResolver) {
    
     OutputStream os = null;
    
     os = response.getOutputStream();
    
     ITextRenderer renderer = new ITextRenderer();
    
     org.jsoup.nodes.Document document = Jsoup.parse(htmlString);
    
     Document doc = null;
    
     W3CDom w3cDom = new W3CDom();
    
     doc = w3cDom.fromJsoup(document);
    
     renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory(renderer.getSharedContext().getReplacedElementFactory(), resourceResolver));
    
     renderer.setDocument(doc, null);
    
     renderer.layout();
    
     renderer.createPDF(os, false);
    
     // complete the PDF
    
     renderer.finishPDF();
    
     // saving the PDF
    
     response.setHeader("Expires", "0");
    
     response.setHeader("Cache-Control", "must-revalidate, post-check=0, pre-check=0");
    
     response.setHeader("Pragma", "public");
    
     // setting the content type
    
     response.setContentType("application/pdf");
    
     response.setHeader("Content-disposition", "attachment; filename=Sample.pdf");
    
     os.flush();
    
     os.close();
    
    }

    To get the images from this document, MediaReplacedElementFactory will be made use of.
    Within Flying Saucer you will have to implement a ReplacedElementFactory so that you can replace any markup before rendering with the image data.
    The following code snippet will get the original rendition of the images and display them in the PDF :

    /**
     * Replaced element in order to replace elements like
     * <tt><div class="media" data-src="image.png" /></tt> with the real media
     * content.
     */
    public class MediaReplacedElementFactory implements ReplacedElementFactory {
    	
    private final ReplacedElementFactory superFactory;
    private ResourceResolver resourceResolver;
    private ITextOutputDevice _outputDevice;
    
    public MediaReplacedElementFactory(ReplacedElementFactory superFactory, ResourceResolver resourceResolver, ITextOutputDevice outputDevice) {
    	this.resourceResolver = resourceResolver;
    	this.superFactory = superFactory;
    	this._outputDevice = outputDevice;
    }
    
    @Override
    public ReplacedElement createReplacedElement(LayoutContext layoutContext, BlockBox blockBox, UserAgentCallback userAgentCallback, int cssWidth, int cssHeight) {
    	Element element = blockBox.getElement();
    	if (element == null) {
    	  return null;
    	}
    		
    	String tagName = element.getTagName();
    
    	// Replace any img tag with the binary data of `image.png` into the PDF.
    	if ("img".equals(tagName)) {
    		InputStream input = null;
    		String imageSrc = element.getAttribute("src");
    		if (imageSrc != null && imageSrc.startsWith("/")) {
    		    imageSrc = imageSrc.replace("_jcr_content", "jcr:content");
    	            Resource imageRes = this.resourceResolver.resolve(imageSrc);
    			
    		    if(imageRes != null && imageRes.getChild("jcr:content/renditions/original/jcr:content") != null){
    			Node node = imageRes.getChild("jcr:content/renditions/original/jcr:content").adaptTo(Node.class);
    			input = node.getProperty("jcr:data").getBinary().getStream();
    			}
    		} 
    
    		if (input != null) {
    		
    			final byte[] bytes = IOUtils.toByteArray(input);
    			final Image image = Image.getInstance(bytes);
    			final FSImage fsImage = new ITextFSImage(image);
    				if (fsImage != null) {
    				    if ((cssWidth != -1) || (cssHeight != -1)) {
    					  fsImage.scale(cssWidth, cssHeight);
    					}
    			return new ITextImageElement(fsImage);
    					}
    			
    		}
    		
    		return this.superFactory.createReplacedElement(layoutContext, blockBox, userAgentCallback, cssWidth, cssHeight);
    	}
    
    	@Override
    	public void reset() {
    		this.superFactory.reset();
    	}
    
    	@Override
    	public void remove(Element e) {
    		this.superFactory.remove(e);
    	}
    
    	@Override
    	public void setFormSubmissionListener(FormSubmissionListener listener) {
    		this.superFactory.setFormSubmissionListener(listener);
    	}
    }
    

    And finally you just need to indicate your ReplacedElementFactory to Flying-Saucer when rendering, using this piece of code in the generatePDF() method:

    renderer.getSharedContext().setReplacedElementFactory(new MediaReplacedElementFactory( renderer.getSharedContext().getReplacedElementFactory(), resourceResolver));

     

  3. Custom CSS : The way content shows up on a website and the way it should show up on a PDF varies a lot, so to get the perfect layout on the PDF, we will write a separate CSS file with styles specific to the PDF. As paged media,the CSS which applies is that marked with the “media” attribute or “print” or “all”. The path to this CSS can be added at the end of the array with all the css file paths.
  4. Most PDFs need to have some kind of footer or header along with page number. Flying Saucer supports CSS standards for paged medium and recognizes @page attribute in the CSS.
    Hence to add any page properties we can simply add the code in HTML and it will be reflected in the PDF.
    For our use case, we added the properties dynamically by appending the following code to the CSS string like so, you can add it in the HTML itself or through JS or jQuery as well.The following css style will mean that the generated pdf will have page size as 8.5 inch by 11 inch, margin of 26 mm at top and 16 mm at left right and bottom and “Page x of y” printed at the top right corner of the pdf document.

    StringBuilder css = new StringBuilder();
    
    css.append("@page{ size: 8.5in 11in;\r\n" + "margin: 26mm 16mm 16mm 16mm;  @top-right { content:   "Page " counter(page) " of " counter(pages);}
    
    }");
  5. To remove unwanted tags like Header, Footer, links, etc we shall make use of JSOUP. Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. With methods like remove(), we can remove unwanted tags from our DOM document.
    In the below snippet of code we are removing unwanted sections from the HTML like header, footer, buttons , which we don’t want to show up on the PDF.

    // Load HTML file
    	String charName = CharEncoding.UTF_8;
    	String arr[] = { "header", "button", "footer", "script", "title”,..all the elements you want to remove};
    	Document doc= Jsoup.parse(html, charName);
    
    	// removing tags that are not required
    	for (String tag : arr) {
    	 	for (Element element : doc.select(tag)) {
    				element.remove();
    			}
    		}
    

    This way you can get a PDF from the content of your webpage. To get a pdf with the content of a page, along with its child pages, we can simply get the HTML of all the pages and append them together in series, and feed that HTML to our generatePDF() method.

    References :

  1.  https://stackoverflow.com/questions/10316607/render-image-from-servlet-in-flyingsaucer-generated-pdf
  2. https://flyingsaucerproject.github.io/flyingsaucer/r8/guide/users-guide-R8.html#xil_43

 

 

About The Author

Leave a Reply

*