Extract values of contents inside the Google's PDF viewer - java

(Edited)
I am posting question detailing the need to do 'CTRL+f' in Google Chrome to search a string in PDF. Can anyone please suggest some solution?
Testing requirement: Access a test application, open the PDF and search a testdata is present in the PDF or not.
Application overview: The test application opens the PDF in the embedded Goggle PDF viewer html view (sample screenshot attached in this comment below 'Sample PDF viewer.jpg'). In the PDF viewer, the PDF pages cannot be captured as web elements. So, the only way remains is to do 'CTRL+f', type-in the string in the search box and extract result from the search box.
Attachments:
The element's value which I'm looking to extract
Sample PDF viewer
(Original)
I'm able to open 'find box' (CTRL + F) in Chrome browser using Robot class and enter desired string in the 'find box'. How to extract the number of results from the find box using selenium in java language?
Attached image highlights the result (in yellow) I want to extract using selenium in java code.
Image

Related

PDF Validation task in ASP.Net web Application while uploading

I've been provided with PDF Validation task in my ASP.Net web Application. I need to do Preflight check for the following points.
Check for presence or barcode or text in a defined area.
Check for embedded font issues.
Check for image transparent issue.
Check version.
I have checked for the options available like Itextsharp etc but they are not fulfilling my requirement. Please help.
My name is Tilal Ahmad and I am developer evangelist at Aspose.
You may try Aspose.Pdf for .NET to accomplish your requirements:
- Check for presence or barcode or text in a defined area.
For checking text in a defined PDF page region, please check following documetnation link of Aspose.Pdf for .NET and analyze the extracted text string(extractedText).
Extract Text from an particular page region
To check presence of barcode in a defined PDF page area. Initially You should convert a specific page region to image and then use Aspose.Barcode to detect barcode from that image.
- Check for embedded font issues.
If you meant to embed un-embedded fonts into PDF document by "checking missing embedded fonts...to correct it" then you may try this documentation link of Aspose.Pdf for .NET for the purpose.
Embedding fonts in an existing PDF document
- Check version.
You may load your PDF document to Aspose.Pdf.Document() object and get the PDF version as following.
Aspose.Pdf.Document doc = new Aspose.Pdf.Document("input.pdf");
Console.WriteLine("PDF version: {0}",doc.Version);
- Check for image transparent issue.
For image transparency issue we need some investigation, if possible you can post your sample document and details in Aspose.PDF forum. We will look into it and guide you.
Furthermore, if you want to validate some PDFA standard then you may load PDF document to Aspose.Pdf.Document() object and use Validate method.
Validate PDF document for PDFA standard

Get text from a section of a pdf page with IcePdf

I'm writing some java classes to test some libraries about the issue to extract text of a portion of a pdf page. I tested iText and PDFBox. Now I want to write a class using IcePdf, but I don't know how to do because I can extract the full page text and not a specific portion.
Can someone help me?

How to Make Existing PDF Text Searchable using any Java Library? With OCR

Any java library?
How to make searchable text using any java library?
Open source or Paid.
how to apply OCR to pdf using PDFBox?
how to make pdf text searchable programmatically using pdfbox
I searched alot. Didn't find any solution.
Can anyone paste code for OCR PDFBox.
Try Apache PDFBox.
To extract text: Textextraction.
Any java library? How to make searchable text using any java library? Open source or Paid.
You can achieve this using Gnostice XtremeDocumentStudio for Java. For more details, follow the link below.
http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_in_Java
FYI, in the article, we have demonstrated how to convert scanned image to searchable PDF. In fact the input can be any scanned document (images, PDF or DOCX).
Disclaimer: I work for Gnostice.
You can use PDFBox to extract images from a PDF file, and then use the OCR system of your choice (for example, Tesseract) to obtain the text. Alternatively, if the PDF is mixed text and images, you can use Ghostscript to create an image of each PDF page, and then run OCR.
If you then need a searchable PDF file, build a new PDF by writing the text first, and then drawing the image over top of the text. The text will be searchable, but you will only see the image.
Note that OCR engines like Tesseract and Google Vision will return positional information for each word, so you will be able to place the text in the correct position.

How to generate the PDF in CQ5.6.1 using page content in cq5

How to generate the PDF in CQ5.6.1 using page content.
A button in my site (genarate PDF) on click of the button i have to genarate the PDF file using the same page content.
Please let me know is there any out of the box PDF genarator in CQ or do i need to get the any linsenced product to genarate the PDF.
Thanks..
Adobe CQ is integrated with the Apache FOP, a formatter able to create PDF files. This tutorial describes how to enable content rewriter providing PDF version of the content under the .pdf extension.
However, please keep in mind that this approach requires manually writing the XSLT transform file able to process your page (and every component on it) and output the XSL-FO document.
In a previous project (CQ 5.5) we used https://code.google.com/p/wkhtmltopdf/ to create PDF files.. worked pretty good!
I had used Phantomjs to create a a custom pdf from the cq5 pages. for example if you don not want to display the right trail in the pdf or you want to disable the header footer. all this can be achieved with the help of phantomjs.
create a servlet which will execute a command at your server.
phantomjs <custom.js> 'page_url' nameofthepdf.pdf
here custom.js will show or hide html content based on your need.
This will work for all pages irrespective of cq5 or any tool.

Convert html to pdf with linked documents inline

I need to convert a bundle of static HTML documents into a single PDF file programmatically on the server side on a Java/J2EE platform using a batch process preferably. The pdf files would be distributed to site users for offline browsing of the web pages.
The major points of the requirements are:
The banner at the top should not be present in the final pdf document.
The navigation bar on the left should be transformed into pdf bookmarks from html hyperlinks.
All hyperlinked contents (html/pdf/doc/docx etc.) present in the web pages should be part of the final pdf document with pdf bookmarks.
Is there any standard open source way of doing this?
Try Apache FOP. I just used it to convert XML to PDF and I think you can do the same with HTML/DOM. The website has a whole section on running FOP in a Java application and there's example code for DOM to PDF.
You can try iText - but I am not sure whether it handles all that you require.
Moreover, it is always better if you explore many options and then decide what you can and cannot do. In many cases there won't be any library/API that will out of the box support all that you ask for.
You can try www.alt-soft.com Xml2PDF for this

Categories