Extracting text between two bookmarks using Apache PdfBox - java

I am using Apache PDFBox to read a PDF document that has a hierarchy defined by bookmarks. The hierarchy is in a tree form with contents only at the leaf level.
Extracting the text between two leaf level bookmarks using the following code:
Stripper.setStartBookmark(),
Stripper.setEndBookmark(),
Stripper.writeText()),
Returns text in the whole page instead. In short, my problem is similar to that mentioned in this thread.
Is there a way to extract the contents between two bookmarks?
If so, what should be the change in my code?

I am guessing that your bookmark does not contain the correct data.
It sounds like the bookmark you are using is only pointing to the page where your content starts, rather than a location on the page.
Here is an example of a bookmark that contains location data:
<Title Action="GoTo" Style="bold" Page="2 FitH 518">
Title Name
</Title>

Related

Split PDF Into Separate Files By Child Bookmarks

I am trying to split PDF file (book) to multiple files by child bookmarks in code
Use case: table of contents of a book is available for a user. User can select up to n sections (might be not sequential) to preview. Application need to extract this sections and merge into single preview PDF
I found few tools, while looking into the solution in internet: Aspose, Spire (E-IceBlue), etc.
All of them can split PDF by pages (top bookmarks), but I need to split PDF by child bookmarks. It means, that area to extract can be started and/or finished at the middle of the page.
Ideally to have abiliti to do this in java code, but if someone knows solution in any other programming language or CLI program - it also would be great
It depends whether you insist that the non-chosen content on a page be redacted or not. For example, if section 6.3.2 takes up the middle half of a page, do you care if the end of 6.3.1 and the beginning of 6.3.3 are shown in the output on the same page?
If you don't care, cpdf can do this easily. Just output the bookmark data as JSON:
cpdf -list-bookmarks-json -utf8 in.pdf > marks.json
Then you can parse this JSON to show the list of bookmarks, and choose which pages to extract based on child bookmark page numbers.
As for redaction, you could use -add-rectangle or -hard-box to clean up the output based on the coordinates from the JSON bookmarks file, but that's not real redaction -- it just removes the content from view.

How do I use CSS selectors in JSoup to select all elements containing images, including in a data-src (HTML5 dataset) attribute?

I'm trying to use JSoup to parse any web page and programmatically identify the elements that are content blocks, defined as any element that occurs multiple times and contains text, a link, and an image. All was going well until I got to http://fansided.com/. Images on this page shows up not in an <img> tag, but in an attribute like data-background="http://cdn.fansided.com/wp-content/blogs.dir/314/files/2015/01/8O7hjxQ-268x150.png".
Is there a way to use a single CSS selector (perhaps a regex?) that will select all elements containing images, regardless of their type?
try this one
Document doc = Jsoup.connect("http://fansided.com/").userAgent("Mozilla").get();
Elements select = doc.select("[data-background],[style~=background:url]");
It will get any element which contains the "data-background" or "style=background:url..." attribute.

how to identify corresponding html object using label using java

I have html file and I have the label name now I need to identify the html object.
Can you please help me to identify the object.
I am using jsoup to parase,
I could not attaching the screen shot,
The page has top row with label below are html object
program, study, study status, study manager (all are labels and below html obj)
text box dropdown, drop down, text box (html objec)
When working with HTML, it pays to be aware of the structure of the HTML elements, and not how they are rendered on-screen. In your case, you will need to find a way to identify the elements you seek in the HTML code, and and then use one of the Document#getElement(s|)By(.+) methods to find it.

creating java help using single HTML file

I have one HTML file which contains 200 definitions and i don't want to create 200 HTML files. I want to create java help using that file such that if user click on TOC(table of content) list and user can reached at the particular definition without scrolling that html documentation.
First study how to make tree like structure using JTree and then study how to show html page in Jframe using JEditorPane.

Replace text with an image docx4j

I have an word template. There is an word photo that has to be replaced with an image. This has to be done with Docx4Java.
How do I do this?
If specifically looking to replace a text with an image(which is not possible using docx4j as answered above), you can use replace bookmark with image as an alternative.
Just open your templated word file, position the cursor at desired location and insert->bookmark and name your bookmark.
I followed the instructions here to replace this bookmark with an image
Disclosure: I manage the docx4j project
The VariableReplace code doesn't handle images.
The best way to do this would be to use data bound content controls, specifically a picture content control pointing via XPath at a base-64 encoded image in an XML document (see Getting Started for details).
However, if you want to replace a word with an image, you can do so, but you'll have to write a bit of glue code. It is pretty straightforward.
First, find the word. You can do this using XPath or TraversalUtil (again, see Getting Started for details).
Hopefully it is in a run (w:r/w:t) by itself. If not, you'll need to split the run up so you don't replace adjacent text.
Then, add the image. See the sample ImageAdd.
I suggest you have a look at the XML created when you add an image in Word (ie save and unzip your docx, then look at document.xml). Take care that the XML representing the image is at the correct level (eg child of w:p).

Categories