I convert HTML files to PDF format using The Flying Saucer Project. This are documents containing repetitive information - premises and their addresses, let's call them elements. At the end of a document I need to create an index. Each index entry should have a page number referring to the page where element was added. The number of elements that can fit on one page will vary.
How can I create a document index? Or how can I get notified while library adds certain type of HTML element to the PDF document?
Try this:
In CSS
ol.toc a::after { content: leader('.') target-counter(attr(href), page);}
In HTML
<h1>Table of Contents</h1>
<ol class='toc'>
<li>Loomings</li>
<li>The Carpet-Bag</li>
<li>The Spouter-Inn</li>
</ol>
<div id="chapter1">Loomings</div>
I found possible answer. You have to start playing with org.xhtmlrenderer.render.BlockBox class. A method public void layout(LayoutContext c, int contentStart) is used to place properly any HTML element in the PDF document. This method iterates through an element a few times. After the last iteration a valid page number is set.
If you mark an element you want to index, by for example using a class attribute, then you can get a page number using following code:
String cssClass = getElement().getAttribute("class");
if(!cssClass.equals("index")) {
int pageNumber = c.getRootLayer().getPages().size();
/* ... */
}
Related
I wrote a method to insert a div with text passed as parameter.
And then I noticed I need to add various HTML content into that div. Current method works on these basic 5 lines of instruction:
//engine is the WebEngine object of some WebView object
Node html = engine.getDocument().getChildNodes().item(0);
Node body = html.getChildNodes().item(1);
Element e = engine.getDocument().createElement("div");
e.setTextContent(msg);
body.appendChild(e);
So here comes my question. Is there a way of parsing some HTML content into an Element object, so I can append that element to the document?
Example HTML String: <b>SomeText</b>
I solved the problem with Javascript! I could append any HTML data with JS.
Example:
engine.executeScript("document.body.innerHTML += '<div><b>SomeText</b></div>' ");
I recently created such a tool, I hope it helps a lot
https://github.com/graycatdeveloper/JavaFXHtmlText
Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,
htmlDom = parser.parse("<p>My First Heading</p>clk");
Elements pe = htmlDom.select("html");
System.out.println("size "+pe.size());
The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?
Thank you.
It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.
That is because Jsoup will try to conform the HTML5 Parsing Rules, and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:
Elements pe = htmlDom.select("head");
System.out.println("size "+pe.size());
To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser, as below, which will parse the HTML as it is.
String customHtml = "<p>My First Heading</p>clk";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());
So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.
I am parsing an html page using jsoup. Here is what i did so far:
doc = Jsoup.connect("http://www.marketimyilmazlar.com/index.php?route=product/category&path=141_77").get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips.getElementById("content");
Elements allProductPricesOnPage = page_clip_content.getElementsByClass("price");
now, when i write:
allProductNamesOnPage.get(0);
it returns me the following:
<div class="name">
<a href="http://www.marketimyilmazlar.com/index.php?
route=product/product&path=141_77&product_id=4309"> here is the text</a>
</div>
What i want to do is, i want to get the "here is the text" part of that object. Can anyone help me with his?
Thanks
You might want to iterate over the Elements you have gathered and print their prices one by one:
Elements allProductPricesOnPage = page_clip_content
.getElementsByClass("price");
for (Element el : allProductPricesOnPage) {
System.out.println(el.text());
}
Gives,
19.99 TL KDV Dahil
9.99 TL KDV Dahil
14.99 TL KDV Dahil
What it does is, you are selecting Elements which implements Iterator (see javadoc here), which gives you an access to individual Element objects within your collection.
Each of these Element objects which are repeating within your HTML have relevant information you want to extract.
If you want to extract only the text, you can call the text() method:
String text = allProductNamesOnPage.get(0).text();
This method gets the text of an Element and its combined children. So if you want to ensure that you are only extracting text from the a element, call text() on the first child element:
String text = allProductNamesOnPage.get(0).child(0).text();
See here: http://jsoup.org/cookbook/extracting-data/attributes-text-html
For example i have html content like this.
<div>go to the text from here.<br> from there <br> Go to the text</div>
In the above content, i want to insert span tag for the word alone Like the below output using java.
I'm using org.w3c.dom package.
I tried but not able to make success
Element e = doc.createElement("span");
String text = preElement.getTextContent();
if(text.indexOf("text"){
e.setTextContent("text");
}
// Afterwards how to insert this to document. How to use insertBefore method for the //inbetween text.
Expected Output:
<div>go to the <span>text</span> from here.<br> from there <br> Go to the <span>text</span></div>
Please help.
You have to use the splitText method on your text node to split it into three nodes, isolating the word you need to wrap in your element. Then you only have to replace the text node you just isolated (use replaceChild) with the new element. There is no need to create a new text node, you can simply put the one you removed in the element you added.
Java implementation reference: http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Text.html#splitText%28int%29 http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html#replaceChild%28org.w3c.dom.Node,%20org.w3c.dom.Node%29.
With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.
String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(content);
Elements links = doc.select("a[href]"); // a with href
System.out.println(links.size());
This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.
You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html>, <head> and <body> element.
This might loosely look like:
// attempt to count html elements in string - incorrect code, see below
public static int countHtmlElements(String content) {
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
return elements.size()-4;
}
However this gives a wrong result if the text contains a <html>, <head> or <body>; compare the results of:
// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));
So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.
More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html>, <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.