iText - the PDF is not good - java

I have the following HTML:
<div align='center' style='height:50px'>
<H1>A Simple Sample Web Page</H1>
<IMG SRC='http://sheldonbrown.com/images/scb_eagle_contact.jpeg'>
<H4>By Sheldon Brown</H4>
<H2>Demonstrating a few HTML features</H2>
</div>
HTML is really a very simple language. '
<P>
'command, which will insert a blank line.If you would like to make a link or
bookmark to this page, the URL is:
<BR>
http://sheldonbrown.com/web_sample1.html
</center>
But the image appears behind the text instead of below!
What's wrong?
if iText cannot handle it - which library is better?
This is my code:
// step 1
Document document = new Document();
// step 2
PdfWriter.getInstance(document, new FileOutputStream("C:\\hello-world.pdf"));
document.open();
String content = "<div align='center' style='height:50px'><H1>A Simple Sample Web Page</H1><IMG SRC='http://sheldonbrown.com/images/scb_eagle_contact.jpeg'><H4>By Sheldon Brown</H4><H2>Demonstrating a few HTML features</H2></div>HTML is really a very simple language. '<P>' command, which will insert a blank line.If you would like to make a link or bookmark to this page, the URL is:<BR> http://sheldonbrown.com/web_sample1.html</center>";
// use the snippet for the PDF document
List<Element> objects = HTMLWorker.parseToList(new StringReader(content), null);
for (Element element : objects)
document.add(element);
document.close();

Do you have any css applied to this HTML? Have you achieved to view this HTML in any other way with a browser (which) ? It renders like you describe here: http://jsfiddle.net/TjUSJ/.
Maybe you want to remove the height styling property on that <div>? It seems like it renders on the middle, but it is actually rendernig at 50px from the top. See this other fiddle, without height styling: http://jsfiddle.net/TjUSJ/1/
Also, remember that the <center> tag is deprecated

The problem was that I was using an old version.
I switched to the last one - 5.1.2 and it works!

Related

Is there a way to create a container for elements in itextpdf 7

I want to create a kind of container, so that I can evaluate a page break for the document in itext for multiple elements.
This is what I would do in HTML:
<div style={{page-break-inside: avoid}}>
<p>Row 1</p>
<p>Row 2</p>
</div>
How would I do that in itextpdf?
You can achieve page-break-inside: avoid with the setKeepTogether method applied on a div.
Div div = new Div()
.setKeepTogether(true);
to add elements to your div simply use .add as in:
div.add(new Paragraph("Hello World");
This is featured on the iText knowledge base in chapter 4 of iText 7 Building Blocks ebook under the section "Grouping elements with the Div class".
If you on the other hand want a page break you can add it anywhere in your document with document.add(new AreaBreak());

Extract image id with Jsoup

I am trying to extract a specific captcha image id using api Jsoup, the html image tag is like :
<img id="wlspispHIPBimg03256465465dsd5456" style="display: inline; width: 200px; height: 100px;" aria-hidden="true" src="https://users/hip/data/rnd=435cb60d0a6b63ef4">
This is my code to obtain the attribute id="wlspispHIPBimg03256465465dsd5456":
doc = Jsoup.connect("http://go.microsoft.com/fwlink/?LinkID=614866&clcid")
.timeout(0).get();
Elements images = doc.select("img[src~=(?i)]");
for (Element image : images) {
System.out.println(image.attr("id"));
}
The problem is that i can't get the id of captcha image
You need to find something in the html that discriminates the img tag of any other tag in the document. From your posted code that is can't be deduced, so i use my imagination here:
Element imageEl = doc.select("img[scr*=rnd]").first();
This exploits that the source of the image contains "rnd" in it path. To get the best solution you must look yourself. Also it helps a lot if you learn the CSS selectors of Jsoup.
I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it.
View also this other question.

style attribute not being displayed using jsoup

I am using Jsoup to fetch all images of a particular manga chapter from online-manga sites using only the first page link.
I have successfully retrieved the total page number and the src of the first page, for example: if supplied with this link "http://www.mangapanda.com/feng-shen-ji/1/1" the output will be:
Total page : 49
Title : Feng Shen Ji 1
ImageURL : http://i15.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974919.jpg
what I want to do now is to fetch the src of the second page and then auto-increment to get the rest. The link to the second page is in the html as:
<div id="prefetchimg" style="background-image: url("http://i34.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974921.jpg");"></div>
but when I use jsoup as
String url = "http://www.mangapanda.com/feng-shen-ji/1";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements div = doc.select("div");
for (Element divParse : div) {
if(divParse.id().equals("prefetchimg"))
System.out.println(divParse);}
I only get
<div id="prefetchimg"></div>
Instead of
<div id="prefetchimg" style="background-image: url("http://i34.mangapanda.com/feng-shen-ji/1/feng-shen-ji-2974921.jpg");"></div>
How do I get the style attribute?
#eltabo said
Ok, in your case, your tag has been modified by a javascript function, so Jsoup can't see this attribut
And this is true, JSoup is for Html page only. For Html with JS use for example HtmlUnit

Freemarker, PDF, Header/Footer and page-breaks

A common use of Freemarker is the generation of a PDF.
Unfortunally I have to generate a pdf with a lot of pages and "they" asking me to put an header with some information and a footer with somethings like "page 2/60" etc...
Searching on web I found how to create a Macro template but it only share some common tags (like css) but it doesn't tell freemarker how to manage multipage PDF.
In addition to this, sometimes I have, inside ftl, a "page-break css class" so I cant determine when and where a new page is created.
Im using Freemakrer 2.3 on Java
Thanks for any help.
You can specify a header and a footer (including page numbers) with CSS.
This will work if the tool used to transform your XHTML into the PDF byte array supports the paged media instructions.
In the CSS:
#page {
#top-center {content: element(header)} /* Header */
#bottom-center {content: element(footer)} /* Enpied */
}
#header {position: running(header);}
#footer {position: running(footer);}
#pagenumber:before {content: counter(page);}
#pagecount:before {content: counter(pages);}
In the HTML:
<div id="header">YOUR HEADER HERE</div>
<div id="footer">Page <span id="pagenumber" /> / <span id="pagecount" /></div>

Selenium WebDriver findElements() Fails on Single Quotes

My goal is to parse a block of HTML code like below to obtain the text, comments and replies fields as separate parts of the block:
<div id='fooID' class='foo'>
<p>
This is the top caption of picture's description</p>
<p>
T=<img src="http://www.mysite.com/images/img23.jpg" alt="" width="64" height="108"/> </p>
<p>
And here is more text to describe the photo.</p>
<div class=comments>(3 comments)</div>
<div id='reply13' class='replies'>
<a href=javascript:getReply('13',1)>Show reply </a></div>
</div>
My problem is that Selenium's WebDriver does not seem to support non-string identifiers in the HTML (notice that the class field in the HTML is 'foo' and as opposed to "foo"). From all examples that I have seen in both the Selenium docs and in other SO posts, the latter format is what WebDriver commonly expects.
Here is the relevant part of my Java code with my various (unsuccessful) attempts:
java.util.List<WebElement> elementList = driver.findElements(By.xpath("//div[#class='foo']"));
java.util.List<WebElement> elementList = (List<WebElement>) ((JavascriptExecutor)driver).executeScript("return $('.foo')[0]");
java.util.List<WebElement> elementList = driver.findElements(By.xpath("//div[contains(#class, 'foo')]"));
java.util.List<WebElement> elementList = driver.findElements(By.cssSelector("div." + foo_tag)); // where foo_tag = "'foo'".replace("'", "\'");
java.util.List<WebElement> elementList = driver.findElements(By.cssSelector("'foo'"));
Is there a sure way of handling this? Or is there an alternative, better way of extracting the above fields?
Other info:
I'm an HTML noob, but have made efforts to understand the structure of the HTML code/tags
Using Firefox (and, accordingly, FirefoxDriver)
Your help/suggestions greatly appreciated!
It's invalid HTML, so Selenium won't have a chance. You should fix it.
You will have a better chance with HTMLAgilityPack:
http://htmlagilitypack.codeplex.com/
It is a little better when it comes to badly formed (which this is) HTML.
Below is a SO post which a few different options for a few different languages, with tools like HTMLAgilityPack. You should find a suitable one:
Options for HTML scraping?
The problem is that the html specification doesnt know single quotes as far as I know. Therefore you don't have a problem with the Selenum webdriver, the problem is the html.
Do you have the chance to edit the html code?

Categories