Convert Html String content to java Map - java

I have the following html string content and i want to convert it into java map using java.
<div dir="ltr"><div class="gmail_quote"><div dir="ltr"><div><div dir="ltr"><div><p style="font-family:arial,sans-serif;font-size:13px">Notification for shipment event group "Picked up" for 13 May 14.<u></u><u></u></p>
<div class="MsoNormal" align="center" style="font-family:arial,sans-serif;font-size:13px;text-align:center">
<hr size="2" width="100%" align="center"></div><table border="0" cellpadding="0" style="font-family:arial,sans-serif;font-size:13px">
<tbody><tr><td style="padding:0.75pt">
<p class="MsoNormal">
AWB Number: 8841965182<br>
Pickup Date: 2014-05-13 20:11:00<br>
Service: P<br>
Pieces: 1<br>
enter code here`
I have used jsoup but did not worked.

Take a look at Boilerpipe
A similar question is asked here at SO

Related

Tomcat 8.5 resolves same variable differently

When expanding a HTML page with embedded variables from JSP code my code produced inexplicable results. The variable "String completename" expands at first to
http://www.formatika.de/cococo.de/products/Sources/Isabelle/Doc/Tutorial/document/Isa-logics.pdf
and 2 lines later to
http://localhost:8080/cococo.de/products/Sources/Isabelle/Doc/Tutorial/document/Isa-logics.pdf
in the following code fragment:
<div>
<table border="0" align="center" cellspacing="2" cellpadding="2">
<tr align="center"><td align="center">
<div>
<a href="<%=completename %>" title="<%=showname%>" target="_blank"><%=filename%><br><br>
<iframe src='<%=completename %>' width='<%=width%>' height='<%=height%>' type='application/pdf'>
</iframe>
</a>
</div>
</td></tr>
</table>
</div>
it can be observed in this URL
http://formatika.de/print.jsp?content=source&file=products/Sources/Isabelle/Doc/Tutorial/document/Isa-logics.pdf
Does anyone know where to search? I use Apache Tomcat 8.0.27 with JAVA EE 6 Web.
Found the answer/bug; it was because the Tomcat still is hidden behind the IIS and therefore the name was here resolved to "localhost:8080" instead of the hostname at IIS.

JSoup scrape HTML document by attribute value

I want to make a dynamic website and need some pics off the internet. I decided to scrape them off flickr and include the owners on my website but am running into problems scraping. I'll post part of the HTML below but if you want to check the source code yourself, here's the website. https://www.flickr.com/explore
HTML:
<div class="thumb ">
<span class="photo_container pc_ju">
<a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="rapidnofollow photo-click"><img id="photo_img_15586482942" src="https://c2.staticflickr.com/4/3945/15586482942_6a7154363f_z.jpg"width="508" height="339" alt="Lake District" class="pc_img " border="0"><div class="play"></div></a>
</span>
<div class="meta">
<div class="title"><a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="title">Lake District</a></div>
<div class="attribution-block">
<span class="attribution">
<span>by </span>
******<a data-track="owner" href="/photos/sheilarogers13" title="sheilarogers22" class="owner">sheilarogers22</a>******
</span>
</div>
<span class="inline-icons">
<a data-track="favorite" href="#" class="rapidnofollow fave-star-inline canfave" title="Add this photo to your favorites?"><img width="12" height="12" alt="[★]" src="https://s.yimg.com/pw/images/spaceball.gif" class="img"><span class="fave-count count">99+</span></a>
<a title="Comments" href="#" class="rapidnofollow comments-icon comments-inline-btn">
<img width="12" height="12" alt="Comments" src="https://s.yimg.com/pw/images/spaceball.gif">
<span class="comment-count count">57</span>
</a>
<img width="12" height="12" alt="" src="https://s.yimg.com/pw/images/spaceball.gif">
</span>
</div>
</div>
I want the line where I put asterisks, in order to be able to give credit to the authors of the pictures.
My code:
Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
The above code however gives me all 4 data tracks in my div.meta though, and I only want the one that =owner.
I checked the JSoup documentation and it says that attributes with values are found using [attr=value], but I can't seem to get it to work. I've tried:
.select("[data-track=owner]")
.select("[data-track='owner']")
but neither work. Thoughts?
Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
Elements ownerElements = new Elements();
for(Element element:pgElem){
if(!element.getElementsByAttributeValueContaining("data-track","owner").isEmpty()){
ownerElements.add(element);
}
}
actually, I just gave it another spin and this works for me:
doc.select("div.thumb").select("div.meta").select("[data-track=owner]")

Jsoup parsing for nested html

I have an HTML to parse with Jsoup and I lose track after the HTML's weird structure. I can summarize HTML like this(Every line is one level inside of the above):
<html>
<body class="page3078">
<div id="mainCapsule">
<div id="contentCapsule" class="capsule">
<div id="content">
<div id="subCapsule" class="clearFix" xmlns="">
<div id="contentLeft">
<iframe width="635" height="1000" frameborder="0" src="apps/Results.aspx">
#document
<html xmlns="http://www.w3.org/1999/xhtml">
<body style="background:none;">
<form id="form1" action="Results.aspx" method="post" name="form1">
<div class="pressContent">
<div class="tableCapsule details">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr class="even">
Basically I want to get text inside of the tag with class "even". I tried directly calling class even like this:
doc.getElementsByClass("even")
It didn't work. I tried parent > child relationship with selector method. It didn't work either. I tried this inside of second html tag:
doc.select("body.page3078 > html > body > #form1 > th");
Didn't work either. Where am I wrong?
One comment summarizes the start of a solution here:
As mentioned here you need to get the page from the iframe in a separate jsoup parser. This page isn't weird at all - it's just a separate page is shown in the iframe. – Boris the Spider

Using embedded CSS in Mail

I am Generating and sending Mailer using Servlet by replacing placeholders in Mail Template.Can I Use embedded style in Email for Styling as Below instead of inline styling
I am using a Newsletter email which has three place holder for Header Image, Email Body and Email Footer.Now the problem is since the Header Image is with in a anchor tag I am getting a border around the Image.
Is it Possible to get rid of the border by using embedded css
Is there any alternate solution for this problem since the whole ###HEADER_IMAGE### is replaced by Image tag rather than Just image source.
The HTML code is as below.
<html>
<style>
a img
{
border-style : none;
}
</style>
<table width="590">
<tr>
<td colspan="2">
<a href="#" target="_blank">
###HEADER_IMAGE###
</a>
</td>
</tr>
</table>
<div>
###EMAIL_BODY###
</div>
<div>
###EMAIL_FOOTER###
</div>
</html>
Thanks for Reply

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories