Android use JSoup parse HTML convert to String

Android use JSoup parse HTML convert to String - java

I am trying to use richtext to display html content so i was parse the url try to get all content inside <div class="margin-box"></div> to String value.
But i can not parse the url.
Code like below:
User Soup parse the url
Document document = Jsoup.parse(news_url);
String news_content = CommonUtil.newsContent(document);
Data Capture
public static String newsContent(Document document){
Elements elements = document.select("div.margin-box");
String newsContent = elements.toString();
return newsContent;
}
Then i get debug result:
Show URL parse unsuccessful.
Actually i want to get value like below:
<div>
<p>
<imgsrc="http://p1.pstatp.com/large/1c67000332373537f0ff" img_width="640" img_height="360" inline="0" alt=“************” onerror="javascript:errorimg.call(this);">
</p>
<p class="pgc-img-caption”>***********</p><p>*************************************</p>
<p><imgsrc="http://p3.pstatp.com/large/1c6e0000841ab42ca326" img_width="640" img_height="425" inline="0" alt=“**********”onerror="javascript:errorimg.call(this);"></p>
<p class="pgc-img-caption”>********************************</p>
<p><img src="http://p1.pstatp.com/large/1c6d00008eebccce3e2f" img_width="550" img_height="375" inline="0" alt=“************” onerror="javascript:errorimg.call(this);"></p>
<p class="pgc-img-caption”>*********</p><p>**************************</p><p>*********************</p><p>*****************</p></div>
What did i do wrong?
Full HTML BLOCK
There are no element inside div class

It is useful to first check, if JSoup can parse the content: http://try.jsoup.org/~8W0oCmiiYnFL01nUM6HDbQ9wwTA
You are using Jsoup.parse which expects html stored in a string. If you want to use parse to retrieve the html source you have to pass a URL and a timeout:
String url = "http://servertrj.com/news/index/208";
Document doc = Jsoup.parse(new URL(url), 3000);
Most of the time you find the get() syntax to pull html source, compare your syntax to this simple example:
String url = "http://servertrj.com/news/index/208";
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36";
Document doc = Jsoup.connect(url).userAgent(userAgent).get();
Elements elements = doc.select(".margin-box");
System.out.println(elements.size() + "\n" + elements.toString());
Output:
1
<div class="margin-box">
<p style="margin: 0px 0px 15px; padding: 0px; border: 0px; line-height: 30px; font-family: "Microsoft YaHei;, SimSun, Verdana, Arial; color: rgb(0, 0, 0); font-size: 15px;">[... truncated because of spam detection, but same as try.jsoup]</p>
</div>

Related

Simple way to display currency symbol in html2pdf for iText 7

I updated my code from iText 5.0 to iText 7 and html2pdf 2.0 according to this post. In earlier version rupee symbol was rendering properly, but because of css issue i changed the code. Now complete page is converting properly to pdf except rupee symbol.
Tried adding font in html style tag itself like * { font-family: Arial; }.
Changed value of rupee symbol from &#x20b9, ₹ and also added directly ₹ , but no use.
My Html:
<html>
<head>
<style>
* { font-family: Arial; }
</style>
<title>HTML div</title>
</head>
<body>
<p style="margin-bottom: 0in; padding-left: 60px;">
<div style="font-size: 450%; text-indent: 150px;">
<strong>BUY <span style="color: #ff420e;">2</span> GET
</strong>
</div>
</p>
<div
style="float: left; display: inline-block; margin: 10px; text-align: right; font-size: 70%; line-height: 27; transform: rotate(270deg);">Offer
Expiry Date : 30/11/2017</Div>
<div
style="float: left; display: inline-block; margin: 10px; text-align: right; font-size: 350%;">
₹
<!-- ₹ -->
</div>
<div
style="float: left; display: inline-block; margin: auto; font-size: 1500%; color: red; font-weight: bold;">99</div>
<div
style="float: left; display: inline-block; margin: 10px; text-align: left; font-size: 250%; line-height: 10;">OFF</div>
<div
style="position: absolute; height: 40px; font-size: 250%; line-height: 600px; color: red; text-indent: 50px">Pepsi
2.25 Pet Bottle ltr</div>
<div
style="position: absolute; height: 40px; font-size: 245%; line-height: 694px; text-indent: 50px">
MRP: ₹ <span style="color: #ff420e;">654</span>
</div>
</body>
</html>
Java Code :
public class Test {
final static String DEST = "D://Workspace_1574973//POP//sample_12.pdf";
final static String SRC = "D://Workspace_1574973//POP//src//com//resources//test.html";
public static void main(String[] args) throws Exception {
createPdf(SRC, DEST);
}
public static void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
}
Earlier code, which was working with symbols.
log.info("Creating file start");
OutputStream file = new FileOutputStream(new File("font_check.pdf"));
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
InputStream is = new ByteArrayInputStream(fileTemplate.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
log.info("Creating file end");
Is there any simple approach to achieve this, with minimal and optimized code ?
Because I've to generate thousands of pdf in one go, So the performance should not affect.
Please let me know, if anyone achieved this through latest version.
Edit : Also how to set particular paper type in this like A6, A3, A4 etc.

Hope you are not mad, because I don't have reputation to write simple comments... so I'll post a full answer instead. I parse HTML for my work, and I read SO sometimes. There is a lot on the subject regarding UTF-8 here. Most software systems support the "greater than char #256" (UTF-8) codes - for instance the Indian Rupee Symbol. However, most of the time the programmer has to include a specific request for such a desired behavior, explicitly.
In HTML, for instance - adding this line usually helps:
String UTF8MetaTag = "<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />";
Anyway, not having used HTMLToPDF - I might not be the right guy to post answers to your questions - but, because I have dealt with UTF-8 foreign language characters for three years, I know that setting a software setting to handle the 65,000 or so chars is usually VERY EASY, BUT ALSO ALWAYS VERY MANDATORY.
Here is an SO post about using HTMLToPDF and UTF-8 to handle Japanese Kanji characters. Most likely, it should handle all UTF-8, but that is not a guarantee.
HTML2PDF support for japanese language(utf8) is not working
Here are a few posts about it using HTML2PDF in PHP:
Converting html 2 pdf (php) using hebrew returns "???"
Having æøå chars in HTML2PDF charset

java jsoup - How to get all links from a href searching by a text

I have a lot of this lines in a webpage:
<span><span style="font-family: Courier New">Title</span></span>
<span style="font-family: Courier New"> (txt)</span></li></ul>
<span><span style="font-family: Courier New">Title</span></span>
<span style="font-family: Courier New"> (txt)</span></li></ul>
and i want to get only:
City1/Waves321.aspx
City2/Waves761.aspx
and so on... every ahref before "Title".
I tested with this code:
public class ListLinks {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
String address;
Document doc = Jsoup.connect(url).timeout(10*1000).get();
Elements links = doc.select("a[href~=(Waves)]");
//String linkText = links.text();
for (Element link : links) {
String linkHref = link.attr("href");
address = url + linkHref;
System.out.println(address);
}
and it works for most of the links, but it misses the ones that "Title" is in a new line, like this:
<a href="City/Waves321.aspx"><span><span style="font-family: Courier New">
Title</span></span></a><span style="font-family: Courier New"> (txt)</span></li></ul>
I cannot change the webpage code (by the way :/)
How can i achieve this in Jsoup?

you can do like this -
Elements e = doc.getElementsByTag("a");
e.stream().forEach(p -> System.out.println(p.attr("href")));

Modifying HTML using java

I am trying to read a HTML file and add link to some of the texts :
for example :
I want to add link to "Campaign0" text. :
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">101</span></p></td>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">Campaign0</span>
<td><p style="overflow: hidden; text-indent: 0px; "><span style="font-family: SansSerif;">unknown</span></p></td>
Link to be added:
<a href="Second.html">
I need a JAVA program that modify html to add hyperlink over "Campaign0" .
How i do this with Jsoup ?
I tried this with JSoup :
File input = new File("D://First.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span").first(); <-- this is only for first span tag :(
span.wrap("");
Is this correct ?? It's not working :(
In short : is there anything like-->
if find <span>Campaign0</span>
then replace by <span>Campaign0</span>
using JSoup or any technology inside JAVA code??

Your code seems pretty much correct. To find the span elements with "Campaign0", "Campaign1", etc., you can use the JSoup selector "span:containsOwn(Campaign0)". See additional documentation for JSoup selectors at jsoup.org.
After finding the elements and wrapping them with the link, calling doc.html() should return the modified HTML code. Here's a working sample:
input.html:
<table>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</table>
Code:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element span = doc.select("span:containsOwn(Campaign0)").first();
span.wrap("");
span = doc.select("span:containsOwn(Campaign1)").first();
span.wrap("");
String html = doc.html();
BufferedWriter htmlWriter =
new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.html"), "UTF-8"));
htmlWriter.write(html);
htmlWriter.close();
output:
<html>
<head></head>
<body>
<table>
<tbody>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign0</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
<tr>
<td><p><span>101</span></p></td>
<td><p><span>Campaign1</span></p></td>
<td><p><span>unknown</span></p></td>
</tr>
</tbody>
</table>
</body>
</html>

displaying cdata in XML to be rendered as html

I know that something similar has been asked many times but I cannot find a solution that works in my situation.
I'm generating CData section within an XML using java (StringBuffer) and I'm putting a simple HTML code as shown below:
public String createXML(OrderDetailBean orderBean) throws ParserConfigurationException {
logger.info("Starting to Create the XML");
getConnectionProperties(); //Load properties file and set the Connection parameters
// Create document
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbf.newDocumentBuilder();
Document doc = builder.newDocument();
//Configuring the Factory to get a validating parser (ie one that understands name and spaces)
dbf.setNamespaceAware(true);
dbf.setValidating(true);
//Create doc type
DOMImplementation domImpl = doc.getImplementation();
DocumentType doctype = domImpl.createDocumentType("paymentService", "-//CompanyName//DTD CompanyName PaymentService v1//EN", "http://dtd.CompanyName.com/Service_v1.dtd");
doc.appendChild(doctype);
/******** Add ROOT element: PaymentService ********/
Element rootElement = doc.createElement("paymentService");
//Add Attributes to the Root Element
rootElement.setAttribute("version", "1.4");
rootElement.setAttribute("Code", Code);
/******** Add first element: submit ********/
Element elementSubmit = doc.createElement("submit");
/******** Add second element: order *******/
Element elementOrder = doc.createElement("order");
elementOrder.setAttribute("orderCode", ""+System.currentTimeMillis());
// Add THIRD child element for CData
Element elementOrderContent = doc.createElement("orderContent");
StringBuffer orderContent = new StringBuffer();
orderContent.append("<![CDATA[<center><table> <tr><td class=\"one width190\" align=\"left\" valign=\"top\">");
orderContent.append("<span style=\" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;\">");
orderContent.append("Product:</span> </td><tr><td class=\"one\" align=\"left\" valign=\"top\"><span style=\" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;\">");
orderContent.append("<strong>Product title</strong></span></td></tr> </table></center>]]>");
logger.info("The orderContent Element in XML : "+orderContent.toString());
Text orderContentText = doc.createTextNode(orderContent.toString());
logger.debug("Converted Text for Order Content is: "+orderContentText);
elementOrderContent.appendChild(orderContentText);
elementOrder.appendChild(elementOrderContent); //Add third Order Child: OrderContent
elementSubmit.appendChild(elementOrder); //Add Order Element to Submit
rootElement.appendChild(elementSubmit); //Add First Element (Submit) to Root Element (PaymentService)
doc.appendChild(rootElement); //Add Root Element to XML Doc
String stringXML = convertDocintoString(doc); //print the XML to File
logger.info("The XML Generated is: " + stringXML);
return stringXML;
}
This part is fine. I'm then converting that XML(XML Document) into String using XMLSerializer as shown below:
/*
* Convert the XML Document into a String: Serialize DOM Document to generate the xml String
*/
public String convertDocintoString(Document doc) {
logger.info("Converting the XML Document into String XML");
//OutputFormat format = new OutputFormat(doc);
OutputFormat format = new OutputFormat(doc, "UTF-8", true);
//format.setIndenting(true);
XMLSerializer serializer;
String outXML = null;
try {
StringWriter stringOut = new StringWriter ();
serializer = new XMLSerializer(stringOut, format);
serializer.asDOMSerializer();
serializer.serialize(doc);
outXML = stringOut.toString();
logger.debug("The XML String IS: " + outXML);
}
catch (FileNotFoundException e) {
e.printStackTrace();
logger.debug("XML Document Not Found for Serialization!", e);
}
catch (IOException e) {
e.printStackTrace();
logger.debug((new StringBuilder("Issues when converting the XML Document into String XML")).append(e).toString());
}
return outXML;
}
Here in this step above, I noticed that all the '<' and '>' tags get replaced by < and >. But I believe that this is normal.
Now when I'm trying to display that CData block in an HTML page, that CData block is being rendered as actual text rather than the actual HTML ie exactly as first code block that I pasted above.Can somebody please suggest whats happening here and what am I doing wrong? The HTML output is:
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<META http-equiv='Pragma' content='no-cache'>
<META http-equiv='Expires' content='0'>
<title>Select Method</title>
<style type="text/css" media="screen"> #import url(/pictures/dispatcher.css);</style>
<script type="text/javascript" src="/jsp/js/jquery-1.6.2.min.js"></script>
</head>
<body >
<div id="ordercontainer"><font ><b>Your Details</b></font>
<br/><font ><![CDATA[<input type="hidden" name="MC_mycustomvar" value="M_ and MC_ combined"><center><table><tr><td class="one width190" align="left" valign="top"><span style=" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;">Product:</span>&nbsp;&nbsp;</td><tr><td class="one" align="left" valign="top"><span style=" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;"><strong>Product title</strong></span></td></tr></table></center>]]></font><br/>
</body>
</html>
Thanks

You need to use the method org.w3c.dom.Document.createCDATASection(String data)
Anything you pass in the data parameter should be wrapped in CDATA in the resulting node.
// Add THIRD child element for CData
Element elementOrderContent = doc.createElement("orderContent");
StringBuffer orderContent = new StringBuffer();
// Note: Removed the <![CDATA[ ]]> from this string concat
orderContent.append("<center><table> <tr><td class=\"one width190\" align=\"left\" valign=\"top\">");
orderContent.append("<span style=\" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;\">");
orderContent.append("Product:</span> </td><tr><td class=\"one\" align=\"left\" valign=\"top\"><span style=\" font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: #002469;\">");
orderContent.append("<strong>Product title</strong></span></td></tr> </table></center>");
logger.info("The orderContent Element in XML : "+orderContent.toString());
// HERE IS THE UPDATED LINE
Text orderContentText = doc.createCDATASection(orderContent.toString());
logger.debug("Converted Text for Order Content is: "+orderContentText);
elementOrderContent.appendChild(orderContentText);
elementOrder.appendChild(elementOrderContent); //Add third Order Child: OrderContent

Replace a substring with a StringBuffer substring

I have a Huge string which is complete html obtained into a string by JSOUP.I have made changes to a substring of the html using String Bufer replace API(replace(int startIndex,int endIndex, "to be changed string).The String buffer is populated perfectly.But when I try to replace the substring of html with new String buffer it does not work.
Here is the code snippet.
html = html.replace(divStyle1.trim(), heightwidthM.toString().trim());
The initial big html is
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
</head>
<body>
**<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">**
<div style="height:2058px; padding-left:0px; padding-top:36px;">
<iframe style="height:90px; width:728px;"/>
</div>
</div>
</body>
</html>
The divStyle1 string is
background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;
And the String buffer has value
background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height:720px; width:900px; text-align: center; margin: 0 auto;
does not work where divStyle is a substring of the last HTML(in String) and heightwidthM is a Stringbuffer value with which it has to be replaced.It doesnt throw any errors but it does not change it as well.
Thanks
Swaraj

This is very easy with JSoup
String html = "<!DOCTYPE html>\n<html xmlns:og=\"http://opengraphprotocol.org/schema/\" xmlns:fb=\"http://www.facebook.com/2008/fbml\" xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\" class=\"SAF\" id=\"global-header-light\">\n<head>\n\n</head>\n<body>\n\n\n**<div style=\"background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;\">** \n\n<div style=\"height:2058px; padding-left:0px; padding-top:36px;\">\n\n\n<iframe style=\"height:90px; width:728px;\"/>\n\n\n\n</div>\n</div>\n\n</body>\n</html>";
String newStyle = "background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height:720px; width:900px; text-align: center; margin: 0 auto;";
Document document = Jsoup.parse(html);
document.body().child(0).attr("style", newStyle);
System.out.println(document.html());

Coming back to my suggestion, if you don't mind trying, you can do something of this sort:
Document newDocument = Jsoup.parse(<your html string>, StringUtils.EMPTY, Parser.htmlParser());
Elements yourStyles = newDocument.select("div[style]"); // this will select all div with attributes style
yourStyles.get(0).attr("style", <your new value>); // this will get your first div and replace attribute style to your new value
System.out.println(newDocument.outerHtml());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android use JSoup parse HTML convert to String - java

Related

Simple way to display currency symbol in html2pdf for iText 7

java jsoup - How to get all links from a href searching by a text

Modifying HTML using java

displaying cdata in XML to be rendered as html

Replace a substring with a StringBuffer substring

Categories

Resources