Replace placeholder of HTML file in JAVA - java

I am using jsoup library to parse HTML file in Java.
I want to replace placeholders of that HTML file.
Presently I am putting the placeholders in <span id = "id_1"> xx </span> and replacing them.
I have tried to do many other things, but didn't achieve success.
Document doc = Jsoup.parse(new File("abc.html"), UTF_8);
doc.getElementById("id_1").text("MUKUL");
Placeholders in my html file are like <%= name %>. I want to replace all placeholders with suitable value. For now I had made change in my html file to put placeholders in <span id = "id_1"> xx </span> tag. I don't want to change my html template.
Can anyone please suggest some clean and better way to achieve this thing.
Why I am not changing it to String: I don't want to change the html file in String because html file contains some Japanese characters and whenever I transform it to string, some of the Japanese characters distort and produces some junk data.

Related

How to write data to pdf file which contains html tags using itext lib in Java

I have String which contains some html tags and it is coming from database, i want to write that in PDF file with same styling present in the String in the form of HTML tag. I tried to use XMLWorkerHelper like this
String html = What is the equation of the line passing through the
point (2,-3) and making an angle of -45<sup>2</sup> with the positive
X-axis?
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new
StringReader(html));
but it only reads the data which is inside the html tag(in this case only 2) other string it simply ignores. But i want the entire String with HTML formating.
With HTMLWorker it works perfectly but that is deprecated so please let me know how to achieve this.
I am using iText 5 lib

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Escape special characters of html string in java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Formatting issue with HTML while using JSoup for Java

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (i.e. < br > along with < p > and < /p >).
However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.
For example, I can get this:
<p><br>Worldwide database</p>
and this:
Worldwide database
but I can't get this, which is my desired result:
Worldwide database<br/>
I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?
Anyways, here's my code:
Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all
// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML)
{
String jsname = link.attr("jsname");
if( jsname.equals("C4s9Ed")){
System.out.println(link);
break;
}
}
I'd really apprecaite any help with this issue.
Thanks,
Jack
HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br />. JSoup parses both, but will print out only normal HTML (<br>).
If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.
If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_", then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br />:
String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

Categories