Issue: Jsoup to parse string with < followed by word - java

I'm using Jsoup to parse the string which contains substring starts with < followed by a word to get the text but not getting text correctly
String input ="<p>testing with less than <string</p>";
String s = Jsoup.parse(input).text();
After extracting attribute text "testing with less than" is the result instead of testing with less than <string

String input = "<p>testing with less than <string</p>";
System.out.println(input);
Output:
<p>testing with less than <string</p>
If we print the input we will get the entire string as shown.
String s1 = Jsoup.parse(input).text();
System.out.println(s1);// when we use method text()
Output:
testing with less than
If we use the jsoup text() method, we get the plain text without HTML tags.
But still, we are not getting the entire input String because of the char "<".
the reason is justified in the following ex.
String s2 = Jsoup.parse(input).html();
System.out.println(s2);// when we use method html()
Output:
<html>
<head></head>
<body>
<p>testing with less than
<string></string> //the end tag is auto generated by the method
</p>
</body>
</html>
If we use the jsoup html() method, we get the entire formatted HTML code.
here we can clearly see that the word which is written after the char "<" in-between another HTML tag is automatically converted into an HTML tag.
(if we only write a start tag the end tag is automatically created whether it is valid or not)
this is the reason we are not getting entire input as shown in the first example

Related

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Delete tabulation character from text retrieved with Jsoup

I'm parsing a HTML file using Jsoup. When getting the text of a h1 it retrieves also tabulations and newlines.
'Name' is what I'm trying to retreive from here:
<h1>\n\t\t\tNAME\n\t\t</h1>
I'm trying to get rid of these characters this way:
String name = document.select( "div header > h1" ).first().ownText().replaceAll( "[^a-zA-Z]+", "" ).trim().toUpperCase();
But this is the result:
NTTTTNAMETNTTT
How can I get the text without all the tabulations and newlines characters?
It seems that the html really contains the strings "\t" and "\n" literally. In that case you probably should clean the source prior to parsing. Something like this should do:
String html = Jsoup.connect(URL).userAgent("Mozilla/5.0").execute().body();
html = html.replaceAll("\\\\[nt]", "");
Document doc = Jsoup.parse(html);

Escape special characters of html string in java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Formatting issue with HTML while using JSoup for Java

I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (i.e. < br > along with < p > and < /p >).
However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.
For example, I can get this:
<p><br>Worldwide database</p>
and this:
Worldwide database
but I can't get this, which is my desired result:
Worldwide database<br/>
I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?
Anyways, here's my code:
Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all
// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML)
{
String jsname = link.attr("jsname");
if( jsname.equals("C4s9Ed")){
System.out.println(link);
break;
}
}
I'd really apprecaite any help with this issue.
Thanks,
Jack
HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br />. JSoup parses both, but will print out only normal HTML (<br>).
If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.
If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_", then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br />:
String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);

How to ignore <BR> tag while printing in console

I am trying to print a table having one column heading as - Assigned <BR> On. When I am printing this table heading in console, due to this BR tag, entire table is not getting printed in desired structure. Till it reaches Assigned, printed in one line, then from On it goes down to another line. So structure of the table is not in correct format. I am using Selenium WebDriver and my goal is to print this table with the same structure it is displaying in the web page. Not finding a way how to ignore this <BR> tag.
Check String regex replacement may help:
String testString = "abc/d <BR> ff</BR>";
testString = testString.replaceAll("<BR>|</BR>", "");

Categories