Why am I getting extra line breaks when converting HTML to text? - java

I am using Jsoup to formatting an HTML string to plain text. I still want to preserve the line breaks and ignore the HTML tags. But when converting I get extra empty lines and its trowing off my string.
String htmlString = "<p>Hello this is a description. </p><p>I know Just checking how it looks.</p><p></p><p><code>Add a line.</code></p><p>This is a notmal line <span style="color:#F9931A">Adding orange</span></p><ul><li><p>one </p></li><li><p>two</p></li></ul>";
HtmlToPlainText convert = new HtmlToPlainText();
Document html = Jsoup.parse(htmlString,"", Parser.xmlParser());
String new = convert.getPlainText(html);
System.out.println("This is the description: " + new);
OUTPUT:
Hello this is a description.
I know Just chekcing how it looks.
Add a line.
This is a notmal line Adding orange
* one
* two

Related

XPATH - getAttribute() and writer.append() to CSV file (Selenium/Selenide)

I'm trying to get "title" attribute value and save it in csv file from element below:
<img src="images/i.png" title="Uwagi: łacina, nieczytelne
Data urodzenia: 25.02.1808 r.">
Whole html here.
I've got this attribute value using xpath below (it works):
SelenideElement uwagi = $(By.xpath("//div[#id='table_b_wrapper']//table[#id='table_b']//tbody//tr[1]//img[contains(#title,'Uwagi')]"));
//tr[1] is just a one example from this table. xpath is ok
Then I've tried to put it into my csv file with:
writer.append(uwagi+";"); //using ; as separator
Problem is that this value "Uwagi: łacina, nieczytelne
Data urodzenia: 25.02.1808 r."
It's divided into 2 parts and they are saved as separate cells, like here
I need all this value in one cell (i.e. J1731 and A1732 values should be as 1 cell).
What's strange when I did System.out.println(uwagi.getAttribute("title"));
only 2nd part of attribute value (Data urodzenia: 25.02.1808 r.) was displayed in console.
How can I save this title attribute value as one cell in csv?
Regards
Tomes
Remove new line character from the title, code below replace \n (new line character) with one space as needed per your shared html.
Also in Selenide you can use $x for xpath selectors:
SelenideElement uwagi = $x("//table[#id='table_b']//tr[#role='row'][1]//img[contains(#title,'Uwagi')]");
//using css selector
uwagi = $("#table_b tr[role='row'] img[title^='Uwagi']");
//or even shorter
uwagi = $("#table_b img[title^='Uwagi']");
String uwagiTitle = uwagi.text().replace("\n", " ");
writer.append(uwagiTitle+";");
I've found solution. I've changed:
FileWriter writer = new FileWriter(pathString, Charset.forName("Cp1250"));
to
CsvWriter writer = new CsvWriter(pathString, ';', Charset.forName("Cp1250"));
using also:
<dependency>
<groupId>net.sourceforge.javacsv</groupId>
<artifactId>javacsv</artifactId>
<version>2.0</version>
</dependency>
Based on the info from: link
Then I've changed writer.apend to writer.write.
Other is the same:
...
SelenideElement xxx = $x("//img[contains(#title,'Uwagi')]");
String str = xxx.getAttribute("title");
writer.write(str);
...
Result: picture
Regards
Tomes

JSoup - How to parse nested texts?

I'm parsing html of a website with JSoup. I want to parse this part:
<td class="lastpost">
This is a text 1<br>
Website Page - 1
</td>
I want like this:
String text = "This is a text 1";
String textNo = "Website Page - 1";
String link = "post/13594";
How can I get the parts like this?
Your code would only get all the text that is in the td elements that you are selecting. If you want to store the text in separate variables, you should grab the parts separately like the following code. Extra comments added so you can understand how/why it is getting each piece.
// Get the first td element that has class="lastpost"
Element lastPost = document.select("td.lastpost").first();
// Get the first a element that is a child of the td
Element linkElement = lastPost.getElementsByTag("a").first();
// This text is the first child node of td, get that node and call toString
String text = lastPost.childNode(0).toString();
// This is the text within the a (link) element
String textNo = linkElement.text();
// This text is the href attribute value of the a (link) element
String link = linkElement.attr("href");

Sanitize HTML string

I have an HTML sting like:
<p dir="ltr"><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>bold</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>all</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>in</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u>one</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></p>
I want to sanitize the html like <b><i><u> bold all in one </b></i></u>
I tried this method: webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
But it is of no use. The html remains clumsy. What should I do to mend the same? Is there any other Regex or JSON way?
But it is of no use. The html remains clumsy. What should I do to mend
the same? Is there any other Regex or JSON way?
Regex may help here, but in general they serve not very well as Html parser if things get complex. Jsoup is a great Html library, and i really can recommend it.
Unfortunately your html is still valid html, so the solution is tricky.
Best you start with the Jsoup documentation, especially the one of it's Selector syntax.
Here's something for starting:
final String html = ... // your html from above
// Parse the html string into a document
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
/*
* Select all elements, which ...
*
* (a) have a text (= at least not empty)
* (b) has no childs it's own
*
* Iterate over those found and print them.
*/
for( Element element : doc.select("*:matches(^..+?$):not(:has(*))") )
{
System.out.println(element);
}
Result:
<u>bold</u>
<u>all</u>
<u>in</u>
<u>one</u>
If you need literally <b><i><u> bold all in one </b></i></u>:
final String html = ... // your html from above
// As above
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
// All text of the document
String text = doc.text();
// Create an element and it's childs
Element element = new Element(Tag.valueOf("b"), "");
element.appendElement("i").appendElement("u").text(text);
System.out.println(element);
Result:
<b><i><u>bold all in one</u></i></b>
You could try below method to remove unwanted html tags:
public String stripHtml(String html)
{
return Html.fromHtml(html).toString();
}

what is wrong with the printing, it stop at someplace not loading to print

Here is my CODE to deleted everything inside the <>.
public static void main (String [] args) throws FileNotFoundException{
Scanner console = new Scanner(System.in);
Scanner Theinput = GetUserInput (console);
while (Theinput.hasNextLine()){
String Input = Theinput.nextLine();
Scanner text = new Scanner(Input);
if (text.hasNext()){
String MyNewText = Input;
while(MyNewText.contains("<") || MyNewText.contains(">") ){
int Max = MyNewText.indexOf ( ">" );
int Min = MyNewText.indexOf ( "<" );
String Replacement = "";
String ToReplacement = MyNewText.substring (Min,Max+1);
MyNewText = MyNewText.replaceAll(ToReplacement,Replacement);
}
System.out.println (MyNewText);
}
else {
System.out.println();
}
}
}
i basically is try to converge a this text
<HEAD>
<TITLE>Basic HTML Sample Page</TITLE>
</HEAD>
<BODY BGCOLOR="WHITE">
<CENTER>
<H1>A Simple Sample Web Page</H1>
<IMG SRC="http://sheldonbrown.com/images/scb_eagle_contact.jpeg">
<H4>By Sheldon Brown</H4>
<H2>Demonstrating a few HTML features</H2>
</CENTER>
HTML is really a very simple language. It consists of ordinary text, with
commands that are enclosed by "<" and ">" characters, or bewteen an "&" and a ";". <P>
You don't really need to know much HTML to create a page, because you can copy bits
of HTML from other pages that do what you want, then change the text!<P>
This page shows on the left as it appears in your browser, and the corresponding HTML
code appears on the right. The HTML commands are linked to explanations of what they do.
<H3>Line Breaks</H3>
HTML doesn't normally use line breaks for ordinary text. A white space of any size is
treated as a single space. This is because the author of the page has no way of knowing
the size of the reader's screen, or what size type they will have their browser set for.<P>
If you want to put a line break at a particular place, you can use the "<BR>" command,
or, for a paragraph break, the "<P>" command, which will insert a blank line.
The heading command ("<4></4>") puts a blank line above and below the heading text.
<H4>Starting and Stopping Commands</H4>
Most HTML commands come in pairs: for example, "<H4>" marks the beginning of a size 4
heading, and "</H4>" marks the end of it. The closing command is always the same as the
opening command, except for the addition of the "/".<P>
Modifiers are sometimes included along with the basic command, inside the opening
command's < >. The modifier does not need to be repeated in the closing command.
<H1>This is a size "1" heading</H1>
<H2>This is a size "2" heading</H2>
<H3>This is a size "3" heading</H3>
<H4>This is a size "4" heading</H4>
<H5>This is a size "5" heading</H5>
<H6>This is a size "6" heading</H6>
<center>
<H4>Copyright ?1997, by
Sheldon Brown
</H4>
If you would like to make a link or bookmark to this page, the URL is:
<BR> http://sheldonbrown.com/web_sample1.html</body>
after all i my output stop at and everything work but just that.
I JUST HAD NO IDEA THAT WHAT HAPPEN TO THE PRINTLN()
This is a size "5" heading
This i
String out = "<TITLE>Basic HTML Sample Page</TITLE>".replaceAll("</?[a-zA-Z0-9]+?>", "");
System.out.println(out);
you can try regular expression, but it can't handle something like
<a
href="http://google.com"
target="_blank"
>google</a>
maybe you should consider using a parser, for example Jsoup

jsoup clean includes unwanted carriage return

This is currently vexing me.
Jsoup is including an extra line break in the returned string if the string includes <br />
eg.
String html ="TEST<br />TEST";
Jsoup.clean(html, org.jsoup.safety.Whitelist.basic());
returns
TEST\n<br />TEST
Any advice on how to avoid the inclusion of the troublesome \n?
Have you tried .text(); or .ownText(); from the Elements class?
//If you want the whole page
String url = "http://www.yourwebsite.com";
Document doc = Jsoup.connect(url).get();
System.out.println(doc.text());
//If you want some specific part of the page
Elements elems = doc.select("query");
for (Element element : elems) {
System.out.println(element.text() + "\n");
System.out.println(element.ownText() + "\n\n");
}
If each element returned < p>Hello< b> there< /b> now!< /p>
The method text(); would return Hello there now!
The method ownText(); would return Hello now!
Just to make it easier to understand: The .text(); will return the whole text within the tag you got. The ownText(); method will return the text from the tag itself, and not the text from its children.
About the query in doc.select("query");, you can search here for any pattern you want.
Cleaner cleaner = new Cleaner(WHITE_LIST);
Document clean = cleaner.clean(body);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
clean.outputSettings(outputSettings);
return clean.body().html();
outputSettings.prettyPrint(false);

Categories