using Jsoup to extract a table inside several divs

using Jsoup to extract a table inside several divs - java

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

Related

Missing Table Elements When Scraping

URL: https://stats.nba.com/player/1628381/defense-dash/
Attempting to get:
`<table>
<tbody>
<!----><tr data-ng-repeat="(i, row) in page" index="0">
<td class="player">Overall</td>
<td>45</td>
<td>45</td>
<td>5.7</td>
<td>12.3</td>
<td>46.6</td>
<td>100%</td>
<td>46.7</td>
<td>-0.1</td>
</tr><!---->
</tbody>
</table> `
My coding:
public static void getData(String url, String Name, int ID) throws
IOException
{
String html = Jsoup.connect(url).execute().body();
html = html.replaceAll("<!---->", "");
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
System.out.println("Elements " + tableElements);
for (Element tableElement : tableElements)
{
String tableId = tableElement.id();
if (tableId.isEmpty()) {
continue;
}
String fileName = "table" + Name + tableId + ID + ".csv";
System.out.println(fileName);
FileWriter writer = new FileWriter(new File("C:\\Users\\noman\\eclipse-workspace\\Senior Project\\src\\", fileName));
//System.out.println(doc);
Elements tableRowElements = tableElement.select(":not(thead) tr td");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
Problem is no elements are being found. this same code works on another site perfectly which (seemingly) no differences in how they store data
Is there something different with this website preventing web-scraping? or a subtle difference maybe?
Please note HTML code provided is a shorten version

As said at the comments, the data you are looking for is loaded dynamically, but, you can fetch it with a simple GET request from this link -
https://stats.nba.com/stats/playerdashptshotdefend?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerID=1628381&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision=
EDIT
To find this link I've used the browser's developer tools and checked for xhr requests.
You can see that the link includes several parameters, among them the playerID which is identical to the number that appears in your intial link. By changing its value you can get stats of other players.

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.

As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

Java Jsoup - Element isn't removed from Elements

I will start from beginning, there's html with pattern like this:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
those div's with styles and that table is optional, sometimes there's just
<div id="post">
Text i need
</div>
And i want to parse that text to String. Here;s the code I'm using
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
I added those print lines to check if it finds them and yes, it does find correct ones, but later when I'm parsing it to String:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
String contains all that removed stuff again for some reasons.
I tried removing them by iterators, or making full for and removing elements by indexes, buut the result is the same.

So you want to get Text i need. Use Element's ownText() method which Gets the text owned by this element only; does not get the combined text of all children.
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}

Selectable Table in JSP, JavaScript

I'm creating a betting site using JSP and Servlets.
I need to create a selectable table of betting coefficients.
It looks like:
The scenario is:
User presses on the coefficients and makes a bet that can consist of different matches. I need to make the cells with coefficients selectable and allow only one selection in a row. Then i should get all the coefficients that were selected and put them into the request and do some stuff with them in my servlet.
Can I do this using JSP, Servlets and HTML ? Or I need some javascript code?
I know almost nothing about javascript, so some links or small code listings would help me a lot.
Thanks in advance

There are two things you must do: 1. Set your table id as table 2.
On each row declare onclick = 'choose(n);', where n the order of
rows beginning from 1; If you know JSP hope you know how make n
increment by each row;
And pay Attention I am setting trLength by subtracting 1 from rows.length, only when there is header on table. So there must be header anyway, otherwise code won't work
<!DOCTYPE html>
<html>
<body onkeydown="chooseNext(event);">
<table id="table">
<th>name</th>
<th>lastname</th>
<th>score</th>
<tr onclick="choose(1);">
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr onclick="choose(2);">
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
<tr onclick="choose(3);">
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
</table>
<script>
/**
*
*
*
* #author Abdullaev Omonullo, 2015
*/
var IC = 1;
function setRow(value) {
IC = value;
}
function getRow() {
return IC;
}
/**** Initialization ****/
var rows = document.getElementById('table').getElementsByTagName('tr');
var trLength = rows.length - 1;
try {
highlightAll();
} catch (error) {
alert('error on getting table: ' + error);
}
choose(getRow());
function highlightAll() {
var i = 1;
for (; i <= trLength; i++) {
highlightByIndex(i);
}
}
/****END Initilization END****/
function chooseNext(event) {
if (event.keyCode == 38) {
goUp();
} else if (event.keyCode == 40) {
goDown();
}
}
function goUp() {
if (IC != 1) {
choose(IC - 1);
}
}
function goDown() {
if (IC != trLength) {
choose(IC + 1);
}
}
function highlight(rowIndex) {
var row = rows[rowIndex];
var oldRow = rows[rowIndex];
highlightByIndex(parseInt(IC));
row.style.backgroundColor = "aqua";
IC = rowIndex;
}
function highlightByIndex(i) {
var row = rows[i];
var colors = [ '#C9EAFF', '#E8F6FF' ];
row.style.backgroundColor = colors[i % 2];
}
function choose(rowIndex) {
highlight(rowIndex);
}
</script>

you need to use javascript, try datatables it has the ability to do what you need.
http://www.datatables.net/examples/server_side/select_rows.html

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code:
public class NewClass {
public String noTags(String str){
return Jsoup.parse(str).text();
}
public static void main(String args[]) {
String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
"<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> googlez</p></BODY> </HTML> ";
NewClass text = new NewClass();
System.out.println((text.noTags(strings)));
}
And I have the result:
hello world yo googlez
But I want to break the line:
hello world
yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).

With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
We're using this method here:
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
By passing it Whitelist.none() we make sure that all HTML is removed.
By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

On Jsoup v1.11.2, we can now use Element.wholeText().
String cleanString = Jsoup.parse(htmlString).wholeText();
user121196's answer still works. But wholeText() preserves the alignment of texts.

Try this by using jsoup:
public static String cleanPreserveLineBreaks(String bodyHtml) {
// get pretty printed html with preserved br and p tags
String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
// get plain text with preserved line breaks by disabled prettyPrint
return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:
Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);
(version 1.10.3)

You can traverse a given element
public String convertNodeToText(Element element)
{
final StringBuilder buffer = new StringBuilder();
new NodeTraversor(new NodeVisitor() {
boolean isNewline = true;
#Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim();
if(!text.isEmpty())
{
buffer.append(text);
isNewline = false;
}
} else if (node instanceof Element) {
Element element = (Element) node;
if (!isNewline)
{
if((element.isBlock() || element.tagName().equals("br")))
{
buffer.append("\n");
isNewline = true;
}
}
}
}
#Override
public void tail(Node node, int depth) {
}
}).traverse(element);
return buffer.toString();
}
And for your code
String result = convertNodeToText(JSoup.parse(html))

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.
Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java
The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.
To avoid link rot, here is Jonathan Hedley's solution in full:
package org.jsoup.examples;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
import java.io.IOException;
/**
* HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
* plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
* scrape.
* <p>
* Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
* </p>
* <p>
* To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
* <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
* where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
*
* #author Jonathan Hedley, jonathan#hedley.net
*/
public class HtmlToPlainText {
private static final String userAgent = "Mozilla/5.0 (jsoup)";
private static final int timeout = 5 * 1000;
public static void main(String... args) throws IOException {
Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
final String url = args[0];
final String selector = args.length == 2 ? args[1] : null;
// fetch the specified URL and parse to a HTML DOM
Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();
HtmlToPlainText formatter = new HtmlToPlainText();
if (selector != null) {
Elements elements = doc.select(selector); // get each element that matches the CSS selector
for (Element element : elements) {
String plainText = formatter.getPlainText(element); // format that element to plain text
System.out.println(plainText);
}
} else { // format the whole doc
String plainText = formatter.getPlainText(doc);
System.out.println(plainText);
}
}
/**
* Format an Element to plain-text
* #param element the root element to format
* #return formatted text
*/
public String getPlainText(Element element) {
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
#Override
public String toString() {
return accum.toString();
}
}
}

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
works if the html itself doesn't contain "br2n"
So,
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();
works more reliable and easier.

Try this:
public String noTags(String str){
Document d = Jsoup.parse(str);
TextNode tn = new TextNode(d.body().html(), "");
return tn.getWholeText();
}

Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator.
Here's some scala code I use for this, java port should be easy:
val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
.asScala.mkString("<br />\n")

Try this by using jsoup:
doc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \n after that
doc.select("br").after("\\n");
//select all <p> tags and prepend \n before that
doc.select("p").before("\\n");
//get the HTML from the document, and retaining original new lines
String str = doc.html().replaceAll("\\\\n", "\n");

This is my version of translating html to text (the modified version of user121196 answer, actually).
This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).
It's originally written in Scala, but you can change it to Java easily
def html2text( rawHtml : String ) : String = {
val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
htmlDoc.select("br").append("\\nl")
htmlDoc.select("div").prepend("\\nl").append("\\nl")
htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")
org.jsoup.parser.Parser.unescapeEntities(
Jsoup.clean(
htmlDoc.html(),
"",
Whitelist.none(),
new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
),false
).
replaceAll("\\\\nl", "\n").
replaceAll("\r","").
replaceAll("\n\\s+\n","\n").
replaceAll("\n\n+","\n\n").
trim()
}

/**
* Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
* #param html
* #param linebreakerString
* #return the html as String with proper java newlines instead of br
*/
public static String replaceBrWithNewLine(String html, String linebreakerString){
String result = "";
if(html.contains(linebreakerString)){
result = replaceBrWithNewLine(html, linebreakerString+"1");
} else {
result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
result = result.replaceAll(linebreakerString, "\n");
}
return result;
}
Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder.
For example:
replaceBrWithNewLine(element.html(), "br2n")
The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.

Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:
org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

using Jsoup to extract a table inside several divs - java

Related

Missing Table Elements When Scraping

How to generate XPath query matching a specific element in Jsoup?

Java Jsoup - Element isn't removed from Elements

Selectable Table in JSP, JavaScript

How do I preserve line breaks when using jsoup to convert html to plain text?

Categories

Resources