jsoup clean includes unwanted carriage return - java

This is currently vexing me.
Jsoup is including an extra line break in the returned string if the string includes <br />
eg.
String html ="TEST<br />TEST";
Jsoup.clean(html, org.jsoup.safety.Whitelist.basic());
returns
TEST\n<br />TEST
Any advice on how to avoid the inclusion of the troublesome \n?

Have you tried .text(); or .ownText(); from the Elements class?
//If you want the whole page
String url = "http://www.yourwebsite.com";
Document doc = Jsoup.connect(url).get();
System.out.println(doc.text());
//If you want some specific part of the page
Elements elems = doc.select("query");
for (Element element : elems) {
System.out.println(element.text() + "\n");
System.out.println(element.ownText() + "\n\n");
}
If each element returned < p>Hello< b> there< /b> now!< /p>
The method text(); would return Hello there now!
The method ownText(); would return Hello now!
Just to make it easier to understand: The .text(); will return the whole text within the tag you got. The ownText(); method will return the text from the tag itself, and not the text from its children.
About the query in doc.select("query");, you can search here for any pattern you want.

Cleaner cleaner = new Cleaner(WHITE_LIST);
Document clean = cleaner.clean(body);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
clean.outputSettings(outputSettings);
return clean.body().html();
outputSettings.prettyPrint(false);

Related

JSoup, how to return data from a dynamic <a href> tag

Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}

How to parse HTML text and links with java and jsoup

I need to parse text from a webpage. The text is presented in this way:
nonClickableText= link1 link2 nonClickableText2= link1 link2
I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.
So in java I would have this:
String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";
Here are some pictures: first second
What exactly is link1 and link2? According to your example
"... nonClickableText2= example3.com example4.com"
they can be different, so what would be the source besides the href?
Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong>-block and then go through the child nodes, using <a>-children with preceding text-nodes:
String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";
Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";
Element container = doc.select("div>p>strong").first();
for (Node node : container.childNodes()) {
if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
parsedHTML += "= " + node.attr("href").toString() + " ";
}
}
parsedHTML.trim();
System.out.println(parsedHTML);
Output:
notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com

Sanitize HTML string

I have an HTML sting like:
<p dir="ltr"><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>bold</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>all</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>in</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u>one</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></p>
I want to sanitize the html like <b><i><u> bold all in one </b></i></u>
I tried this method: webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
But it is of no use. The html remains clumsy. What should I do to mend the same? Is there any other Regex or JSON way?
But it is of no use. The html remains clumsy. What should I do to mend
the same? Is there any other Regex or JSON way?
Regex may help here, but in general they serve not very well as Html parser if things get complex. Jsoup is a great Html library, and i really can recommend it.
Unfortunately your html is still valid html, so the solution is tricky.
Best you start with the Jsoup documentation, especially the one of it's Selector syntax.
Here's something for starting:
final String html = ... // your html from above
// Parse the html string into a document
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
/*
* Select all elements, which ...
*
* (a) have a text (= at least not empty)
* (b) has no childs it's own
*
* Iterate over those found and print them.
*/
for( Element element : doc.select("*:matches(^..+?$):not(:has(*))") )
{
System.out.println(element);
}
Result:
<u>bold</u>
<u>all</u>
<u>in</u>
<u>one</u>
If you need literally <b><i><u> bold all in one </b></i></u>:
final String html = ... // your html from above
// As above
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
// All text of the document
String text = doc.text();
// Create an element and it's childs
Element element = new Element(Tag.valueOf("b"), "");
element.appendElement("i").appendElement("u").text(text);
System.out.println(element);
Result:
<b><i><u>bold all in one</u></i></b>
You could try below method to remove unwanted html tags:
public String stripHtml(String html)
{
return Html.fromHtml(html).toString();
}

Parsing elements within div using Jsoup

Here is the html I'm trying to parse:
<div class="entry">
<img src="http://www.example.com/image.jpg" alt="Image Title">
<p>Here is some text</p>
<p>Here is some more text</p>
</div>
I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
for (Element desc : descs) {
String text = desc.getElementsByTag("p").first().text();
myArrayList.add(text);
}
But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.
I'm using a BufferedReader to read the html file one line at a time.
You could change your approach to the following:
Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");
for (Element pElem : pElems) {
myArrayList.add(pElem.data());
}
Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:
String line = "<div class=\"entry\">" +
"<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" +
"<p>Here is some text</p>" +
"<p>Here is some more text</p>" +
"</div>";
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
List<String> myArrayList = new ArrayList<String>();
for (Element desc : descs) {
Elements paragraphs = desc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
myArrayList.add(paragraph.text());
}
}
In your for-loop:
Elements ps = desc.select("p");
(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))
Try this:
Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();

Regex to add <span> tag before <a>

I need to write a util to add a tag before any
Test string points to <p>Acd Event with an image <img src="image.jpg">
This needs to be changed to
Test string points to <p><span class="test_class">Acd Event</span> with an image <img src="image.jpg">
As you can see the tag needs to be added only in case of a url pointing to a physical page and not if its an image.
I was planning to use regex to achieve this, but w/o any luck so far.
Any pointer on this will be highly appeciated.
-Thanks
Turning my comment into an answer, regular expressions aren't the right tool for the job. I'd advise the use of a parser such as John Cowan's 'TagSoup' to write some code to filter the HTML. If you prefer something more DOM-like than SAX-like, there's NekoHTML.
If you're absolutely certain you want to go down the regular expression route and you're using PCRE or another regex engine that supports look-ahead, you can use assertions, thus this regex may do the job for you:
s.replaceAll("<a[^>]*?>(?!<img.*)(.+?)</a>", "<span class=\"test_class\">$0</span>");
I haven't tested that, but the gist is correct. The important thing there is (?!<img.*), which asserts that you don't want to match <img followed by anything at that position. That may do the job for you, but I'm still of the opinion that a little bit of parsing is the best route.
If you have a library like jQuery on the page you could do it with something like this:
$("a").wrap("<span class='test_class' />");
Or if you need to do some check against the URL first:
$("a").each(function(){
var element = $(this);
var href = element.attr("href");
if (href.indexOf("http://someUrl") > -1){
element..wrap("<span class='test_class' />");
}
});
If you don't have jQuery you could do it like this:
var elements = document.body.getElementsByTagName("a");
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
var clone = element.cloneNode(true);
var parent = element.parentNode;
var span = document.createElement("span");
span.setAttribute("class", "test_class");
span.appendChild(clone);
parent.replaceChild(span, element);
}
You could do something very similar in Java using the Document interface:
DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(yourJavaHtmlString);
NodeList nodes = doc.getElementsByTagName("a");
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
String href = element.getAttribute("href");
if (!href.equals("http://www.acdevents.com")) {
Element clone = element.cloneNode(true);
Element parent = element.getParentNode();
Element span = doc.createElement("span");
span.setAttribute("class", "test_class");
span.appendChild(clone);
parent.replaceChild(span, element);
}
}

Categories