Getting the particular (pre-formatted) text (from a website) using JSoup - java

I'm new to JSoup, and I want to get the text written in this specific HTML tag:
<pre class="cg-msgbody cg-view-msgbody"><span class="cg-msgspan"><span>**the text I want to get is present here, how can I get it using JSoup?**</span></span></pre>
Any help would be appreciated.
Thanks!

String html = "<pre class=\"cg-msgbody cg-view-msgbody\">"
+ "<span class=\"cg-msgspan\">"
+ "<span>**the text I want to get is present here, "
+ "how can I get it using JSoup?**</span>"
+ "</span>"
+ "</pre>";
org.jsoup.nodes.Document document = Jsoup.parse(html);
//a with href
Element link = document.select("span").last();
System.out.println("Text: " + link.text());

Related

How to get the value from the input tag in java?

<input type="text" name="n1" value="howru"/>
The given code above is the question and i want to know how we can extract value i.e howru from value using java and the given code is not jsp i wanted solution in core java no jsp tags are used and this is normal html code
As the question isn't clear,but it seems like you want to extract value from the tag without using jsp and servlet, and let's just add a couple of more tags for better understanding.
This can be done using JSOUP: https://jsoup.org/cookbook/
String inputTag = "<input type=\"text\" name=\"n1\" value=\"howru\"/>"
+ "<input type=\"text\" id=\"textField2\" name=\"n2\" value=\"howru2\"/>"
+ "<input type=\"hidden\" id=\"hiddenField\" name=\"n3\"value=\"howru3\"/>";
Document document = Jsoup.parse(inputTag);
Elements elementByTag = document.select("input[type=text]");
System.out.println("Element By Tag(First Input Tag--text):" + elementByTag.get(0).attr("value"));
System.out.println("Element By Tag(Second Input Tag--text):" + elementByTag.get(1).attr("value"));
Element elementByID = document.getElementById("textField2");
System.out.println("Element By ID(Second Input Tag--text):" + elementByID.attr("value"));
elementByID = document.getElementById("hiddenField");
System.out.println("Element By ID(Hidden Field):" + elementByID.attr("value"));
Output
Element By Tag(First Input Tag--text):howru
Element By Tag(Second Input Tag--text):howru2
Element By ID(Second Input Tag--text):howru2
Element By ID(Hidden Field):howru3

How to parse HTML text and links with java and jsoup

I need to parse text from a webpage. The text is presented in this way:
nonClickableText= link1 link2 nonClickableText2= link1 link2
I want to be able to convert all to a string in java. The non clickable text should remain like it is while the clickable text should be replaced with its actual link.
So in java I would have this:
String parsedHTML = "nonClickableText= example.com example.com nonClickableText2= example3.com example4.com";
Here are some pictures: first second
What exactly is link1 and link2? According to your example
"... nonClickableText2= example3.com example4.com"
they can be different, so what would be the source besides the href?
Based on you images the following code should give you everything to adopt your final string presentation. First we grab the <strong>-block and then go through the child nodes, using <a>-children with preceding text-nodes:
String htmlString = "<html><div><p><strong>\"notClickable1\"<a rel=\"nofollow\" target=\"_blank\" href=\"example1.com\">clickable</a>\"notClickable2\"<a rel=\"nofollow\" target=\"_blank\" href=\"example2.com\">clickable</a>\"notClickable3\"<a rel=\"nofollow\" target=\"_blank\" href=\"example3.com\">clickable</a></strong></p></div></html>";
Document doc = Jsoup.parse(htmlString); //can be replaced with Jsoup.connect("yourUrl").get();
String parsedHTML = "";
Element container = doc.select("div>p>strong").first();
for (Node node : container.childNodes()) {
if(node.nodeName().equals("a") && node.previousSibling().nodeName().equals("#text")){
parsedHTML += node.previousSibling().toString().replaceAll("\"", "");
parsedHTML += "= " + node.attr("href").toString() + " ";
}
}
parsedHTML.trim();
System.out.println(parsedHTML);
Output:
notClickable1= example1.com notClickable2= example2.com notClickable3= example3.com

Java parsing HTML with dynamic pages

I've come to a halt.
For school project we have to parse shitton of links formatted: http://us.imdb.com/M/title-exact?Desperado%20(1995). If you go to this link, you'll see that page gets built dynamically.
How could I use jsoup.org or something similar to get HTML to my procedures? I'm trying to parse some names out of these pages.
I tried this:
Document doc;
doc = (Document) Jsoup.connect(url).get();
System.out.println("text : " + doc.title());
but it returns 403.
Help:(
Are you sure to use connect(String url) method initialize all default parameter before getting result, If not you may first do,
Try this way,
Document doc = Jsoup.connect("http://www.imdb.com/title/tt0112851/")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
String title = doc.title();
System.out.println("text : " + title);

What would the Jsoup selector be for TD elements that have no children and no COLSPAN attribute?

So I'm trying to parse through a web page that is relatively messy. It contains several key-value pairs that I would like to extract. The unifying theme of these pairs is that they are non-empty, they have no children, and they do not have a COLSPAN attribute. Here's what I've tried, which seems to make sense logically but does not yield any results.
Elements tds = document.select("td:not([colspan]):not(:has(*))");
So I want TDs that:
Do not contain COLSPAN
Do not have any children
Seems like I must be close, but just not having any luck. Any thoughts?
I came up with an answer that uses a loop to remove those elements that you don't want to select.
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
I mocked up a table that has the two situations you are trying to keep out of your select.
String html =
"<table>" +
"<thead><tr><th>Col1</th><th>Col2</th><th>Col3</th></tr></thead>" +
"<tbody>" +
"<tr><td>row1col1</td><td>row1col2</td><td>row1col3</td></tr>" +
"<tr><td colspan='3'>row2fullrow</td></tr>" +
"<tr><td></td><td>row3col2</td><td><strong>row3col3</strong></td></tr>" +
"<tr><td>row4col1</td><td colspan='2'><strong>row4col2and3</strong></td></tr>" +
"</tbody>" +
"</table>";
Document doc = Jsoup.parse(html);
for(Element td : doc.select("td")) {
if (td.children().size() > 0 || td.hasAttr("colspan")) {
td.remove();
}
}
System.out.println(doc);
+++++++++++++++++++++++
UPDATE
+++++++++++++++++++++++
I played around with it a little more and came up with this (which proves your select does work). Your HTML must have some other little thing that I don't represent with mine.
String html =
"<table>" +
"<thead><tr><th>Col1</th><th>Col2</th><th>Col3</th></tr></thead>" +
"<tbody>" +
"<tr><td>row1col1</td><td>row1col2</td><td>row1col3</td></tr>" +
"<tr><td colspan='3'>row2fullrow</td></tr>" +
"<tr><td></td><td>row3col2</td><td><strong>row3col3</strong></td></tr>" +
"<tr><td id='x'>row4col1</td><td colspan='2'><strong>row4col2and3</strong></td></tr>" +
"</tbody>" +
"</table>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select("td:not([colspan]):not(:has(*))"));

How to get text & Other tags between specific tags using Jericho HTML parser?

I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE>. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser?
For example:
<TABLE cellspacing=0>
<tr><td>HELLO</td>
<td>How are you</td></tr>
</TABLE>
Answer:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Once you have found the Element of your table, all you have to do is call getContent().toString(). Here's a quick example using your sample HTML:
Source source = new Source("<TABLE cellspacing=0>\n" +
" <tr><td>HELLO</td> \n" +
" <td>How are you</td></tr>\n" +
"</TABLE>");
Element table = source.getFirstElement();
String tableContent = table.getContent().toString();
System.out.println(tableContent);
Output:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Aby, I walk down the code for all elements and show on screen. Maybe help you.
List<Element> elementListTd = source.getAllElements(HTMLElementName.TD);
//Scroll through the list of elements "td" page
for (Element element : elementListTd) {
if (element.getAttributes() != null) {
String td = element.getAllElements().toString();
String tag = "td";
System.out.println("TD: " + td);
System.out.println(element.getContent());
String conteudoAtributo = element.getTextExtractor().toString();
System.out.println(conteudoAtributo);
if (td.contains(palavraCompara)) {
tabela.add(conteudoAtributo);
}
}

Categories