I am about to parse this url : http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y
Document jDoc = Jsoup.connect(url1).get();
System.out.println(jDoc1.text());
But the output of the second line(above) is all TAGS inside textarea + text of other tags. Output is like :
..
..
<ul class="">
<li><a data-time="1dy" data-frequency="1mi" class="mdm_time">1 Day</a></li>
<li><a data-time="5dy" data-frequency="15mi" class="mdm_time">5 Days</a></li>
..
..
All the html is getting printed (of what is inside ) and text of other tags. I either want to remove this tag from Doc or want to get this as element so that I can remove it by my hand.
Hope, I am able to explain everything clearly. Please help me solve this.
EDIT :
As per suggestion, I did this :
System.out.println(jDoc1.select("textarea"));
And output comes is :
textarea id="wsj_autocomplete_template" style="display:none">
<div>
<div class="acHeadline hidden" >
</div>
<div class="dropdownContainerClass">
<div class="suggestionblock hidden" templateType="C1">
....
...
..
Certainly it is selecting the textarea, but is not able to parse inner elements. possibly due to < instead of < tag. Is there any workaround for this ?
If you want to remove the entire text area tag use doc.select("textarea").remove();. Or if you want to get the content of textarea use doc.select("textarea").text(). Note here i'm using the text() method instead of toString() or html() methods. This gives the exact text rather than html escape codes.
Again if you want to manipulate this html you can parse it again like Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
Example
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class WSJParser {
public static void main(String[] args) {
String url = "http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y";
try {
Document doc = Jsoup.connect(url).get();
//doc.select("textarea").remove(); // Removes the entire text area tag
Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
System.out.println(textareaDoc);
} catch (IOException e) {
e.printStackTrace();
}
}
}
If I understand correctly, what you want is this
Elements textareas = Jsoup.connect(url1).get().select("textarea");
for (Element textarea : textareas) {
Elements elements = textarea.select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
}
Related
Hi i need to scrape a web site using JSOUP and i needed to get output in key- value pairs can anyone suggest me.
The url which i need to scrape is https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=
The code which i written is:
package com.jaysons;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ScrapeBody {
public static void main( String[] args ) throws IOException{
String url = "https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=";
Document doc = Jsoup.connect(url).get();
Elements content = doc.select("div.views-field views-field-php");
doc = Jsoup.parse( content.html().replaceAll("</div>", "</div><span>")
.replaceAll("<div", "</span><div") );
Elements labels = doc.select("div.remedy");
for (Element label : labels) {
System.out.println(String.format("%s %s", label.text().trim(),
label.nextElementSibling().text()));
}
}
}
i need output in key value pairs like
Date:OCTOBER 20, 2017
remedy:
units:
website:http://www.bosch-home.com/us
phone:(888) 965-5813
kindly let me know where did i do mistake
Theres no need to reassign and re-parse the value of the content variable.
Elements content = doc.select("div.views-field >span");
for (Element viewField : content) {
/*
each viewField corresponds to one
<div class="views-field views-field-php">
<span class="field-content">
<a href="/Recalls/2018/BSH-Home-Appliances-amplía-retiro-del-mercado-de-lavavajillas">
<div class="date">
October 20, 2017
</div>
...
</span>
</div>
*/
Elements divs = viewField.getElementsByTag("div");
for (Element div : divs) {
String className = div.className();
if (className.equals("date")) {
// store and extract date
} else if (className.equals("...")) {
// do something else
} // else...
}
}
Not only you can select subelements by tag, but also by name, by some attributes etc. Check the official documentation for more info: https://jsoup.org/cookbook/extracting-data/dom-navigation
Disclaimer: I could not test the code right now.
Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}
Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.
I have looked through multiple forms before asking this question.Basically, what i need is to select part of the text in a HTML file. the html is constructed something like this
<div class = "pane big">
<code>
<pre>
SomeText
<a id="par1" href="#par1">¶</a>
MoreText
.
.
.
<a id="par2" href="#par2">¶</a>
MoreText
</pre>
</code>
</div>
So what i need to do, is to extract the text under the href tag par1 by itself and then get the text under par2 href tag separately. i tried to use Jsoup but all i could do is to select the whole text withing the div. Also tried XPath but the expression that I'm evaluating is not accepted. not sure maybe because it's not an XML file to begin with.
and example of XPath expressions that I used is .
/html/body/div/div[2]/code[2]/pre/text()[3]
and CSS
body > div > div.pane.big > code:nth-child(7) > pre
It's not possible to do that with pure CSS selectors, additional extracting and appending logic in Java code needed:
Select pre element
Split it to sequence of text parts by a element as splitter.
Skip 1st element and join two (or more) next parts.
Here simple code sample for that (JDK 1.8 style with stream API and old JDK 1.5 - 1.7 style):
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import static java.util.Arrays.stream;
import static java.util.stream.Collectors.joining;
public class SimpleParser {
public static void main(String[] args) throws IOException {
final Document document = Jsoup.parse(new File("div.html"), "UTF-8");
final Elements elements = document.select("div.pane.big pre");
System.out.println("JDK 1.8 style");
System.out.println(
stream(elements.html().split("\\s+<a.+</a>\\s+"))
.skip(1)
.collect(joining("\n")
));
System.out.println("\nJDK 1.7 style");
String[] textParts = elements.html().split("\\s+<a.+</a>\\s+");
StringBuilder resultText = new StringBuilder();
for (int i = 1; i < textParts.length; i++) {
resultText.append(textParts[i] + "\n");
}
System.out.println(resultText.toString());
}
}
P.S. Note that last tag div in your HTML code sample should be closed-tag.
Wait ,so you need the part inside the href tag,right ?Say we have
<a id="par1" href="#iNeedThisPart">¶</a> , then do you want 'iNeedThisPart'?
If that is indeed what you want ,then you need to use the css query a[href] ,which would select all 'a' tags with 'href' attribute. The JSoup code for the same will be as follows:
public List<String> getTextWithinHrefAttribute(final File file) throws IOException{
final List<String> hrefTexts = new ArrayList<>();
final Document document=Jsoup.parse(file,"utf-8");
final Elements ahrefs =document.select("a[href]");
for(final Element ahref : ahrefs ){
hrefTexts.add(ahref.attr("href"));
}
return hrefTexts;
}
I am assuming that you are parsing from a file, and not crawling a web page.
Here is the html I'm trying to parse:
<div class="entry">
<img src="http://www.example.com/image.jpg" alt="Image Title">
<p>Here is some text</p>
<p>Here is some more text</p>
</div>
I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
for (Element desc : descs) {
String text = desc.getElementsByTag("p").first().text();
myArrayList.add(text);
}
But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.
I'm using a BufferedReader to read the html file one line at a time.
You could change your approach to the following:
Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");
for (Element pElem : pElems) {
myArrayList.add(pElem.data());
}
Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:
String line = "<div class=\"entry\">" +
"<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" +
"<p>Here is some text</p>" +
"<p>Here is some more text</p>" +
"</div>";
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
List<String> myArrayList = new ArrayList<String>();
for (Element desc : descs) {
Elements paragraphs = desc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
myArrayList.add(paragraph.text());
}
}
In your for-loop:
Elements ps = desc.select("p");
(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))
Try this:
Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();
I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE>. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser?
For example:
<TABLE cellspacing=0>
<tr><td>HELLO</td>
<td>How are you</td></tr>
</TABLE>
Answer:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Once you have found the Element of your table, all you have to do is call getContent().toString(). Here's a quick example using your sample HTML:
Source source = new Source("<TABLE cellspacing=0>\n" +
" <tr><td>HELLO</td> \n" +
" <td>How are you</td></tr>\n" +
"</TABLE>");
Element table = source.getFirstElement();
String tableContent = table.getContent().toString();
System.out.println(tableContent);
Output:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Aby, I walk down the code for all elements and show on screen. Maybe help you.
List<Element> elementListTd = source.getAllElements(HTMLElementName.TD);
//Scroll through the list of elements "td" page
for (Element element : elementListTd) {
if (element.getAttributes() != null) {
String td = element.getAllElements().toString();
String tag = "td";
System.out.println("TD: " + td);
System.out.println(element.getContent());
String conteudoAtributo = element.getTextExtractor().toString();
System.out.println(conteudoAtributo);
if (td.contains(palavraCompara)) {
tabela.add(conteudoAtributo);
}
}