I have looked through multiple forms before asking this question.Basically, what i need is to select part of the text in a HTML file. the html is constructed something like this
<div class = "pane big">
<code>
<pre>
SomeText
<a id="par1" href="#par1">¶</a>
MoreText
.
.
.
<a id="par2" href="#par2">¶</a>
MoreText
</pre>
</code>
</div>
So what i need to do, is to extract the text under the href tag par1 by itself and then get the text under par2 href tag separately. i tried to use Jsoup but all i could do is to select the whole text withing the div. Also tried XPath but the expression that I'm evaluating is not accepted. not sure maybe because it's not an XML file to begin with.
and example of XPath expressions that I used is .
/html/body/div/div[2]/code[2]/pre/text()[3]
and CSS
body > div > div.pane.big > code:nth-child(7) > pre
It's not possible to do that with pure CSS selectors, additional extracting and appending logic in Java code needed:
Select pre element
Split it to sequence of text parts by a element as splitter.
Skip 1st element and join two (or more) next parts.
Here simple code sample for that (JDK 1.8 style with stream API and old JDK 1.5 - 1.7 style):
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import static java.util.Arrays.stream;
import static java.util.stream.Collectors.joining;
public class SimpleParser {
public static void main(String[] args) throws IOException {
final Document document = Jsoup.parse(new File("div.html"), "UTF-8");
final Elements elements = document.select("div.pane.big pre");
System.out.println("JDK 1.8 style");
System.out.println(
stream(elements.html().split("\\s+<a.+</a>\\s+"))
.skip(1)
.collect(joining("\n")
));
System.out.println("\nJDK 1.7 style");
String[] textParts = elements.html().split("\\s+<a.+</a>\\s+");
StringBuilder resultText = new StringBuilder();
for (int i = 1; i < textParts.length; i++) {
resultText.append(textParts[i] + "\n");
}
System.out.println(resultText.toString());
}
}
P.S. Note that last tag div in your HTML code sample should be closed-tag.
Wait ,so you need the part inside the href tag,right ?Say we have
<a id="par1" href="#iNeedThisPart">¶</a> , then do you want 'iNeedThisPart'?
If that is indeed what you want ,then you need to use the css query a[href] ,which would select all 'a' tags with 'href' attribute. The JSoup code for the same will be as follows:
public List<String> getTextWithinHrefAttribute(final File file) throws IOException{
final List<String> hrefTexts = new ArrayList<>();
final Document document=Jsoup.parse(file,"utf-8");
final Elements ahrefs =document.select("a[href]");
for(final Element ahref : ahrefs ){
hrefTexts.add(ahref.attr("href"));
}
return hrefTexts;
}
I am assuming that you are parsing from a file, and not crawling a web page.
Related
Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}
I am writing a selenium test on our web page. There is one label field that has one to many text segments separated by a right-caret icon. I am trying to extract the individual text segments from the label into a list.
This is what the html looks like in the DOM. In this case there are 3 individual text segments: "MainSchedule", "Container1", and "Container1.2".
<p class="MuiTypography-root MuiTypography-body1" style="word-break: break-all;">
"MainSchedule"
<svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="caret-right" class="svg-inline--fa fa-caret-right fa-w-6 sm-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" style="margin: 0px 5px;">
<path fill="currentColor" d="M0 384.662V127.338c0-17.818 21.543-26.741 34.142-14.142l128.662 128.662c7.81 7.81 7.81 20.474 0 28.284L34.142 398.804C21.543 411.404 0 402.48 0 384.662z"/>
</svg>
"Container1"
<svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="caret-right" class="svg-inline--fa fa-caret-right fa-w-6 sm-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" style="margin: 0px 5px;">
<path fill="currentColor" d="M0 384.662V127.338c0-17.818 21.543-26.741 34.142-14.142l128.662 128.662c7.81 7.81 7.81 20.474 0 28.284L34.142 398.804C21.543 411.404 0 402.48 0 384.662z"/>
</svg>
"Container1.2"
</p>
I can easily get the paragraph object with
WebElement label = WebDriver.findElement(By.cssSelector("p.MuiTypography-root"))
but when I try to do a getText() off of label it returns all 3 of the text segments in one string with no breaks to show where the image icons are.
Using the Chrome tools I can look at the element's properties and on the "p.MuiTypography-root" I see the "firstChild" text content is the first text segment "MainSchedule". I have tried
label.findElement(By.xpath("first-child"))
and it just throws an error. From that "firstChild" I can step through the "nextSibling" in the Chrome tools and find the ones that hold the individual text segments. But I have not figured out how to code this to read them.
I am writing my tests in java.
You can't do this directly in Selenium, because you need to return text fragments, and the Selenium finders all return web elements.
However, there are xpath selectors you can use for this, which will return the specific text fragments you need. The basic approach is an xpath selector like this:
//p[contains(#class, 'MuiTypography-root')]/text()[position() = 1]
This will return the first fragment of text inside the <p> element - so, this (after trimming off the excess whitespace):
"MainSchedule"
How to use the above xpath selector? We will change the above "1" so it is not hard-coded; we will determine the number of possible text fragments we need to extract, and we will build a loop accordingly.
We use the xpath classes and parsers provided in Java as follows:
import java.io.IOException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
...
// assuming Firefox (I guess you are using Chrome):
System.setProperty("webdriver.gecko.driver", "your/path/here/geckodriver.exe");
WebDriver driver = new FirefoxDriver();
String uri = "your URL in here";
driver.navigate().to(uri);
// Here is where we use the Java parser and xpath classes:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(uri);
XPath xPath = XPathFactory.newInstance().newXPath();
// count how many <svg> tags there are.
String svgCounter = "count(//p[contains(#class, 'MuiTypography-root')]/svg)";
String count = xPath.compile(svgCounter).evaluate(doc);
// There can be up to this many pieces of text we need to extract:
int max = Integer.parseInt(count) + 1;
String expressionOne = "//p[contains(#class, 'MuiTypography-root')]/text()[position() = %s]";
for (int i = 1; i <= max; i++) {
String result = xPath.compile(String.format(expressionOne, i)).evaluate(doc).trim();
if (!result.isBlank()) {
System.out.println(result);
}
}
driver.quit();
The above print statements print the following:
"MainSchedule"
"Container1"
"Container1.2"
Points to note:
(1) This approach assumes that you have a well-formed HTML document which can be parsed at this step:
Document doc = docBuilder.parse(uri);
(2) The above code assumes there is one <p> element with an unspecified number of child <svg> tags. If you have multiple such <p> elements in your page, then you will need to adjust the above code accordingly, to process each <p> element one-by-one.
(3) If you don't have a well-formed HTML document, the above approach may fail. There is a hackier approach you can take, in that case - but it is not really recommended because it involves using a regular expression to split up a string of HTML - almost never a good idea. Often, this will be brittle and fail in surprising ways.
The hack goes like this:
String expressionTwo = "//p[contains(#class, 'MuiTypography-root')]";
WebElement element1 = driver.findElement(By.xpath(expressionTwo));
String html = element1.getAttribute("innerHTML").replace('\n', ' ');
String[] items = html.split("<svg .*?</svg>");
for (String item : items) {
System.out.println(item.trim());
}
In this case, we use the innerHTML attribute to get the string we need to manipulate.
Hi i need to scrape a web site using JSOUP and i needed to get output in key- value pairs can anyone suggest me.
The url which i need to scrape is https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=
The code which i written is:
package com.jaysons;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ScrapeBody {
public static void main( String[] args ) throws IOException{
String url = "https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=";
Document doc = Jsoup.connect(url).get();
Elements content = doc.select("div.views-field views-field-php");
doc = Jsoup.parse( content.html().replaceAll("</div>", "</div><span>")
.replaceAll("<div", "</span><div") );
Elements labels = doc.select("div.remedy");
for (Element label : labels) {
System.out.println(String.format("%s %s", label.text().trim(),
label.nextElementSibling().text()));
}
}
}
i need output in key value pairs like
Date:OCTOBER 20, 2017
remedy:
units:
website:http://www.bosch-home.com/us
phone:(888) 965-5813
kindly let me know where did i do mistake
Theres no need to reassign and re-parse the value of the content variable.
Elements content = doc.select("div.views-field >span");
for (Element viewField : content) {
/*
each viewField corresponds to one
<div class="views-field views-field-php">
<span class="field-content">
<a href="/Recalls/2018/BSH-Home-Appliances-amplía-retiro-del-mercado-de-lavavajillas">
<div class="date">
October 20, 2017
</div>
...
</span>
</div>
*/
Elements divs = viewField.getElementsByTag("div");
for (Element div : divs) {
String className = div.className();
if (className.equals("date")) {
// store and extract date
} else if (className.equals("...")) {
// do something else
} // else...
}
}
Not only you can select subelements by tag, but also by name, by some attributes etc. Check the official documentation for more info: https://jsoup.org/cookbook/extracting-data/dom-navigation
Disclaimer: I could not test the code right now.
I am about to parse this url : http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y
Document jDoc = Jsoup.connect(url1).get();
System.out.println(jDoc1.text());
But the output of the second line(above) is all TAGS inside textarea + text of other tags. Output is like :
..
..
<ul class="">
<li><a data-time="1dy" data-frequency="1mi" class="mdm_time">1 Day</a></li>
<li><a data-time="5dy" data-frequency="15mi" class="mdm_time">5 Days</a></li>
..
..
All the html is getting printed (of what is inside ) and text of other tags. I either want to remove this tag from Doc or want to get this as element so that I can remove it by my hand.
Hope, I am able to explain everything clearly. Please help me solve this.
EDIT :
As per suggestion, I did this :
System.out.println(jDoc1.select("textarea"));
And output comes is :
textarea id="wsj_autocomplete_template" style="display:none">
<div>
<div class="acHeadline hidden" >
</div>
<div class="dropdownContainerClass">
<div class="suggestionblock hidden" templateType="C1">
....
...
..
Certainly it is selecting the textarea, but is not able to parse inner elements. possibly due to < instead of < tag. Is there any workaround for this ?
If you want to remove the entire text area tag use doc.select("textarea").remove();. Or if you want to get the content of textarea use doc.select("textarea").text(). Note here i'm using the text() method instead of toString() or html() methods. This gives the exact text rather than html escape codes.
Again if you want to manipulate this html you can parse it again like Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
Example
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class WSJParser {
public static void main(String[] args) {
String url = "http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y";
try {
Document doc = Jsoup.connect(url).get();
//doc.select("textarea").remove(); // Removes the entire text area tag
Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
System.out.println(textareaDoc);
} catch (IOException e) {
e.printStackTrace();
}
}
}
If I understand correctly, what you want is this
Elements textareas = Jsoup.connect(url1).get().select("textarea");
for (Element textarea : textareas) {
Elements elements = textarea.select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
}
I want to split the following string according to the td tags:
<html>
<body>
<table>
<tr><td>data1</td></tr>
<tr><td>data2</td></tr>
<tr><td>data3</td></tr>
<tr><td>data4</td></tr>
</table>
</body>
I'v tried split("h2"); and split("[h2]"); but this way the split method splits the html code where it finds "h" or "2" and if Iam not mistaken also "h2".
My ultimate goal is to retrieve everything between <td> and </td>
Can anyone please please tell me how to do this with only using split()?
Thanks alot
No.
That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.
Here is how to solve your optimal goal:
String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);
for (Matcher m = p.matcher(html); m.find(); ) {
String tag = m.group(1);
System.out.println(tyg);
}
Please note that this code is written here without compiler but it gives the idea.
BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)
You cannot successfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:
import java.io.StringReader;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;
public static List<String> extractTDs(String html) throws IOException {
final List<String> tdList = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
StringBuffer buffer = new StringBuffer();
public void handleText(final char[] data, final int pos) {
buffer.append(data);
}
public void handleEndTag(Tag t, final int pos) {
if(Tag.TD.equals(t)) {
tdList.add(buffer.toString());
}
buffer = new StringBuffer();
}
};
parserDelegator.parse(new StringReader(html), parserCallback, true);
return tdList;
}
String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td> looked like:
<td>
<table><tr><td> td inside a td? </td></tr></table>
</td>
A regex would greedily match everything between the outer <td>...</td> giving you unwanted results.
You should use an HTML parser like Johan mentioned.
You should really use a html parser, such as neko html or HtmlParser.
Iff you have a very small set of controlled html you could (although I generally recommend against it) use a regex such as
(?<=\\<td\\>)\\w+(?=\\</td\\>)