Java get single element from webpage - java

Hi so after some searching still not found an answer but i would like to get a single element of a webpage to a String Variable. I know how to do this in C but would like to know in java
eg:
document.nav(the webpage)
String value = document.getElementbyid(theid)
Thanks
so eg:
some webpage has
<body>
<P id=element1>the value i want</p>
</body>
and i need to get that value from the webpage into a String variable

You can use jsoup for that:
String url = "http://www.example.com"; // or whatever goes here
Document document = Jsoup.connect(url).followRedirects(false).timeout(60000/*wait up to 60 sec for response*/).get();
String value = document.body().select("#element1" /*css selector*/).get(0).text();
If you need another input format please refer to the cookbook
It's not really necessary to specify timeout ect. for the connection. You could just use
Document document = Jsoup.connect(url).get();
I only included the timeout if the webpage takes really long to load. You also may want to follow redirects.

Related

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

How can i search variable value in View page source using selenium

Here I have a scenario, where I need to check variable value from view page source code.
For ex:- For the below URL
https://www.seniorhousingnet.com/seniorliving-detail/overture-fair-ridge-62-apartment-homes_3955-fair-ridge-drive_fairfax_va_22033-581333
Click view page source, then find an a variable "leadtype"
I know, we need to use driver.getpagesource() to get view page source in selenium, But I need to check leadtype value for a particular property, If it is SHN-enhanced, The logic will be different, If leadtype value is different then we need to apply another logic. Just please let me know how to check leadtype value in this scenario.
Hope you are working in java, Java provides multiple libraries for reading an html content.
once you get the page source, make an html object , parse it and reach the desired node. when you finally got the node of your choice you can get its attributes , its value and other properties as well
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
JSOUP library
jsoup tutorial
baeldung jsoup tutorial
there was also a stackoverflow question for html parser, Please do check it once link
You don't have to parse the HTML to get the value. That JS line is actually executed and the adobeDTM variable then holds the data. You can access that using adobeDTM.leadType but you will need to execute JavaScript in order to obtain the value.
String leadType = (String) ((JavascriptExecutor)driver).executeScript("return adobeDTM.leadType"));
leadType now contains "shn-enhanced" (according to my code execution).

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

How to get HTML table content from an in-complete webpage (HTML only) online?

Question: How to get table content in HTML by Java?
Requirement: Has to be online page, not local file.
I want to extract the url of the 1st subject at:
https://discussions.apple.com/community/ipad/using_ipad?view=discussions#/?page=3
I tried following code to get 1st subject at page 3
String url_page3 = "https://discussions.apple.com/community/ipad/using_ipad?view=discussions#/?page=3";
String key = "td.jive-table-cell-subject > a[href]";
Document doc = Jsoup.connect(url_page3).maxBodySize(0).timeout(0).get();
Element e = doc.select(key).first();
System.out.println(e.attr("abs:href");
It returns the first subject on page 1 (even I changed connected url to page 4, page 5, ...)
But why does this happen? Is there any other method I could try?
The reason is simple. The hash tag does not matter for the server so it sends only the first page. I guess the other pages are send out by AJAX so you need to check the network traffic to find that url. Than you can also read the next pages.

How to parse content with <pre>?

I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.
Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.

Categories