HTML parse in Java?

HTML parse in Java? - java

I'm trying to parse (with jsoup) some specific text from a website, but it doesn't work for me.
LINK TO SITE
It's the number "43" in red text that I am interested in at the top-right of the page.
This is what I tried:
String test;
public void scan(String url) throws Exception {
Document document = Jsoup.connect(url).get();
Elements votes = document.select("#malicious-votes .pull-right");
test = votes.text();
}
public int returnVotes(){
return test();
}
~ ~ ~
public static void main(String[] args) throws Exception {
Scan_VirusTotal virustotal = new Scan_VirusTotal();
virustotal.scan("https://www.virustotal.com/sv/url/cbf2d00f974d212b6700e7051f8b23f2038e876173066af41780e09481ef1cdd/analysis/1407146081");
System.out.println(virustotal.returnVotes());
This prints nothing. Other elements work fine with this exact method, so I'm really confused as to why this particular piece of text won't parse.
Ideas? Thanks.
EDIT - added some HTML from page as requested:
<div style="display:block" class="pull-right value text-red" id="malicious-votes">44</div>

Try using this instead:
Elements votes = document.select("#malicious-votes");
test = votes.text();
I tried this $("#malicious-votes .pull-right") in the browser console of the given page, gives me empty array. But $("#malicious-votes") gives me the vote div which itself has the class pull-right.

Your selector should be:
"#malicious-votes", not "#malicious-votes .pull-right".
"#malicious-votes .pull-right" selects any elements with class pull-right that are descendants of #malicious-votes. What you want is the #malicious-votes element itself.

Related

If it's possible to get something from '(.*)' into a variable?

I'm only learning cucumber.Tell me if it's possible to get something from '(.*)' into a variable?
#Then("^ Page is '(.*)' on PDP$")
For example : #Then("^ Page is 'display' on PDP$").And I want to get the word 'display' in a variable. Because i want work with it in my method.
I want something like this
#Then("^ Page is '(.*)' on PDP$")
public void title() {
String str = '(.*)';
...
}

When you use a cucumber and write some description in a step, you can select the desired word and receive it in the method arguments.
#Then("^ Page is '(.*)' on PDP$")
public void title(String str) {
String str = '(.*)';
...
}

How to take value by attribute using Jsoup Java?

I am taking the HTML code from website and then I would like to take the value "31 983" from attribute using Jsoup:
<span class="counter nowrap">31 983</span>
The code below is almost ready, but do not take this value. Could you please help me?:
public class TestWebscrapper {
private static WebDriver driver;
#BeforeClass
public static void before() {
System.setProperty("webdriver.chrome.driver", "src/main/resources/chromedriver.exe");
driver = new ChromeDriver();
}
#Test
public void typeAllegroUserCodeIntoAllegroPageToAuthenticate() {
String urlToAuthencicateToTypeUserCode="https://www.test.pl/";
driver.get(urlToAuthencicateToTypeUserCode);
Document doc = Jsoup.parse(driver.getPageSource());
//how to take below value:
System.out.println(doc.attr("counter nowrap"));
}
#AfterClass
public static void after() {
driver.quit();
}
}
I was trying to use doc.attr, but does not help.

Jsoup uses CSS selectors to find elements in HTML source. To achieve what you want use:
// select the first element containing given classes
Element element = doc.select(".counter.nowrap").first();
// get the text from this element
System.out.println(element.text());
I'm afraid in your case there may be many elements containing classes counter and nowrap so you may have to iterate over them or try different selector to address directly the one you want. It's hard to tell without webpage URL.
Answering you original question, how to select by attribute:
Element element = doc.select("span[class=counter nowrap]").first();
or just:
Element element = doc.select("[class=counter nowrap]").first();

How to find all URLs recursively on a website -- java

I have a method which allows me to get all URLs from the page (and optional - to check is it valid).
But it works only for 1 page, I want to check all the website. Need to make a recursion.
private static FirefoxDriver driver;
public static void main(String[] args) throws Exception {
driver = new FirefoxDriver();
driver.get("https://example.com/");
List<WebElement> allURLs = findAllLinks(driver);
report(allURLs);
// here are my trials for recursion
for (WebElement element : allURLs) {
driver.get(element.getAttribute("href"));
List<WebElement> allUrls = findAllLinks(driver);
report(allUrls);
}
}
public static List findAllLinks(WebDriver driver)
{
List<WebElement> elementList = driver.findElements(By.tagName("a"));
elementList.addAll(driver.findElements(By.tagName("img")));
List finalList = new ArrayList();
for (WebElement element : elementList)
{
if(element.getAttribute("href") != null)
{
finalList.add(element);
}
}
return finalList;
}
public static void report(List<WebElement> allURLs) {
for(WebElement element : allURLs){
System.out.println("URL: " + element.getAttribute("href")+ " returned " + isLinkBroken(new URL(element.getAttribute("href"))));
}
}
See comment "here are my trials for recursion". But it goes through the first page, then again through the first page and that's all.

You're trying to write a web crawler. I am a big fan of code reuse. Which is to say I always look around to see if my project has already been written before I spend the time writing it myself. And there are many versions of web crawlers out there. One written by Marilena Panagiotidou pops up early in a google search. Leaving out the imports, her basic version looks like this.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
}
}
Probably the most important thing to note here is how the recursion works. A recursive method is one that calls itself. Your example above is not recursion. You have a method findAllLinks that you call once on a page, and then once for every link found in the page. Notice how Marilena's getPageLinks method calls itself once for every link it finds in a page at a given URL. And in calling itself it creates a new stack frame and a generates a new set of links from a page and calls itself again once for every link, etc. etc.
Another important thing to note about a recursive function is when does it stop calling itself. In this case Marilena's recursive function keeps calling itself until it can't find any new links. If the page you are crawling links to pages outside its domain, this program could run for a very long time. And, incidentally, what probably happens in that case is where this website got it's name: a StackOverflowError.

Make sure you are not visiting the same URL twice. Add some table where you store already visited URLs. Since every page might be starting with header that is linked to the home page you might end up visiting it over and over again, for example.

Is it possible to use String values to populate a doc comment?

I'm wondering if it's possible to tie the text of a doc comment of a method to a String value in code. For example:
String myHelloMethodDocComment = "This is a doc comment. \n #param arg1";
{Some sort of magic, maybe here maybe somewhere else}
public void printHello(String world){
System.out.print("Hello " + world);
}
public static void main(String args[]){
printHello("world");
}
In this example, should I generate a javadoc or mouse over the line printHello("world") in main using an editor that supports them, I would ideally want to see
"This is a doc comment.
param arg1"
In the mouse-over window.
Is this possible, maybe using something in the new Java 8 doctree api that I just haven't found?

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Page page = new Page();
page.populateProductList( driver.findElement( By.xpath("//div[#id='my_25_products']") ) );
class Page
{
public final static String ALL_PRODUCTS_PATTERN = "//div[#id='all_products']";
private List<Product> productList = new ArrayList<>();
public void populateProductList(final WebElement pProductSectionElement)
{
final List<WebElement> productElements = pProductSectionElement.findElements(By.xpath(ALL_PRODUCTS_PATTERN));
for ( WebElement productElement : productElements ) {
// test 1 - works
// System.out.println( "Product block: " + productElement.getText() );
final Product product = new Product(productElement);
// test 2 - wrong
// System.out.println( "Title: " + product.getUrl() );
// System.out.println( "Url: " + product.getTitle() );
productList.add(product);
}
// test 3 - works
//System.out.println(productElements.get(0).findElement(Product.URL_PATTERN).getAttribute("href"));
//System.out.println(productElements.get(1).findElement(Product.URL_PATTERN).getAttribute("href"));
}
}
class Product
{
public final static String URL_PATTERN = "//div[#class='url']/a";
public final static String TITLE_PATTERN = "//div[#class='title']";
private String url;
private String title;
public Product(final WebElement productElement)
{
url = productElement.findElement(By.xpath(URL_PATTERN)).getAttribute("href");
title = productElement.findElement(By.xpath(TITLE_PATTERN)).getText();
}
/* ... */
}
The webpage I am trying to 'parse' with Selenium has a lot of code. I need to deal with just a smaller portion of it that contains the products grid.
To the populateProductList() call I pass the resulting portion of the DOM that contains all the products.
(Running that XPath in Chrome returns the expected all_products node.)
In that method, I split the products into 25 individual WebElements, i.e., the product blocks.
(Here I also confirm that works in Chrome and returns the list of nodes, each containing the product data)
Next I want to iterate through the resulting list and pass each WebElement into the Product() constructor that initializes the Product for me.
Before I do that I run a small test and print out the product block (see test 1); individual blocks are printed out in each iteraion.
After performing the product assignments (again, xpath confirmed in Chrome) I run another test (see test 2).
Problem: this time the test returns only the url/title pair from the FIRST product for EACH iteration.
Among other things, I tried moving the Product's findElement() calls into the loop and still had the same problem. Next, I tried running a findElement**s**() and do a get(i).getAttribute("href") on the result; this time it correctly returned individual product URLs (see test 3).
Then when I do a findElements(URL_PATTERN) on a single productElement inside the loop, and it magically returns ALL product urls... This means that findElement() always returns the first product from the set of 25 products, whereas I would expect the WebElement to contain only one product.
I think this looks like a problem with references, but I have not been able to come up with anything or find a solution online.
Any help with this? Thanks!
java 1.7.0_15, Selenium 2.45.0 and FF 37

The problem is in the XPATH of the Product locators.
Below xpath expression in selenium means you are looking for a matching element which CAN BE ANYWHERE in document. Not relative to the parent as you are thinking!!
//div[#class='url']/a
This is why it always returns the same first element.
So, in order to make it relative to the parent element it should be as given below. (just a . before //)
public final static String URL_PATTERN = ".//div[#class='url']/a";
public final static String TITLE_PATTERN = ".//div[#class='title']";
Now you make it search for matching child element relative to the parent.
XPATH in selenium works as given below.
/a/b/c --> Absolute - from the root
//a/b --> Matching element which can be anywhere in the document (even outside the parent).
.//a/b --> Matching element inside the given parent

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML parse in Java? - java

Try using this instead: Elements votes = document.select("#malicious-votes"); test = votes.text(); I tried this $("#malicious-votes .pull-right") in the browser console of the given page, gives me empty array. But $("#malicious-votes") gives me the vote div which itself has the class pull-right.

Your selector should be: "#malicious-votes", not "#malicious-votes .pull-right". "#malicious-votes .pull-right" selects any elements with class pull-right that are descendants of #malicious-votes. What you want is the #malicious-votes element itself.

Related

If it's possible to get something from '(.*)' into a variable?

How to take value by attribute using Jsoup Java?

How to find all URLs recursively on a website -- java

Is it possible to use String values to populate a doc comment?

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Categories

Resources