JSoup selector contains for two elements - java

I have a situation where I need to include both of the two possible words from JSoup selector. I have already done it for the first word, but struggle to have some kind of logical OR 'contain another word'. Code I already have:
Iterator<Element> activity = table.select("td[class=xl75], td[class=xl71], td[class=xl73]:contains(word1))").iterator();
I have tried to edit it this way:
Iterator<Element> activity = table.select("td[class=xl75], td[class=xl71], td[class=xl73]:contains(word1):contains(word2)").iterator();
but it's not working. Any ideas have to have both of two words included in one selector?

You can consider using regex matching for this kinda work.
Where your selector:
td[class=xl75], td[class=xl71], td[class=xl73]:contains(word1):contains(word2)
Can be rewritten as the following code:
td[class=xl75], td[class=xl71], td[class=xl73]:matches((word1)|(word2))

Related

Get element from multiple div class with colon in css html

There are 2 classes with the same name
<div class="website text:middle"> A</div>
<div class="website text:middle"> 1</div>
How to get A and 1? I tried using getElementById with :eq(0) and it gives out null
Method getElementById queries for elements with a specified id, not class; I'm not sure what you were trying to query with :eq(0) either.
Try:
// String html = ...
Document doc = Jsoup.parse(html);
List<String> result = doc.getElementsByClass("text:middle").eachText();
// result = ["A", "1"]
EDIT
You can query for elements that match multiple classes! See Jsoup select div having multiple classes.
However, a colon (:) is a special character in css and needs to be escaped when it appears as part of a class name in a selector query. I don't think that jsoup currently supports this and simply treats everything after a colon as a pseudo-class.
To add to Janez's correct answer - while jsoup's CSS selector (currently) doesn't support escaping a : character in the class name, there are other ways to get it to work if you want to use the select() method instead of getElementsByXXX -- e.g. if you want to combine selectors in one call:
Elements divs = doc.select("div[class=website text:middle]");
That will find div elements with the literal attribute class="website text:middle". Example.
Or:
Elements divs = doc.select("div[class~=text:middle]");
That finds elements with the class attribute that matches the regex /text:middle/. Example
For the presented data though, I think think the getElementsByClass() DOM method is the way to go and the most general. I just wanted to show a couple alternatives for other cases.
document.querySelectorAll(".website")[0] // 0 is child index
you should use querySelector it is fully supported by every browser
check this for support details support

How to select all classNames of a page using java?

I'm using selenium (with java) to search all the classNames of a Page and then use Regex to only save the className(s) which have "insignia" in them.
I tried using the below code with regex to search for classNames which a mention of "insignia" in them but it didn't return any result.
System.out.println(driver.findElements(By.className(".*\\binsignia\\b.*")).get(1).getAttribute("src"));
You can't use Regex inside a locator string. You can use a CSS selector and find all elements that contain "insignia" in the class name.
System.out.println(driver.findElements(By.cssSelector("[class*='insignia']")).get(1).getAttribute("src"));
CSS selector reference
driver.findElements(By.xpath("//*[contains(#class,'.*\\binsignia\\b.*')]")
is going to return the webElements containing class name insignia

Does driver.findElement(By.className) locate a class that has multiple class names?

I've got this HTML:
<class = "abc pqr"></class>
So if I do a driver.findElement(By.className("abc"), will WebDriver actually find the class in the DOM structure?
What I want to know is, does the By.className work if we provide just a substring of the class?
The short answer is yes! Either By.className("abc") or By.className("pqr") is perfectly fine in this case.
Note that this is not using a sub-string. In your element <class = "abc pqr"/>, this a space-separated list of classes!
If you still need the answer - yes, it will. It will not wok in case if you will try to search it like By.classname("abc pqr") it won't.
If you want to search by part of class use css selector. Imagine you have
<a class="superclass secondaryclass">
Then you can find it with:
By.cssSelector("//a[class*=super]")
as *= is search by substring (not strict one).
If you still want to search by xpath and substring, then you can do some tricks like:
//a[substring(#class, string-length(#class) - 5) = 'super']
Or try to use a[contains(#class, 'super')]
There are too many ways to do it :)
Short answer: no. You will have to use one of the below methods for selecting an element based on multiple class names.
driver.find(By.cssSelector(".abc.pqr"));
or
driver.find(By.xpath("//*[#class='abc pqr']"));

How to extract links from a web content?

I have download a web page and I want to extract all the links in that file. this links include absolutes and relatives. for example we have :
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>
or
<a href="http://stackoverflow.com/" />
so after reading the file, what should I do?
This isn't that complicated to do, if you want to use the builtin regex system from Java. The hard bit is finding the right regex to match URLs[1][2]. For the sake of the answer, I'm gonna just assume you've done that, and stored that as a Pattern with syntax along the lines of this:
Pattern url = Pattern.compile("your regex here");
and some way of iterating through each line. What you'll want to do is define an ArrayList<String>:
ArrayList<String> urlsFound = new ArrayList<>();
From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line), and inside you'll put this:
Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());
What this does is create a Matcher for your line and the URL-matching Pattern from before. Then, it loops until #find() returns false (i.e., there are no more matches) and adds the match (with #group()) to the list, urlsFound.
At the end of your loop, urlsFound will contain all the matches for all of the URLs on the page. Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound will get quite big, and you'll be creating and ditching a lot of Matchers.
1: I found a few good sites with a quick Google search; the cream of the crop seem to be here and here, as far as I can tell. Your needs may vary.
2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all. It can be tweaked to work if there are multiple parts, though.

Regex href parsing

a regex question in java.
I'm scraping Id numbers from a element href attribute. I have a bunch on links like these in a string:
Whatever
After the 'pdf' and slash comes an Id number, which I'm interested in.
So I must get all Id's from multiple occurences of this kind of url in the string. What would be the best regex for it?
Thanks in advance.
If you know that the url will be exactly this, your regex can just be:
someplacelol\\.com/pdf/([0-9]+)/
I'm no regex artist but you should be able to get the url out of the element with:
\<a\s.*?href=(?:\"([\w\.:/?=&#%_\-]*)\"|([^\"][\w\.:/?=&#%_\-]*[^\"\>])).*?\>
The first group will contain the URL.
From there you should be able to extract the number without too much difficulty. I tested that link on the source of this page and it was able to correctly identify all of the HREFS in all of the as.
Please don't comment and say It breaks for <a id="<<<>><><<>>href=" href="<a href="> because OP has provided in his description of the problem that ridiculous abuses of the HTTP standard such as this one will not be present in his trail cases.
Also, if for some weird reason, an element has 2 hrefs, only the first will be grabbed. You could probably address that if you cared.
Edit: added whitespace requirement after <a so it won't match things like <asdffsdfsfg href="lol">.

Categories