Get attribute values from all elements - java

Code:
Document doc = Jsoup.connect("things.com").get();
Elements jpgs = doc.select("img[src$=.jpg]");
String links = jpgs.attr("src");
System.out.print("all: " + jpgs);
System.out.print("src: " + links);
Output:
all:
<img alt="Apple" src="apple.jpg">
<img alt="Cat" src="cat.jpg">
<img alt="Boat" src="boat.jpg">
src: apple.jpg
Jsoup gave the attribute value for first element. How can I get the others (cat.jpg and boat.jpg)?
Thank you.

You loop through links and get it from each one via Element#attr, since Elements#attr (note the s) says:
Get an attribute value from the first matched element that has the attribute.
(My emphasis.)
So for instance:
for (Element e : jpgs) {
// use e.attr("src") here
}
Using Java 8's new Stream stuff, you can probably get a List<String> of them if you like:
List<String> links = jpgs.stream<Element>()
.map(element -> element.attr("src"))
.collect(Collectors.toList());
...but my Java 8 streams-fu is very weak, so that may not be quite right. Yeah, that isn't right. But that's the general idea.
The boring old-fashioned way is:
List<String> links = new ArrayList<String>(links.size());
for (Element e : jpgs) {
srcs.add(e.attr("src"));
}

Elements#attr will only return the first match.
Elements#attr Source Code
public String attr(String attributeKey) {
for (Element element : this) {
if (element.hasAttr(attributeKey))
return element.attr(attributeKey);
}
return "";
}
Solution
To obtain the result you want, you should loop over your Elements
for (Element e : jpgs) {
System.out.println(e.attr("src"));
}

Related

Java Code Optimization(jsoup)

Is there an efficient way to optimize this code, as most part of it look like identical, I just started learning jsoup and dont know how really can do that ://
Document doc = Jsoup.connect("http://www.blocket.se/hela_sverige/bilar?ca=11&cg=1020&w=3&md=th").get();
Elements partOne = doc.select("a[title=Flera bilder]");
for (Element element : partOne) {
String myElementOne = element.attr("abs:href");
System.out.println(myElementOne);
}
Elements partTwo = doc.select("a[title=\"\"]");
for (Element element : partTwo) {
String myElementTwo = element.attr("abs:href");
System.out.println(myElementTwo);
}
Elements partThree = doc.select("a[title=Bild]");
for (Element element : partThree) {
String myElementThree = element.attr("abs:href");
System.out.println(myElementThree);
}
The partOne, partTwo and partThree blocks are basically identical; just replace all of the parameter differences with variables and extract to a method:
void someMethodName(Document doc, String selector) {
Elements partOne = doc.select(selector);
for (Element element : partOne) {
String myElementOne = element.attr("abs:href");
System.out.println(myElementOne);
}
}
Example invocation:
someMethodName(doc, "a[title=Flera bilder]");
Alternatively, if you have access to Guava:
Iterable<Element> it = Iterables.concat(
doc.select("a[title=Flera bilder]"),
doc.select("a[title=\"\"]"),
doc.select("a[title=Bild]"));
for (Element element : it) {
String myElement = element.attr("abs:href");
System.out.println(myElement);
}
Andy's solution is of course doing the job. However, since you asked specifically for ways optimizing the JSoup calls, I would suggest to learn more about CSS selectors and regular expressions. For example this will do fine in your case:
Elements allParts = doc.select("a[title~=^Flera bilder$|^$|^Bild$]");
for (Element element : allParts) {
String elStr = element.attr("abs:href");
System.out.println(elStr);
}
Here, I use the ~= operator for attribute texts. It allows me to use a common regular expression to combine all three of your select statements into one.
An alternative way of doing this would be to use the , operator for adding all selectors into one:
Elements allParts2 = doc.select("a[title=Flera bilder],a[title=\"\"],a[title=Bild]");

how to extract email id using jsoup?

Elements elements = doc.select("span.st");
for (Element e : elements) {
out.println("<p>Text : " + e.text()+"</p>");
}
Element e contains text with some email id in it. How to extract the maild id from it. I have seen the Jsoup API doc which provides :matches(regex), but I didn't understand how to use it. I'm trying to use
^[a-zA-Z0-9_!#$%&’*+/=?`{|}~^.-]+#[a-zA-Z0-9.-]+$
which I found while googling.
Thank in advance for your help.
:matches(regex) is useful if you want to find something based on a specified regex (e.g. find all nodes that contain email).
I think this is not what you want. Instead, you need to extract the email from e.text() using regex. In your case:
Elements elements = doc.select("span.st");
for (Element e : elements) {
out.println("<p>Text : " + e.text()+"</p>");
out.println(extractEmail(e.text()));
}
// ...
public static String extractEmail(String str) {
Matcher m = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0- 9-.]+").matcher(str);
while (m.find()) {
return m.group();
}
return null;
}

Retrieving Reviews from Amazon using JSoup

I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.

How to Iterate Through Multiple Maps

So essentially, I have two hashmaps, one containing the following values:
rId33=image23
rId32=image22
rId37=image2
And the other containing this data:
{image2.jpeg=C:\Documents and Settings\image2.jpeg, image22.jpeg=C:\Documents and Settings\image22.jpeg, image23.jpeg=C:\Documents and Settings\image23.jpeg}
I basically want to be able to iterate through the first map, find a match of the key's, if a match is found, get the associated value, then look in the second map, find a match in the keys, then pull out the associated value (meaning the file path).
I was thinking of doing something like this for example (the follow is simplified)...
String val2 = "rId33";
for (String rID: map.keySet())
{
if (rID.contains(val2))
{
//enter code here
}
}
I was looking at the methods available for something like .getValue or something, but I'm not entirely sure how to do that. Any help would be appreciated. Thanks in advance for any replies.
Edited Code with Help From Bozho
else if ("v:imagedata".equals(qName) && headingCount > 0)
{
val2 = attributes.getValue("r:id");
String rID = imageMap.get(val2);
String path = imageLocation.get(rID + ".jpeg");
for (String rels: imageMap.keySet())
{
if (rels.contains(val2))
{
inImage = true;
image docImage = new image();
imageCount++;
docImage.setRelID(val2);
docImage.setPath(path);
addImage(docImage);
}
}
From what I see you don't need to iterate. Just:
String value1 = map1.get(key1);
if (value1 != null) {
String path = map2.get(value1 + ".jpeg");
}
If you don't always know whether it's value1 + ".jpeg", but you just know that the key starts with the first value, then you can iterate the 2nd map with:
for (Map.Entry<String, String> entry : map2.entrySet()) {
String key2 = entry.getKey();
String value2 = entry.getValue();
if (key.startsWith(value1)) {
return value2;
}
}
But note that the first code snippet is O(1) (both operations take constant time), while the 2nd is O(n)
And to answer the question as it is formulated in the title:
Get the iterators of both maps, and use it1.next() and it2.next() within a while loop. If any of the maps doesn't have more elements (it.hasNext()) - break.
That seems very inefficient. The entire point of a hash map is to do fast lookups. Do you really need to use that contains call on rID? In other words, can you change your hash map so that it directly contains the verbatim strings you want to search for and not just strings that contain the strings you want to search for as substrings? If yes, you could then use the answer given already. If not and if you must work with these data structures for whatever reason, the way to do what you're trying to do is something like:
String val2 = "rId33";
String path;
for (String rID: map.keySet())
{
if (rID.contains(val2))
{
path = secondMap.get(map.get(rID)+".jpeg");
break;
}
}
if (path == null)
{
//value not found
}

JSoup - Select all comments

I want to select all comments from a document using JSoup. I would like to do something like this:
for(Element e : doc.select("comment")) {
System.out.println(e);
}
I have tried this:
for (Element e : doc.getAllElements()) {
if (e instanceof Comment) {
}
}
But the following error occurs in eclipse "Incompatible conditional operand types Element and Comment".
Cheers,
Pete
Since Comment extends Node you need to apply instanceof to the node objects, not the elements, like this:
for(Element e : doc.getAllElements()){
for(Node n: e.childNodes()){
if(n instanceof Comment){
System.out.println(n);
}
}
}
In Kotlin you can get via Jsoup every Comment of the whole Document or a specific Element with:
fun Element.getAllComments(): List<Comment> {
return this.allElements.flatMap { element ->
element.childNodes().filterIsInstance<Comment>()
}
}

Categories