I've this HTML block:
ul class="list_attachments"><li>
<img src='pdf.png' alt='pdf'/> File1</li><li>
<img src='pdf.png' alt='pdf'/> File2</li>
</ul>
I would like to extract all the "a href" row, in particular site and name file informations.
So I tried this:
String [] fileName = new String[2];
String [] url = new String[2];
int i=0;
attachments = document.select(".list_attachments");
for (Element attachment : attachments) {
String fileName[i] = attachment.text();
String url[i] = attachment.select("a").attr("href");
i++;
}
But the result is:
String fileName = "File1 File2";
String url = "www.site1.com";
The problem is that there is only one attachment element instead of two as I expected.
How to solve this? Thanks.
.select(".list_attachments") can select only <ul class="list_attachments"> so it returns Elements with only one <ul> element. I suspect that you wanted to select all <a ...> elements which exist inside .list_attachments and then take their text and href. In that case your code should look more like
Elements anchors = document.select(".list_attachments a");
for (Element anchor : anchors) {
fileName[i] = anchor.text();
url[i] = anchor.attr("href");
i++;
}
Related
With the following code I am able to get the desired text from the website but am unable to get the associated link of the text. Tried several method permutations and combinations. At most what I get is the entire outer html as given below:
<li class="list-item">
<h4><a class="bold" href="abacavir.htm">Abacavir </a> </h4>
Abacavir is an antiviral drug that is effective against the HIV-1 virus.</li>
Here is the code:
public static void main(String[] args) throws Exception {
Map<String,String> drugLinks = new LinkedHashMap<String,String>();
final int OK = 200;
//String currentURL;
//int page = 1;
int status = OK;
Connection.Response response = null;
Document doc = null;
String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
//String keyword = "a";
for (String keyword : keywords){
final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;
response = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.execute();
status = response.statusCode();
doc = response.parse();
Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();
Elements links = tds.select("li[class=list-item]");
for (Element link : links){
System.out.println("generic::"+link.select("a[href]").text());
System.out.println("link::"+link.attr("abs:a"));
}
}
}
Output
generic::Abacavir
link::
generic::Abacavir Sulfate and Lamivudine
link::
generic::Abacavir Sulfate, Lamivudine and Zidovudine
link::
generic::Abaloparatide
link::
generic::Abarelix
link::
How do i get the absolute links from the given HTML?
To get the link from the element, you can use:
link.select("a").attr("href")
However, that will only give you the relative link.
The full link will be:
"https://www.medindia.net/doctors/drug_information/" + link.select("a").attr("href")
Here is the HTML snippet:
<label class="abc">
<span class="bcd">Text1</span>
Text2
</label>
How do I just extract Text2 using selenium script? I know how to extract Text2 by getting innerHTML of "abc" class and then removing innerHTML of "bcd" class. But I am just looking for a better way to solve this.
Try this :
WebElement element = driver.findElement(By.xpath("//label[contains(text(),'Text2')]"));
String test = element.getText();
xpath:
//label[#class='abc']/child::text()
for Python:
returnText = driver.execute_script("return document.evaluate(\"//label[#class='abc']/child::text()\", document, null, XPathResult.STRING_TYPE, null).stringValue;")
sometimes, there can be spaces and it's evaluated as a string by program, so you should use array:
returnText = []
returnText = self.driver.execute_script("var iterator = document.evaluate(\"//label[#class='abc']/child::text()\", document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null); var arrayXpath = new Array(); var thisNode = iterator.iterateNext(); while (thisNode) {arrayXpath.push(thisNode.textContent); thisNode = iterator.iterateNext(); } return arrayXpath;")
for item in returnText:
print item
Try This :
String text2=driver.findElement(By.xpath("//label[#class='abc']").Text;
I will start from beginning, there's html with pattern like this:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
those div's with styles and that table is optional, sometimes there's just
<div id="post">
Text i need
</div>
And i want to parse that text to String. Here;s the code I'm using
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
I added those print lines to check if it finds them and yes, it does find correct ones, but later when I'm parsing it to String:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
String contains all that removed stuff again for some reasons.
I tried removing them by iterators, or making full for and removing elements by indexes, buut the result is the same.
So you want to get Text i need. Use Element's ownText() method which Gets the text owned by this element only; does not get the combined text of all children.
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}
I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. but it'snot working..Can anyone help me. This is the small snippet of my output
my code
doc = Jsoup.connect("myurl").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}
output
<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54"
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410"
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>
You can use the attr() method of Element class to extract the value of attributes.
For example:
String href = link.attr("href");
String title = link.attr("title");
See this page for more: Extract attributes, text, and HTML from elements
To get the page title, you can use
Document doc = Jsoup.connect("myurl").get();
String title = doc.title();
For getting the individual links from the different hrefs, you can use this
Elements links = doc.select("a[href]");
for(Element ele : links) {
System.out.println(ele.attr("href").toString());
}
attr() method gives the content inside the matching attributed spedified to it in the given tag.
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
REGULAR EXPRESSION EXPLAINATION:
I have to parse an html page. I have to extract the value of the name element in the below html which is assigned to a javascript function. How do I do it using JSoup.
<input type="hidden" name="fields.DEPTID.value"/>
JS:
departmentId.onChange = function(value) {
var departmentId = dijit.byId("departmentId");
if (value == null || value == "") {
document.transferForm.elements["fields.DEPTID.value"].value = "";
document.transferForm.elements["fields.DEPTID_DESC.value"].value = "";
} else {
document.transferForm.elements["fields.DEPTID.value"].value = value;
document.transferForm.elements["fields.DEPTID_DESC.value"].value = departmentId.getDisplayedValue();
var locationID = departmentId.store.getValue(departmentId.item, "loctID");
var locationDesc = departmentId.store.getValue(departmentId.item, "loct");
locationComboBox = dijit.byId("locationId");
if (locationComboBox != null) {
if (locationID != "") {
setLocationComboBox(locationID, locationDesc);
} else {
setLocationComboBox("AMFL", "AMFL - AMY FLORIDA");
}
}
}
};
I'll try to teach you form the top:
//Connect to the url, and get its source html
Document doc = Jsoup.connect("url").get();
//Get ALL the elements in the page that meet the query
//you passed as parameter.
//I'm querying for all the script tags that have the
//name attribute inside it
Elements elems = doc.select("script[name]");
//That Elements variable is a collection of
//Element. So now, you'll loop through it, and
//get all the stuff you're looking for
for (Element elem : elems) {
String name = elem.attr("name");
//Now you have the name attribute
//Use it to whatever you need.
}
Now if you want some help with the Jsoup querys to get any other elements you might want, here you go the API documentation to help: Jsoup selector API
Hope that helped =)