Extract image id with Jsoup

Extract image id with Jsoup - java

I am trying to extract a specific captcha image id using api Jsoup, the html image tag is like :
<img id="wlspispHIPBimg03256465465dsd5456" style="display: inline; width: 200px; height: 100px;" aria-hidden="true" src="https://users/hip/data/rnd=435cb60d0a6b63ef4">
This is my code to obtain the attribute id="wlspispHIPBimg03256465465dsd5456":
doc = Jsoup.connect("http://go.microsoft.com/fwlink/?LinkID=614866&clcid")
.timeout(0).get();
Elements images = doc.select("img[src~=(?i)]");
for (Element image : images) {
System.out.println(image.attr("id"));
}
The problem is that i can't get the id of captcha image

You need to find something in the html that discriminates the img tag of any other tag in the document. From your posted code that is can't be deduced, so i use my imagination here:
Element imageEl = doc.select("img[scr*=rnd]").first();
This exploits that the source of the image contains "rnd" in it path. To get the best solution you must look yourself. Also it helps a lot if you learn the CSS selectors of Jsoup.

I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it.
View also this other question.

Related

How to convert selenium webdriver xpath value into css

The current code has long HTML Xpath values that need to be converted & shortened to a css value:
driver.findElement(By.xpath("html/body/div[1]/div/div[2]/div/form/div[3]/div[2]/button")).click();

driver.findElement(By.xpath("html/body/div[1]/div/div[2]/div/form/div[3]/div[2]/button")).click();
You could possibly have chosen a better xpath expression than this one above. What you have done (without looking at the actual HTML code) is you have written down the complete xpath, is it possible to make it shorter / more robust?
Consider the following example:
<html itemscope itemtype="http://schema.org/QAPage">
<body class = "question-page new-topbar">
<div id="notify-container"></div>
<div id="topbar-wrapper"></div>
<button id="button1"></button>
</body>
</html>
You want to click on button1, you can find it using the complete xpath:
driver.findElement(By.xpath("html/body/div/div/button")).click()
or you can find it using xpath along this element's other attributes, in this case, its id.
driver.findElement(By.xpath("//button[#id='button1']")).click()
or as you wanted, you can use CSS selector:
driver.findElement(By.cssSelector('button[id='button1']')).click()
If you want us to help you with converting your xpath into css selector, you will need to copy and paste your html code in your question as well. Without looking at the actual code, we can not be 100% sure.
You may find the following link useful when trying to convert xpath into css selector.
https://www.simple-talk.com/dotnet/.net-framework/xpath,-css,-dom-and-selenium-the-rosetta-stone/

How to get img url from html text

In Android development, how can I get an image URL or load an image from the HTML text shown below? I get it from HTML, and I want to get an image URL from this code:
<p style="text-align: justify;"><span style="font-size: 16px;"><img class="aligncenter wp-image-2699 size-large" src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg" alt="helal parti-1" width="750" height="189" srcset="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-300x76.jpg 300w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-768x193.jpg 768w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg 1024w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-560x141.jpg 560w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1.jpg 1564w" sizes="(max-width: 750px) 100vw, 750px" /></span></p>
I want to get src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg"

You should probably use an HTML parser -- there are several Java HTML parsing libraries that can be found with a quick search.
A quick and dirty, way, however, would be to search the input string for the src=" declaration, like so:
int index = input.indexOf("src=\"");
String substr = input.substring(index + 5);
int endIndex = substr.indexOf("\"");
String imgUrl = substr.substring(0, endIndex);
Disclaimer: I haven't tested this, so it may have errors. It also makes a lot of assumptions which may not be true -- which is why you should use a library for this sort of thing!
Edit: Fixed one error after testing (had to use a different machine than the one I'm typing this on). It should work for you now -- but again, you should really use a library.

to create an image in HTML just type:
<img src="www.example.com/images/this-one.png" alt="an image" />
you may nest the img tag into a p and or span tag as needed.
Just a note, code is much easier to read when you use external style sheets

Iterate over HTML tags

I am developing some kind of RSS application: the app downloads the content provided by a RSS feed and shows it to the user.
The post's content has tags like p, img and h2, I want to iterate (in order) over them and create TextView's and ImageView's depending of the tag.
For example, I want to show this HTML code:
<body>
<h2>Some text</h2>
<img src="image1.jpg">
<p>A lot of text</p>
</body>
as
<TextView />
<ImageView />
<TextView />
I think Jsoup is an option, but I am not sure how to use it or if Android includes a native solution.
I also want to incorporate lacy download for images, and I've found the Ion library, but maybe for my use there are more simple solutions
EDIT:
As #Vogabe suggested, I am iterating over the tags using Jsoup. This is the code, maybe someone can find it useful
Document document = Jsoup.parse(htmlContent);
Elements elements = document.getAllElements();
for (Element element:elements) {
Tag tag = element.tag();
if (tag.getName().equalsIgnoreCase("p")) {
// ...
}
}

JSoup is a good solution for parsing HTML pages and retrieving data from it. The Select() method just accepts a css selector and will return the html elements that comply with that selector.
These 2 links should get you started:
http://jsoup.org/cookbook/extracting-data/selector-syntax
http://jsoup.org/cookbook/extracting-data/dom-navigation
There are other parsers out there, but I do not have experience with them.
JSoup is widely adopted and very easy to use.

How do I get this text using Jsoup?

How do i get "this text" from the following html code using Jsoup?
<h2 class="link title"><a href="myhref.html">this text<img width=10
height=10 src="img.jpg" /><span class="blah">
<span>Other texts</span><span class="sometime">00:00</span></span>
</a></h2>
When I try
String s = document.select("h2.title").select("a[href]").first().text();
it returns
this textOther texts00:00
I tried to read the api for Selector in Jsoup but could not figure out much.
Also how do i get an element of class class="link title blah" (multiple classes?). Forgive me I only know both Jsoup and CSS a little.

Use Element#ownText() instead of Element#text().
String s = document.select("h2.link.title a[href]").first().ownText();
Note that you can select elements with multiple classes by just concatenating the classname selectors together like as h2.link.title which will select <h2> elements which have at least both the link and title class.

iText - the PDF is not good

I have the following HTML:
<div align='center' style='height:50px'>
<H1>A Simple Sample Web Page</H1>
<IMG SRC='http://sheldonbrown.com/images/scb_eagle_contact.jpeg'>
<H4>By Sheldon Brown</H4>
<H2>Demonstrating a few HTML features</H2>
</div>
HTML is really a very simple language. '
<P>
'command, which will insert a blank line.If you would like to make a link or
bookmark to this page, the URL is:
<BR>
http://sheldonbrown.com/web_sample1.html
</center>
But the image appears behind the text instead of below!
What's wrong?
if iText cannot handle it - which library is better?
This is my code:
// step 1
Document document = new Document();
// step 2
PdfWriter.getInstance(document, new FileOutputStream("C:\\hello-world.pdf"));
document.open();
String content = "<div align='center' style='height:50px'><H1>A Simple Sample Web Page</H1><IMG SRC='http://sheldonbrown.com/images/scb_eagle_contact.jpeg'><H4>By Sheldon Brown</H4><H2>Demonstrating a few HTML features</H2></div>HTML is really a very simple language. '<P>' command, which will insert a blank line.If you would like to make a link or bookmark to this page, the URL is:<BR> http://sheldonbrown.com/web_sample1.html</center>";
// use the snippet for the PDF document
List<Element> objects = HTMLWorker.parseToList(new StringReader(content), null);
for (Element element : objects)
document.add(element);
document.close();

Do you have any css applied to this HTML? Have you achieved to view this HTML in any other way with a browser (which) ? It renders like you describe here: http://jsfiddle.net/TjUSJ/.
Maybe you want to remove the height styling property on that <div>? It seems like it renders on the middle, but it is actually rendernig at 50px from the top. See this other fiddle, without height styling: http://jsfiddle.net/TjUSJ/1/
Also, remember that the <center> tag is deprecated

The problem was that I was using an old version.
I switched to the last one - 5.1.2 and it works!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract image id with Jsoup - java

I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it. View also this other question.

Related

How to convert selenium webdriver xpath value into css

How to get img url from html text

Iterate over HTML tags

How do I get this text using Jsoup?

iText - the PDF is not good

Categories

Resources