Why does JSoup create empty files named after the link? - java

Im trying to get the all the images of a website but why does Jsoup not only get the images of the page but also a create document named like the link after the slash?
Elements imageElements = document.select("img[src$=.png], img[src$=.jpg], img[src$=.jpeg]");
for(Element imageElement : imageElements){
String strImageURL = imageElement.attr("abs:src");
Here is the full code

I found a way to fix it it is probably a prnt.sc issue:
Instead selecting the everything with the "img" tag like that:
Elements img = doc.getElementsByTag("img");
I selected everything with the file ending png, jpg and jpeg. Here the code:
Elements imageElements = document.select("img[src$=.png], img[src$=.jpg], img[src$=.jpeg]");

Related

Use Jsoup to get every files contain in the web page

Right now I'm trying to implement a WebCrawler that lists out every file (images, etc.) and file extension (.jpg, .png, etc.) contained within a website using Jsoup. And I don't how to extract files elements from the URL.
Right now I know how to get all the text contained in the URL by doing something like this.
val doc = Jsoup.connect(link).get
val body: Element = doc.body()
val allText: String = body.text

Extracting image src in HTML with jsoup

I tried to get the .png file's link with this code.
e = rawData.select("img[class=competitive-rank]");
for(Element el : e){
playerRankIconURL = el.attr("src");
println(playerRankIconURL);
}
But it seems to be not working properly...what am I doing it wrong?
Your selector is looking for an img with the class competitive-rank, but there isn't one. It's the div which has that class.
You probably instead want to select an img which is contained by a div with that class, which you could do with the selector div.competitive-rank img.

Control the origin of images in an SWT browser

I'm programming a desktop application using SWT and I use the browser in parts of the interface because of the flexibility.
I easily can introduce external images. An image in the file system:
<img src="/home/user/image.jpg" />
Or an image on the web:
<img src="http://some.cl/image.jpg" />
Can I obtain the images from a stream? In some place of my code I want to program something like this:
OutputSteam getExternaResource(String resourcePath)
I want to arbitrarily control the origin of the request.
I don't know of a direct way to do this, all I can think of is using javascript to set the image data as base64 string into the src of the image.
Using org.eclipse.swt.browser.Browser.execute(String) or maybe use org.eclipse.swt.browser.BrowserFunction.
The images should have an id which can be used in javascript:
<img id="image1" />
Edit: on the other hand, maybe it's easier to just parse the HTML previously and set the image base64 string there.
Depending on how you get the HTML you could do:
if you create the HTML yourself, just use <img src="data:image/png;base64.... convert the image to base64 and put it in the src attribute
if you read the HTML from an external source, you could use JSoup to parse the HTML and replace the image src attribute with a base64 string. afterwards use Browser.setText(String) to set the HTML of the browser, be aware that in that case relative paths (in links or images) don't work.
String html = "html";
Document doc = Jsoup.parse(html);
Elements img = doc.getElementsByTag("img");
for (Element element : img) {
String src = element.attr("src");
// READ image using the existing src, convert to base64 (using java.util.Base64)
String base64 = "";
element.attr("src", "data:image/png;base64,"+);
}
String newHtml = doc.html();
browser.setText(newHtml);
If you have control over the HTML page, i.e. it is generated by your code, possibly from a template, then you can embed the image.
The bytes of the image need to be base64 encoded and appended to the src attribute of the image tag like described here: http://www.techerator.com/2011/12/how-to-embed-images-directly-into-your-html/

JSOUP gives weird output with URL

I am Using JSOUP to parse the HTML page and extract all text from it. Below code works fine with other URL's but this is giving weird output with this URL. http://gumgum-public.s3.amazonaws.com/numbers.html
Document doc = null;
doc = Jsoup.connect("http://gumgum-public.s3.amazonaws.com/numbers.html").maxBodySize(0).get();
String parsedText = doc.body().text();
System.out.println("Output-"+parsedText);
Output-
Output-This is a test page
Output-This is a test page
HTML page contains large set of numbers. Please Help..
Thanks
Then your solution is the following:
Download the page
Slice it in smaller parts
Add a tag before and after
send the file to Jsoup
get your content.
Concat the parts

Unable to click on images selector SELENIUM JAVA

I am unable to click on png image and encounter error.
HTML:
<a onmouseover="i2uiSetMenuCoords(this,event)" href="javascript:showMenu('9721')"><img hspace="1" src="./skins/e2-modern/images/dropdown.png" border="0px"></a>
Code:
if (navigateToDetails) {
SearchListSelectorExt selector = new SearchListSelectorExt();
//switchToFrame(getFrames(FRAME_TYPE.rcp_content));
//switchToFrame(getHeaderFrames());
WebElement element= selector.get(By.xpath("//a[contains(#src,'./skins/e2-modern/images/dropdown.png'"));
Object value = selector.getElementValue(element);
systemDocID = value.toString();
selector.clickName(systemDocID);
//selector.clickName(CustomerItem);
}
Your xpath is wrong...Use the below xpath
//a/img[contains(#src,'/skins/e2-modern/images/dropdown.png')]
Hope this helps you...kindly get back if it is not working
Try below xpath:-
//img[contains(#src,'dropdown.png')]
Here, we are directly looking for img tag such that its src attribute contains dropdown.png text.
If there are more than 1 web elements satisfying above xpath, then try to make it unique by adding extra attributes or parent.
//a/img[contains(#src,'dropdown.png')]
//img[#hspace='1' and contains(#src,'dropdown.png')]

Categories