Download all images like wget does with Java on client-side - java

It's perfectly easy to download all images from a website using wget.
But I need this feature on client-side, best would be in Java.
I know wget's source can be accessed online, but I don't know any C and the source is quite complex. Of course, wget has also other features which "blow up the source" for me.
As Java has a built-in HttpClient, yet I don't know how sophisticated wget really is, could you tell me if it is hard to re-implement the "download all images recursively" feature in Java?
How is this done, exactly? Does wget fetch the HTML source code of the given URL, extract all URLs with the given file endings (.jpg, .png) from the HTML and downloads them? Does it also search for images in the stylesheets that are linked in that HTML document?
How would you do this? Would you use regular expressions to search for (both relative and absolute) image URLs within the HTML document and let HttpClient download each of them? Or is there already some Java library that does something similar?

In Java you could use the Jsoup library to parse any web page and extract anything you want

For me crawler4j was the open source library to recursively crawl (and replicate) a site, e.g. like this (their QuickStart example):
(it also supports CSS URL crawling)
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
+ "|png|mp3|mp3|zip|gz))$");
/**
* This method receives two parameters. The first parameter is the page
* in which we have discovered this new url and the second parameter is
* the new url. You should implement this function to specify whether
* the given url should be crawled or not (based on your crawling logic).
* In this example, we are instructing the crawler to ignore urls that
* have css, js, git, ... extensions and to only accept urls that start
* with "http://www.ics.uci.edu/". In this case, we didn't need the
* referringPage parameter to make the decision.
*/
#Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
#Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
}
More webcrawlers and HTML parsers can be found here.

Found this program which downloads images. It is open source.
You could get the images in a website using the <IMG> tags. Look into the following question. It might help you.
Get all Images from WebPage Program | Java

Related

Upload files using selenium

How to upload files from local via window prompt using selenium webdriver?
I want to perform the following actions:
click on 'Browse' option on the window
from the window prompt go to the particular location in the local where the file is kept
select the file and click on 'Open' to upload the file.
Have you tried using input() on proper file input control?
WebElement fileInput = driver.findElement(By.id("some id"));
fileInput.sendKeys("C:/path/to/file.extension");
I have used below three different ways to upload a file in selenium webdriver.
First simple case of just finding the element and typing the absolute path of the document into it. But we need to make sure the HTML field is of input type. Ex:<input type="file" name="uploadsubmit">
Here is the simple code:
WebElement element = driver.findElement(By.name("uploadsubmit"));
element.sendKeys("D:/file.txt");
driver.findElement(By.name("uploadSubmit"));
String validateText = driver.findElement(By.id("message")).getText();
Assert.assertEquals("File uploaded successfully", validateText);
Second case is uploading using Robot class which is used to (generate native system input events) take the control of mouse and keyboard.
The the other option is to use 'AutoIt' (open source tool).
You can find the above three examples : - File Uploads with Selenium Webdriver
Selenium Webdriver doesn't really support this. Interacting with non-browser windows (such as native file upload dialogs and basic auth dialogs) has been a topic of much discussion on the WebDriver discussion board, but there has been little to no progress on the subject.
I have, in the past, been able to work around this by capturing the underlying request with a tool such as Fiddler2, and then just sending the request with the specified file attached as a byte blob.
If you need cookies from an authenticated session, WebDriver.magage().getCookies() should help you in that aspect.
edit: I have code for this somewhere that worked, I'll see if I can get ahold of something that you can use.
public RosterPage UploadRosterFile(String filePath){
Face().Log("Importing Roster...");
LoginRequest login = new LoginRequest();
login.username = Prefs.EmailLogin;
login.password = Prefs.PasswordLogin;
login.rememberMe = false;
login.forward = "";
login.schoolId = "";
//Set up request data
String url = "http://www.foo.bar.com" + "/ManageRoster/UploadRoster";
String javaScript = "return $('#seasons li.selected') .attr('data-season-id');";
String seasonId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
javaScript = "return Foo.Bar.data.selectedTeamId;";
String teamId = (String)((IJavaScriptExecutor)Driver().GetBaseDriver()).ExecuteScript(javaScript);
//Send Request and parse the response into the new Driver URL
MultipartForm form = new MultipartForm(url);
form.SetField("teamId", teamId);
form.SetField("seasonId", seasonId);
form.SendFile(filePath,LoginRequest.sendLoginRequest(login));
String response = form.ResponseText.ToString();
String newURL = StaticBaseTestObjs.RemoveStringSubString("http://www.foo.bar.com" + response.Split('"')[1].Split('"')[0],"amp;");
Face().Log("Navigating to URL: "+ newURL);
Driver().GoTo(new Uri(newURL));
return this;
}
Where MultiPartForm is:
MultiPartForm
And LoginRequest/Response:
LoginRequest
LoginResponse
The code above is in C#, but there are equivalent base classes in Java that will do what you need them to do to mimic this functionality.
The most important part of all of that code is the MultiPartForm.SendFile method, which is where the magic happens.

What URL do I use to open a String object in a web browser

If I have a HTML String object, using Selenium in Java, how can I get the browser to open that String as a HTML page? I have seen this done before but I don't remember the format that the URL needs to be.
For this example, let's say the string is :
<h2>This is a <i>test</i></h2>
I looked through this page and couldn't find the answer but I might be overlooking it. For example I tried this URL and it didn't work for me:
data:<h2>This is a <i>test</i></h2>
Here is a link for documentation http://en.wikipedia.org/wiki/Data_URI_scheme. You need to specify MIME-type of data. Try data:text/html,<h2>This is a <i>test</i></h2>

Replace all URLs in a HTML

I'm crawling some HTML files with crawler4j and I want to replace all links in those pages with custom links. Currently I can get the source HTML and a list of all outgoing links with this code:
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
However a simple foreach loop and search & replace won't get me what I want. The problem is that athe WebURL.getURL(); will return the absolute URL but sometimes the links are relative and sometimes are not.
I want to handle all links (Images, URLs, JavaScript files, etc.). For instance I want to replace images/img.gif with view.php?url=http://www.domain.com/images/img.gif.
The only solution that comes to me is using a somewhat complicated Regex but I'm afraid I'm going to miss some rare cases. Has this been done already? Is there a library or some tool to achive this?
I think you can make use of regular expression for this:
For example :
...
String regex = "\\/[^.]*\\/[^.]*\\.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
String imageLink = matcher.group();
text = text.replace(imageLink,prefix+imageLink);
}
Does it has to be a Java solution? PhantomJs in combination with pjscrape can site scrape a page to find all urls.
You just have to create a configuration javascript file.
getlinks.js:
pjs.addSuite({
url: 'http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html',
noConflict: true,
scraper: function() {
var links = _pjs.$('a').map(function() {
// convert relative URLs to absolute
var link = _pjs.toFullUrl($(this).attr('href'));
return link;
});
return links.toArray();
}
});
pjs.config({
// options: 'stdout' or 'file' (set in config.outFile)
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'stdout',
scrape_output.json
});
And run the command phantomjs pjscrape.js getlinks.js. In this example, the output is stored in a file (can also be logged in the console):
Here is the (partial) output:
* Suite 0 starting
* Opening http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Scraping http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items

Editing response content on doView()

I have a simple JSR 286 Portlet that displays a user manual (pure HTML code, not JSP).
Actually, my doView method, just contains this :
public class UserManualPortlet extends GenericPortlet
{
#Override
protected void doView(RenderRequest request, RenderResponse response)
throws PortletException, IOException
{
PortletRequestDispatcher rd = getPortletContext().getRequestDispatcher(
"/html/usermanual.html");
rd.include(request, response);
}
}
This works as expected, however I'm having trouble when including images. I'm aware that the path to images should be something like :
<img src='<%=renderResponse.encodeURL(renderRequest.getContextPath() + "/html/image.jpg")%>'/>
However, my HTML file containing the user manual is used elsewhere, so I would like to preserve it as a pure HTML file.
Is there a way to dynamically replace my classic images urls by something like the example above ? Perhaps using the PrintWriter of the response ?
If such thing is not possible, I thing I would need to generate a JSP file during my Maven build.
Any solution or ideas are welcome.
With JSR-268 portlets you have a better way of referencing resources: create ResourceURL using renderResponse.createResourceURL() and then you set the resourceID in the ResourceURL. That should give more consistent results across all portlet containers.
That said, if you want to modify the generated content from your usermanual.html but you don't want to convert it to a JSP then, instead of using a request dispatcher, I would load the file contents on my own, parse it at the same time that I do the URL replacements and then print all the contents to the portlet's response.

Scraping Data. Save File?

I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first save the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.
You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.
If the data are generated with javascript then the data are in the downloaded page.
Better is to directly parse them on the fly as you do with plain HTML or Text parsing.
If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.
I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;

Categories