Writing a simple web crawler that interacts with the browser (Java)

Writing a simple web crawler that interacts with the browser (Java) - java

I need to create an automated process (preferably using Java) that will:
Open browser with specific url.
Login, using the username and password specified.
Follow one of the links on the page.
Refresh the browser.
Log out.
This is basically done to gather some statistics for analysis. Every time a user follows the link a bunch of data is generated for this particular user and saved in database. The thing I need to do is, using around 10 fake users, ping the page every 5-15 min.
Can you tink about simple way of doing that? There has to be an alternative to endless login-refresh-logout manual process...

Try Selenium.

It's not Java, but Javascript. You could do something like:
window.location = "<url>"
document.getElementById("username").value = "<email>";
document.getElementById("password").value = "<password>";
document.getElementById("login_box_button").click();
...
etc
With this kind of structure you can easily cover 1-3. Throw in some for loops for page refreshes and you're done.

Use HtmlUnit if you want
FAST
SIMPLE
java based web interaction/crawling.
For example: here is some simple code showing a bunch of output and an example of accessing all IMG elements of the loaded page.
public class HtmlUnitTest {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://www.google.com");
System.out.println(page.getTitleText());
for (HtmlElement node : page.getHtmlElementDescendants()) {
if (node.getTagName().toUpperCase().equals("IMG")) {
System.out.println("NAME: " + node.getTagName());
System.out.println("WIDTH:" + node.getAttribute("width"));
System.out.println("HEIGHT:" + node.getAttribute("height"));
System.out.println("TEXT: " + node.asText());
System.out.println("XMl: " + node.asXml());
}
}
}
}
Example #2 Accessing named input fields and entering data/clicking:
final HtmlPage page = webClient.getPage("http://www.google.com");
HtmlElement inputField = page.getElementByName("q");
inputField.type("Example input");
HtmlElement btnG = page.getElementByName("btnG");
Page secondPage = btnG.click();
if (secondPage instanceof HtmlPage) {
System.out.println(page.getTitleText());
System.out.println(((HtmlPage)secondPage).getTitleText());
}
NB: You can use page.refresh() on any Page object.

You could use Jakarta JMeter

Related

fill a form in a asp dynamic page with htmlunit

I'm making a little script in java to check iPhone IMEI numbers.
There is this site from Apple :
https://appleonlinefra.mpxltd.co.uk/search.aspx
You have to enter an IMEI number. If this number is OK, it drives you to this page :
https://appleonlinefra.mpxltd.co.uk/Inspection.aspx
Else, you stay on /search.aspx page
I want to open the search page, enter an IMEI, submit, and check if the URL has changed. In my code there is a working IMEI number.
Here is my java code :
HtmlPage page = webClient.getPage("https://appleonlinefra.mpxltd.co.uk/search.aspx");
HtmlTextInput imei_input = (HtmlTextInput)page.getElementById("ctl00_ContentPlaceHolder1_txtIMEIVal");
imei_input.setValueAttribute("012534008614194");
//HtmlAnchor check_imei = page.getAnchorByText("Rechercher");
//Tried with both ways of getting the anchor, none works
HtmlAnchor anchor1 = (HtmlAnchor)page.getElementById("ctl00_ContentPlaceHolder1_imeiValidate");
page = anchor1.click();
System.out.println(page.getUrl());
I can't find out where it comes from, since i often use HTMLUnit for this and i never had this issue. Maybe because of the little loading time after submiting ?
Thank you in advance

You can do this by using a connection wrapper that HTMLUnit provides
Here is an example
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("Inspection.aspx")) {
String content = response.getContentAsString("UTF-8");
WebResponseData data = new WebResponseData(content.getBytes("UTF-8"), response.getStatusCode(),
response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
With the connection wrapper above, you can check for any request and response that is passing through HTMLUnit

Cannot download full Document using HtmlUnit and Jsoup combination (using Java)

Problem Statement:
I want to crawl this page : http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0
Lets say I want to parse the address, that is "24, Middle Gap Road, The Peak, Hong Kong"
What I did:
I first only tried to load using jsoup, but then I noticed that the page is taking some time to load. So, then I also plugged in HTMLUnit to wait for the page to load first
Code I wrote:
public static void parseByHtmlUnit() throws Exception{
String url = "http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0";
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.waitForBackgroundJavaScriptStartingBefore(30000);
HtmlPage page = webClient.getPage(url);
synchronized(page) {
page.wait(30000);
}
try {
Document doc = Jsoup.parse(page.asXml());
String address = ElementsUtil.getTextOrEmpty(doc.select(".addr"));
System.out.println("address"+address);
} catch (Exception e) {
e.printStackTrace();
}
}
Expected output :
In the console, I should get this output:
address 24, Middle Gap Road, The Peak, Hong Kong
Actual output :
address

How about this?
final Document document = Jsoup.parse(
new URL("http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0"),
30000
);
System.out.println(document.select(".addr").text());

Selenium webdriver IE9 event not firing

I am using nodejs, Soda with selenium
I am hitting problems trying to get the javascript to fire events like on entry of data. Script can't even get through the creation of client flow on IE. Only worried about IE 9+ btw.
Am I missing anything ? This works fine in latest FF. Below posting sample code using for FF
var soda = require('soda');
var assert = require('assert');
var browser = soda.createClient({
url: 'http://192.168.12.1/', // localhost website
port:5555,
host:'192.168.1.3', //grid ip
browser: 'firefox',
browserTimeout:5
});
var testname = 'Nits' + parseInt(Math.random() * 100000000); //create unique name for new person.
console.log('randome name: ' + testname);
browser
.chain
.session()
.setSpeed(speed)
.setTimeout(20000)
.open('/')
.and(login('dev#dev.com', 'x1212GQtpS'))
.and(killClient(killAtEnd))
.end(function(err){
console.log('done');
if(err){
console.log(err);
}else{
console.log("We're Good");
}
if (closeAtEnd){
browser.testComplete(function() {
console.log('killing browser');
});
}
});
function login(user, pass) {
return function(browser) {
browser
.click('css=a#loginButton')
.type('css=input.input-medium.email',user)
.type('css=input.input.pwd',pass)
.clickAndWait('css=a.btn.login')
.assertTextPresent('Clients',function(){ console.log('logged in ok')})
}
}

Can you be more specific about exactly where the script stops, fails, or throws an error? I suspect that IE can't handle the way you put your css selector together (i.e. the classname combinators such as 'input.input.pwd' ). I would recommend trying a different selector or rewrite/simplify your existing selector until you get it working.

I want to pull Facebook posts from a public page to a Java application

I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/

I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)

Grails AddTo in for loop

Merged with Grails addTo in for loop.
I am facing a problem due to that i'm newbie to grails
i'm doing a website for reading stories and my goal now is to do save the content of the story into several pages to get a list and then paginate it easily .. so i did the following.
in the domain i created two domains one called story and have this :
class Story {
String title
List pages
static hasMany=[users:User,pages:Page]
static belongsTo = [User]
static mapping={
users lazy:false
pages lazy:false
}
}
and have of course domain called page have this :
class Page {
String Content
Story story
static belongsTo = Story
static constraints = {
content(blank:false,size:3..300000)
}
}
and the controller saving method gone like this:
def save = {
def storyInstance = new Story(params)
def pages = new Page(params)
String content = pages.content
String[] contentArr = content.split("\r\n")
int i=0
StringBuilder page = new StringBuilder()
for(StringBuilder line:contentArr){
i++
page.append(line+"\r\n")
if(i%10==0){
pages.content = page
storyInstance.addToPages(pages)
page =new StringBuilder()
}
}
if (storyInstance.save(flush:true)) {
flash.message = "${message(code: 'default.created.message', args: [message(code: 'story.label', default: 'Story'), storyInstance.id])}"
redirect(action: "viewstory", id: storyInstance.id)
}else {
render(view: "create", model: [storyInstance: storyInstance])
}
}
i know it looks messy but it's a prototype..any way.. the problem is that i'm waiting from "storyInstance.addToPages(pages)" line to add to the set of pages an instance of the pages every time the condition is true..but what actually happen that it's give me the last instane only with the last page_idx while i thought it should save the pages one by one and so i can get a list of pages to every story..
why this happen and is there a simpler way to do it than what i did..
i'm waiting for any help here that is appreciated

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Writing a simple web crawler that interacts with the browser (Java) - java

Try Selenium.

You could use Jakarta JMeter

Related

fill a form in a asp dynamic page with htmlunit

Cannot download full Document using HtmlUnit and Jsoup combination (using Java)

Selenium webdriver IE9 event not firing

I want to pull Facebook posts from a public page to a Java application

Grails AddTo in for loop

Categories

Resources