I've been waiting for an idea and I think I finally have one. I am going to attempt to make a Android App using Web Scraping that will allow me to navigate and use the forums on Roblox (.com if you really want to look it up) better than I can now. Not only are the forums pretty bad in general but they are even worse on my Android Device (Samsung Galaxy Player). Can anyone give me an pointers or advice? I'm not sure what libraries I should use... This is my first big attempt at coding :)
Oh, Obviously I would want to give it a feature to reply to posts but I'm not sure how login for that type of thing would work...
EDIT: I got the idea from this application: GooglePlay, Github
You should look up how to get the data from the website, and you should also make sure that you understand html. You also need a simple way to handle html.
Get the page (use the example in the question): Read data from webpage
A bit about html: http://www.w3schools.com/html/default.asp
Handle html: http://jsoup.org/cookbook/extracting-data/dom-navigation
To login you should do a post request with the login information to the standard login page, then you keep the cookie that were generated and pass it with your other requests.
Little about handle cookies: Java: Handling cookies when logging in with POST
Some things you also might want to think about:
Linear or branched view of posts in the forum?
Should you get a message if someone post a new post?
A own search function?
Signature?
You have to use JSOUP libaray of java ,you can easily parse the html data through this library. Example: In doc object you are getting complete web page
File input = new File(url);
Document doc = null;
doc = Jsoup.connect(url).get();
Elements headlinesCat1 = doc.select("div[class=abc");
Related
I am trying to parse a website, specifically this one It does not provide a api for that, like it does for bf4 or other titles, but the owner said that I should just parse the data.
The problem I have is that using jSoup, it retrieves the data, but if you look carefully, the website makes a new httpget and only after that the search is completed.
From what I could gather, i think it sends some paramethers in the header to.
If i just use jSoup to call that like I get some data, and where the search should be I get the message:
Please activate Javascript to see the search results.
Is there is a way to get the data? I really need this, any help is very much appreciated.
Please help
You need a javascript-capable client, e.g., HtmlUnit or Selenium.
I daily visit this link to find my lectures at school. Every time I have to scroll down the list to find my own class, and then post it so I can view the result. Is there any way i could make a direct link to the preferred content? I'm looking to create a simple webview app in Android showing individual form categories.
EDIT : Really any method for converting the aspx info into another format would do the trick. Prefferably a direc link to each form item. But if I can convert every single item to a .xml file or anything else I could work with it. But I have to make it automated.
You can capture the outgoing request and write a simple application to POST the data back to the page. The WebClient class is useful for this.
Looking at the request in Chrome's developer tools, I see that the form posts back to itself and then redirects to the result page. Presumably, you should POST the form data to the initial page, which will then cause it to perform the redirect.
The form contains a large amount of ViewState data which may or may not need to be included in the request to make it work.
A completely different approach would be to find a browser extension, such as a macro recorder, which emulate your actions. This plugin (haven't tried it myself) appears to do exactly that.
I want to retrieve a set of results, which consist of all results produced by (looping) all the options of one of the request-form fields.
I'm using Java language, and HtmlUnit API.
I have managed to do this looping form-fill using the URL to 'fill' the field's variables (I don't know if its the best method, and actually am quite worried it's one of the worst...But it's the one i could do with the knowledge i have).
But i'm having problems figuring out how to make the program submit the form in order to reach the result page, and on how to download (scrape) that page before moving to the next.
NOTES:
-If you have a better way of filling the 'request-form', that is welcome as well.
UPDATE:
This solves the issues when using HtmlUnit API (thank you, touti):
HtmlPage resultado = pageNow.getElementByName("buscar").click();
System.out.println(resultado.asText());
A better way than loading both the request and response pages is still hugely welcome tough!
you can simulate using Jquery the click on your submit input like this
$("#submit_id").trigger("click");
I want to get the list of all Image urls from HTML source of a webpage(Both abosulte and relative urls). I used Jsoup to parse the HTML but its not giving all images. For example when I am parsing google.com HTML source its showing zero images..In google.com HTML source image links are in form..
"background:url(/intl/en_com/images/srpr/logo1w.png)
And in rediff.com the images links are in form..
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bappi-da-the-first-indian-in-grammy-jury/2684982","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/v3np2zgbla4vdccf.D.0.bappi.jpg","Bappi Da - the first Indian In Grammy jury","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:33)");
j = 1
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bebo-shahid-jab-they-met-again-/2681664","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/ra8p9eeig8zy5qvd.D.0.They-Met-Again.jpg","Bebo-Shahid : Jab they met again!","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:17)");
All images are not with in "img" tags..I also want to extract images which are not even with in "img" tags as shown in above HTML source.
How can I do this..?Please help me on this..
Thanks
This is going to be a bit difficult, I think. You basically need a library that will download a web page, construct the page's DOM and execute any javascript that may alter the DOM. After all that is done you have to extract all the possible images from the DOM. Another possible option is to intercept all calls by library to download resources, examine the URL and if the URL is an image record that URL.
My suggestion would be to start by playing with HtmlUnit(http://htmlunit.sourceforge.net/gettingStarted.html.) It does a good job of building the DOM. I'm not sure what types of hooks it has, for intercepting the methods that download resources. Of course if it doesn't provide you with the hooks you can always use AspectJ or simply modify the HtmlUnit source code. Good luck, this sounds like a reasonably interesting problem. You should post your solution, when you figure it out.
If you just want every image referred to in the page, can't you just scan the HTML and any linked javascript or CSS with a simple regex? How likely is it you'd get [-:_./%a-zA-Z0-9]*(.jpg|.png|.gif) in the HTML/JS/CSS that's not an image? I'd guess not very likely. And you should be allowing for broken links anyway.
Karthik's suggestion would be more correct, but I imagine it's more important to you to just get absolutely everything and filter out uninteresting images.
I've often read StackOverflow as a source to get answers; but now I have a very specific question and I can't really find any data on the internet. I trust you to be as helpful as always! :D
Basically, I'm relying on Mozilla's XULRunner and its XPCOM objects to analyze the HTTP stream of an SWT browser in a Java application on Linux.
Heavily based on Snippet128 and Snippet321 from the Java SWT website (can't post more than 1 URL sorry :/ ), my browser so far can parse all of the HTTP headers using an nsIHttpHeaderVisitor - and do some pretty stuff like printing them on a tree and such.
Full source is here.
Now... That's already pretty good. It covers the majority of what I want to do (school assignment at first, going a bit further than asked!).
But what I would really like is to be able to get the raw "content" data from every HTTP request: HTML of course ; but also CSS and images.
I've been trying different ways to achieve this goal, but everything failed so far:
Using an XPCOM object - which one?
nsIInputStream would be a good one. But I can't seem to find where the good stream actually is... The nsIHttpChannel open() method (who gives back an nsIInputStream) seems to be called by the SWT browser, leaving me with no way of getting the stream back.
nsIRequest : no luck.
another Listener that I'd have missed? I just spent an hour trying to use the nsIHttpActivityObserver interface, but it doesn't give me any HTTP content (merely GETs and 200 OK).
Using another object
the SWT's browser for instance. Well it kinda works: its getText() method gives me the html source of the page I'm visiting. But I want more!
I'm really stuck here, and I would greatly appreciate any help.
Cheers!
Florent
Perhaps nsITraceableChannel can help you?