I've often read StackOverflow as a source to get answers; but now I have a very specific question and I can't really find any data on the internet. I trust you to be as helpful as always! :D
Basically, I'm relying on Mozilla's XULRunner and its XPCOM objects to analyze the HTTP stream of an SWT browser in a Java application on Linux.
Heavily based on Snippet128 and Snippet321 from the Java SWT website (can't post more than 1 URL sorry :/ ), my browser so far can parse all of the HTTP headers using an nsIHttpHeaderVisitor - and do some pretty stuff like printing them on a tree and such.
Full source is here.
Now... That's already pretty good. It covers the majority of what I want to do (school assignment at first, going a bit further than asked!).
But what I would really like is to be able to get the raw "content" data from every HTTP request: HTML of course ; but also CSS and images.
I've been trying different ways to achieve this goal, but everything failed so far:
Using an XPCOM object - which one?
nsIInputStream would be a good one. But I can't seem to find where the good stream actually is... The nsIHttpChannel open() method (who gives back an nsIInputStream) seems to be called by the SWT browser, leaving me with no way of getting the stream back.
nsIRequest : no luck.
another Listener that I'd have missed? I just spent an hour trying to use the nsIHttpActivityObserver interface, but it doesn't give me any HTTP content (merely GETs and 200 OK).
Using another object
the SWT's browser for instance. Well it kinda works: its getText() method gives me the html source of the page I'm visiting. But I want more!
I'm really stuck here, and I would greatly appreciate any help.
Cheers!
Florent
Perhaps nsITraceableChannel can help you?
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
we have a problem (we are a group).
We have to use jsoup in java for an university project. We can parse Htmls with it. But the problem is that we have to parse an html which updates when you click on a button (https://www.bundestag.de/services/opendata).
First Slide
Second Slide
We want to access all xmls from "Wahlperiode 20". But when you click on the slide buttons the html code updates but the html url stays the same. But you have never access to all xmls in the html because the html is updating over the slide button.
Another idea was to find out how the urls of the xmls we want to access are built so that we dont have to deal with the slide buttons and only access the xml urls. But they are all built different.
So we are all desperate how to go on. I hope y'all can help us :)
It's rather ironic that you are attempting to hack1 out some data from an opendata website. There is surely an API!!
The problem is that websites aren't static resources; they have javascript, and that javascript can fetch more data in response to e.g. the user clicking a 'next page' button.
What you're doing is called 'scraping': Using automated tools to attempt to query for data via a communication channel (namely: This website) which is definitely not meant for that. This website is not meant to be read with software. It's meant to be read with eyeballs. If someone decides to change the design of this page and you did have a working scraper, it would then fail after the design update, for example.
You have, in broad strokes, 3 options:
Abort this plan, this is crazy
This data is surely open, and open data tends to come with APIs; things meant to be queried by software and not by eyeballs. Go look for it, and call the german government, I'm sure they'll help you out! If they've really embraced the REST principles of design, then send an accept header that including e.g. application/json and application/xml and does not include text/html and see if the site just responds with the data in JSON or XML format.
I strongly advise you fully exhaust these options before moving on to your next options, as the next options are really bad: Lots of work and the code will be extremely fragile (any updates on the site by the bundestag website folks will break it).
Use your browser's network inspection tools
In just about every browser there's 'dev tools'. For example, in Vivaldi, it's under the "Tools" menu and is called "Developer tools". You can also usually right click anywhere on a web page and there will be an option for 'Inspect', 'Inspector', or 'Development Tools'. Open that now, and find the 'network' tab. When you (re)load this page, you'll see all the resources its loading in (so, images, the HTML itself, CSS, the works). Look through it, find the interesting stuff. In this specific case, the loading of wahlperioden.json is of particular interest.
Let's try this out:
curl 'https://www.bundestag.de/static/appdata/filter/wahlperioden.json'
[{"value":"20","label":"WP 20: seit 2021"},{"value":"19","label":"WP 19: 2017 - 2021"},(rest omitted - there are a lot of these)]
That sounds useful, and as its JSON you can just read this stuff with a json parser. No need to use JSoup (JSoup is great as a library, but it's a library that you can use when all other options have failed, and any code written with JSoup is fragile and complicated simply because scraping sites is fragile and complicated).
Then, click on the buttons that 'load new data' and check if network traffic ensues. And so it does, when you do so, you notice a call going out. And so it is! I'm seeing this URL being loaded:
https://www.bundestag.de/ajax/filterlist/de/services/opendata/866354-866354?limit=10&noFilterSet=true&offset=10
The format is rather obvious. offset=10 means: Start from the 10th element (as I just clicked 'next page') and limit=10 means: NO more than 10 pages.
This html is also incredibly basic which is great news, as that makes it easy to scrape. Just write a for loop that keeps calling this URL, modifying the offset=10 part (first loop: no offset. Second, offset=10, third: offset=20. Keep going until the HTML you get back is blank, then you got it all).
For future reference: Browser emulation
Javascript can also generate entire HTML on its own; not something jsoup can ever do for you: The only way to obtain such HTML is to actually let the javascript do its work, which means you need an entire browser. Tools like selenium will start a real browser but let you use JSoup-like constructs to retrieve information from the page (instead of what browsers usually do, which is to transmit the rendered data to your eyeballs). This tends to always work, but is incredibly complicated and quite slow (you're running an entire browser and really rendering the site, even if you can't see it - that's happening under the hood!).
Selenium isn't meant as a scraping tool; it's meant as a front-end testing tool. But you can use it to scrape stuff, and will have to if its generated HTML. Fortunately, you're lucky here.
Option 1 is vastly superior to option 2, and option 2 is vastly superior to option 3, at least for this case. Good luck!
[1] I'm using the definition of: Using a tool or site to accomplish something it was obviously not designed for. The sense of 'I bought half an ikea cupboard and half of an ikea bookshelf that are completely unrelated, and put them together anyway, look at how awesome this thingie is' - that sense of 'hack'. Not the sense of 'illegal'.
I daily visit this link to find my lectures at school. Every time I have to scroll down the list to find my own class, and then post it so I can view the result. Is there any way i could make a direct link to the preferred content? I'm looking to create a simple webview app in Android showing individual form categories.
EDIT : Really any method for converting the aspx info into another format would do the trick. Prefferably a direc link to each form item. But if I can convert every single item to a .xml file or anything else I could work with it. But I have to make it automated.
You can capture the outgoing request and write a simple application to POST the data back to the page. The WebClient class is useful for this.
Looking at the request in Chrome's developer tools, I see that the form posts back to itself and then redirects to the result page. Presumably, you should POST the form data to the initial page, which will then cause it to perform the redirect.
The form contains a large amount of ViewState data which may or may not need to be included in the request to make it work.
A completely different approach would be to find a browser extension, such as a macro recorder, which emulate your actions. This plugin (haven't tried it myself) appears to do exactly that.
I've been waiting for an idea and I think I finally have one. I am going to attempt to make a Android App using Web Scraping that will allow me to navigate and use the forums on Roblox (.com if you really want to look it up) better than I can now. Not only are the forums pretty bad in general but they are even worse on my Android Device (Samsung Galaxy Player). Can anyone give me an pointers or advice? I'm not sure what libraries I should use... This is my first big attempt at coding :)
Oh, Obviously I would want to give it a feature to reply to posts but I'm not sure how login for that type of thing would work...
EDIT: I got the idea from this application: GooglePlay, Github
You should look up how to get the data from the website, and you should also make sure that you understand html. You also need a simple way to handle html.
Get the page (use the example in the question): Read data from webpage
A bit about html: http://www.w3schools.com/html/default.asp
Handle html: http://jsoup.org/cookbook/extracting-data/dom-navigation
To login you should do a post request with the login information to the standard login page, then you keep the cookie that were generated and pass it with your other requests.
Little about handle cookies: Java: Handling cookies when logging in with POST
Some things you also might want to think about:
Linear or branched view of posts in the forum?
Should you get a message if someone post a new post?
A own search function?
Signature?
You have to use JSOUP libaray of java ,you can easily parse the html data through this library. Example: In doc object you are getting complete web page
File input = new File(url);
Document doc = null;
doc = Jsoup.connect(url).get();
Elements headlinesCat1 = doc.select("div[class=abc");
How to get print out a document (Which taken from data base or current fields form the form) in java with customized page size. Mostly important thing is I want to customize the page as my requirements (May be text alignment also needed). am Not a java hard coder. Your helps will me big help to me.
Thanks.
not clear what is (Which taken from data base or current fields form the form) , I suggest to go throught the 2D Graphics tutorial, there is detailed described Printing in Java
Everywhere I've worked that wanted well formatted output from a Java back-end we've deployed Apache FO (http://xmlgraphics.apache.org/fop/) which allowed us to use XSLT to convert XML to PDF. It works really well, but has a pretty steep learning curve.
I'm writing a Java application which parses links from html & uses them to request their content. The area of url encoding when we have no idea of the "intent" of the url author is very thorny. For example when to use %20 or + is a complex issue: (%20 vs +), a browser would perform this encoding for a url containing an un-encoded space.
There are many other situations in which a browser would change the content of a parsed url before requesting a page, for example:
http://www.Example.com/รพ
... when parsed & requested by a browser becomes ...
http://www.Example.com/%C3%BE
.. and...
http://www.Example.com/&
... when parsed & requested by a browser becomes ...
http://www.Example.com/&
So my question is, instead of re-inventing the wheel again is there perhaps a Java library I haven't found to do this job? Failing that can anyone point me towards a reference implementation in a common browsers source? or perhaps pseudo code? Failing that, any recommendations on approach welcome!
Thanks,
Jon
HtmlUnit can certainly pick URLs out of HTML and resolve them (and much more).
I don't know whether it handles your corner cases, though. I would imagine it will handle the second, since that is a normal, if slightly funny-looking, use of HTML and a URL. I don't know what it will do with the second, in which an invalid URL is encoded in HTML.
I also know that if you find that HTMLUnit does something differently to how real browsers do it, write a JUnit test case to prove it, and file a bug report, then its maintainers will happily fix it with great alacrity.
How about using java.net.URLEncoder.encode() & java.net.URLDecoder.decode().