Java HTML element reading - java

I am trying to create a java program that an detect changes in HTML elements on a web page. For example: http://timer.onlineclock.net/
With each passing second, the HTML elements of the clock change the source of the image they display. Is there anyway, using java, that I can EFFICIENTLY open a connection to this page, and be able to see when these elements change?
I have used HTMLUnit, but I decided that takes to long to load a page to be considered efficient enough.
The only way I know how to do it with a URL is to use a BufferedReader to read the page, and then use Regular Expressions to parse an HTML element within the source, but this would require me to "reload" the page every time that I want to see the properties of an element. Can anybody give me an suggetions on how I can detect these changes in a matter of milliseconds, without using much network resources?

Your best bet is to learn and use javascript instead of server-side java. Javascript program runs on the client side (ie: the web browser) as opposed to the server side.
Typical HTML document consists of elements (eg: text, paragraph, list items etc). With javascript you can create timer, action user's event accordingly and manipulate those elements.
http://www.w3schools.com/js/default.asp is probably a good introduction to javascript, I suggest you spend some time on it

The page in question appears to be a...Javascript digital clock.
If you want the current time, try new Date();.
If you want code to be called at a constant rate, try the Timer class. You can set a task to be called every second, which is the same frequency you will get by polling the page.
If you want to use the page as an external source of time, try the Network Time Protocol. http://en.wikipedia.org/wiki/Network_Time_Protocol It will provide much lower latency and is actually designed for this purpose.

Related

Parsing an updating html using jsoup [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
we have a problem (we are a group).
We have to use jsoup in java for an university project. We can parse Htmls with it. But the problem is that we have to parse an html which updates when you click on a button (https://www.bundestag.de/services/opendata).
First Slide
Second Slide
We want to access all xmls from "Wahlperiode 20". But when you click on the slide buttons the html code updates but the html url stays the same. But you have never access to all xmls in the html because the html is updating over the slide button.
Another idea was to find out how the urls of the xmls we want to access are built so that we dont have to deal with the slide buttons and only access the xml urls. But they are all built different.
So we are all desperate how to go on. I hope y'all can help us :)
It's rather ironic that you are attempting to hack1 out some data from an opendata website. There is surely an API!!
The problem is that websites aren't static resources; they have javascript, and that javascript can fetch more data in response to e.g. the user clicking a 'next page' button.
What you're doing is called 'scraping': Using automated tools to attempt to query for data via a communication channel (namely: This website) which is definitely not meant for that. This website is not meant to be read with software. It's meant to be read with eyeballs. If someone decides to change the design of this page and you did have a working scraper, it would then fail after the design update, for example.
You have, in broad strokes, 3 options:
Abort this plan, this is crazy
This data is surely open, and open data tends to come with APIs; things meant to be queried by software and not by eyeballs. Go look for it, and call the german government, I'm sure they'll help you out! If they've really embraced the REST principles of design, then send an accept header that including e.g. application/json and application/xml and does not include text/html and see if the site just responds with the data in JSON or XML format.
I strongly advise you fully exhaust these options before moving on to your next options, as the next options are really bad: Lots of work and the code will be extremely fragile (any updates on the site by the bundestag website folks will break it).
Use your browser's network inspection tools
In just about every browser there's 'dev tools'. For example, in Vivaldi, it's under the "Tools" menu and is called "Developer tools". You can also usually right click anywhere on a web page and there will be an option for 'Inspect', 'Inspector', or 'Development Tools'. Open that now, and find the 'network' tab. When you (re)load this page, you'll see all the resources its loading in (so, images, the HTML itself, CSS, the works). Look through it, find the interesting stuff. In this specific case, the loading of wahlperioden.json is of particular interest.
Let's try this out:
curl 'https://www.bundestag.de/static/appdata/filter/wahlperioden.json'
[{"value":"20","label":"WP 20: seit 2021"},{"value":"19","label":"WP 19: 2017 - 2021"},(rest omitted - there are a lot of these)]
That sounds useful, and as its JSON you can just read this stuff with a json parser. No need to use JSoup (JSoup is great as a library, but it's a library that you can use when all other options have failed, and any code written with JSoup is fragile and complicated simply because scraping sites is fragile and complicated).
Then, click on the buttons that 'load new data' and check if network traffic ensues. And so it does, when you do so, you notice a call going out. And so it is! I'm seeing this URL being loaded:
https://www.bundestag.de/ajax/filterlist/de/services/opendata/866354-866354?limit=10&noFilterSet=true&offset=10
The format is rather obvious. offset=10 means: Start from the 10th element (as I just clicked 'next page') and limit=10 means: NO more than 10 pages.
This html is also incredibly basic which is great news, as that makes it easy to scrape. Just write a for loop that keeps calling this URL, modifying the offset=10 part (first loop: no offset. Second, offset=10, third: offset=20. Keep going until the HTML you get back is blank, then you got it all).
For future reference: Browser emulation
Javascript can also generate entire HTML on its own; not something jsoup can ever do for you: The only way to obtain such HTML is to actually let the javascript do its work, which means you need an entire browser. Tools like selenium will start a real browser but let you use JSoup-like constructs to retrieve information from the page (instead of what browsers usually do, which is to transmit the rendered data to your eyeballs). This tends to always work, but is incredibly complicated and quite slow (you're running an entire browser and really rendering the site, even if you can't see it - that's happening under the hood!).
Selenium isn't meant as a scraping tool; it's meant as a front-end testing tool. But you can use it to scrape stuff, and will have to if its generated HTML. Fortunately, you're lucky here.
Option 1 is vastly superior to option 2, and option 2 is vastly superior to option 3, at least for this case. Good luck!
[1] I'm using the definition of: Using a tool or site to accomplish something it was obviously not designed for. The sense of 'I bought half an ikea cupboard and half of an ikea bookshelf that are completely unrelated, and put them together anyway, look at how awesome this thingie is' - that sense of 'hack'. Not the sense of 'illegal'.

parsing web page which is changing real time in JAVA

Heres what i want to do. Im quite a beginner with this so maybe a lame question, But, I want to implement gui application in java wich gets data from sports livescore pages
e.g
http://www.futbol24.com/Live/
http://livescore.com/
and parse it (somehow) in my app...and then i will be able to store it in for example jtable ,save full time results in database,playing sounds after goal is scored and so on
What is the best way to do this ?
It would be almost impossible to parse an HTML document from a live web page and get specific information from it. If you did manage to work out exactly where in the document the data is, the page structure could change at any time. The scores might not even be in the HTML - they could be fetched by Javascript in the page.
I suggest you find an RSS feed of the information you want. Then you'll only have a nice, small piece of XML to parse. That's what it's for.

How to programmatically submit a filled form and scrape the resulting page?

I want to retrieve a set of results, which consist of all results produced by (looping) all the options of one of the request-form fields.
I'm using Java language, and HtmlUnit API.
I have managed to do this looping form-fill using the URL to 'fill' the field's variables (I don't know if its the best method, and actually am quite worried it's one of the worst...But it's the one i could do with the knowledge i have).
But i'm having problems figuring out how to make the program submit the form in order to reach the result page, and on how to download (scrape) that page before moving to the next.
NOTES:
-If you have a better way of filling the 'request-form', that is welcome as well.
UPDATE:
This solves the issues when using HtmlUnit API (thank you, touti):
HtmlPage resultado = pageNow.getElementByName("buscar").click();
System.out.println(resultado.asText());
A better way than loading both the request and response pages is still hugely welcome tough!
you can simulate using Jquery the click on your submit input like this
$("#submit_id").trigger("click");

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!
First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.
This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

Why does GWT wrap() method iterate the whole DOM unnecessarily?

I'm using a custom GWT component that wraps around an existing textbox in my HTML page.
The page returns a list of information - so as larger sets of information are loaded - the GWT loading process takes longer and longer.
Looking at the source code of the wrap() method - it appears it iterates through the DOM looking for matching ids.
Isn't this unnecessary? Is there a way to make it just iterate to my component and then stop?
GWT does a bunch of DOM house keeping and you are just not going to get around it easily or at all.
It sounds like there's a point at which you have so many text inputs that you need to rethink how you're approaching this anyway. Dynamically creating inputs fields in a form panel in GWT is pretty easy and fast, and you can very simply and quickly download a json structure with the data you need for the input fields in your original html page load, convert it to a dictionary or simple array in GWT and , use it to populate your form.
Once done, you can clear the pointer to the data so it will GC'd if you don't need it any longer.
To access data in javascript look at creating a native method, its very easy to do. If it makes sense, you can format the json data as a dictionary and GWT's dictionary class will map directly to it.
I use these techniques all the time and they are robust and pretty much as fast as javascript can populate the DOM.

Categories