Parsing an updating html using jsoup [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
we have a problem (we are a group).
We have to use jsoup in java for an university project. We can parse Htmls with it. But the problem is that we have to parse an html which updates when you click on a button (https://www.bundestag.de/services/opendata).
First Slide
Second Slide
We want to access all xmls from "Wahlperiode 20". But when you click on the slide buttons the html code updates but the html url stays the same. But you have never access to all xmls in the html because the html is updating over the slide button.
Another idea was to find out how the urls of the xmls we want to access are built so that we dont have to deal with the slide buttons and only access the xml urls. But they are all built different.
So we are all desperate how to go on. I hope y'all can help us :)

It's rather ironic that you are attempting to hack1 out some data from an opendata website. There is surely an API!!
The problem is that websites aren't static resources; they have javascript, and that javascript can fetch more data in response to e.g. the user clicking a 'next page' button.
What you're doing is called 'scraping': Using automated tools to attempt to query for data via a communication channel (namely: This website) which is definitely not meant for that. This website is not meant to be read with software. It's meant to be read with eyeballs. If someone decides to change the design of this page and you did have a working scraper, it would then fail after the design update, for example.
You have, in broad strokes, 3 options:
Abort this plan, this is crazy
This data is surely open, and open data tends to come with APIs; things meant to be queried by software and not by eyeballs. Go look for it, and call the german government, I'm sure they'll help you out! If they've really embraced the REST principles of design, then send an accept header that including e.g. application/json and application/xml and does not include text/html and see if the site just responds with the data in JSON or XML format.
I strongly advise you fully exhaust these options before moving on to your next options, as the next options are really bad: Lots of work and the code will be extremely fragile (any updates on the site by the bundestag website folks will break it).
Use your browser's network inspection tools
In just about every browser there's 'dev tools'. For example, in Vivaldi, it's under the "Tools" menu and is called "Developer tools". You can also usually right click anywhere on a web page and there will be an option for 'Inspect', 'Inspector', or 'Development Tools'. Open that now, and find the 'network' tab. When you (re)load this page, you'll see all the resources its loading in (so, images, the HTML itself, CSS, the works). Look through it, find the interesting stuff. In this specific case, the loading of wahlperioden.json is of particular interest.
Let's try this out:
curl 'https://www.bundestag.de/static/appdata/filter/wahlperioden.json'
[{"value":"20","label":"WP 20: seit 2021"},{"value":"19","label":"WP 19: 2017 - 2021"},(rest omitted - there are a lot of these)]
That sounds useful, and as its JSON you can just read this stuff with a json parser. No need to use JSoup (JSoup is great as a library, but it's a library that you can use when all other options have failed, and any code written with JSoup is fragile and complicated simply because scraping sites is fragile and complicated).
Then, click on the buttons that 'load new data' and check if network traffic ensues. And so it does, when you do so, you notice a call going out. And so it is! I'm seeing this URL being loaded:
https://www.bundestag.de/ajax/filterlist/de/services/opendata/866354-866354?limit=10&noFilterSet=true&offset=10
The format is rather obvious. offset=10 means: Start from the 10th element (as I just clicked 'next page') and limit=10 means: NO more than 10 pages.
This html is also incredibly basic which is great news, as that makes it easy to scrape. Just write a for loop that keeps calling this URL, modifying the offset=10 part (first loop: no offset. Second, offset=10, third: offset=20. Keep going until the HTML you get back is blank, then you got it all).
For future reference: Browser emulation
Javascript can also generate entire HTML on its own; not something jsoup can ever do for you: The only way to obtain such HTML is to actually let the javascript do its work, which means you need an entire browser. Tools like selenium will start a real browser but let you use JSoup-like constructs to retrieve information from the page (instead of what browsers usually do, which is to transmit the rendered data to your eyeballs). This tends to always work, but is incredibly complicated and quite slow (you're running an entire browser and really rendering the site, even if you can't see it - that's happening under the hood!).
Selenium isn't meant as a scraping tool; it's meant as a front-end testing tool. But you can use it to scrape stuff, and will have to if its generated HTML. Fortunately, you're lucky here.
Option 1 is vastly superior to option 2, and option 2 is vastly superior to option 3, at least for this case. Good luck!
[1] I'm using the definition of: Using a tool or site to accomplish something it was obviously not designed for. The sense of 'I bought half an ikea cupboard and half of an ikea bookshelf that are completely unrelated, and put them together anyway, look at how awesome this thingie is' - that sense of 'hack'. Not the sense of 'illegal'.

Related

How to use an input field to permanently save to a webpage?

First of all, I'm not a programming expert. I'm fluent in VB, functional with html & php, & somewhat fluent with java.
I have created a password protected side of my business' website that basically has commonly needed reference material & alot of organized links to other websites that we frequently use. Right now, if I want to add a new link, I have to go into the html and code the button. (side note: bookmark syncronization via xMarks is what we have been using. While it's functional, I need something that can be more easily accessed on multiple computers, sometimes even public computers & computers owned by clients, so I don't want to be limited by xMarks...we basically store URLs in notes on our smartphones so we can type them in when we need them...archaic, I know)
It seems that it would be possible to simply have a form. One field for the URL, one field for the title, and when I click submit it would be permanently added as a button on that page...but I can't even really figure out where to start. I feel like this is probably a job for Java, but I just don't know what direction to go.
You don't have to write the code for me (by all means, if you have the desire, feel free) I just need to know what direction to go!
This is a job for "any programming language" (that is supported by your server, or which you are willing to add support for to your server).
Of your tags, you could use Java or PHP. My personal preference would probably be to Perl or Python.
The basics would be:
HTML form submitting to a server side program that adds the data to a database. For a low traffic system like that, that database could be SQLite.
Plus: Server side program that generates a list of links from the database. It would query the database for all the links (possibly adding paging when the list got to a certain size) then loop over the results and output the HTML for each one.
Using a template language inside your programming language would be wise. Make sure you look up how to defend yourself from SQL Injection and XSS.
This can be easily done using PHP

Find direct link in a aspx form or read forms with android

I daily visit this link to find my lectures at school. Every time I have to scroll down the list to find my own class, and then post it so I can view the result. Is there any way i could make a direct link to the preferred content? I'm looking to create a simple webview app in Android showing individual form categories.
EDIT : Really any method for converting the aspx info into another format would do the trick. Prefferably a direc link to each form item. But if I can convert every single item to a .xml file or anything else I could work with it. But I have to make it automated.
You can capture the outgoing request and write a simple application to POST the data back to the page. The WebClient class is useful for this.
Looking at the request in Chrome's developer tools, I see that the form posts back to itself and then redirects to the result page. Presumably, you should POST the form data to the initial page, which will then cause it to perform the redirect.
The form contains a large amount of ViewState data which may or may not need to be included in the request to make it work.
A completely different approach would be to find a browser extension, such as a macro recorder, which emulate your actions. This plugin (haven't tried it myself) appears to do exactly that.

Making a Safari Reader-style application

I've been inspired by Safari's Reader feature, which lets you ignore all content on a webpage except the story (all the text, links, and images which comprise the point of the page, but none of the markup, antecedents, or consequents). I want to make a Java-based version of it as a lightweight "browser".
My problem is herein: I don't know exactly how to discern the main content. Upon inspection of such Reader-recognized pages as MSN articles and fan fictions, I realized that the actual text that Reader recognizes is not only hard to find, but inconsistent and broken up with seemingly random tags. For instance, whereas the news link starts its story with <div class="postBody"> and every paragraph is in <p>s, the fiction linked starts with <div class="chapter_content" id="chapter_container"> and every paragraph starts with <br /><div style='float:left; height:1.0em; width:3.0em;'></div> but is not within its own container.
Since Safari supprots this "Reader" interface, there obviously IS a way of doing this, so I won't ask if it exists. Instead, I want to know this: What is a good, fast, Java-supported algorithm for extracting the title and body of a story on a webpage, regardless of how the page, itself, is constructed?
For context, I've already created a basic browser with a JEditorPane as the window, whose EditorKit is set to be an HTMLEditorKit, and am using the setPage(URL page) method to show the target page, but this can change i needed.
If you're willing to use a service, you should look into the Instapaper or Readability APIs; otherwise, you can peep into arc90 lab's JavaScript proof-of-concept implementation of Readability. You can also find several ports of Readability to Java and several other languages on GitHub.

How can I highlight text - strictly timed - a la Karaoke without Flash on a web page. What technology choice?

I would like to display the whole text of a poem, then have text highlighted according to a pre-established time sequence. Something like Karaoke, but without any sound track. A user would then be able to read it at exactly the "right" tempo.
I figure I can generate a subtitle track (for example, with something like Aegisum - although this keeps crashing on my Mac) with the timing data. Something line by line, such as:
1
00:00:18,067 --> 00:00:20,067
Twinkle twinkle little star
2
00:00:20,467 --> 00:00:22,467
How I wonder what you are
... or better still, a word or sylable at a time.
I don't want to use Flash for iPad/iPhone reasons.
My exact question is this as I'm somewhat naive: What would be the best technology to use? I don't need an exact solution, just some pointers on where I should concentrate my efforts. Does Timed Text in HTML5 (TTML) have anything I could use on this? Or SMIL?
Someone posted a karaoke display engine build in js: https://github.com/sk89q/ricekaraoke
You can use Javascript and CSS to accomplish what you want. You can wrap each word in a span, then apply styles to the span elements at the proper timing intervals. If you can store timing information about when you want corresponding words highlighted, you can use setInterval to add styles at the appropriate times. If you want to use HTML5 features, you might look into using Canvas or SVG to enable more advanced animations.
You can achieve a karaoke effect using a javascript library from Mozilla called popcorn.js You can download it from http://mozillapopcorn.org/
Here is a tutorial http://net.tutsplus.com/articles/news/a-look-at-popcorn/
Here is a demo http://danharper.me/demo/a-look-at-popcorn/
Lots of links to related info at the bottom of the second link.

How do I send a query to a website and parse the results?

I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!
First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.
This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.

Categories