Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
we have a problem (we are a group).
We have to use jsoup in java for an university project. We can parse Htmls with it. But the problem is that we have to parse an html which updates when you click on a button (https://www.bundestag.de/services/opendata).
First Slide
Second Slide
We want to access all xmls from "Wahlperiode 20". But when you click on the slide buttons the html code updates but the html url stays the same. But you have never access to all xmls in the html because the html is updating over the slide button.
Another idea was to find out how the urls of the xmls we want to access are built so that we dont have to deal with the slide buttons and only access the xml urls. But they are all built different.
So we are all desperate how to go on. I hope y'all can help us :)
It's rather ironic that you are attempting to hack1 out some data from an opendata website. There is surely an API!!
The problem is that websites aren't static resources; they have javascript, and that javascript can fetch more data in response to e.g. the user clicking a 'next page' button.
What you're doing is called 'scraping': Using automated tools to attempt to query for data via a communication channel (namely: This website) which is definitely not meant for that. This website is not meant to be read with software. It's meant to be read with eyeballs. If someone decides to change the design of this page and you did have a working scraper, it would then fail after the design update, for example.
You have, in broad strokes, 3 options:
Abort this plan, this is crazy
This data is surely open, and open data tends to come with APIs; things meant to be queried by software and not by eyeballs. Go look for it, and call the german government, I'm sure they'll help you out! If they've really embraced the REST principles of design, then send an accept header that including e.g. application/json and application/xml and does not include text/html and see if the site just responds with the data in JSON or XML format.
I strongly advise you fully exhaust these options before moving on to your next options, as the next options are really bad: Lots of work and the code will be extremely fragile (any updates on the site by the bundestag website folks will break it).
Use your browser's network inspection tools
In just about every browser there's 'dev tools'. For example, in Vivaldi, it's under the "Tools" menu and is called "Developer tools". You can also usually right click anywhere on a web page and there will be an option for 'Inspect', 'Inspector', or 'Development Tools'. Open that now, and find the 'network' tab. When you (re)load this page, you'll see all the resources its loading in (so, images, the HTML itself, CSS, the works). Look through it, find the interesting stuff. In this specific case, the loading of wahlperioden.json is of particular interest.
Let's try this out:
curl 'https://www.bundestag.de/static/appdata/filter/wahlperioden.json'
[{"value":"20","label":"WP 20: seit 2021"},{"value":"19","label":"WP 19: 2017 - 2021"},(rest omitted - there are a lot of these)]
That sounds useful, and as its JSON you can just read this stuff with a json parser. No need to use JSoup (JSoup is great as a library, but it's a library that you can use when all other options have failed, and any code written with JSoup is fragile and complicated simply because scraping sites is fragile and complicated).
Then, click on the buttons that 'load new data' and check if network traffic ensues. And so it does, when you do so, you notice a call going out. And so it is! I'm seeing this URL being loaded:
https://www.bundestag.de/ajax/filterlist/de/services/opendata/866354-866354?limit=10&noFilterSet=true&offset=10
The format is rather obvious. offset=10 means: Start from the 10th element (as I just clicked 'next page') and limit=10 means: NO more than 10 pages.
This html is also incredibly basic which is great news, as that makes it easy to scrape. Just write a for loop that keeps calling this URL, modifying the offset=10 part (first loop: no offset. Second, offset=10, third: offset=20. Keep going until the HTML you get back is blank, then you got it all).
For future reference: Browser emulation
Javascript can also generate entire HTML on its own; not something jsoup can ever do for you: The only way to obtain such HTML is to actually let the javascript do its work, which means you need an entire browser. Tools like selenium will start a real browser but let you use JSoup-like constructs to retrieve information from the page (instead of what browsers usually do, which is to transmit the rendered data to your eyeballs). This tends to always work, but is incredibly complicated and quite slow (you're running an entire browser and really rendering the site, even if you can't see it - that's happening under the hood!).
Selenium isn't meant as a scraping tool; it's meant as a front-end testing tool. But you can use it to scrape stuff, and will have to if its generated HTML. Fortunately, you're lucky here.
Option 1 is vastly superior to option 2, and option 2 is vastly superior to option 3, at least for this case. Good luck!
[1] I'm using the definition of: Using a tool or site to accomplish something it was obviously not designed for. The sense of 'I bought half an ikea cupboard and half of an ikea bookshelf that are completely unrelated, and put them together anyway, look at how awesome this thingie is' - that sense of 'hack'. Not the sense of 'illegal'.
How can i extract email id based on the website that i have got from Place Details using Google maps API.
I want to extract the generic email id from these place details or the website. I have attached a screen shot to show what i have. I am using open Refine.I am able to parse the phone number but can't parse email id's as its not present in these place details but i can see it on the website.
Thank you in advance.
This is the code i am using -
"https://maps.googleapis.com/maps/api/place/details/json?placeid=" + cells['PlaceID'].value + "&key=AIzaSyCSp-f-_FWHw8jfNFF9yd7mgUxaX-DZo8g"
According to the google maps places API this is not something that can be returned from the API. I believe that Google does not want its API to be used as a source for spamming companies. Based on that, I can't think of any legal option that would give you this email.
The only solution that I can think of would be to get the URL attribute from google maps API result and from then scrape the page and hope that there is a way to find the email address in the HTML code, but :
I don't think that google would present it in a easily scrappable way;
I believe it would go against terms of use of Google.
So to summarize I don't think there is an easy and legal solution to your question.
I am using mentions input jquery for my text area for mentioning names like facebook.
When I try to edit the mentioned text, already mentioned name changes to plaintext. How can I handle this?
Actually there is no such option, jQuery is a JavaScript Library, which you can implement for your own usage.
You cannot just ask jQuery to do your own work. You need to code it out!
When Facebook uses this feature, they actually try to use it like [userId:userName] which would be changed back to the Name and the Link to their profile.
I would ask you to do the same, when a user edit the names, replace that text fully. Don't just append it, just replace it as a whole. This way, the plaintext wouldn't be written and the mentionsId would have the new Name mentioned inside it.
I am building a plugin that requires a user to click on a Long link in chat. Somethig like this:
http://www.example.com/sample.php?code=124ds8g89fgfg9fd9g76hg89f7d698d67fgh7
Whenever i send this to the user:
sender.sendMessage(url);
Only part of the link is copied...example: http://www.example.com/sample.php?code=124ds8
How do i make minecraft accept the full URL length? I do not want to use 3rd Part Url shorten-ers either. Thanks!
Minecraft will automatically wrap onto the next line, or cut of a part of the Link/Message if it is too long, it can only fit so many characters into the textbox.
Your best bet is to use something like goo.gl or bit.ly to shorten the links into something more manageable.
I am writing a java code.
I want to search a string on google and google images using my java code.
Previously i wasnt even able to search text and then i had to register with google and then i could do it.
Now i want to search a string against googleimages
How can i do it?
Regards
Manjot
It's pretty similar to text search, you just need to specify the URL as
http://ajax.googleapis.com/ajax/services/search/images
Standard arguments are described here, image -specific ones are here.
I agree with ChssPly76, but you have to include version number in URL as following
http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=
put q= watever you want to search for.
I hope it will help you.
Cheers.