How to parse content with <pre>?

How to parse content with <pre>? - java

I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.

Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.

Related

Unable to retrieve table elements using jsoup

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?

The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same

The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.

Java Jsoup Submitting a form

I am trying to figure out how to submit a form using Jsoup.
On Xfinity's website, I am trying to input an address and get back the resulting page after clicking on "Show me deals" from the url below:
https://www.xfinity.com/learn/offers
Here is my current code:
public String getISP() throws IOException {
Connection.Response addressFormResponse = Jsoup.connect("https://www.xfinity.com/learn/offers")
.data("Address.SingleStreetAddress", address)
.method(Connection.Method.POST)
.execute();
Document doc = addressFormResponse.parse();
System.out.println(doc.title());
System.out.println(doc.location());
if (doc.location().contains("Active Address")) {
return "Comcast XFinity";
}
return "Cannot find an ISP";
}
The current code only returns the same webpage, how would I get back the resulting page?

Jsoup is a HTML parser library, it provides functionality for extracting and manipulating data on HTML page. If you need traverse websites, submit forms, click elements, it's better to use another tools, like selenium, HTTP client (which are often used for automated test of web applications) or web crawler libraries like crawler4j.

I would tend to disagree with Daniil's answer in that neither HTTP Client or crawler4j support javascript which is required for this page. Selenium is probably the best solution.
What follows is an example of how to use jsoup to fetch a page, fill out a form, and submit it. The result is json and so you would then pass that string to gson or similar. I did not that the page was very flaky just in a regular browser, and sometimes would catch the address input and sometimes would barf on the same input.
Document doc = Jsoup.connect("https://www.xfinity.com/learn/offers").get();
FormElement form = (FormElement) doc.selectFirst("[data-form-dealfinder-localization]");
Element input = form.selectFirst("#Address_StreetAddress");
input.val("2000 YALE AVE E, SEATTLE, WA 98102");
String json = form.submit().ignoreContentType(true).execute().body();
System.out.println(json);

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.

You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,
htmlDom = parser.parse("<p>My First Heading</p>clk");
Elements pe = htmlDom.select("html");
System.out.println("size "+pe.size());
The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?
Thank you.

It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.
That is because Jsoup will try to conform the HTML5 Parsing Rules, and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:
Elements pe = htmlDom.select("head");
System.out.println("size "+pe.size());
To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser, as below, which will parse the HTML as it is.
String customHtml = "<p>My First Heading</p>clk";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());
So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.

Java get single element from webpage

Hi so after some searching still not found an answer but i would like to get a single element of a webpage to a String Variable. I know how to do this in C but would like to know in java
eg:
document.nav(the webpage)
String value = document.getElementbyid(theid)
Thanks
so eg:
some webpage has
<body>
<P id=element1>the value i want</p>
</body>
and i need to get that value from the webpage into a String variable

You can use jsoup for that:
String url = "http://www.example.com"; // or whatever goes here
Document document = Jsoup.connect(url).followRedirects(false).timeout(60000/*wait up to 60 sec for response*/).get();
String value = document.body().select("#element1" /*css selector*/).get(0).text();
If you need another input format please refer to the cookbook
It's not really necessary to specify timeout ect. for the connection. You could just use
Document document = Jsoup.connect(url).get();
I only included the timeout if the webpage takes really long to load. You also may want to follow redirects.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse content with <pre>? - java

Related

Unable to retrieve table elements using jsoup

Java Jsoup Submitting a form

Jsoup extract Hrefs from the HTML content

JSoup check if <HTML>,<HEAD> and <BODY> tags are present

Java get single element from webpage

Categories

Resources