How to select classes that have dash "-" with jsoup? - java

Suppose I want to select the following html with jsoup:
<p class="foo bar-baz">Hello World!</p>
I can select it from the Document object doc with doc.select("p.foo"). This look for paragraphs with class foo. I want to be more specific so I then try to go with doc.select("p.foo.bar-baz"). I know I can specify additional classes if I separate with dot however in the example above the dash seems to be causing problems. What do I need to do to also select class bar-baz?

Turns out the problem was that I was relying on the html from developers tools in chrome but the html of the Document object was different. It's not the same code but what happened essentially was that chrome would show this:
<p class="foo bar-baz">Hello World!</p>
When the html of the Document object in reality only had this:
<p class="foo">Hello World!</p>
Naturally that was the reason for the null pointer. I then tried to select elements with dashes and it had problem dealing with them (as pointed by luksch in a comment above).
Iam not sure why they would show different html but now I know I can only rely on the Document html to select elements.

Related

JSOUP Deleting closing and/or opening divs only

Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

Parsing HTML content with sibling tags in Java (or) Finding content between two <open> tags

Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script> or <style> with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.
I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been unable to deal with html like this:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.
HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?
You can pretty much do anything you want with JSoup setters
Would that suit you ?
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

Check and insert alt attribute for images which miss it

I want to write a Java tool to assess HTML pages of an existing site and if any image has no alt attribute, the tool will insert alt="" to that image. One approach is using an HTML parser (like HtmlCleaner) to generate the DOM then adding the alt attribute to the images in the DOM before writing back the HTML.
However, this approach won't keep the original HTML intact and probably cause some unpredictable side effects, esp. when the existing amount of HTML pages is huge and there is no guarantee about their being well-formed.
Is there any safer way to accomplish this (i.e. should keep the original HTML intact and only add the alt attribute)?
Short of writing some horrible mess of regexp or other string manipulation code, I don't believe that there is another way of doing this.
I do question why you want to do this? The only reason I can imagine is to pass some sort of automatic validation, but the reason for requiring alt tags is a matter of usability. Adding empty alt tags does not help that in any way. You are just hiding the problem.
Instead I'd suggest writing a bit of Javascript that throws a red border around any image missing an alt tag and making the front end designers add meaningful alt tags to every image thus flagged.
It's kind of pointless to add empty alt tags to your layout. I second Kris in that it's defeating the purpose of having the alt tags in the first place and I agree with David Dorward's comment.
But, if there is some ulterior motive here, you could do it after the fact in the browser with javascript (or, preferably, jQuery). The client's browser certainly won't be able to change the original HTML and is smart enough to parse through it even if it's not perfectly well-formed.
Using jQuery, place this script in the head section of your page:
<script language="javscript">
$(function() {
$('img:not([alt])').attr('alt','');
});
</script>
And make sure you include the jQuery library.
I've used the Jericho HTML Parser library in the past with success for parsing HTML. It's supposed to work well with poorly formed HTML. This would alter the original HTML though.

Categories