Pretty print ("indentation-only") HTML documents in Java (no JTidy)

Pretty print ("indentation-only") HTML documents in Java (no JTidy) - java

We're generating HTML files out of apaches velocity generic template engine. The generated HTML is kind of ugly and not with correcht indentation.
In my case I've got the HTML stored in a String which I want to manipulate in this way, that it looks pretty printed.
I've already gave JTidy a try, but it changes the HTML source code when I pipe the raw HTML trough it. Sometimes it adds or removes HTML tags.
My question:
Is there a java library or something else out there which (only!) pretty prints my HTML code without adding, removing tags from my HTML document? It shall only do the indentation, so that it looks pretty printed! Nothing more, nothing less. Any ideas? :-)
Also code suggestions, hints or tips are welcome.
Best regards

Maybe a little to late, but I found a solution to this with Jsoup.
you can get the "pretty" version of the html by using only the parser, and (in case of needed) avoid the generation of the html elements by using a "custom parser"
I got the answer from this Jsoup question
And its
public static String formatHTML(String html) throws Exception{
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
return doc.toString();
}
I hope this helps.
Regards

Find any SAX parser example in java. indent++ for opening tags, intent-- for closing, and write content with counted intentation.

Why don't you write a simple Java parser to pretty print HTML yourself. Here is a sketch:
Track open and close tags for example and
have a counter to figure out the current indentation level.
Perhaps use a stack to push, pop the indentation level
Just iterate thru the HTML string and push the current indentation level on stack when you see a tag
If you see a nested tag then increment indentation level and keep going
When you see an end of tag e.g . etc then pop the stack to go back to prev indent level
I wanted to give you a rough idea here, you can use this as a starting point. I have written many perl based pretty printers. You could use Perl to script a parse fairly quickly..

Related

How to fix hanging html tags in HTML fragment?

I am getting a possibly ill-composed HTML fragment from an external source:
<p>Include all the information someone would need to answer your <i><i>question<p>
How to make it safe for rendering within a bigger HTML document, closing all hanging HTML tags in Java?

You can try to parse incoming string to XML - there is plenty of tools that do that. If it fails it means that HTML is wrongly formatted (for instance not all tags are correctly closed).
If you need better validation you may additionally validate it against XSD.

You can achieve that by writing your own Java custom parser and fixing the tags.
Idea will be like this, get all open tags and find its relevant closing tag in the string.
You can replace with if there is no closing tag founds.
You need to handle duplicates and pre , post valid tags.
Else you can try this opensource handy parses which helps in achieving that.
http://java-source.net/open-source/html-parsers
http://htmlcleaner.sourceforge.net/ looks good option.
Hope this helps.

How do I truncate HTML strings to remove broken invalid HTML fragments?

In my Java webapp, I create summary text of long HTML text. In the process of truncation, the HTML fragments in the string often break, producing HTML string with invalid & broken fragments. Like this example HTML string:
Visit this link <img src="htt
Is there any Java library to deal with this better so that such broken fragments as above are avoided ?
Or could I let this be included in the HTML pages & somehow deal with this using client side code ?

Since browsers will usually be able to deal with almost any garbage you feed into it (if it ain't XHTML...), if the only thing that actually happens with the input (assuming it's valid HTML of any kind) is being sliced, then the only thing you have to worry about is to actually get rid of invalid opening tags; you won't be able to distinguish broken 'endings' of tags, since they, in themselves, ain't special in any way. I'd just take a slice I've generated and parse it from the end; if I encounter a stray '<', I'd get rid of everything after it. Likewise, I'd keep track of the last opened tag - if the next close after it wasn't closing that exact tag, it's likely the closing tag got out, so I'd insert it.
This would still generate a lot of garbage, but would at least fix some rudimentary problems.
A better way would be to manage a stack of opened/closed tags and generate/remove the needed/broken/unnecessary ones as they emerge. A stack is a proper solution since HTML tags musn't 'cross' [by the spec, AFAIR it's this way from HTML 4], i.e. <span><div></span></div> isn't valid.
A much better way would be to splice the document after first parsing it as SGML/HTML/XML (depends on the exact HTML doctype) - then you could just remove the nodes, without damaging the structure.
Note that you can't actually know if a tag is correct without providing an exact algorithm you use to generate this 'garbled' content.

I used owasp-java-html-sanitizer to fix those broken fragments to generate safe HTML markup from Java.
PolicyFactory html_sanitize_policy = Sanitizers.LINKS.and(Sanitizers.IMAGES);
String safeHTML = html_sanitize_policy.sanitize(htmlString);
This seemed to be easiest of all solutions I came across.

get part with regex

I need to get everything bewteen
onmouseout="this.style.backgroundColor='#fff'">
and the following <
in this case:
onmouseout="this.style.backgroundColor='#fff'">example<
I would like to get the word example.
Here is a more complicated example of where it should work as well:
onmouseout="this.style.backgroundColor='#fff'">going to drink?<br></span><span title="Juist!" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">Exactly!</span></span></div></div>
So here i need 2 of them back (and not joined).
Could someone help? I suck at regex.
Someone edited my tag to javascript.
I need a solution to use in java, i just get a file as plain text. So javascript or html solutions are not really helpfull.

Regex with html? Well, If you have to parse only a few lines then ok. But in general is better to use a html parser (because HTML is not a regular language).
This is pure gold: https://stackoverflow.com/a/1732454/434171

getting text that will be displayed to user from html

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.

Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.

I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.
If you have
<strong> test </strong>
(and other tags that change the look of the test) you could place in conditions to not place a full stop here.

HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.
there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

HTML Parser to extract text out of the body (in java)

I am working on this project that requires me to carry out some text manipulation out of the text that I obtain from web pages.
Now, the first step towards doing this would be for me to find a parser that would extract the required body text ignoring the redundant information. I am not sure how I would do this, since I am extremely new to programming. I would really appreciate any help I could get.
Thanks in advance

I found this html parser very useful. It also provides a sample example . http://jericho.htmlparser.net/docs/index.html

I am just now doing it using HTMLParser, available at Sourceforge:
http://sourceforge.net/projects/htmlparser/
Seems very easy and straightforward, but since you claim to be new at this, here is an example with source code:
http://kickjava.com/src/org/htmlparser/parserapplications/StringExtractor.java.htm

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pretty print ("indentation-only") HTML documents in Java (no JTidy) - java

Find any SAX parser example in java. indent++ for opening tags, intent-- for closing, and write content with counted intentation.

Related

How to fix hanging html tags in HTML fragment?

How do I truncate HTML strings to remove broken invalid HTML fragments?

get part with regex

getting text that will be displayed to user from html

HTML Parser to extract text out of the body (in java)

Categories

Resources