Encoding strings into HTML in Java - java

I need to encode a Java String to display properly as part of an HTML document, e.g. all new lines or tab characters. So I have e.g. "one\ntwo", and I want to get something like "one<br>two". Do you know any library that would do it for me?

try using preformat tag <pre>

Spring Framework's HtmlUtils.htmlEscape(String) method should do the trick:
Turn special characters into HTML
character references. Handles complete
character set defined in HTML 4.01
recommendation.

You don't need a library for those simple cases. A simple loop over string.replaceAll() should do the trick.
If you are looking for more fancy conversions (such as done here on SO or in a wiki) you can check out the Java Wikipedia API. Code example here. Although I guess it may be a bit overkill for your needs.

Related

Cleanning a String from html code and accents with java

I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.

Java Regex for Finding a Pattern and Getting Value in It?

I am working on a plugin. I will parse HTML files. I have a naming convention like that:
<!--$include="a.html" -->
or
<!--$include="a.html"-->
is similar
According to this pattern(similar to server side includes) I want to search an HTML file.
Question is that:
Find that pattern and get value (a.html at my example, it is variable)
It should be like:
while(!notFinishedWholeFile){
fileName = findPatternFunc(htmlFile)
replaceFunc(fileName,something)
}
PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.
Any ideas?
You mean like this?
<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->
Read a file into a string then
str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);
will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)
If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.
Using regex for this task is fine performance wise, but see e.g.
How to use regular expressions to parse HTML in Java? and Java Regex performance etc.
I have used that pattern:
"<!--\\$include=\"(.+)(.)(html|htm)\"-->"

how to decode html codes using Java? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title in StackOverflow) from an html file.
I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.
EXAMPLE
field extracted:
Paging Lucene&#39s search results (with **;** among **&#39** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags
for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as λ, λ or λ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™ for example is not valid, yet many browsers will interpret it as ™.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
Feed string into a robust HTML parser
Get parsed (and fully decoded) string back

I want stop < being converted to < in Java when I save it as part of a String

I am storing some HTML as a String which I want to output to a JSP.
Is there a simply utility function that I should use for this, or should I write my own. I could write it easily but I'd rather do it the most common way.
Thanks,
You don't need to do anything special. A simple expression will output the string without escaping: ${str}
It takes extra work to get escaping, such as using the JSTL <c:out/> tag. You must be doing something like that, either in the JSP, or to the String before the JSP is rendered.
I think you have to reinvent wheel here again as java does not provide any class which can do this for you. You can encode and decode URL. Other way is to make use of some one's wheel invention like 'apache commons Langs'.
cheers
I'm not sure I completely follow your question. If you just want to print a string without escaping the output, try:
<% out.println(str); %>

get text between html tags

Possible duplicate: RegEx matching HTML tags and extracting text
I need to get the text between the html tag like <p></p> or whatever. My pattern is this
Pattern pText = Pattern.compile(">([^>|^<]*?)<");
Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.
Thanks
SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.
It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:
Pattern pText = Pattern.compile(">([^<>]*?)<");
Don't use regular expressions when parsing HTML.
Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.

Categories