This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title in StackOverflow) from an html file.
I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags
for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as λ, λ or λ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. for example is not valid, yet many browsers will interpret it as ™.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
Feed string into a robust HTML parser
Get parsed (and fully decoded) string back
Related
I have a huge xml file which contains a lot hexadecimal values ex: §, so how to convert these hexadecimal value in entire file to Char ex: § (§) in java.
<md.first.line.cite>UT ST § 10-2-409</md.first.line.cite>
You mentioned that you want to convert to §, so this is one of way to conver hex to your desired character:
System.out.println((char)0xA7);
Or
int hex=0xA7;
System.out.println((char)hex);
Lets start with what § and mean. These are XML character references. They represent the Unicode code-points U+00A7 and U+2002. (This is real XML syntax, not just some random nuisance escape sequence that needs to be dealt with.)
So, if you parse that XML with a conformant XML parser, the parser will automatically take care of translating the references to the corresponding Unicode code-points. Your application should not need to do any translating.
This implies that you are NOT using a proper XML parser in your application. Bad idea! Doing your own XML parsing by string bashing or using regexes tends to give inflexible code and/or unreliable results when faced with "variant" XML. So my main recommendation would be:
Use a standard off-the-shelf XML parser.
If your XML is non-compliant, consider using Jsoup or similar to extract information from the XML.
If you already deep down the rabbit hole of string bashing, etc, the best thing to do would be to extract the entire encoded XML text segment and convert it to a String using existing library code. The standard Java SE class libraries don't provide this functionality, but you could use StringEscapeUtils.unescapeXml() from org.apache.commons.text. (The version from org.apache.commons.lang3 has been deprecated.)
I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.
i am facing with a very difficult problem, which is following:
I have a number of HTML-formatted Strings. they were generated by a Document-Element, where the text was edited in RTF and saved in HTML (to display it on a website).
the problem now is, that some RTF-Elements which are parset to HTML seems to be unusable in html, which leads it to crash. One of the in html disallowed chars is e.g. the %0b
according to http://www.tutorialspoint.com/html/html_url_encoding.htm it has no function, or i can't figure out why it is needed (in fact, it isn't even copyable).
My question now is: Is there a function out there (I already searched) which is able to eliminate all non-html characters of such a formatted rtf2html-string?
I just need to eliminate them when the html is loaded, so there aren't any display problems
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Credit to: #jlordo
Or you can use replaceAll("%0b", "");
I need to encode a Java String to display properly as part of an HTML document, e.g. all new lines or tab characters. So I have e.g. "one\ntwo", and I want to get something like "one<br>two". Do you know any library that would do it for me?
try using preformat tag <pre>
Spring Framework's HtmlUtils.htmlEscape(String) method should do the trick:
Turn special characters into HTML
character references. Handles complete
character set defined in HTML 4.01
recommendation.
You don't need a library for those simple cases. A simple loop over string.replaceAll() should do the trick.
If you are looking for more fancy conversions (such as done here on SO or in a wiki) you can check out the Java Wikipedia API. Code example here. Although I guess it may be a bit overkill for your needs.
I want to parse a HTML code and create objects from their text representation in table. I have several columns and I want to save context of certain columns on every row.
Now, I have the HTML code and I understand I should use Pattern and Matcher to get those strings, but I don't know how to write required regular expression.
This is a row I will be parsing:
<tr><td>Delirium</td><td>65...</tr>
So, I want to extract Delirium from that string. How do I write regular expression that sais
get me the string that is between the string htm"> and </a></td>
?
This is a common question on SO and the answer is always the same: regular expressions are a poor and limited tool for parsing HTML because HTML is not a regular language.
You should be using an HTML parser, for example HTML Parser.
If you're curious what I mean by "regular language", have a look at JMD, Markdown and a Brief Overview of Parsing and Compilers. Basically a regular expression is a DFA (deterministic finite automaton or deterministic finite state machine). HTML requires a PDA (pushdown automaton) to parse. A PDA is a DFA with a stack. It's how it handles recursive elements.
htm">(.+)</a></td>
Searches for any character (that's the .+ bit) that is between htm"> and </a></td> and return what's in between to use with Pattern.matcher() (which is why there are brackets around .+ )
http://www.regular-expressions.info/java.html