I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.
Related
I have a problem with extracting text from scientific articles.
I use PDFBox to extract text from pdf. The
problem is not from extraction process but with some special math notations that leads to problem when I want to write the extracted text into an XML file, the special character which is not extracted correctly will cause trouble. Instead of , or other similar HTML codes will be inserted to the XML file and ruins the whole file. How to fix this issue?
The HTML codes that I mean are look like these and at the moment, number 218 is the trouble. But I guess for different math notations, different HTML codes will be replaced and cause the problem afterward.
I have already tried following string cleanings but didn't help:
nextWord=nextWord.replaceAll("[-+.^:,]", "");
nextWord=nextWord.replaceAll("\\s+", "");
nextWord=nextWord.replaceAll("[^\\x00-\\x7F]", "");
You may write a pre-check before writing each line to a file, to check whether the text does not contain ambiguous characters. Below pattern contains all basic characters in any given textbook. You may add or remove as per your content.
public boolean isValidCharacters(String word){
String pattern= "^[a-zA-Z0-9~##$^*()_+={}|\\,.?: -]*$";
return word.matches(pattern);
}
You can write something yourself with a regex or if you have other String manipulations to do the Apache StringUtils are really great. It has a isAlpha() isNumeric() method that is easy to implement.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
i am facing with a very difficult problem, which is following:
I have a number of HTML-formatted Strings. they were generated by a Document-Element, where the text was edited in RTF and saved in HTML (to display it on a website).
the problem now is, that some RTF-Elements which are parset to HTML seems to be unusable in html, which leads it to crash. One of the in html disallowed chars is e.g. the %0b
according to http://www.tutorialspoint.com/html/html_url_encoding.htm it has no function, or i can't figure out why it is needed (in fact, it isn't even copyable).
My question now is: Is there a function out there (I already searched) which is able to eliminate all non-html characters of such a formatted rtf2html-string?
I just need to eliminate them when the html is loaded, so there aren't any display problems
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Credit to: #jlordo
Or you can use replaceAll("%0b", "");
I'm having one div which will display some text. I'm getting this text from DB. This text can contains special characters like "\",">","<" etc. When I'm trying to display this text in my page, these special characters wont be visible in my page for obvious reasons. So how to handle this situation.
Since you have mentioned database, I am assuming that you have Java involved...
That being said, you can take a look at Apache's StringEscapeUtils and escape your strings accordingly.
in your javascript you can write function, which will replace all the special characters with code
have a look at this answer Convert special characters to HTML in Javascript
Write a function on java side which will convert all these or expected special characters and will return to front end.
e.g.
function String convert(String var){
var.replace(/&/g,"&").replace(/>/g,">");
}
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title in StackOverflow) from an html file.
I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags
for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as λ, λ or λ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. for example is not valid, yet many browsers will interpret it as ™.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
Feed string into a robust HTML parser
Get parsed (and fully decoded) string back
I need to encode a Java String to display properly as part of an HTML document, e.g. all new lines or tab characters. So I have e.g. "one\ntwo", and I want to get something like "one<br>two". Do you know any library that would do it for me?
try using preformat tag <pre>
Spring Framework's HtmlUtils.htmlEscape(String) method should do the trick:
Turn special characters into HTML
character references. Handles complete
character set defined in HTML 4.01
recommendation.
You don't need a library for those simple cases. A simple loop over string.replaceAll() should do the trick.
If you are looking for more fancy conversions (such as done here on SO or in a wiki) you can check out the Java Wikipedia API. Code example here. Although I guess it may be a bit overkill for your needs.