i am facing with a very difficult problem, which is following:
I have a number of HTML-formatted Strings. they were generated by a Document-Element, where the text was edited in RTF and saved in HTML (to display it on a website).
the problem now is, that some RTF-Elements which are parset to HTML seems to be unusable in html, which leads it to crash. One of the in html disallowed chars is e.g. the %0b
according to http://www.tutorialspoint.com/html/html_url_encoding.htm it has no function, or i can't figure out why it is needed (in fact, it isn't even copyable).
My question now is: Is there a function out there (I already searched) which is able to eliminate all non-html characters of such a formatted rtf2html-string?
I just need to eliminate them when the html is loaded, so there aren't any display problems
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Credit to: #jlordo
Or you can use replaceAll("%0b", "");
Related
I have a problem with extracting text from scientific articles.
I use PDFBox to extract text from pdf. The
problem is not from extraction process but with some special math notations that leads to problem when I want to write the extracted text into an XML file, the special character which is not extracted correctly will cause trouble. Instead of , or other similar HTML codes will be inserted to the XML file and ruins the whole file. How to fix this issue?
The HTML codes that I mean are look like these and at the moment, number 218 is the trouble. But I guess for different math notations, different HTML codes will be replaced and cause the problem afterward.
I have already tried following string cleanings but didn't help:
nextWord=nextWord.replaceAll("[-+.^:,]", "");
nextWord=nextWord.replaceAll("\\s+", "");
nextWord=nextWord.replaceAll("[^\\x00-\\x7F]", "");
You may write a pre-check before writing each line to a file, to check whether the text does not contain ambiguous characters. Below pattern contains all basic characters in any given textbook. You may add or remove as per your content.
public boolean isValidCharacters(String word){
String pattern= "^[a-zA-Z0-9~##$^*()_+={}|\\,.?: -]*$";
return word.matches(pattern);
}
You can write something yourself with a regex or if you have other String manipulations to do the Apache StringUtils are really great. It has a isAlpha() isNumeric() method that is easy to implement.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title in StackOverflow) from an html file.
I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags
for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as λ, λ or λ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. for example is not valid, yet many browsers will interpret it as ™.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
Feed string into a robust HTML parser
Get parsed (and fully decoded) string back
Here is my string:
String str = "<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font>
<font
size="4"><strong>";
I want to remove all html tags in a string with using StringTokenizer. But I don't understand how to use StringTokenizer for this situation. Because when I use str.replaceAll("\\<.*?>",""), it is not efficient to remove all tags because some tags will be on the next line of string, as seen the string above. But I want to do it for all situations between < and >. How can I do it? (I want to achieve it using StringTokenizer). Thanks..
As a general rule, you shouldn't parse HTML with anything except an HTML parsing library. Writing your own parser creates a security risk and exposes your applications to possible attack vectors like Cross Site Scripting and various other bugs. Again: don't parse HTML with regex or a simple tokenizer. An exception to this rule may be if you have a small set of known HTML data inputs and you will use your code on that data only. In this scenario, you can and should verify that your code is doing the correct thing for each input.
That said, your original regex is very close. The dot wildcard matches everything except newlines, and so if we add to your regex the possibility of newlines in addition to the dot wildcard, we get positive results on your test string.
String result = str.replaceAll("<(.|\r|\n|\f)*?>","");
DO NOT USE THIS CODE ON UNKNOWN INPUT! DO NOT USE IT IN PRODUCTION! IT IS NOT A SAFE OR CORRECT APPROACH TO PARSING HTML.
Trying to process HTML with regexes or StringTokenizer alone is... painful.
This answer is compulsory reading before you go any further.
If your HTML files are simple, you might get away with removing the newlines, then applying a regex, then reformatting the HTML - or try multiline regexes.
But you should really look at using a proper HTML parser. See this question (and probably many others...)
It is better to use an HTML parser library instead of StringTokenizer. Please have a look at the below demonstration:
Download jsoup-1.6.1.jar core library from http://jsoup.org/download.
Add this library to your classpath.
Play with your HTML as you like. Example below is the code for converting HTML content to text format:
import org.jsoup.Jsoup;
public class HtmlParser {
public static String removeAllHtml(String htmlContent) {
return Jsoup.parse(htmlContent).text();
}
public static void main(String[] args) {
String htmlContent = "<pre><font size=\"5\"><strong><u>LVI . The Day of Battle</u></strong></font><fontsize=\"4\"><strong>";
System.out.println(removeAllHtml(htmlContent));
}
}
I need to encode a Java String to display properly as part of an HTML document, e.g. all new lines or tab characters. So I have e.g. "one\ntwo", and I want to get something like "one<br>two". Do you know any library that would do it for me?
try using preformat tag <pre>
Spring Framework's HtmlUtils.htmlEscape(String) method should do the trick:
Turn special characters into HTML
character references. Handles complete
character set defined in HTML 4.01
recommendation.
You don't need a library for those simple cases. A simple loop over string.replaceAll() should do the trick.
If you are looking for more fancy conversions (such as done here on SO or in a wiki) you can check out the Java Wikipedia API. Code example here. Although I guess it may be a bit overkill for your needs.