I have a problem with extracting text from scientific articles.
I use PDFBox to extract text from pdf. The
problem is not from extraction process but with some special math notations that leads to problem when I want to write the extracted text into an XML file, the special character which is not extracted correctly will cause trouble. Instead of , or other similar HTML codes will be inserted to the XML file and ruins the whole file. How to fix this issue?
The HTML codes that I mean are look like these and at the moment, number 218 is the trouble. But I guess for different math notations, different HTML codes will be replaced and cause the problem afterward.
I have already tried following string cleanings but didn't help:
nextWord=nextWord.replaceAll("[-+.^:,]", "");
nextWord=nextWord.replaceAll("\\s+", "");
nextWord=nextWord.replaceAll("[^\\x00-\\x7F]", "");
You may write a pre-check before writing each line to a file, to check whether the text does not contain ambiguous characters. Below pattern contains all basic characters in any given textbook. You may add or remove as per your content.
public boolean isValidCharacters(String word){
String pattern= "^[a-zA-Z0-9~##$^*()_+={}|\\,.?: -]*$";
return word.matches(pattern);
}
You can write something yourself with a regex or if you have other String manipulations to do the Apache StringUtils are really great. It has a isAlpha() isNumeric() method that is easy to implement.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
Related
I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.
i am facing with a very difficult problem, which is following:
I have a number of HTML-formatted Strings. they were generated by a Document-Element, where the text was edited in RTF and saved in HTML (to display it on a website).
the problem now is, that some RTF-Elements which are parset to HTML seems to be unusable in html, which leads it to crash. One of the in html disallowed chars is e.g. the %0b
according to http://www.tutorialspoint.com/html/html_url_encoding.htm it has no function, or i can't figure out why it is needed (in fact, it isn't even copyable).
My question now is: Is there a function out there (I already searched) which is able to eliminate all non-html characters of such a formatted rtf2html-string?
I just need to eliminate them when the html is loaded, so there aren't any display problems
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Credit to: #jlordo
Or you can use replaceAll("%0b", "");
I'm having one div which will display some text. I'm getting this text from DB. This text can contains special characters like "\",">","<" etc. When I'm trying to display this text in my page, these special characters wont be visible in my page for obvious reasons. So how to handle this situation.
Since you have mentioned database, I am assuming that you have Java involved...
That being said, you can take a look at Apache's StringEscapeUtils and escape your strings accordingly.
in your javascript you can write function, which will replace all the special characters with code
have a look at this answer Convert special characters to HTML in Javascript
Write a function on java side which will convert all these or expected special characters and will return to front end.
e.g.
function String convert(String var){
var.replace(/&/g,"&").replace(/>/g,">");
}
I am working on a plugin. I will parse HTML files. I have a naming convention like that:
<!--$include="a.html" -->
or
<!--$include="a.html"-->
is similar
According to this pattern(similar to server side includes) I want to search an HTML file.
Question is that:
Find that pattern and get value (a.html at my example, it is variable)
It should be like:
while(!notFinishedWholeFile){
fileName = findPatternFunc(htmlFile)
replaceFunc(fileName,something)
}
PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.
Any ideas?
You mean like this?
<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->
Read a file into a string then
str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);
will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)
If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.
Using regex for this task is fine performance wise, but see e.g.
How to use regular expressions to parse HTML in Java? and Java Regex performance etc.
I have used that pattern:
"<!--\\$include=\"(.+)(.)(html|htm)\"-->"
Here is my string:
String str = "<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font>
<font
size="4"><strong>";
I want to remove all html tags in a string with using StringTokenizer. But I don't understand how to use StringTokenizer for this situation. Because when I use str.replaceAll("\\<.*?>",""), it is not efficient to remove all tags because some tags will be on the next line of string, as seen the string above. But I want to do it for all situations between < and >. How can I do it? (I want to achieve it using StringTokenizer). Thanks..
As a general rule, you shouldn't parse HTML with anything except an HTML parsing library. Writing your own parser creates a security risk and exposes your applications to possible attack vectors like Cross Site Scripting and various other bugs. Again: don't parse HTML with regex or a simple tokenizer. An exception to this rule may be if you have a small set of known HTML data inputs and you will use your code on that data only. In this scenario, you can and should verify that your code is doing the correct thing for each input.
That said, your original regex is very close. The dot wildcard matches everything except newlines, and so if we add to your regex the possibility of newlines in addition to the dot wildcard, we get positive results on your test string.
String result = str.replaceAll("<(.|\r|\n|\f)*?>","");
DO NOT USE THIS CODE ON UNKNOWN INPUT! DO NOT USE IT IN PRODUCTION! IT IS NOT A SAFE OR CORRECT APPROACH TO PARSING HTML.
Trying to process HTML with regexes or StringTokenizer alone is... painful.
This answer is compulsory reading before you go any further.
If your HTML files are simple, you might get away with removing the newlines, then applying a regex, then reformatting the HTML - or try multiline regexes.
But you should really look at using a proper HTML parser. See this question (and probably many others...)
It is better to use an HTML parser library instead of StringTokenizer. Please have a look at the below demonstration:
Download jsoup-1.6.1.jar core library from http://jsoup.org/download.
Add this library to your classpath.
Play with your HTML as you like. Example below is the code for converting HTML content to text format:
import org.jsoup.Jsoup;
public class HtmlParser {
public static String removeAllHtml(String htmlContent) {
return Jsoup.parse(htmlContent).text();
}
public static void main(String[] args) {
String htmlContent = "<pre><font size=\"5\"><strong><u>LVI . The Day of Battle</u></strong></font><fontsize=\"4\"><strong>";
System.out.println(removeAllHtml(htmlContent));
}
}