I have some Java code that looks like this:
String xml = "<string>" + escapeXml(input) + "</string>";
protected String escapeXml(String input) {
return input.replaceAll("&", "&")
.replaceAll("'", "'")
.replaceAll("\"", """)
.replaceAll("<", "<")
.replaceAll(">", ">")
}
input is a variable UTF-8 encoded string.
What I'm finding is that in some cases the xml string ends up being equal to <string> without the enclosing </string>. Why might this be? Is it possible for Java to evaluate escapeXml into something that truncates the string before </string> can be appended to it?
UPDATE: In response to Sotirios, let me add some clarifications. The xml string is being saved to a SQLite database column, which in turn is parsed by another utility. So far, I've noticed that this behavior occurs when the xml string saved to the database is either <string> or <string> with some non-ASCII Unicode character afterwards.
input is being fed automatically from a hook into an Android function. Because everything is running on Android in a non-standard configuration, it's a bit difficult to debug to learn exactly what's going on. I was hoping that there might be some obvious answer involving Java strings.
I never got to the bottom of this, but I did fix my problem by modifying the escapeXml function to use a proper XML encoder (org.apache.commons.lang library). I don't see how that would make a difference, but it did, and now the xml string is properly constructed.
Related
I have a problem with Strings and Unicode in Java.
I'm currently working on a bot for Discord and have to pass it a string with an emoji. For this I use the java-specific form, i.e. I want to have the emoji "fire", for example. If I manually set my method for java-specific code (\uD83D \uDD25) in a string, it works, but if I use the return value (also a string) there, the whole thing no longer works.
Hence the question of whether it makes a difference if the java-specific code is entered manually and if it is entered automatically. Maybe java can't recognize that the second one is also an unicode?
Thanks for your help
String emoji1 = "\uD83D\uDD25";
String emoji2 = convertToJava(":fire:"); //return a String with the content "\uD83D\uDD25"
msg.addReaction(ReactionEmoji.of(id, emoji1, isAnimated)).block(); //this is working
msg.addReaction(ReactionEmoji.of(id, emoji2, isAnimated)).block(); //this returns me an error called "unknown emoji"
I have a problem with extracting text from scientific articles.
I use PDFBox to extract text from pdf. The
problem is not from extraction process but with some special math notations that leads to problem when I want to write the extracted text into an XML file, the special character which is not extracted correctly will cause trouble. Instead of , or other similar HTML codes will be inserted to the XML file and ruins the whole file. How to fix this issue?
The HTML codes that I mean are look like these and at the moment, number 218 is the trouble. But I guess for different math notations, different HTML codes will be replaced and cause the problem afterward.
I have already tried following string cleanings but didn't help:
nextWord=nextWord.replaceAll("[-+.^:,]", "");
nextWord=nextWord.replaceAll("\\s+", "");
nextWord=nextWord.replaceAll("[^\\x00-\\x7F]", "");
You may write a pre-check before writing each line to a file, to check whether the text does not contain ambiguous characters. Below pattern contains all basic characters in any given textbook. You may add or remove as per your content.
public boolean isValidCharacters(String word){
String pattern= "^[a-zA-Z0-9~##$^*()_+={}|\\,.?: -]*$";
return word.matches(pattern);
}
You can write something yourself with a regex or if you have other String manipulations to do the Apache StringUtils are really great. It has a isAlpha() isNumeric() method that is easy to implement.
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
after hours of googeling and searching within SO, I finaly come to the place where I need to ask you! :)
Situation is the following:
A webservice delivers data in a CDATA. This data is parsed and put into our model. Using Spring MVC we access the model inside the JSP files to create....here come the point... JSON! Don't ask, historically! ;-)
Now, somehow someone came to the glorious idea to put multiple (back)slashes into a title property. The getTitle() method returns the string "/// Glasvegas \\". This of course doesn't work, if we do a JavaSCript eval() on the JSON (created within the JSP) to get the JavaScript Json object. It simply interprets the backslashes as comment, making the Json invalid.
I tried to use the escapeHtml() methods from apache.common and springframework, but they both just ignore the backslashes while encoding all other special chars correctly.
Then I tried to write my own method:
public static String escapeHTML(String string) {
String foreslash="\";
String regex="\\\\";
System.out.println(string.replaceAll(regex,foreslash));
string.replaceAll(regex,foreslash);
return string;
}
In console output the string is correctly replaced, but if break at the return and inspect the variable 'string' in the debugger it's still "/// Glasvegas \\". Also the same in the generated JSP.
So, I'm kind of lost here.
Regards,
ASP
strings are immutable. the name of the method "replaceAll" makes it sound as though you're actually modifying the string object itself, but you're not. the method just returns the result of the operation. this is why you get the correct output from the System.out.println. but then you make an error of thought, thinking that just because the call is standing by itself, not inside a System.out.println, the java code should understand by itself that this time you want the change to be permanent in the string object ;)
try to rewrite the end of your method like this:
System.out.println(string.replaceAll(regex,foreslash));
return string.replaceAll(regex,foreslash);
also, the virable name "foreslash" makes it sound as though 92 is the code for a forward slash. maybe it is, i don't know. your regular expression then looks for a backslash. that's a bit confusing!
I have several blocks of text that I need to be able to paste inline in my code for some unit tests. It would make the code difficult to read if they were externalized, so is there some web tool where I can paste in my text and it will generate the code for a StringBuffer that preserves it's formatting? Or even a String, I'm not that picky at this point.
This seems like a code generator like this must exist somewhere on the web. I tried to Google one, but I have yet to come up with a set of search terms that don't fill my results with Java examples and documentation.
I suppose I could write one myself, but I'm in a bit of a time crunch and would rather not duplicate effort.
If I understood it correctly, any text editor which supports regexps should make it an easy task. For instance Notepad++ - just replace ^(.+)$ with "\1"+, then copy the result to the code, remove the last + and add String s = to the beginning :)
If you want to externalize then, use a properties file or something like that to read the text.
If you are looking for a simple tool to break up your text into concatenated strings that are joined together by stringbuffer then, most modern IDE will help you do it automatically. Here's how.
Copy the block of text in the IDE
Surround it in double quotes and assign to a String type variable. (This step may not be required)
Enter carriage returns wherever you want to wrap the text to next line and the IDE will automatically break the literals, concatenate them using double quotes "" and add them together
All modern compilers will internally convert "addas" + "addasfdas" literals to a String using StringBuffer.
The squirrel SQL client has a function called convert to string buffer it works nice.
I am trying to create an XML document (rss feed) and have worked out all the kinks in it except for one character encoding issue. The problem is that I am using a UTF-8 encoding like so <?xml version="1.0" encoding="UTF-8"?> except the document itself is not encoded to UTF-8.
I am using the org.apache.ecs.xml package to create all the tags. I then use doc.output(stream) to write the content. This method does not seem to write output using UTF-8 and I don't know how to make that happen. Until I do, some symbols (the british pound is what I first noticed) aren't rendered properly in most readers.
--Updated with more information--
I ended up using a bad solution (as explained in the comments) to fix this problem. The correct answer seems to be dont use the org.apache.ecs.xml library. Thank you all for the help. StackOverflow wins again.
The simplest workaround is probably going to be changing your code like follows:
XMLDocument doc = new XMLDocument(1.0,false,Charset.defaultCharset().toString());
I'm guessing they're just using the default encoding to write characters to the stream. So pass the default encoding to the prologue and you should be fine.
I'll agree with other posters that this is probably the least of your worries. Looking at the source repository for ECS, it doesn't appear to have been updated for four years (the "ECS2" repository likewise).
And some self-promotion: if you're looking to build XML documents using a simple interface, the Practical XML library has a builder. It uses the standard JDK serialization mechanism for output.
Any chance you can write to a Writer rather than an OutputStream... that way you could specify the encoding.
Here is a solution my co-worker came up with that I THINK is the correct way to do it but what do I know. Instead of using doc.output(stream) we used:
try {
IOUtils.write(doc.toString(), stream, "UTF-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
To be honest I dont completely understand the problem yet, which is why I am having problems in the first place. It seems that #subtenante's solution went through and converted any character that UTF-8 could not represent and replaced it with the unicode entity. This solution seems to write to the stream using the UTF-8 encoding like I originally wanted doc.output to. I dont know the exact difference, just that both solved my problems. Any further comments to help me understand the problem would be appreciated.
I'm not familiar with this package but from the source on the web I suspect it may be broken:
http://kickjava.com/src/org/apache/ecs/xml/XMLDocument.java.htm
contains stuff like
for (int i=0; i<prolog.size(); i++) {
268 ConcreteElement e = (ConcreteElement)prolog.elementAt(i);
269 e.output(out);
270 // XXX really this should use line separator!
271 // XXX should also probably check for pretty print
272 // XXX also probably have difficulties with encoding
which suggests problems.
We use XOM (http://www.xom.nu) and that specifically has a setEncoding() on its Serializer so I would suggest changing packages...
Here is a function I wrote to convert all non-ASCII characters to their corresponding entity. Might help you sanitizing some PCDATA content before output.
/**
* Creates xml entities for non ascii characters in the given String.
*/
public static String xmlEntitify(String in){
StringBuffer b = new StringBuffer();
for (int i=0;i<in.length();i++){
Character c = in.charAt(i);
if (c<128){
b.append(c);
}
else if (c=='\ufeff'){
// BOM character, just remove it
}
else {
String cstr = Integer.toHexString(c).toUpperCase();
while(cstr.length()<4){
cstr="0"+cstr;
}
b.append("&#x");
b.append(cstr);
b.append(";");
}
}
return b.toString();
}
Read your input stream into a String content, and write into the output stream xmlEntitify(content).
Your output is guaranteed to contain only ASCII characters, no more encoding problem.
UPDATE
Given the comments, I'll be even bolder : if you are not sanitizing your data, you are calling for trouble. I guess you are at least already replacing the < and & characters in your PCDATA. If not, you definitely should. I have another version of the above method which, instead of the first if, has :
if (c<128 && c!='&' && c!='<' && c!='>' && c!='"'){
b.append(c);
}
so that these characters are also converted to their corresponding Unicode entity.
This converts all of my PCDATA to unicode-friendly ASCII-only strings. I had no more encoding problem since I'm using this technique. I don't ever output XML PCDATA which has not been passed through this method : this is not sweeping the elephant under the carpet. It is just getting rid of the problem by being as generic as can be.