From time to time we have encountered a very strange encoding problem in Tomcat in our production environment.
I have not yet been able to pinpoint exactly where in the code the problem happens, but it involves the replacement of non ascii characters to approximated ascii characters.
For example replacing the character 'å' with 'a'. Since the site is in swedish, the characters 'å', 'ä' and 'ö' is quite common. But for some reason the replacement of the 'ö' character always works, so a string like "Köp inte grisen i säcken" becomes "Kop inte grisen i säcken", ie the 'ä' is not replaced as it should, while the 'ö' character is.
Some quick facts about the problem:
It happens very seldom (we have noticed it 3-4 times, the first time maybe 1-2 years ago).
A restart of the troubled server makes the problem go away (until the next time).
It has never occured on more then one front end server at the same time.
It doesn't always happen on the same front end server.
No user input on the front end is involved.
All front end servers connect to the same CMS and DB, with the relevant config being identical.
All front end servers have the same relevant configuration (linux config, tomcat config, java environment config like "file.encoding" etc), and are started using the same script (all according to the hosting/service provider).
All front end servers use the same exact war file for the site, and the same jar files.
No other encoding problems can be seen on the site while this character replacement problem occurs.
We have never been able to reproduce the problem in any other environment.
We use Tomcat 5.5 and Java 5, because of CMS requirements.
I can only think of two plausible causes for this behaivor:
The hosting provider sometimes starts/restarts the front end servers in a different way, maybe with another user account with other environment variables or other file access rights, or maybe using some other script than the normal one.
Some process running during Tomcat or webapp startup depends upon some other process, and sometimes (intermittantly but seldom) these two (or more) processes happen to run in an order that causes this encoding defect.
But even if 1 or 2 above is the case, it still doesn't explain fully what really happens. What exact difference could explain this? Since all of "file.encoding", "file.encoding.pkg" "sun.io.unicode.encoding", "sun.jnu.encoding" and all other relevant environment variables all match on all front end machines (verified visually using a debug page, while the problem was occuring).
Can someone think of some plausible explanation for this strange intermittent behaivor? Simply upgrading Tomcat and/or Java version is not really a relevant answer since we don't really know if that would solve the problem, and it still doesn't explain what the problem was. I'm more interested in understanding exactly what the problem is caused by.
Regards
/Jimi
UPDATE:
I think I have found the code that performs the character replacements. On initiation (triggered by the first call to do a replacement) it builds a HashMap<Character, String>, and fills it like this:
lookup.put(new Character('å'), "a");
Then when it should replace characters for a String, it loops over each character and for each one does a lookup in the hash map with the charactar as the key, and if a replacement String is found it is used, otherwise the original character is used.
This part of the code is more then 3 years old, and written by a developer who is long gone. If I would rewrite this code today I would do something totally different, and that might even solve the problem. But it would still not explain exactly what happend. Can someone see some possible explanation?
Normalize the input to normal Form C, before doing the replacement.
For instance, ä can be just 1 character, U+00E4, or it can be two characters, a (U+0061) and the combining diaeresis U+0308.
If your replacement just looks for the composed form, then the decomposed form will still remain as \u0061\u0308 because neither of those match \u00e4:
public static void main(String args[]) {
String decomposed = "\u0061\u0308";
String composed = "\u00e4";
System.out.println(decomposed);
System.out.println(composed);
System.out.println(composed.equals(decomposed));
System.out.println(Normalizer
.normalize(decomposed, Normalizer.Form.NFC).equals(composed));
}
Output
ä
ä
false
true
Related
I'm currently working on an application in Eclipse where I'm running a really huge SQL statement that spans about 20 lines when splitting it in notepad to fit on the screen. Thus I want the string for the query also to be formatted somewhat more readable than a single line. All the time autoformatting normally worked when I used Eclipse but somehow now neither Ctrl + Alt + F nor rightclicking and selecting the "Format" option from the menu doesn't work to get a line break after a certain amount of characters.
I already checked the preferences where I already tried running my own profile with 120 and 100 characters line width but that didn't fix anything so far. I really don't know why Eclipse won't let me format this anymore. Normally Eclipse would be splitting the string into several lines in this case but I don't really know why this doesn't work anymore.
However other formatting is being fixed when executing autoformatting (e.g. if(xyz){ still becomes if (xyz) {.
Thank you for your help in advance.
As far as I can tell, autoformat as you described was never supported (at least as far back as 2008). And I have been using Eclipse much longer than that
You can do one of several things.
Simply insert the cursor in the string and hit a return.
Toggle word wrap Alt-Shift-Y
Try writing a regex to do what you want(not certain if this will work).
Context
I have a SOAP web service which is served by a JBOSS EAP instance and is called via a SOAP UI client.
In the result returned by this web service there may be an XML string returned like this by the web service:
The same string will be rendered as follows in the SOAP UI client:
As you can observe, during the transport of this message some characters (specifically <) have been encoded to <: this is normal, as the encoder wants to avoid that the string gets interpreted as markup when it's just an output to be returned as is.
Problem
What we have observed is that when the string is too long, the encoding goes just wrong. I've tried to analyze and understand and this is all I can get:
Towards the end of the string, some < characters are left as such and are not converted into <
Very weirdly, an XML tag that is normally formed on server side:
<calculationPeriod>
...some stuff
</calculationPeriod>
... has its second c converted into < and that clearly breaks completely the XML:
<cal<ulationPeriod>
...some stuff
</calculationPeriod>
My question
Honestly, I have no idea how to debug this issue furtherly. All I can notice is that:
When inside the web service (stack that I control), the response is normally formed and encoded in XML using the open tag <.
Once out in the SOAP UI client (all across the stack there are generic JBOSS calls and RMI invocations), the message gets corrupted like this.
It is important to remark that this only happens when the string is particularly long. I have one output with length 8192 characters (before encoding) that goes fine, while the other output having length 9567 characters (before encoding) goes wrong and is the subject of this question.
Apologises :)
I'm sorry not to be able to provide a reproductible use case, as well as to use a title which means nothing and everything in the question.
I'm open to provide any additional information for those who may help and to rephrase the question once I get a clearer picture of what the problem is.
I've of course looked a lot on the web but I can't find anything similar, probably I don't search with the right keywords.
We are escaping the special characters for exact search (" ") and it works fine, except for few cases where it throws
Arrayindex out of bounds exception at org.apache.solr.spelling.wordbreaksolrspellchecker.getSuggestions
search text : "PRINTING 9-27 TEST CARDS ADD-ON MATT LAMINATION ON 2-SIDE OF TEST CARDS PER BOX OF 100 PCS". the config is spellcheck.dictionary is default and commented the spellcheck.dictionary wordbreak
we cannot apply any patch now, checked the issue LUCENE-5494
any of you suggest any work around to get the results in-spite of the exception. any configuration changes to suppress suggest or spellcheck. commenting word break dictionary also didn't help. solr version 4.10.4
Due to security reasons, I cannot post anything related to code and sorry for the minimal information in the query.
Anyways It might be useful to someone like me. The reason even after commenting the wordbreak dictionary its still showing the exception is,the changes that were made in the solrconfig.xml file was not reflected. I was testing in my local machine which is a standalone environment. Restarting the container (weblogic) doesn't reflect the changes snd reloading the core through admin screen also didn't help. So import and reloadset and then restarting the container did the trick.
I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.
What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.
Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.
Seems to be a fairly hit issue, but I've not yet been able to find a solution; perhaps because it comes in so many flavors. Here it is though. I'm trying to read some comma delimited files (occasionally the delimiters can be a little bit more unique than commas, but commas will suffice for now).
The files are supposed to be standardized across the industry, but lately we've seen many different types of character set files coming in. I'd like to be able to set up a BufferedReader to compensate for this.
What is a pretty standard way of doing this and detecting whether it was successful or not?
My first thoughts on this approach are to loop through character sets simple->complex until I can read the file without an exception. Not exactly ideal though...
Thanks for your attention.
The Mozilla's universalchardet is supposed to be the efficient detector out there. juniversalchardet is the java port of it. There is one more port. Read this SO for more information Character Encoding Detection Algorithm