When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?
It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.
Related
My current requirement is to store Unicode and other special characters, such as double quotes in MySQL tables. For that purpose, as many have suggested, we should use Apache's StringEscapeUtils.escapeJava() method. The problem is, although this method does replace special characters with their respective unicodes (\uxxxx), the MySQL table stores them as uxxxx and not \uxxxx. Due to this, when I try to decode it while fetching from the database, StringEscapeUtils.unescapeJava() fails (since it cannot find the '\').
Here are my questions:
Why is it happening (that is, '\' are skipped by the table).
What is the solution for this?
Don't use Unicode "codepoints" (\uxxxx), use UTF8.
Dont' use any special functions. Instead announce that everything is UTF-8 (utf8mb4 in MySQL).
See Best Practice
(If you are being provided \uxxxx, then you are stuck with converting to utf8 first. If your real question is on how to convert, then ask it that way.)
`
Application needs to validate the different input XML(s) messages for non-printable ascii characters. We currently know two options to do this.
Change the XSD to include the restriction.
Validate the input xml string in java application using Regular Expression
Which approach is better in terms of performance as our application has to return the response within a few seconds? Is there any other option available to do this?
It's mainly a matter of opinion but if you have an XSD that seems to be the natural place to include the validations. The only thing you may need to consider is that via XSD you will either fail or pass, whereas with ad-hoc java validation you can ignore non-printable, or replace or take an action without failing the input completely.
The only characters that are (a) ASCII, (b) non-printable, and (c) allowed in XML 1.0 documents are CR, NL, and TAB. I find it hard to see why excluding those three characters is especially important, but if you already have an XSD schema, then it makes sense to add the restriction there.
The usual approach is not to make these three characters invalid, but to treat them as equivalent to space characters, which you can do by using a data type that has the whitespace facet value "normalize" or "collapse".
As I looked for a reliable character for splitting strings, I found out an earlier post about using "((char)007)" as a split character so i decided to use that for a request/response project I'm building.
But when I send data with "((char)007)" between data parts that need to be seperated, the data arrives at the other end of the socket like this instead "teq□weq□1231□21231".
So splitting this data properly is unsuccessful at the moment. Any ideas about why this happens and what kind of approach I might follow to fix this, what else I can use for splitting, any ideas would be appreciated, thanks.
If you are printing control characters (BELL) then your console may not print it out properly.
In any case, consider just sending a structure like a serialized object (be careful with deserializing user-supplied content) or perhaps JSON. Any structure with a standardized format will do better in the long term versus arbitrary splitting on a magic character
The following code is very well known to convert accented chars into plain Text:
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll
1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)
Thanks.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.
For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.
Blocks are nearly never what you want.
In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.
That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.
You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.
Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.
However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.
Took me a while, but I fished them all out:
Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])
Hope this saves you some time.
Is there any real way to represent a URL (which more than likely will also have a query string) as a filename in Java without obscuring the original URL completely?
My first approach was to simply escape invalid characters with arbitrary replacements (for example, replacing "/" with "_", etc).
The problem is, as in the example of replacing with underscores is that a URL such as "app/my_app" would become "app_my_app" thus obscuring the original URL completely.
I have also attempted to encode all the special characters, however again, seeing crazy %3e %20 etc is really not clear.
Thank you for any suggestions.
Well, you should know what you want here, exactly. Keep in mind that the restrictions on file names vary between systems. On a Unix system you probably only need to escape the virgule somehow, whereas on Windows you need to take care of the colon and the question mark as well.
I guess, the safest thing would be to encode anything that could potentially clash (everything non-alphanumeric would be a good candidate, although you migth adapt this to the platform) with percent-encoding. It's still somewhat readable and you're guaranteed to get the original URL back.
Why? URL-encoding is already defined in an RFC: there's not much point in reinventing it. Basically you must have an escape character such as %, otherwise you can't tell whether a character represents itself or an escape. E.g. in your example app_my_app could represent app/my/app. You therefore also need a double-escape convention so you can represent the escape character itself. It is not simple.