How do I store accented characters in S3 metadata? - java

I am trying to store accented characters such as ò in the metadata of an S3 object. I am using the REST API which according to this page only accepts US-ASCII: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
Is there a way to convert Strings in Scala or Java from Bòrd to B\u00F2rd?
I have tried using Normalizer.normalize(str, Normalizer.Form.NFD) but the character when submitted to S3 is still causing an error because it appears as ò. When I try to print out the returned String it is also showing ò.

A normalized unicode string is just normalized in terms of composing characters, not necessarily to ASCII. Using NFKC would be more likely to convert characters to ASCII forms, but certainly would not reliably to do so.
It sounds like what you want is to escape non-ascii characters. You could use e.g. UnicodeEscaper from commons-lang, and UnicodeUnescaper to translate back.

Related

Azure search indexer base64encode function

I have a question regarding azure64encode function in the indexer. When I try to encode via Java I got different encoding rather than in azure indexer:
In azure
{
sourceString= "00cbc05fc051e634d7d485c7879fe7bdb4f6509a"
base64EncodedString= "MDBjYmMwNWZjMDUxZTYzNGQ3ZDQ4NWM3ODc5ZmU3YmRiNGY2NTA5YQ2",
}
In Java
{
sourceString= "00cbc05fc051e634d7d485c7879fe7bdb4f6509a"
base64EncodedString= "MDBjYmMwNWZjMDUxZTYzNGQ3ZDQ4NWM3ODc5ZmU3YmRiNGY2NTA5YQ==",
}
Why in azure at the end "2" in java "=="???
Both are decoded to the same string.
The "2" at the end from indexer field mappings represents there are 2 equal signs in "==".
Standard base64 encoding uses equal signs as padding characters at the end of a string to make the length a multiple of 4, but they're not necessary to decode the original string.
Since standard encoding uses characters that are meaningful in URL query strings and sometimes the encoded strings will be passed through the URL, so there are versions that swap out/omit characters to make the encoding URL-safe.
The indexer has 2 implementations of base64Encode and defaults to using HttpServerUtility.UrlTokenEncode, which replaces all equal signs at the end of encoded strings with the count of those equal signs. The other implementation simply omits the equal signs, and you can choose between the two behaviors by setting useHttpServerUtilityUrlTokenEncode (defaults to true but you probably want false).
You can encode the string 00>00?00 in the indexer/Java to see exactly which behavior you're getting, and check this table to see how to convert between them.
N.B. - using standard base64 decoding with HttpServerUtility.UrlTokenEncode is very misleading and should be avoided. Try encoding and decoding a, aa, aaa, sometimes you get the original string back and sometimes you don't.

Split combined arabic characters in individual characters

I'm trying to convert "combined arabic characters" (like ﻼ ) in the different individual characters that compose that "combined" character (e.g. ﻝ ا). I wasnt able to do this in JAVA or C#, because I need split the complete list of characters.
In C# i'm trying to get the Unicode character, convert it to Windows-1256 waiting to get 2 o 3 bytes that are the individual characters and that combined character uses, but I wasnt able to do this.
String unicodeWord = (char)sc;
byte[] arabicBytes = System.Text.Encoding.GetEncoding(1256).GetBytes(unicodeWord);
but the result is always ?.
Can you help me with this? I have not problem to use either java or c#.
Thanks a lot!
string input = "ﻼ";
string normalized = input.Normalize(NormalizationForm.FormKC);
Note that there are different normalization forms with different results; FormKC results in ل and ا

Convert Freebase Unicode codepoints to Java String

I'm doing some Freebase queries. Sometimes the result of the query contains Unicode characters. How could I convert those characters into a Java String? (e.g., The_Police_$0028band$0029 → The_Police_(band)). I've tried:
new String(arg_in_byte,"UTF-8")
but it doesn't work. I saw in another question that one solution is the method replaceAll but I think that there is some other method that will be cleaner.
Those aren't UTF-8 encoded, but rather private encoding of Unicode codepoints. If your Java client library for Freebase doesn't include the necessary decoding method, you'll need to write one yourself to take the four digits after the dollar sign ($), interpret them as a hexadecimal integer and then convert that to a Java character (which also uses Unicode code points internally).
Here is some documentation on the encoding:
http://wiki.freebase.com/wiki/MQL_key_escaping

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).
unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).
This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.
Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);
i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input
If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀

XML, java , unicode

In XML, if one character unicode is written as \ue123 in Java
how can a string of two characters be written ?
note I tried \u123\u123 but it didn't work !
Well \u123\u123 doesn't work because \u needs to be followed by four hex digits. But this should work fine:
String text = "\u0123\u0123";
Note that this is just the Java string literal side - it has nothing to do with XML. XML has different ways of escaping the characters it needs to, but if you use an appropriate encoding (e.g. UTF-8) you shouldn't need to escape non-ASCII characters.

Categories