How to know what variant of Base64 encoding Java class uses? - java

In my Java program i use com.ibm.xml.enc.dom.Base64 class for encoding/decoding
binary files. How can i know what variant of Base64 encoding this class uses?

You have indeed several implementations for base64.
The idea behind this encoding, is to find a way to carry raw bytes through the different network layers, without them to be altered.
Each layer is reading bytes, and you don't want your raw data to be cut (corrupted) because a random sequence of bytes is interpreted like a "end-of-rquest". That's why application data are encoded as printable characters.
(more details: http://www.comptechdoc.org/independent/networking/protocol/protlayers.html)
Most of the Base64 tables are using A-Z, a-z and 0-9 for the 62 first characters. And the differences among implementations are for the last 2 characters and the padding one.
The most common implementations use + and / for the last two characters of the table. But you might find as well - and _ which are used to be url-safe.
For your class com.ibm.xml.enc.dom.Base64, nothing is specified in the doc:
https://www.ibm.com/support/knowledgecenter/en/SSYKE2_6.0.0/com.ibm.java.security.api.doc/xmlsec/com/ibm/xml/enc/dom/Base64.html#Base64()
So you can assume that they use the most common implementation for Base64. If you have doubt, just try to generate example with random raw bytes. You can double-check that the base64 are using + and / and 63 and 64 characters.
If you need to write a generic base64 decoder, able to handle different variant of bas64. You would need to check for these special characters, check the length of the string, and the characters used for padding. From these info you could deduce an implementation to be used.
You have much more details about the variants on wikipedia:
https://en.wikipedia.org/wiki/Base64

Related

String that cannot be represented in UTF-8

I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.
The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).
is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.

Why Java char uses UTF-16?

I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char.
Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.
But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.
So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.
Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.
Is there some good reason for UTF-16 that I'm missing?
Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:
Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.
This, and the birth of UTF-16, is further explained by the Unicode FAQ page:
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
As #wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.
Historically, one reason was the performance characteristics of random access or iterating over the characters of a String:
UTF-8 encoding uses a variable number (1-4) bytes to encode a Unicode character. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.
Even today, Python uses a fixed-width format for Strings internally, storing either 1, 2, or 4 bytes per character depending on the maximum size of a character in that string.
Of course, this is no longer a pure benefit in Java, since, as nj_ explains, Java no longer uses a fixed-with format. But at the time the language was developed, Unicode was a fixed-width format (now called UCS-2), and this would have been an advantage.

Is there a drastic difference between UTF-8 and UTF-16

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.
Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.
Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.
In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.
Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no
accents, etc. occupy one byte which is identical to US-ASCII
representation. This way all US-ASCII strings become valid UTF-8,
which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this
introduces a great deal of backwards compatibility too.
Main UTF-8 cons:
Many common characters have different length, which slows indexing
and calculating a string length terribly.
Main UTF-16 pros:
Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese
can be represented with 2 bytes. Unless really exotic characters are
needed, this means that the 16-bit subset of UTF-16 can be used as a
fixed-length encoding, which speeds indexing.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no
null-terminated strings and a lot of wasted memory.
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol
There are two things:
the encoding in which you exchange data;
the internal string representation of Java.
You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to Strings (char arrays ultimately), and to convert form Strings to your data.
The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes(), all Readers and Writers are but two possible methods. And there are all static methods of Character as well.
If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.
In particular, you should be aware that when you create a Reader or Writer, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.
This Website provide UTF TO UTF Conversion
http://www.fileformat.info/convert/text/utf2utf.htm
UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
HOWEVER
UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.
UTF-8: Generally speaking, you should use UTF-8. Most HTML documents use this encoding.
It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!
UTF-16:
This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.
If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.

JAVA - writing chars to a file in one byte per char

I couldn't find any documentation about this...
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
does anyone know what class to use?
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
Okay - so you need to pick an encoding which only uses a single byte per character, such as ISO-8859-1. Create a FileOutputStream, wrap it in an OutputStreamWriter specifying the encoding, and you're away. However, you need to be aware that you're limiting the range of characters which can be represented in your file.
Take a "Writer"
Writer do output chars
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/FileWriter.html
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/OutputStreamWriter.html
OutputStream do output bytes
You may try to use an other encoding.
In that case you should supply an CharSetEncoder as this has an onUnmappableCharacter method
http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/charset/CharsetEncoder.html#onUnmappableCharacter%28java.nio.charset.CodingErrorAction%29
First figure out which kinds of chars you are going to be talking about.
In C a char is eight bits, even if you need two or more chars in sequence to represent one glyph, or in human-terms, one typed character. It gets worse, there are also glyphs that represent two "typed" characters, like the conjoined ff and ll glyphs you often see in typesetting.
If you are talking about C chars, then by definition every file contains the same number of chars as chars. If you are talking about any other meaning of the word character, then you need to make some choices.
Eight bit characters are guaranteed for the ASCII character set in UTF-8, which is by far the best character set to choose going forward, as it has explicit support in web protocols (thank you w3c!). This means that as long as you verify that every java char in your string is less than 128 (integer value), you are going to get one byte per char with UTF-8.
ISO-8859-1 is a character set which also uses only one byte per character. The downside to ISO-8859-1 is that it tends to not be the default character set of anything other than Microsoft systems. Even within the Microsoft realm, UTF-8 has been making a lot of headway.
The cost to convert between the two is not overly high, but the extensibility of the two differ dramatically. Basically, if you are using ISO-8859-1 and someone tells you that the next product must support language "X", then in some cases, you must first convert to a different character set and then add the language support. With UTF-8 such a need to convert to another character set prior to adding support is rare. I mean very rare, like so rare that you should consider just using images because the language is likely dead, is likely of historical interest only, and is likely to have been documented as a dialect from a lesser tribe on an island where the primary language has full support.

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.
Main UTF-8 cons:
Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
Main UTF-16 pros:
BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
Security: Use only UTF-8
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
CVE-2008-2938
CVE-2012-2135
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.

Categories