Translate UTF-8 character encoding function from PHP to Java

Translate UTF-8 character encoding function from PHP to Java - java

I am trying to translate one PHP encoding function to Android Java method. Because Java string length function handles UTF-8 string differently. I failed to make the translated Java codes consistent with PHP code in converting the second UTF-8 str2. The first non UTF-8 string does work.
The original PHP codes are :
function myhash_php($string,$key) {
$strLen = strlen($string);
$keyLen = strlen($key);
$j=0 ; $hash = "" ;
for ($i = 0; $i < $strLen; $i++) {
$ordStr = ord(substr($string,$i,1));
if ($j == $keyLen) { $j = 0; }
$ordKey = ord(substr($key,$j,1));
$j++;
$hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));
}
return $hash;
}
$str1 = "good friend" ;
$str2 = "好友" ; // strlen($str2) == 6
$key = "iuyhjf476" ;
echo "php encode str1 '". $str1 ."'=".myhash_php($str1, $key)."<br>";
echo "php encode str2 '". $str2 ."'=".myhash_php($str2, $key)."<br>";
PHP output are:
php encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
php encode str2 '好友'=a9u7m899x6p6
Current translated Java codes that produce wrong result are:
public static String hash_java(String string, String key) {
//Integer strLen = byteLenUTF8(string) ; // consistent with php strlen("好友")==6
//Integer keyLen = byteLenUTF8(key) ; // byteLenUTF8("好友") == 6
Integer strLen = string.length() ; // "好友".length() == 2
Integer keyLen = key.length() ;
int j=0 ;
String hash = "" ;
int ordStr, ordKey ;
for (int i = 0; i < strLen; i++) {
ordStr = ord_java(string.substring(i,i+1)); //string is String, php substr($string,$i,$n) == java string.substring(i, i+n)
// ordStr = ord_java(string[i]); //string is byte[], php substr($string,$i,$n) == java string.substring(i, i+n)
if (j == keyLen) { j = 0; }
ordKey = ord_java(key.substring(j,j+1));
j++;
hash += strrev(base_convert(dechex(ordStr + ordKey),16,36));
}
return hash;
}
// return the ASCII code of the first character of str
public static int ord_java( String str){
return( (int) str.charAt(0) ) ;
}
public static String dechex(int input ) {
String hex = Integer.toHexString(input ) ;
return hex ;
}
public static String strrev(String str){
return new StringBuilder(str).reverse().toString() ;
}
public static String base_convert(String str, int fromBase, int toBase) {
return Integer.toString(Integer.parseInt(str, fromBase), toBase);
}
String str1 = "good friend" ;
String str2 = "好友" ;
String key = "iuyhjf476" ;
Log.d(LogTag,"java encode str1 '"+ str1 +"'="+hash_java(str1, key)) ;
Log.d(LogTag,"java encode str2 '"+ str2 +"'="+hash_java(str2, key)) ;
Java output are:
java encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
java encode str2 '好友'=arh4ng
The encoded output of UTF-8 str2 in Java method is not correct. How to fix the problem?

Do not use literals for testing - this is prone to yield unexpected results if not fully being aware of what you do and how the file is encoded. For UTF-8 you should everything treat as raw bytes and never use a String for en/decoding. Example in PHP:
$test1 = pack( 'H*', '414243' ); // "ABC" in hexadecimal: 2 digits per byte
$test2 = pack( 'H*', 'e5a5bde58f8b' ); // "好友" in hexadecimal, UTF-8 encoded, 3 bytes per character
Example in Java:
byte[] test1 = new byte[] { 0x41, 0x42, 0x43 }; // "ABC"
byte[] test2 = new byte[] { (byte)0xe5, (byte)0xa5, (byte)0xbd, (byte)0xe5, (byte)0x8f, (byte)0x8b }; // "好友"
Only this way you can make sure your test is set up correctly and unbound to how the source file is encoded. If your Java file is encoded in UTF-8 and your PHP file is encoded in UTF-16LE then you'd fail even worse, simply because you didn't separate between definition (raw bytes) and assumption (strings based on the text encoding) so far.
(This is also a big misunderstanding when people want to en/decrypt texts: they operate on (any programming language's) String rather than the actual bytes and then wonder why different results occur with a different programming language.)

In Java, convert the string to a byte array, using UTF-8 character encoding. Then, apply your encoding algorithm to this byte array instead of the string.
Your PHP program seems to implicitly do the same thing, to treat e.g. the character 好 as three individual byte values, according to UTF-8 encoding.
EDIT:
In the comments, you say you receive the string from the user entering it on Android. So, you start with a Java String coming from some UI widget.
And you need that Java String to give the same result that the given PHP function will produce when fed with the same UTF-8 string. The resulting string will only use ASCII characters, so its character encoding is less problematic (doesn't matter whetherit's e.g. ISO-8859-1 or UTF-8).
The PHP string datatype is ignorant about the encoding, just stores a sequence of bytes, so in general it might contain ISO-8859-1 bytes where one byte represents one character, or UTF-8 byte sequences, where characters often occupy multiple bytes, or any other encoding. The PHP string does not know how the bytes are meant to be interpreted as characters, it just sees and counts bytes.
So, what your PHP string calls "characters", effectively is the bytes of the UTF-8 encoding, and the Java side must emulate this behaviour when doing its algorithm.
Java has a String data type very different from PHP, not based on byte sequences, but (mainly) seeing a string as a sequence of characters. So, if you work with the characters of the Java String, you'll not see the same sequence of elements that PHP sees.
When Java iterates over a String like "好友", there are two steps, one for each of the two characters (seeing the character's Unicode code point number), while PHP has six steps, one for each byte of the UTF-8 representation, seeing the byte value.
So, to emulate PHP, in Java you have to convert the String to a byte[] array using UTF-8 encoding. This way, one Java byte will correspond to one PHP character.
Remark
By the way, the wording "UTF-8 string" does not make sense in Java.
That is different from PHP where e.g. "Maß" as ISO-8859-1 string (having a length of 3) differs from "Maß" as UTF-8 string (having a length of 4).
In Java, Strings are sequences of characters, and that's the reason why e.g. "好友" has a length of 2, as it's just two characters that happen to come from a non-Latin script. [This is true for most Unicode characters you'll typically encounter, but there are exceptions.] In Java, terms like UTF-8 matter only when converting between strings and byte sequences.

Related

Java, trying to create a specific network byte header based on length of command

I'm running into some trouble when attempting to create a network byte header. The header should be 2 bytes long, which simply defines the length of the following command.
For example; The following command String "HED>0123456789ABCDEF" is 20 characters long, which is 0014 as hex signed 2 complement, creating the network byte header for this command works as the command is under 124 characters. The following snippet of code essentially works out the byte header and adds the following prefix to the command \u00000\u0014 when the command is under 124 characters.
However for commands that are 124 characters or above, the code in the if block doesn't work. Therefore, I looked into possible alternatives and tried a couple of things regarding generating hex characters and setting them as the network byte header, but as they aren't bytes it's not going to work (As seen in the else block). Instead the else block simply returns 0090 for a command which is 153 characters long which is technically correct, but I'm not able to use this 'length' header the same way as the if blocks length header
public static void main(String[] args) {
final String commandHeader = "HED>";
final String command = "0123456789ABCDEF";
short commandLength = (short) (commandHeader.length() + command.length());
char[] array;
if( commandLength < 124 )
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = new String( bb.array() ).toCharArray();
}
else
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = convertToHex(bb.array());
}
final String command = new String(array) + commandHeader + command;
System.out.println( command );
}
private static char[] convertToHex(byte[] data) {
final StringBuilder buf = new StringBuilder();
for (byte b : data) {
int halfByte = (b >>> 4) & 0x0F;
int twoHalves = 0;
do {
if ((0 <= halfByte) && (halfByte <= 9))
buf.append((char) ( '0' + halfByte));
halfByte = b & 0x0F;
} while (twoHalves++ < 1);
}
return buf.toString().toCharArray();
}
Furthermore, I have managed to get this working in Python 2 by doing the following three lines, no less! This returns the following network byte header for a 153 character command as \x00\x99
msg_length = len(str_header + str_command)
command_length = pack('>h', msg_length)
command = command_length + str_header + str_command
Also simply replicated by running Python 2 and entering the following commands:
In [1]: import struct
In [2]: struct.pack('>h', 153)
Out[2]: '\x00\x99'
Any assistance, or light that could be shed to resolve this issue would be greatly appreciated.

The basic problem is that you (try to) convert fundamentally binary data to character data. Furthermore, you do it using the platform's default charset, which varies from machine to machine.
I think you have mischaracterized the problem slightly, however. I am confident that it arises when command.length() is at least 124, so that commandLength, which includes the length of commandHeader, too, is at least 128. You would also find that there are some (much) larger command lengths that worked, too.
The key point here is that when any of the bytes in the binary representation of the length have their most-significant bit set, that is meaningful to some character encodings, especially UTF-8, which is a common (but not universal) default. Unless you get very lucky, binary lengths that have any such bytes will not be correctly decoded into characters in UTF-8. Moreover, they may get decoded into characters successfully but differently on machines with that use different charsets for the purpose.
You also have another, related inconsistency. You are formatting data for transmission over the network, which is a byte-oriented medium. The transmission will be a sequence of bytes. But you are measuring and reporting the number of characters in the decoded internal representation, not the number of bytes in the encoded representation that will go over the wire. The two counts are the same for your example command, but they would differ for some strings that you could express in Java.
Additionally, your code is inconsistent with your description of the format wanted. You say that the "network byte header" should be four bytes long, but your code emits only two.
You can address all these issues by taking character encoding explicitly into account, and by avoiding the unneeded and inappropriate conversion of raw binary data to character data. The ByteBuffer class you're already using can help with that. For example:
public static void main(String[] args) throws IOException {
String commandHeader = "HED>";
// a 128-byte command
String command = "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF";
// Convert characters to bytes, and do so with a specified charset
// Note that ALL Java implementations are required to support UTF-8
byte[] commandHeaderBytes = commandHeader.getBytes("UTF-8");
byte[] commandBytes = command.getBytes("UTF-8");
// Measure the command length in bytes, since that's what the receiver
// will need to know
int commandLength = commandHeaderBytes.length + commandBytes.length;
// Build the whole message in your ByteBuffer
// Allow a 4-byte length field, per spec
ByteBuffer bb = ByteBuffer.allocate(commandLength + 4);
bb.putInt(commandLength)
.put(commandHeaderBytes)
.put(commandBytes);
// DO NOT convert to a String or other character type. Output the
// bytes directly.
System.out.write(bb.array());
System.out.println();
}

Will the result of String.getBytes() ever contain zeros?

I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}

It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.

Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]

Converting string to binary and back again does not give the same string

I'm writing a Simplified DES algorithm to encrypt and subsequently decrypt a string. Suppose I have the initial character ( which has the binary value 00101000 which I get using the following algorithm:
public void getBinary() throws UnsupportedEncodingException {
byte[] plaintextBinary = text.getBytes("UTF-8");
for(byte b : plaintextBinary){
int val = b;
int[] tempBinRep = new int[8];
for(int i = 0; i<8; i++){
tempBinRep[i] = (val & 128) == 0 ? 0 : 1;
val <<= 1;
}
binaryRepresentations.add(tempBinRep);
}
}
After I perform the various permutations and shifts, ( and it's binary equivalent is transformed into 10001010 and it's ASCII equivalent Š. When I come around to decryption I pass the same character through the getBinary() method I now get the binary string 11000010 and another binary string 10001010 which translates into ASCII as x(.
Where is this rogue x coming from?
Edit: The full class can be found here.

You haven't supplied the decrypting code, so we can't know for sure, but I would guess you missed the encoding either when populating your String. Java Strings are encoded in UTF-16 by default. Since you're forcing UTF-8 when encrypting, I'm assuming you're doing the same when decrypting. The problem is, when you convert your encrypted bytes to a String for storage, if you let it default to UTF-16, you're probably ending up with a two-byte character because the 10001010 is 138, which is beyond the 127 range for ASCII charaters that get represented with a single byte.
So the "x" you're getting is the byte for the code page, followed by the actual character's byte. As suggested in the comments, you'd do better to just store the encrypted bytes as bytes, and not convert them to Strings until they're decrypted.

How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.
I'm looking a clean way to replace these characters.
Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.
N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.

We ended up implementing the following method in Java for this problem.
Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.
The offset calculations are to make sure we stay on the unicode code points.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String s) {
final int length = s.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
b.append(CharUtils.REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(CharUtils.REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}

Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:
text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.
Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.

My java class implementation of XOR encryption has gone wrong

I am new to java but I am very fluent in C++ and C# especially C#. I know how to do xor encryption in both C# and C++. The problem is the algorithm I wrote in Java to implement xor encryption seems to be producing wrong results. The results are usually a bunch of spaces and I am sure that is wrong. Here is the class below:
public final class Encrypter {
public static String EncryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
public static String DecryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
}

Strings in Java are Unicode - and Unicode strings are not general holders for bytes like ASCII strings can be.
You're taking a string and converting it to bytes without specifying what character encoding you want, so you're getting the platform default encoding - probably US-ASCII, UTF-8 or one of the Windows code pages.
Then you're preforming arithmetic/logic operations on these bytes. (I haven't looked at what you're doing here - you say you know the algorithm.)
Finally, you're taking these transformed bytes and trying to turn them back into a string - that is, back into characters. Again, you haven't specified the character encoding (but you'll get the same as you got converting characters to bytes, so that's OK), but, most importantly...
Unless your platform default encoding uses a single byte per character (e.g. US-ASCII), then not all of the byte sequences you will generate represent valid characters.
So, two pieces of advice come from this:
Don't use strings as general holders for bytes
Always specify a character encoding when converting between bytes and characters.
In this case, you might have more success if you specifically give US-ASCII as the encoding. EDIT: This last sentence is not true (see comments below). Refer back to point 1 above! Use bytes, not characters, when you want bytes.

If you use non-ascii strings as keys you'll get pretty strange results. The bytes in the kbytes array will be negative. Sign-extension then means that val will come out negative. The cast to char will then produce a character in the FF80-FFFF range.
These characters will certainly not be printable, and depending on what you use to check the output you may be shown "box" or some other replacement characters.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.