Encode french character in Java over smpp? - java

This is my code i am tring to Send Message Over SMPP but as Output ? is coming:
public class Encoding
{
public static void main(String[] args) throws SocketTimeoutException, AlreadyBoundException, VersionException, SMPPProtocolException, UnsupportedOperationException, IOException
{
SubmitSM sm=new SubmitSM();
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
UCS2Encoding uc = UCS2Encoding.getInstance(true);
sm.setDataCoding(2);
sm.setMessageText(utf8);
System.out.println(sm.getMessageText());
}
}

Your problem is here:
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
Why do you do that at all? Since the UCS2Encoding class accepts a String as an argument, it will take care of the encoding itself.
Just do:
sm.setMessageText(strMessage);
As I mentioned in the other question you asked, you are mixing a LOT of concepts. Remind that a String is a sequence of chars; it is independent of the encoding you use. The fact that internally Java uses UTF-16 is totally irrelevant here. It could use UTF-32 or EBCDIC, or even use carrier pigeons, the process itself would not change:
encode decode
String (char[]) --------> byte[] --------> String (char[])
And by using the String constructor taking a byte array as an argument, you create a seqeunce of chars from these bytes using the default JVM encoding. Which may, or may not, be UTF-8.
In particular, if you are using Windows, the default encoding will be windows-1252. Let us replace encode and decode above with the charset names. What you do is:
UTF-8 windows-1252
String (char[]) -------> byte[] --------------> String (char[])
"Houston, we have a problem!"
For more details, see the javadocs for Charset, CharsetEncoder and CharsetDecoder.

Related

java convert utf-8 2 byte char to 1 byte char

There are many similar questions, but no one helped me.
utf-8 can be 1 byte or 2,3,4.
ISO-8859-15 is allways 2 bytes.
But I need 1 byte character like code page Code "page 863" (IBM863).
http://en.wikipedia.org/wiki/Code_page_863
For example "é" is code point 233 and is 2 bytes long in utf 8, how can I convert it to IBM863 (1 byte) in Java?
Running on JVM -Dfile.encoding=UTF-8 possible?
Of course that conversion would mean that some characters can be lost, because IBM863 is smaller.
But I need the language specific characters, like french, è, é etc.
Edit1:
String text = "text with é";
Socket socket = getPrinterSocket( printer);
BufferedWriter bwOut = getPrinterWriter(printer,socket);
...
bwOut.write("PRTXT \"" + text + "\n");
...
if (socket != null)
{
bwOut.close();
socket.close();
}
else
{
bwOut.flush();
}
Its going a label printer with Fingerprint 8.2.
Edit 2:
private BufferedWriter getPrinterWriter(PrinterLocal printer, Socket socket)
throws IOException
{
return new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
}
First of all: there is no such thing as "1 byte char" or, in fact, "n byte char" for whatever n.
In Java, a char is a UTF-16 code unit; depending on the (Unicode) code point, either one, or two chars, are necessary to represent a code point.
You can use the following methods:
Character.toChars() to turn a Unicode code point into a char array representing this code point;
a CharsetEncoder to perform the char[] to byte[] conversion;
a CharsetDecoder to perform the byte[] to char[] conversion.
You obtain the two latter from a Charset's .new{Encoder,Decoder}() methods.
It is crucially important here to know what your input is exactly: is it a code point, is it an encoded byte array? You'll have to adapt your code depending on this.
Final note: the file.encoding setting defines the default charset to use when you don't specify a charset to use, for instance in a FileReader constructors; you should avoid not specifying a charset to begin with!
byte[] someUtf8Bytes = ...
String decoded = new String(someUtf8Bytes, StandardCharsets.UTF8);
byte[] someIso15Bytes = decoded.getBytes("ISO-8859-15");
byte[] someCp863Bytes = decoded.getBytes("cp863");
If you start with a string, use just getBytes with a proper encoding.
If you want to write strings with a proper encoding to a socket, you can either use OutputStream instead of PrintStream or Writer and send byte arrays, or you can do:
new BufferedWriter(new OutputStreamWriter(socket.getOutputStream(), "cp863"))

How to unescape html special characters in Java?

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

Remove Non-Ansi Chars from a UTF String and Keep Others

We have a java lib accpeting a UTF8 string as the input. But if there is any char which is a non-ansi char in the input, the lib may crash. So, we want to remove all non-ansi char from the string. But how to do that in java?
Thanks,
Try this, I pulled this from here so haven't tested it
// Create a encoder and decoder for the character encoding
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
// This line is the key to removing "unmappable" characters.
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
String result = inString;
try {
// Convert a string to bytes in a ByteBuffer
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));
// Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
CharBuffer cbuf = decoder.decode(bbuf);
result = cbuf.toString();
} catch (CharacterCodingException cce) {
String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
cce.printStackTrace()
}
Take a look at String.codePointAt(index). That can give you the Unicode code point for a given character, and from there you could remove those outside your range.
How you handle the fact that a character has been removed is on your end, but keep in mind that the string you'll be sending to the library isn't necessarily the same as that provided by the client. This may or may not cause problems.
I'm not sure what you mean by ANSI here. Do you mean the Windows 1252 character encoding that people typically call ANSI? That's not ASCII and it's also not IS0-8859-1, so make sure you get your code pages correct.

How can I check if a String is encodable in some encoding?

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ...
#Test
public void testEncoding() throws UnsupportedEncodingException {
final String czech = "Řízeček a šampáňo a žízeň";
// okay
final byte[] bytesInLatin2 = czech.getBytes("ISO8859-2");
// different bytes, but okay
final byte[] bytesInWin1250 = czech.getBytes("Windows-1250");
// different bytes, but okay
final byte[] bytesInUtf8 = czech.getBytes("UTF-8");
// nonsense; Ř,č,... are not in Latin1 code set!!!
final byte[] bytesInLatin1 = czech.getBytes("ISO8859-1");
System.out.println(Arrays.toString(bytesInLatin2));
System.out.println(Arrays.toString(bytesInWin1250));
System.out.println(Arrays.toString(bytesInUtf8));
System.out.println(Arrays.toString(bytesInLatin1));
System.out.flush();
final String latin2 = new String(bytesInLatin2, "ISO8859-2");
final String win1250 = new String(bytesInWin1250, "Windows-1250");
final String utf8 = new String(bytesInUtf8, "UTF-8");
final String latin1 = new String(bytesInLatin1, "ISO8859-1");
Assert.assertEquals("latin2", czech, latin2);
Assert.assertEquals("win1250", czech, win1250);
Assert.assertEquals("utf8", czech, utf8);
Assert.assertEquals("latin1", czech, latin1); // this test will fail!
}
There are many situations where the data are finally corrupted because of this behaviour of Java. Is there any library available to validate Strings if they are encodable with some encoding?
I suspect you're looking for CharsetEncoder.canEncode(CharSequence).
Charset latin2 = Charset.forName("ISO8859-2");
boolean validInLatin2 = latin2.newEncoder().canEncode(czech);
...
As an alternative to Jon Skeet's suggestion, you can also use CharsetEncoder class to do the encoding directly (with the encode method), but first call the onMalformedInput and onUnmappableCharacter methods to specify what the encoder should do when it encounters bad input.
That way most of the time you're just doing a simple encode call, but if anything goes wrong you'll get an exception.

MD5 Hash of ISO-8859-1 string in Java

I'm implementing an interface for digital payment service called Suomen Verkkomaksut. The information about the payment is sent to them via HTML form. To ensure that no one messes with the information during the transfer a MD5 hash is calculated at both ends with a special key that is not sent to them.
My problem is that for some reason they seem to decide that the incoming data is encoded with ISO-8859-1 and not UTF-8. The hash that I sent to them is calculated with UTF-8 strings so it differs from the hash that they calculate.
I tried this with following code:
String prehash = "6pKF4jkv97zmqBJ3ZL8gUw5DfT2NMQ|13466|123456||Testitilaus|EUR|http://www.esimerkki.fi/success|http://www.esimerkki.fi/cancel|http://www.esimerkki.fi/notify|5.1|fi_FI|0412345678|0412345678|esimerkki#esimerkki.fi|Matti|Meikäläinen||Testikatu 1|40500|Jyväskylä|FI|1|2|Tuote #101|101|1|10.00|22.00|0|1|Tuote #202|202|2|8.50|22.00|0|1";
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
String hash = Crypt.md5sum(prehash).toUpperCase();
String hashIso = Crypt.md5sum(prehashIso).toUpperCase();
Unfortunately both hashes are identical with value C83CF67455AF10913D54252737F30E21. The correct value for this example case is 975816A41B9EB79B18B3B4526569640E according to Suomen Verkkomaksut's documentation.
Is there a way to calculate MD5 hash in Java with ISO-8859-1 strings?
UPDATE: While waiting answer from Suomen Verkkomaksut, I found an alternative way to make the hash. Michael Borgwardt corrected my understanding of String and encodings and I looked for a way to make the hash from byte[].
Apache Commons is an excellent source of libraries and I found their DigestUtils class which has a md5hex function which takes byte[] input and returns a 32 character hex string.
For some reason this still doesn't work. Both of these return the same value:
DigestUtils.md5Hex(prehash.getBytes());
DigestUtils.md5Hex(prehash.getBytes("ISO-8859-1"));
You seem to misunderstand how string encoding works, and your Crypt class's API is suspect.
Strings don't really "have an encoding" - an encoding is what you use to convert between Strings and bytes.
Java Strings are internally stored as UTF-16, but that does not really matter, as MD5 works on bytes, not Strings. Your Crypt.md5sum() method has to convert the Strings it's passed to bytes first - what encoding does it use to do that? That's probably the source of your problem.
Your example code is pretty nonsensical as the only effect this line has:
String prehashIso = new String(prehash.getBytes("ISO-8859-1"), "ISO-8859-1");
is to replace characters that cannot be represented in ISO-8859-1 with question marks.
Java has a standard java.security.MessageDigest class, for calculating different hashes.
Here is the sample code
include java.security.MessageDigest;
// Exception handling not shown
String prehash = ...
final byte[] prehashBytes= prehash.getBytes( "iso-8859-1" );
System.out.println( prehash.length( ) );
System.out.println( prehashBytes.length );
final MessageDigest digester = MessageDigest.getInstance( "MD5" );
digester.update( prehashBytes );
final byte[] digest = digester.digest( );
final StringBuffer hexString = new StringBuffer();
for ( final byte b : digest ) {
final int intByte = 0xFF & b;
if ( intByte < 10 )
{
hexString.append( "0" );
}
hexString.append(
Integer.toHexString( intByte )
);
}
System.out.println( hexString.toString( ).toUpperCase( ) );
Unfortunately for you it produces the same "C83CF67455AF10913D54252737F30E21" hash. So, I guess your Crypto class is exonerated. I specifically added the prehash and prehashBytes length printouts to verify that indeed 'ISO-8859-1' is used. In this case both are 328.
When I did presash.getBytes( "utf-8" ) it produced "9CC2E0D1D41E67BE9C2AB4AABDB6FD3" (and the length of the byte array became 332). Again, not the result you are looking for.
So, I guess Suomen Verkkomaksut does some massaging of the prehash string that they did not document, or you have overlooked.
Not sure if you solved your problem, but I had a similar problem with ISO-8859-1 encoded strings with nordic ä & ö characters and calculating a SHA-256 hash to compare with stuff in documentation. The following snippet worked for me:
import java.security.MessageDigest;
//imports omitted
#Test
public void test() throws ProcessingException{
String test = "iamastringwithäöchars";
System.out.println(this.digest(test));
}
public String digest(String data) throws ProcessingException {
MessageDigest hash = null;
try{
hash = MessageDigest.getInstance("SHA-256");
}
catch(Throwable throwable){
throw new ProcessingException(throwable);
}
byte[] digested = null;
try {
digested = hash.digest(data.getBytes("ISO-8859-1"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String ret = BinaryUtils.BinToHexString(digested);
return ret;
}
To transform bytes to hex string there are many options, including the apache commons codec Hex class mentioned in this thread.
If you send UTF-8 encoded data that they treat as ISO-8859-1 then that could be the source of your problem. I suggest you either send the data in ISO-8859-1 or try to communicate to Suomen Verkkomaksut that you're sending UTF-8. In a http-based protocol you do this by adding charset=utf-8 to Content-Type in the HTTP header.
A way to rule out some issues would be to try a prehash String that only contains characters that are encoded the same in UTF-8 and ISO-8859-1. From what I can see you can achieve this by removing all "ä" characters in the string you'e used.

Categories