Is there a way to fix wrong encoded strings? - java

I am getting this string via a message broker (Stomp):
João
and that's how it suposed to be:
João
Is there a way to revert this in Java?!
Thanks!

U+00C3 Ã c3 83 LATIN CAPITAL LETTER A WITH TILDE
U+00C2 Â c3 82 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3 £ c2 a3 POUND SIGN
U+00E3 ã c3 a3 LATIN SMALL LETTER A WITH TILDE
I'm having trouble determining how this could be a data (encoding) conversion problem. Is it possible the data is just bad?
If the data isn't bad, then we have to assume you are misinterpreting the encoding. We don't know the original encoding and unless you're doing something different, the default encoding for Java is UTF-16. I don't see how João encoded in any common encoding could be interpreted as João in UTF-16
Just to be sure, I whipped this python script up with no match found. I'm not entirely sure it covers all encodings or I'm not missing a corner case, FWIW.
#!/usr/bin/env python
# -- coding: utf-8 --
import pkgutil
import encodings
good = u'João'
bad = u'João'
false_positives = set(["aliases"])
found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found
for x in found:
for y in found:
res = None
try:
res = good.encode(x).decode(y)
print res,x,y
except:
pass
if not res is None:
if res == bad:
print "FOUND"
exit(1)

In some cases a hack works. But best is to prevent it from ever happening.
I had this problem before when I had a servlet that correctly printed the correct headers and http content type and encoding on the page, but IE would submit forms encoded with latin1 instead of the correct one. So I created a quick dirty hack (involving a request wrapper that detects and converts if it is indeed IE) to fix it for new data which worked fine. And for the data in the database that was already messed up, I used the following hack.
Unfortunately my hack doesn't work perfectly for your example string, but it looks very close (just an extra à in your broken string compared to my 'theoretical cause' reproduced broken string). So perhaps my guess of "latin1" is wrong, and you should try others (such as in that other link posted by Tomas).
package peter.test;
import java.io.UnsupportedEncodingException;
/**
* User: peter
* Date: 2012-04-12
* Time: 11:02 AM
*/
public class TestEncoding {
public static void main(String args[]) throws UnsupportedEncodingException {
//In some cases a hack works. But best is to prevent it from ever happening.
String good = "João";
String bad = "João";
//this line demonstrates what the "broken" string should look like if it is reversible.
String broken = breakString(good, bad);
//here we show that it is fixable if broken like breakString() does it.
fixString(good, broken);
//this line attempts to fix the string, but it is not fixable unless broken in the same way as breakString()
fixString(good, bad);
}
private static String fixString(String good, String bad) throws UnsupportedEncodingException {
byte[] bytes = bad.getBytes("latin1"); //read the Java bytes as if they were latin1 (if this works, it should result in the same number of bytes as java characters; if using UTF8, it would be more bytes)
String fixed = new String(bytes, "UTF8"); //take the raw bytes, and try to convert them to a string as if they were UTF8
System.out.println("Good: " + good);
System.out.println("Bad: " + bad);
System.out.println("bytes1.length: " + bytes.length);
System.out.println("fixed: " + fixed);
System.out.println();
return fixed;
}
private static String breakString(String good, String bad) throws UnsupportedEncodingException {
byte[] bytes = good.getBytes("UTF8");
String broken = new String(bytes, "latin1");
System.out.println("Good: " + good);
System.out.println("Bad: " + bad);
System.out.println("bytes1.length: " + bytes.length);
System.out.println("broken: " + broken);
System.out.println();
return broken;
}
}
And the result (with Sun jdk 1.7.0_03):
Good: João
Bad: João
bytes1.length: 5
broken: João
Good: João
Bad: João
bytes1.length: 5
fixed: João
Good: João
Bad: João
bytes1.length: 6
fixed: Jo�£o

Related

Reading Unicode Characters using Selenium Webdriver in Java

I am trying to get the text of the following web element including its unicode character (Copyright symbol).
© 2021 ABC Inc. All rights reserved.
enter image description here
I tried getWebDriver().findElement(elem).getText() but that gives me the following output.
? 2021 ABC Inc. All rights reserved.
I saw a few posts on this from earlier but still could not figure out how to go about reading this web element so that I capture unicode symbol (©) as well.
Appreciate any suggestions in this regard.
Thanks!
Update: I am confused after finding that number 169 decimal is the COPYRIGHT SIGN character in both Unicode and Windows-1252. So I have no idea as to what is really going!
I will leave this Answer as-is in case the code is helpful to anyone trying to solve this mystery.
Likely due to a limited (non-Unicode) character set and encoding in use by whatever means you generated your output.
Here is demo code showing your example string dumped to console via System.out by using the current default Charset, using UTF-8, and using the limited legacy Windows-1252.
See this example code run live at IdeOne.com.
import java.util.*;
import java.lang.*;
import java.io.*;
import java.nio.charset.StandardCharsets ;
import java.nio.charset.Charset ;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
String blurb = "© 2021 ABC Inc. All rights reserved." ;
// The character set and encoding currently in use by `System.out` is not known, some default.
System.out.println( "----------| default |--------------------------" );
System.out.println( "blurb: " + blurb ) ;
// Let's set the character set and encoding to UTF-8 by wrapping `System.out` in a `PrintStream`.
System.out.println( "----------| UTF-8 |--------------------------" );
try
{
PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() );
printStream.println( "blurb: " + blurb );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
// In contrast, try Windows-1252 character set.
System.out.println( "----------| windows-1252 |--------------------------" );
// Verify windows-1252 charset is available on the current JVM.
String windows1252CharSetName = "windows-1252";
boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
if ( isWindows1252CharsetAvailable )
{
System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
}
// Print the blurb.
try
{
PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );
printStream.println( "blurb: " + blurb );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
}
}
When run.
----------| default |--------------------------
blurb: © 2021 ABC Inc. All rights reserved.
----------| UTF-8 |--------------------------
blurb: © 2021 ABC Inc. All rights reserved.
----------| windows-1252 |--------------------------
isWindows1252CharsetAvailable = true
blurb: � 2021 ABC Inc. All rights reserved.
As expected, we see the COPYRIGHT SIGN character (code point 169 decimal) appear properly for Unicode but fail for Windows-1252. According to Wikipedia,
Recommended reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to fix "GetStatus Write RFID_API_UNKNOWN_ERROR data(x)- Field can Only Take Word values" Android RFID 8500 Zebra

I am trying to develop and application to read and write to RF tags. Reading is flawless, but I'm having issues with writing. Specifically the error "GetStatus Write RFID_API_UNKNOWN_ERROR data(x)- Field can Only Take Word values"
I have tried reverse-engineering the Zebra RFID API Mobile by obtaining the .apk and decoding it, but the code is obfuscated and I am not able to decypher why that application's Write works and mine doesn't.
I see the error in the https://www.ptsmobile.com/rfd8500/rfd8500-rfid-developer-guide.pdf at page 185, but I have no idea what's causing it.
I've tried forcefully changing the writeData to Hex, before I realized that the API does that on its own, I've tried changing the Length of the writeData as well, but it just gets a null value. I'm so lost.
public boolean WriteTag(String sourceEPC, long Password, MEMORY_BANK memory_bank, String targetData, int offset) {
Log.d(TAG, "WriteTag " + targetData);
try {
TagData tagData = null;
String tagId = sourceEPC;
TagAccess tagAccess = new TagAccess();
tagAccess.getClass();
TagAccess.WriteAccessParams writeAccessParams = tagAccess.new WriteAccessParams();
String writeData = targetData; //write data in string
writeAccessParams.setAccessPassword(Password);
writeAccessParams.setMemoryBank(MEMORY_BANK.MEMORY_BANK_USER);
writeAccessParams.setOffset(offset); // start writing from word offset 0
writeAccessParams.setWriteData(writeData);
// set retries in case of partial write happens
writeAccessParams.setWriteRetries(3);
// data length in words
System.out.println("length: " + writeData.length()/4);
System.out.println("length: " + writeData.length());
writeAccessParams.setWriteDataLength(writeData.length()/4);
// 5th parameter bPrefilter flag is true which means API will apply pre filter internally
// 6th parameter should be true in case of changing EPC ID it self i.e. source and target both is EPC
boolean useTIDfilter = memory_bank == MEMORY_BANK.MEMORY_BANK_EPC;
reader.Actions.TagAccess.writeWait(tagId, writeAccessParams, null, tagData, true, useTIDfilter);
} catch (InvalidUsageException e) {
System.out.println("INVALID USAGE EXCEPTION: " + e.getInfo());
e.printStackTrace();
return false;
} catch (OperationFailureException e) {
//System.out.println("OPERATION FAILURE EXCEPTION");
System.out.println("OPERATION FAILURE EXCEPTION: " + e.getResults().toString());
e.printStackTrace();
return false;
}
return true;
}
With
Password being 00
sourceEPC being the Tag ID obtained after reading
Memory Bank being MEMORY_BANK.MEMORY_BANK_USER
target data being "8426017056458"
offset being 0
It just keeps giving me "GetStatus Write RFID_API_UNKNOWN_ERROR data(x)- Field can Only Take Word values" and I have no idea why this is the case, nor I know what a "Word value" is, and i've searched for it. This is all under the "OperationFailureException", as well. Any help would be appreciated, as there's almost no resources online for this kind of thing.
Even this question is a bit older, I had the same problem so as far as I know this should be the answer.
Your target data "8426017056458" length is 13 and at writeAccessParams.setWriteDataLength(writeData.length()/4)
you are devide it with four. Now if you are trying to write the target data it is longer than the determined WriteDataLength. And this throws the Error.
One 'word' is 4 Hex => 16 Bits long. So your Data have to be filled up first and convert it to Hex.

Mapping characters to keycodes for international keysets

so I built a pi zero keyboard emulator as mentioned here:
https://www.rmedgar.com/blog/using-rpi-zero-as-keyboard-setup-and-device-definition
I make it type text that it reads from a local text-file (everything developed in java - for reasons :) ).
My problem now is that the configured keysets on the various computers my pi zero is attached to differ very much (german, english, french, ...). Depending on the computer this leads to several typing mistakes (e.g., z instead of y).
So I now built some "translation tables" that map characters to the keycodes fitting to the computer. Such a table looks like this:
public scancodes_en_us() {
//We have (Character, (scancode, modifier))
table.put("a",Pair.create("4","0"));
table.put("b",Pair.create("5","0"));
table.put("c",Pair.create("6","0"));
table.put("d",Pair.create("7","0"));
table.put("e",Pair.create("8","0"));
table.put("f",Pair.create("9","0"));
table.put("g",Pair.create("10","0"));
table.put("h",Pair.create("11","0"));
table.put("i",Pair.create("12","0"));
table.put("j",Pair.create("13","0"));
table.put("k",Pair.create("14","0"));
table.put("l",Pair.create("15","0"));
table.put("m",Pair.create("16","0"));
table.put("n",Pair.create("17","0"));
table.put("o",Pair.create("18","0"));
table.put("p",Pair.create("19","0"));
table.put("q",Pair.create("20","0"));
table.put("r",Pair.create("21","0"));
table.put("s",Pair.create("22","0"));
table.put("t",Pair.create("23","0"));
table.put("u",Pair.create("24","0"));
table.put("v",Pair.create("25","0"));
table.put("w",Pair.create("26","0"));
table.put("x",Pair.create("27","0"));
table.put("y",Pair.create("28","0"));
table.put("z",Pair.create("29","0"));
table.put("A",Pair.create("4","2"));
table.put("B",Pair.create("5","2"));
table.put("C",Pair.create("6","2"));
table.put("D",Pair.create("7","2"));
table.put("E",Pair.create("8","2"));
table.put("F",Pair.create("9","2"));
table.put("G",Pair.create("10","2"));
table.put("H",Pair.create("11","2"));
table.put("I",Pair.create("12","2"));
table.put("J",Pair.create("13","2"));
table.put("K",Pair.create("14","2"));
table.put("L",Pair.create("15","2"));
table.put("M",Pair.create("16","2"));
table.put("N",Pair.create("17","2"));
table.put("O",Pair.create("18","2"));
table.put("P",Pair.create("19","2"));
table.put("Q",Pair.create("20","2"));
table.put("R",Pair.create("21","2"));
table.put("S",Pair.create("22","2"));
table.put("V",Pair.create("25","2"));
table.put("W",Pair.create("26","2"));
table.put("X",Pair.create("27","2"));
table.put("Y",Pair.create("28","2"));
table.put("Z",Pair.create("29","2"));
table.put("1",Pair.create("30","0"));
table.put("2",Pair.create("31","0"));
table.put("5",Pair.create("34","0"));
table.put("6",Pair.create("35","0"));
table.put("7",Pair.create("36","0"));
table.put("8",Pair.create("37","0"));
table.put("9",Pair.create("38","0"));
table.put("0",Pair.create("39","0"));
table.put("!",Pair.create("30","2"));
table.put("#",Pair.create("31","2"));
table.put("#",Pair.create("32","2"));
table.put("$",Pair.create("33","2"));
table.put("%",Pair.create("34","2"));
table.put("^",Pair.create("35","2"));
table.put("&",Pair.create("36","2"));
table.put("*",Pair.create("37","2"));
table.put("(",Pair.create("38","2"));
table.put(")",Pair.create("39","2"));
table.put(" ",Pair.create("44","0"));
table.put("-",Pair.create("45","0"));
table.put("=",Pair.create("46","0"));
table.put("[",Pair.create("47","0"));
table.put("]",Pair.create("48","0"));
table.put("\\",Pair.create("49","0"));
table.put(";",Pair.create("51","0"));
table.put("'",Pair.create("52","0"));
table.put("`",Pair.create("53","0"));
table.put(",",Pair.create("54","0"));
table.put(".",Pair.create("55","0"));
table.put("/",Pair.create("56","0"));
table.put("_",Pair.create("45","2"));
table.put("+",Pair.create("46","2"));
table.put("{",Pair.create("47","2"));
table.put("}",Pair.create("48","2"));
table.put("|",Pair.create("49","2"));
table.put(":",Pair.create("51","2"));
table.put("\"",Pair.create("52","2"));
table.put("~",Pair.create("53","2"));
table.put("<",Pair.create("54","2"));
table.put(">",Pair.create("55","2"));
table.put("?",Pair.create("56","2"));
Having such a table for many different keyboard layouts is a pain. Is there some more clever version to map a character to the scancode for a specific keyboard layout?
If not - is there some kind of archive where I can find such a character to scancode mapping for many different keyboard layouts?
Thank you very much
Look at how localization works, they all share the same approach: Create a special version for each localization as a property file, then have an abstract class to load the property based on locale.
You will develop a loader class like this:
public scancodes(Locale locale) {
// load locale property file or download if missing
// read the property and store to the table
ResourceBundle scanCodes = ResourceBundle.getBundle("codes",locale);
}
And your codes_locale looks like:
codes_de.properties
a=4,0
b=5,0
By doing this, you separate the locale specific character with your logic code, and you don't need to bundle all keyboards in side your app. You can download them as needed.
You can access a tutorial here
If I understood what you are trying to do correctly then you don't have to map anything at all, just use a pre-made format (like unicode which works for all languages I know of), just send a char code and translate it to it's matching char.
Example file reader - char interpreter:
JFileChooser fc = new JFileChooser();
fc.setFileSelectionMode(JFileChooser.FILES_ONLY);
fc.showOpenDialog(null);
File textFile = fc.getSelectedFile();
if(textFile.getName().endsWith(".txt")) {
System.out.println(textFile.getAbsolutePath());
FileInputStream input = new FileInputStream(textFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "UNICODE"));
char[] buffer = new char[input.available() / 2 - 1];
System.out.println("Bytes left: " + input.available());
int read = reader.read(buffer);
System.out.println("Read " + read + " characters");
for(int i = 0; i < read; i++) {
System.out.print("The letter is: " + buffer[i]);
System.out.println(", The key code is: " + (int) buffer[i]);
}
}
you can later use the key code to emulate a key press on your computer
For scan-code mappings you can visit following sites:
Keyboard scancodes
Scan Codes Demystified
My solution is to determine the list of keycode on runtime, it'll save you a lot of caffeine and headache
package test;
import java.util.HashMap;
import java.util.Map;
import javax.swing.KeyStroke;
public class Keycode {
/**
* List of chars, can be stored in file
* #return
*/
public String getCharsets() {
return "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSVWXYZ12567890!##$%^&*() -=[]\\;'`,./_+{}|:\\~<>?";
}
/**
* Determines the keycode on runtime
* #return
*/
public Map<Character, Integer> getScancode() {
Map<Character, Integer> table = new HashMap<>();
String charsets = this.getCharsets();
for( int index = 0 ; index < charsets.length() ; index++ ) {
Character currentChar = charsets.charAt(index);
KeyStroke keyStroke = KeyStroke.getKeyStroke(currentChar.charValue(), 0);
// only for example i've used Map, but you should populate it by your table
// table.put("a",Pair.create("4","0"));
table.put(currentChar, keyStroke.getKeyCode());
}
return table;
}
public static void main(String[] args) {
System.out.println(new Keycode().getScancode());
}
}

Java string is terminated when its printed

I have a string which has huge data
String string = "afsa fd fdsfdsfdsfdsfdsfds fdsfds fdsf dsfds fdsfds fdsf dsfdsf dsfdsfsdfsdf JsonStr [{\"apk_name\":\"Android System\",\"apk_package\":\"android\",\"apk_versioncode\":17},{\"apk_name\":\"Bubble\",\"apk_package\":\"bz.ktk.bubble\",\"apk_versioncode\":21},{\"apk_name\":\"Kingsoft Office\",\"apk_package\":\"cn.wps.moffice_eng\",\"apk_versioncode\":74},{\"apk_name\":\"Math Workout\",\"apk_package\":\"com.akbur.mathsworkout\",\"apk_versioncode\":118},{\"apk_name\":\"Apollo\",\"apk_package\":\"com.andrew.apollo\",\"apk_versioncode\":2},{\"apk_name\":\"Tags\",\"apk_package\":\"com.android.apps.tag\",\"apk_versioncode\":101},{\"apk_name\":\"com.android.backupconfirm\",\"apk_package\":\"com.android.backupconfirm\",\"apk_versioncode\":17},{\"apk_name\":\"Bluetooth Share\",\"apk_package\":\"com.android.bluetooth\",\"apk_versioncode\":17},{\"apk_name\":\"Browser\",\"apk_package\":\"com.android.browser\",\"apk_versioncode\":17},{\"apk_name\":\"Calculator\",\"apk_package\":\"com.android.calculator2\",\"apk_versioncode\":17},{\"apk_name\":\"Calendar\",\"apk_package\":\"com.android.calendar\",\"apk_versioncode\":17},{\"apk_name\":\"Cell Broadcasts\",\"apk_package\":\"com.android.cellbroadcastreceiver\",\"apk_versioncode\":17},{\"apk_name\":\"Certificate Installer\",\"apk_package\":\"com.android.certinstaller\",\"apk_versioncode\":17},{\"apk_name\":\"Chrome\",\"apk_package\":\"com.android.chrome\",\"apk_versioncode\":1916122},{\"apk_name\":\"Contacts\",\"apk_package\":\"com.android.contacts\",\"apk_versioncode\":17},{\"apk_name\":\"Package Access Helper\",\"apk_package\":\"com.android.defcontainer\",\"apk_versioncode\":17},{\"apk_name\":\"Clock\",\"apk_package\":\"com.android.deskclock\",\"apk_versioncode\":203},{\"apk_name\":\"Dev Tools\",\"apk_package\":\"com.android.development\",\"apk_versioncode\":1},{\"apk_name\":\"Basic Daydreams\",\"apk_package\":\"com.android.dreams.basic\",\"apk_versioncode\":17},{\"apk_name\":\"Photo Screensavers\",\"apk_package\":\"com.android.dreams.phototable\",\"apk_versioncode\":17},{\"apk_name\":\"Email\",\"apk_package\":\"com.android.email\",\"apk_versioncode\":410000},{\"apk_name\":\"Exchange Services\",\"apk_package\":\"com.android.exchange\",\"apk_versioncode\":500000},{\"apk_name\":\"Face Unlock\",\"apk_package\":\"com.android.facelock\",\"apk_versioncode\":17},{\"apk_name\":\"Black Hole\",\"apk_package\":\"com.android.galaxy4\",\"apk_versioncode\":1},{\"apk_name\":\"Gallery\",\"apk_package\":\"com.android.gallery3d\",\"apk_versioncode\":40001},{\"apk_name\":\"HTML Viewer\",\"apk_package\":\"com.android.htmlviewer\",\"apk_versioncode\":17},{\"apk_name\":\"Input Devices\",\"apk_package\":\"com.android.inputdevices\",\"apk_versioncode\":17},{\"apk_name\":\"Android keyboard (AOSP)\",\"apk_package\":\"com.android.inputmethod.latin\",\"apk_versioncode\":17},{\"apk_name\":\"Key Chain\",\"apk_package\":\"com.android.keychain\",\"apk_versioncode\":17},{\"apk_name\":\"Fused Location\",\"apk_package\":\"com.android.location.fused\",\"apk_versioncode\":17},{\"apk_name\":\"Magic Smoke Wallpapers\",\"apk_package\":\"com.android.magicsmoke\",\"apk_versioncode\":17},{\"apk_name\":\"Messaging\",\"apk_package\":\"com.android.mms\",\"apk_versioncode\":17},{\"apk_name\":\"Music Visualization Wallpapers\",\"apk_package\":\"com.android.musicvis\",\"apk_versioncode\":17},{\"apk_name\":\"Nfc Service\",\"apk_package\":\"com.android.nfc\",\"apk_versioncode\":17},{\"apk_name\":\"Bubbles\",\"apk_package\":\"com.android.noisefield\",\"apk_versioncode\":1},{\"apk_name\":\"Package installer\",\"apk_package\":\"com.android.packageinstaller\",\"apk_versioncode\":17},{\"apk_name\":\"Phase Beam\",\"apk_package\":\"com.android.phasebeam\",\"apk_versioncode\":1},{\"apk_name\":\"Phone\",\"apk_package\":\"com.android.phone\",\"apk_versioncode\":17},{\"apk_name\":\"Search Applications Provider\",\"apk_package\":\"com.android.providers.applications\",\"apk_versioncode\":17},{\"apk_name\":\"Calendar Storage\",\"apk_package\":\"com.android.providers.calendar\",\"apk_versioncode\":17},{\"apk_name\":\"Contacts Storage\",\"apk_package\":\"com.android.providers.contacts\",\"apk_versioncode\":17},{\"apk_name\":\"Download Manager\",\"apk_package\":\"com.android.providers.downloads\",\"apk_versioncode\":17},{\"apk_name\":\"Downloads\",\"apk_package\":\"com.android.providers.downloads.ui\",\"apk_versioncode\":17},{\"apk_name\":\"DRM Protected Content Storage\",\"apk_package\":\"com.android.providers.drm\",\"apk_versioncode\":17},{\"apk_name\":\"Media Storage\",\"apk_package\":\"com.android.providers.media\",\"apk_versioncode\":511},i am still testing this :) ";
when I print this string like
System.out.println(TAG + string);
The string is truncated on console, why is it so?
System.out is redirected to logcat. Logcat messages have a maximum length of about 1k and extra characters are truncated.
If you need to log longer messages, use your own logging/file-writing solution.

Creating a UTF-8 File in Java

I'm currently making a program that saves Chinese Words onto a text file. I create the text file in java, and then try and write words to it. However, the text file I create is never encoded in UTF-8. This is the code I'm using, why doesn't it work? I was told that there was a bug inherent in Java but I have no idea how to get around it.
public void createFile(String name) {
try {
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(name +".txt"), "UTF-8"));
out.write("");
}
catch(java.io.IOException e) {
System.err.println("Something went wrong.");
}
}
Also, do I have another option aside from text files with which I could still use UTF encoding?
Also I'm testing its encoding by opening the TextEdit application and trying to write Chinese characters. Could this also be a problem?
First, files themselves don't have encodings. They're a bunch of 0s and 1s. If you write "asdf" in utf-8, it's completely indistinguishable from plain old ascii7.
If you were writing in, say, utf-16, then the byte-order mark (BOM) would be a pretty clear indication that it's written in utf-16, even with an empty string, but utf-8 does not require such a marker to be present.
Therefore, your editor has no way of knowing that this file is supposed to be written in utf-8. You could write utf-8's BOM to your file by:
out.write(0xEFBBBF);
However, in this case, outwould have to be an OutputStream, such as the FileOutputStream. (BufferedWriter and OutputStreamWriter do not accept byte arrays for input.)
This may be a TextEdit usage issue.
If there are no non-ASCII characters in the file you're writing, TextEdit's algorithm to determine encoding will likely land on ASCII or a Latin-1 variant.
You can specify a text file's encoding in the File->Open dialog. I'm not sure whether TextEdit remembers this decision on future double-clicks of this file.
Try the following code. It worked for me. The file was written out as UTF-8. I was able to open it with Notepad++, which verified that the encoding was UTF-8. The characters encoded correctly. I got the characters from http://www.khngai.com/chinese/charmap/tbluni.php.
package testutf8;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
public class TestUTF8 {
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException, IOException {
String str = "Unicode Character Map, 0x4E00 - 0x4FFF\n" +
"4E00 一 丁 丂 七 丄 丅 丆 万 丈 三 上 下 丌 不 与 丏\n" +
"4E10 丐 丑 丒 专 且 丕 世 丗 丘 丙 业 丛 东 丝 丞 丟\n" +
"4E20 丠 両 丢 丣 两 严 並 丧 丨 丩 个 丫 丬 中 丮 丯\n" +
"4E30 丰 丱 串 丳 临 丵 丶 丷 丸 丹 为 主 丼 丽 举 丿\n" +
"4E40 乀 乁 乂 乃 乄 久 乆 乇 么 义 乊 之 乌 乍 乎 乏\n" +
"4E50 乐 乑 乒 乓 乔 乕 乖 乗 乘 乙 乚 乛 乜 九 乞 也\n" +
"4E60 习 乡 乢 乣 乤 乥 书 乧 乨 乩 乪 乫 乬 乭 乮 乯\n" +
"4E70 买 乱 乲 乳 乴 乵 乶 乷 乸 乹 乺 乻 乼 乽 乾 乿\n" +
"4E80 亀 亁 亂 亃 亄 亅 了 亇 予 争 亊 事 二 亍 于 亏\n" +
"4E90 亐 云 互 亓 五 井 亖 亗 亘 亙 亚 些 亜 亝 亞 亟\n" +
"4EA0 亠 亡 亢 亣 交 亥 亦 产 亨 亩 亪 享 京 亭 亮 亯\n" +
"4EB0 亰 亱 亲 亳 亴 亵 亶 亷 亸 亹 人 亻 亼 亽 亾 亿\n" +
"4EC0 什 仁 仂 仃 仄 仅 仆 仇 仈 仉 今 介 仌 仍 从 仏\n" +
"4ED0 仐 仑 仒 仓 仔 仕 他 仗 付 仙 仚 仛 仜 仝 仞 仟\n" +
"4EE0 仠 仡 仢 代 令 以 仦 仧 仨 仩 仪 仫 们 仭 仮 仯\n" +
"4EF0 仰 仱 仲 仳 仴 仵 件 价 仸 仹 仺 任 仼 份 仾 仿\n" +
"4F00 伀 企 伂 伃 伄 伅 伆 伇 伈 伉 伊 伋 伌 伍 伎 伏\n" +
"4F10 伐 休 伒 伓 伔 伕 伖 众 优 伙 会 伛 伜 伝 伞 伟\n" +
"4F20 传 伡 伢 伣 伤 伥 伦 伧 伨 伩 伪 伫 伬 伭 伮 伯\n" +
"4F30 估 伱 伲 伳 伴 伵 伶 伷 伸 伹 伺 伻 似 伽 伾 伿\n" +
"4F40 佀 佁 佂 佃 佄 佅 但 佇 佈 佉 佊 佋 佌 位 低 住\n" +
"4F50 佐 佑 佒 体 佔 何 佖 佗 佘 余 佚 佛 作 佝 佞 佟\n" +
"4F60 你 佡 佢 佣 佤 佥 佦 佧 佨 佩 佪 佫 佬 佭 佮 佯\n" +
"4F70 佰 佱 佲 佳 佴 併 佶 佷 佸 佹 佺 佻 佼 佽 佾 使\n" +
"4F80 侀 侁 侂 侃 侄 侅 來 侇 侈 侉 侊 例 侌 侍 侎 侏\n" +
"4F90 侐 侑 侒 侓 侔 侕 侖 侗 侘 侙 侚 供 侜 依 侞 侟\n" +
"4FA0 侠 価 侢 侣 侤 侥 侦 侧 侨 侩 侪 侫 侬 侭 侮 侯\n" +
"4FB0 侰 侱 侲 侳 侴 侵 侶 侷 侸 侹 侺 侻 侼 侽 侾 便\n" +
"4FC0 俀 俁 係 促 俄 俅 俆 俇 俈 俉 俊 俋 俌 俍 俎 俏\n" +
"4FD0 俐 俑 俒 俓 俔 俕 俖 俗 俘 俙 俚 俛 俜 保 俞 俟\n" +
"4FE0 俠 信 俢 俣 俤 俥 俦 俧 俨 俩 俪 俫 俬 俭 修 俯\n" +
"4FF0 俰 俱 俲 俳 俴 俵 俶 俷 俸 俹 俺 俻 俼 俽 俾 俿\n";
FileOutputStream fos = new FileOutputStream("tmp.txt");
Writer out = new OutputStreamWriter(fos, "UTF-8");
out.write(str);
out.close();
}
}
Try UTF-8 instead of UTF8. This might solve your problem.
I noticed that you didn't close your stream:
out.close();
Of course you didn't include the code that wrote the actual characters either...

Categories