What are \xHEX characters and is there a table for them? - java

When reading a textfile, I read these characters, when printed out to console it outputs blanks or �:
['\x80', '\xc3', '\x94', '\x99', '\x98','\x9d', '\x9c', '\xa9', '\xa6', '\xe2']
What are these \xHEX characters? Is there a link to the table to lookup these characters?
SOLVED:
it's not an ascii textfile, it was a unicode utf8 file. That was why I was unable to get correct the characters.
For Java:
import java.io.*
File infile = new File('\home\foo\bar.txt');
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(infile), "UTF8"));
while ((str = in.readLine()) != null) {
System.out.println(str);
}
if system.out.println complains try:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(str);
For Python, simply:
import codecs
infile = '\home\foo\bar.txt'
reader = codecs.open(infile,'r','urf8')
for l in reader:
print ln

Here is a link to all unicode characters:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
Also, if you are using Eclipse, make sure your project "Text File Encoding" is set to UTF-8.
Project->properties->resources->Text File Encoding.
I had similar problem with cyrillic characters :)

I may suggest that your text file is not really a "text file". The first two bytes form the unicode 'À' character. The rest, I guess, are non-printable characters. It seems that your file has a raw sequence of bytes, that don't have to be characters.
You've got a table here.

Please note that java encodes characters in an unicode format (\u...) . It is possible to display the numbers '80', but not its character's presentation '\x80' to the console.
For a list, please refer to ascii characters list, like this one

Related

Shift_JIS to UTF_8 conversion of full width tilde [~] character returns a thicker character

I'm processing Shift_JIS files and outputting UTF-8 files. Most of the characters are displayed as expected when viewed in a text editor, except for the full width tilde character [~]. It becomes thicker similar to this: [~].
note: this is not the same character, I just don't know how to type it here so I bolded it
When I type it manually in the UTF-8 file, I get the regular version.
Here is my code:
try (BufferedReader in = new BufferedReader(new InputStreamReader (
new FileInputStream(inFile), Charset.forName("Shift_JIS")))) {
try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter (
new FileOutputStream(outFile), StandardCharsets.UTF_8))) {
IOUtils.copy(in, out);
}
}
I also tried using "MS932" and also tried not using IOUtils.
To read Shift_JIS files made with Windows, you have to use Charset.forName("Windows-31j") rather than Charset.forName("Shift_JIS").
Java distinguish Shift_JIS and Windows-31j. "Shift_JIS" in documents for Windows means Windows-31J(MS932) in Java. On the other hand, "Shift_JIS" in documents for AIX means Shift_JIS in Java.
Character mappings for Windows-31J and Shift_JIS are slightly different. For example, ~ (0x8160 in Shift_JIS) is mapped to U+301C in Shift_JIS, and U+FF5E in Windows-31j. Microsoft IME uses U+FF5E (FULLWIDTH TILDE to represent the character ~.

character ° encoding and visualization in txt file

I have a field in a table that contains the string "Address Pippo p.2 °".
My program read this value and write it into txt file, but the output is:
"Address Pippo p.2 °" ( is unwanted)
I have a problem because the txt file is a positional file.
I open the file with these Java istructions:
FileWriter fw = new FileWriter(file, true);
pw = new PrintWriter(fw);
I want to write the string without strange characters
Any help for me ?
Thanks in advance
Try encoding the string into UTF-8 like this,
File file = new File("D://test.txt");
FileWriter fw = new FileWriter(file, true);
PrintWriter pw = new PrintWriter(fw);
String test = "Address Pippo p.2 °";
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(test);
test = StandardCharsets.UTF_8.decode(byteBuffer).toString();
pw.write(test);
pw.close();
Java uses Unicode. When you write text to a file, it gets encoded using a particular character encoding. If you don't specify it explicitly, it will use a "system default encoding" which is whatever is configured as default for your particular JVM instance. You need to know what encoding you've used to write the file. Then you need to use the same encoding to read and display the file content. The funny characters you are seeing are probably due to writing the file using UTF-8 and then trying to read and display it in e.g. Notepad using Windows-1252 ("ANSI") encoding.
Decide what encoding you want and stick to it for both reading and writing. To write using Windows-1252, use:
Writer w = new OutputStreamWriter(new FileInputStream(file, true), "windows-1252");
And if you write in UTF-8, then tell Notepad that you want it to read the file in UTF-8. One way to do that is to write the character '\uFEFF' (Byte Order Mark) at the beginning of the file.
If you use UTF-8, be aware that non-ASCII characters will throw the subsequent bytes out of position. So if, for example, a telephone field must always start at byte position 200, then having a non-ASCII character in an address field before it will make the telephone field start at byte position 201 or 202. Using windows-1252 encoding you won't have this issue, but that encoding can't encode all Unicode characters.

file reading encoding trouble

I've a file to read save, do something with its informations and then rewrite them back to another file. the problem is that the original file contains some characters from asian languages like 坂本龍一, 東京事変 and メリー (I guess they're chinese, japanese and korean). I can see them using Notepad++.
the problem is when I read them and write those things via java they get corrupted and I see weird stuff in my output file like ???????? or Жанна БичевÑ?каÑ?
I think I got something wrong with the encoding but I've no idea of which to use and how to use it.
can someone help me? here's my code:
String fileToRead= SONG_2M;
Scanner scanner = new Scanner(new File(fileToRead), "UTF-8");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] songData = line.split("\t");
if (/*something*/) {
save the string in the map
}
}
scanner.close();
saveFile("coded_artist_small2.txt");
}
public void saveFile(String fileToSave) throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileToSave, "UTF-8");
for (Entry<String, Integer> entry : artistsMap.entrySet()) {
writer.println(entry.getKey() + DELIMITER + entry.getValue());
}
writer.close();
}
It is likely that your input file is not, in fact, encoded in UTF-8 (an encoding using two bytes per character satisfying the unicode standard). For instance, the character 坂 you are seeing is unicode 0x5742. If, in fact, your file is encoded in ASCII, that should be displayed as character 0x57 followed by 0x42 - i.e. 9*.
If you're unsure of your file's encoding - take a guess that it might be ASCII text. Try removing the encoding when you set up the Scanner i.e. make the second line of your code
Scanner scanner = new Scanner(new File(fileToRead));
If, in fact, you know the file is unicode, there are different encodings. See this answer for a more comprehensive unicode reader - dealing with various unicode encodings.
For your output - you need to decide how you want the file encoded : some unicode encoding (e.g. UTF-8) or as ASCII.

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.
\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.
Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Why is the first character of the first line of a file in windows a 0?

So I'm reading a plain text file in Java, and I'd like do identify which lines start with "abc". I did the following:
Charset charset = StandardCharsets.UTF_8;
BufferedReader br = Files.newBufferedReader(file.toAbsolutePath(), charset);
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith("abc")) {
// Do something
}
}
But if the first line of the file is "abcd", it won't match. By debugging I've found out that the first character is a 0 (non-printable character), and because of this it won't match. Why is that so? How could I robustly identify which lines start with "abc"?
EDIT: perhaps I should point out that I'm creating the file using notepad
Windows has a few problems with UTF-8, and as such it is a heavy user of the UTF-8 BOM (Byte Order Mark).
If my guess is correct, the first three bytes would then be (in hexadecimal): 0xef, 0xbb, 0xbf.
Given that, for instance, Excel creates UTF-8 CSV files with a BOM prefix, I wouldn't be surprised at all if Notepad did as well...
edit: not surprisingly, it seems this is the case: see here.

Categories