Appending extended ascii in strings - java

I'm trying to use extended ascii character 179(looks like pipe).
Here is how I use it.
String cmd = "";
char pipe = (char) 179;
// cmd ="02|CO|0|101|03|0F""
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
Output
cmd 02³CO³0³101³03³0F
But the output is like this . I have read that extended ascii characters are not displayed correctly.
Is my code correct and just the ascii is not correctly displayed
or my code is wrong.
I'm not concerned about showing this string to user I need to send it to server.
EDIT
The vendor's api document states that we need to use ascii 179 (looks like pipe) . The server side code needs 179(part of extended ascii) as pipe/vertical line so I cannot use 124(pipe)
EDIT 2
Here is the table for extended ascii
On the other hand this table shows that ascii 179 is "3" . Why
are there different interpretation of the same and which one should I
consider??
EDIT 3
My default charset value is (is this related to my problem?)
System.out.println("Default Charset=" + Charset.defaultCharset());
Default Charset=windows-1252
Thanks!
I have referred to
How to convert a char to a String?
How to print the extended ASCII code in java from integer value
Thanks

Use the below code.
String cmd = "";
char pipe = '\u2502';
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
System.out.println("int value: " + (int)pipe);
Output:
cmd 02│CO│0│101│03│0F
int value: 9474
I am using IntelliJ. This is the output I am getting.

Your code is correct; concatenating String values and char values does what one expects. It's the value of 179 that is wrong. You can google "unicode 179", and you'll find "Unicode Character 'SUPERSCRIPT THREE' (U+00B3)", as one might expect. And, you could simply say "char pipe = '|';" instead of using an integer. Or even better: String pipe = "|"; which also allows you the flexibility to use more than one character :)
In response to the new edits...
May I suggest that you fix this rather low-level problem not at the Java String level, but instead replace the byte encoding this character before sending the bytes to the server?
E.g. something like this (untested)
byte[] bytes = cmd.getBytes(); // all ascii, so this should be safe.
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '|') {
bytes[i] = (byte)179;
}
}
// send command bytes to server
// don't forget endline bytes/chars or whatever the protocol might require. good luck :)

Related

Interpret a string from one encoding to another in java

I've looked around for answers to this (I'm sure they're out there), and I'm not sure it's possible.
So, I got a HUGE file that contains the word "för". I'm using RandomAccessFile because I know where it is (kind of) and can therefore use the seek() function to get there.
To know that I've found it I have a String "för" in my program that I check for equality. Here's the problem, I ran the debugger and when I get to "för" what I get to compare is "för".
So my program terminates without finding any "för".
This is the code I use to get a word:
private static String getWord(RandomAccessFile file) throws IOException {
StringBuilder stb = new StringBuilder();
String word;
char c;
c = (char)file.read();
int end;
do {
stb.append(c);
end = file.read();
if(end==-1)
return "-1";
c = (char)end;
} while (c != ' ');
word = stb.toString();
word.trim();
return word;
}
So basically I return all the characters from the current point in the file to the first ' '-character. So basically I get the word, but since (char)file.read(); reads a byte (I think), UTF-8 'ö' becomes the two characters 'Ã' and '¶'?
One reason for this guess is that if I open my file with encoding UTF-8 it's "för" but if I open the file with ISO-8859-15 in the same place we now have exactly what my getWord method returns: "för"
So my question:
When I'm sitting with a "för" and a "för", is there any way to fix this? Like saying "read "för" as if it was an UTF-8 string" to get "för"?
If you have to use a RandomAccessFile you should read the content into a byte[] first and then convert the complete array to a String - somthing along the lines of:
byte[] buffer = new byte[whatever];
file.read(buffer);
String result = new String(buffer,"UTF-8");
This is only to give you a general impression what to do, you'll have to add some length-handling etc.
This will not work correctly if you start reading in the middle of a UTF-8 sequence, but so will any other method.
You are using RandomAccessFile.read(). This reads single bytes. UTF-8 sometimes uses several bytes for one character.
Different methods to read UTF-8 from a RandomAccessFile are discussed here: Java: reading strings from a random access file with buffered input
If you don't necessarily need a RandomAccessFile, you should definitely switch to reading characters instead of bytes.
If possible, I would suggest Scanner.next() which searches for the next word by default.
import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

Replace non UTF compliant characters characters in a meaningful way rather than simply removing them

My application is malfunctioning because of the special characters in the strings any many areas.
Eg 1 : you can see the ? character that was displaying instead of ’.
Text :
The Hilton Paris La Defense hotel is located at the foot of the Grande Arche at the very heart of Europe’s largest business district and puts you in easy reach of some of Paris’ most famous attractions. Only a few minutes from the...
Screen Shot :
Eg 2 : Parser exception while parsing a XML having special characters (like ’,& etc) using AXIOM.
XMLStreamReader parser = XMLInputFactory.newInstance().createXMLStreamReader(new StringBufferInputStream(responseXML));
OMElement documentElement = new StAXOMBuilder(parser).getDocumentElement();
I found many posts to remove them when they are found.
Eg :
How to remove bad characters that are not suitable for utf8 encoding in MySQL?
remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java
And I'm using following character to remove the non UTF compliant characters characters.
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
if ( !(ch < 0x00FD && ch > 0x001F) || ch =='&' || ch=='#') {
byteArr[i]=' ';
}
}
return new String( byteArr );
But this lead to another problem of removing some informative characters like ’.
What I want to do is, I want to replace them in a meaningful way rather than simply removing them. Eg : ’ can be replaced by ', & can be replaced by 'and' etc.
Is there any standard way to do this rather than manually replacing one by one?
The javadoc for StringBufferInputStream says
Deprecated. This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.
Don't use it.
The file is read as bytes, no matter where it comes from. Never convert your data to a String if you need it as bytes in the first place.
If you're reading from a file, use a FileInputStream. (Never use FileReader, since it doesn't allow you to specify the encoding.)

How to remove Ascii code from the JTextArea?

My java program gets some weather information from an API. But it has weird letters in the text. Looks like ASCII code.
Here is an example:
Min temp: 0°C (32°F) which should be: Min temp: 0C (32F) (i think).
How can I change it?
Well one solution can be before posting you can do following
String withoutDegSymbol = str.replaceAll("°", "");
Where str contains you temperature data.
try this
String s = "0°C (32°F)".replaceAll("[\u0080-\u00FF]", "");
or if you have HTML character references in your text use
String s = "Min temp: 0°C (32°F)".replaceAll("&#x.+?;", "");
If using ASCII character encoding in your codes, when you saved your code, did your IDE asked you in what format you want to save it. Because in Eclipse IDE, if you are using an ASCII character, it prompts you to save your code in UTF-8 format. Hope this helps.
You need to know that :
° :
is Unicode Character for 'DEGREE SIGN'
Encoding :HTML Entity (hex)
so if this is the just problem you have (i mean this is the only character you use "Degree sign") , so you can convert it manually Like that :
String s = "Min temp: 0°C (32°F)".replaceAll("°", "°");
System.out.println(s);
if this is not the special character you have , so you may use : Class StringEscapeUtils
or jsoup library to convert it to java

why is String.split("£", 2) not working?

I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.
Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).
This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());
I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");
Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.
#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôô
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.
If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.
Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Categories