I am unable to convert extended ASCII characters(having code greater than 128) into their codes.
I am using (int)'�' for conversion but its is giving
� -- 65533
I am using the following java function:
static String decodeCandidateId2(String CandidateId ){
byte[] valueDecoded= Base64.decodeBase64(CandidateId.getBytes());
CandidateId=new String(valueDecoded);
String key="#!#&%$##&^%$";
String output="";
for(int i=0; i<CandidateId.length(); i++) {
int ascii=0;
ascii=(int)CandidateId.charAt(i)-((int)key.charAt((i-1) % key.length()));
output += Character.toString ((char) ascii);
}
return output;
}
if CandidateId = "VXJaWl9lVlV0XpRbZHVZWFpeXVuAV5NVW3BRW19XVQ=="
current output = 12979#2248゚6#5854998ᄑ1゚07008921, but i need to get 12979#224866#5854998#1507008921 as output.
Can anyone please help me to get the correct code.
Strictly speaking, there is no such thing as "Extended ASCII" - or there are several different colloquial definitions of this non-standard term. ASCII codes are from 0 to 127. Full stop.
The Java type char has values ranging from 0 to 65535, which are code points in the Basic Multilingual Plane of the Unicode character set.
Your encoding algorithm uses 16-bit subtraction. "Negative" values will be in the range 32768 to 65535 (since char values are unsigned). However, you seem to want to only deal with values in the range 0 to 255. To do that, you can force your arithmetic to be modulo 256 - e.g. by ANDing the result with 0xFF.
Related
I have Python background and I don't understand that how byte casting returns decimal value of char according to ASCII.
Here are the some code examples:
// C#
string s = "abc123éé";
int[] x = new int[255];
for (int i = 0; i < s.Length; i++){
amount[(byte)s[i] - (byte)'0']++;
}
If we look for first iteration the casting is on 'a' char and it returns 97.
// Java
char a = 'a';
System.out.println((byte)a);
Same as Java, it returns 97 too. But in Python 3, it does not return as decimal value of char.
>>> a = bytes("a", encoding="utf-8")
>>> a
b'a'
And now if we're coming to my questions:
How / Why byte casting works like this?
I know that byte's value range is -128 to 127 but char's is 0 to 255. How does not it give an exception even 'é' value is 233?
What's the difference between Python at this point?
Only for Java, I do not use Python:
How / Why byte casting works like this?
It is specified by the Java Language Specification, mostly JLS-5.1.3: "...A narrowing conversion of a char to an integral type T likewise simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the resulting value to be a negative number, even though chars represent 16-bit unsigned integer values..."
("Why?" because it is so specified)
I know that byte's value range is -128 to 127 but char's is 0 to 255. How does not it give an exception even 'é' value is 233?
Wrong, chars are 0 to 65535 (or '\u0000' to '\uFFFF') JLS-4.2.1
No reason for Exception, it will result in the byte value -23 (same bits as 'é' or int 233)
I must pass the last point/question, I do not know enough Python
I am trying to get a char from an int value > 0xFFFF. But instead, I always get back the same char value, that when cast to an int, prints the value 65535 (0xFFFF).
I couldn't understand why it is generating symbols for unicode > 0xFFFF.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF. Instead, the output comes back as 65535.
This is because, while an int is 4 bytes, a char is only 2 bytes. Thus, you can't represent all values in a char that you can in an int. Using a standard unsigned integer representation, you can only represent the range of values from 0 to 2^16 - 1 == 65535 in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you'll lose data.
int is 4 byte. char is 2 byte.
Your number was well within range an int can hold, but not which char can.
So when you converted that number to a char, it lost data and became the maximum a char can hold, which is what it printed i.e. 65535
Your number was too big to be a char which is 2 bytes. But it was small enough where it fit in as an int which is 4 bytes. 65535 is the biggest amount that fits in a char so that's why you got that value. Also, if a char was big enough to fit your number, when you returned it to an int it might have returned the decimal value for 0x10FFFF which is 1114111.
Unfortunately, I think you were expecting a Java char to be the same thing as a Unicode code point. They are not the same thing.
The Java char, as already expressed by other answers, can only support code points that can be represented in 16 bits, whereas Unicode needs 21 bits to support all code points.
In other words, a Java char on its own, only supports Basic Multilingual Plane characters (code points <= 0xFFFF). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points > 0xFFFF), then you need surrogate characters, or a pair of characters to do that. This is how UTF-16 works. And, internally, this is how Java strings work as well. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by 2 characters if the code point is > 0xFFFF:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("𢯭".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert an int value that represents a Unicode code point to a char (or chars to be more exact), and then convert it back to an int code point, you will have to use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating char arrays, you can use a String, like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class
I've got a huge string of bits (with some \n in it too) that I pass as a parameter to a method, which should isolate the bits 8 by 8, and convert them all to bytes using parseInt().
Thing is, every time the substring of 8 bits starts with a 1, the resulting byte is a negative number. For example, the first substring is '10001101', and the resulting byte is -115. I can't seem to figure out why, can someone help? It works fine with other substrings.
Here's my code, if needed :
static String bitsToBytes(String geneString) {
String geneString_temp = "", sub;
for(int i = 0; i < geneString.length(); i = i+8) {
sub = geneString.substring(i, i+8);
if (sub.indexOf("\n") != -1) {
if (sub.indexOf("\n") != geneString.length())
sub = sub.substring(0, sub.indexOf("\n")) + sub.substring(sub.indexOf("\n")+1, sub.length()) + geneString.charAt(i+9);
}
byte octet = (byte) Integer.parseInt(sub, 2);
System.out.println(octet);
geneString_temp = geneString_temp + octet;
}
geneString = geneString_temp + "\n";
return geneString;
}
In Java, byte is a signed type, meaning that when the most significant bit it set to 1, the number is interpreted as negative.
This is precisely what happens when you print your byte here:
System.out.println(octet);
Since PrintStream does not have an overload of println that takes a single byte, the overload that takes an int gets called. Since octet's most significant bit is set to 1, the number gets sign-extended by replicating its sign bit into bits 9..32, resulting in printout of a negative number.
byte is a signed two's complement integer. So this is a normal behavior: the two's complement representation of a negative number has a 1 in the most-significant bit. You could think of it like a sign bit.
If you don't like this, you can use the following idiom:
System.out.println( octet & 0xFF );
This will pass the byte as an int while preventing sign extension. You'll get an output as if it were unsigned.
Java doesn't have unsigned types, so the only other thing you could do is store the numbers in a wider representation, e.g. short.
In Java, all integers are signed, and the most significant bit is the sign bit.
Because parseInt parse signed int that means it converts the binary if it begins with 0 its positive and if 1 its negative try to use parseUnsignedInt instead
The problem I am facing occurs when I try to type cast some ASCII values to char.
For example:
(char)145 //returns ?
(char)129 //also returns ?
but it is supposed to return a different character. It happens to many other values as well.
I hope I have been clear enough.
ASCII is a 7-bit encoding system. Some programs even use this to detect if a file is binary or textual. Characters below 32 are escape characters and are used as directives (for instance new lines, print command)
The program however will still work. A character is simply stored as a short (thus sixteen bits). But the values in that range don't have an interpretation. This means that the textual output of both values will lead to nothing. On the other hand comparisons like (char) 145 == (char) 129 will still work (return false). Simply because for a processor, there is no difference between a short and a character.
If you are interested in converting your value such that only the lowest seven bits count (this modifying the value such that it is in the valid range), you can use masking:
int value = 145;
value &= 0x7f;
char c = (char) value;
The char type is Unicode 16 bit, UTF-16. So you could do (char) 265 for c-with-circumflex. ASCII is 7 bits 0 - 127.
String s = "" + ((char)145) + ((char)129);
The above is a string of two Unicode characters (each 2 bytes, UTF-16).
byte[] bytes = s.getBytes(StandardCharsets.US_ASCII); // ASCII with '?' as 7bit
s = new String(bytes, StandardCharsets.US_ASCII); // "??"
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1); // ISO-8859-1 with Latin1
byte[] bytes = s.getBytes("Windows-1252"); // With Windows Latin1
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // No information loss.
s = new String(bytes, StandardCharsets.UTF_9); // Orinal string.
In java String/char/Reader/Writer tackle text (in Unicode), whereas byte[]/InputStream/OutputStream tackle binary data, bytes.
And for bytes must always be associated with an encoding to give text.
Answer: as soon as there is a conversion from text to some encoding that does not represent that char, a question mark can be written.
These expressions evaluate to true:
((char) 145) == '\u0091';
((char) 129) == '\u0081';
These UTF-16 values map to the Unicode code points U+0091 and U+0081:
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
These are both control characters without visible graphemes (the question mark acts as a substitution character) and one of them is private use so has no designated purpose. Neither are in the ASCII set.
I apologize if this question is a bit simplistic, but I'm somewhat puzzled as to why my professor has made the following the statement:
Notice that read() returns an integer value. Using an int as a return type allows read() to use -1 to indicate that it has reached the end of the stream. You will recall from your introduction to Java that an int is equal to a char which makes the use of the -1 convenient.
The professor was referencing the following sample code:
public class CopyBytes {
public static void main(String[] args) throws IOException {
FileInputStream in = null;
FileOutputStream out = null;
try {
in = new FileInputStream("Independence.txt");
out = new FileOutputStream("Independence.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
} finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
This is an advanced Java course, so obviously I've taken a few introductory courses prior to this one. Maybe I'm just having a "blonde moment" of sorts, but I'm not understanding in what context an integer could be equal to a character when making comparisons. The instance method read() returns an integer value when it comes to EOF. That I understand perfectly.
Can anyone shed light on the statement in bold?
In Java, chars is a more specific type of int. I can write.
char c = 65;
This code prints out "A". I need the cast there so Java knows I want the character representation and not the integer one.
public static void main(String... str) {
System.out.println((char) 65);
}
You can look up the int to character mapping in an ASCII table.
And per your teacher, int allows for more values. Since -1 isn't a character value, it can serve as a flag value.
To a computer a character is just a number (that may at some point be mapped to a picture of a letter for display to the user). Languages usually have a special character type to distinguish between "just a number" and "a number that refers to a character", but inside, it's still just some sort of integer.
The reason why read() returns an int is to have "one extra value" to represent EOF. All the values of char are already defined to mean something else, so it uses a larger type to get more values.
It means your professor has been spending too much time programming in C. The definition of read for InputStream (and FileInputStream) is:
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255. If no byte is available because the end of the stream has been reached, the value -1 is returned.
(See http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#read())
A char in Java, on the other hand, represents a Unicode character, and is treated as an integer in the range 0 to 65535. (In C, a char is an 8-bit integral value, either 0 to 255 or -128 to 127.)
Please note that in Java, a byte is actually an integer in the range -128 to 127; but the definition of read has been specified to avoid the problem, by decreeing that it will return 0 to 255 anyway. The javadoc is using "byte" in a loose sense here.
The char data type in Java is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
The int data type in Java is a 32-bit signed two's complement integer. It has a minimum value of -2,147,483,648 and a maximum value of 2,147,483,647 (inclusive).
Since char cannot be negative (a number between 0 and 65,535) and an int can be negative, the possible values returned from the method is -1 (to signify nothing left) to 65,535 (max value of a char).
What your professor is referring to the fact that characters are just integers used in a special context. If we ignore Unicode and other encoding types and focus on the old days of ASCII, there was an ASCII table (http://www.asciitable.com/). A string of characters is really just a sequence of integers, for example, TUV would be 84 followed by 85 followed by 86.
The 'char' type is an integer internally in the JVM and is more or less a hint that this integer should only be used in a character context.
You can even cast between them.
char a = (char) 65;
int i = (int) 'A';
Those two variables hold the same data in memory, but the compiler and JVM treat them slightly differently.
Because of this, read() returns an integer instead of char so as to allow a -1, which is not a valid character code. Values other than -1 can be cast to a char, while -1 indicates EOF.
Of course, Unicode changes all of this with multi-byte character and code points. I'll leave that as an exercise to you.
I am not sure what the professor means but what it all comes down to is computers only understand 1's and 0's we don't understand 1's and 0's all that we'll so we use a code system first Morris code then ascii now utf -16 ... It varies from computer to computer how accurate numbers(int) is.you know in the real world int is infinate they just keep counting.char also has a size.in utf _16 let's just say it's 16 bits (I will let you read up on that) so if char and int both take 16 bits as the professor says they are the same (size) and reading 1 char is the same as 1int . By the way to be politically correct char is infinite as well.Chinese characters French characters and the character I just made up but can't post cause its not supported.so think of the code system for int and char. -1 int is eof char.(eof = end of file) good luck, I hope this helped.what I don't understand is reading and writing to the same file?