Is there any way in Java so that I can get Unicode equivalent of any character? e.g.
Suppose a method getUnicode(char c). A call getUnicode('÷') should return \u00f7.
You can do it for any Java char using the one liner here:
System.out.println( "\\u" + Integer.toHexString('÷' | 0x10000).substring(1) );
But it's only going to work for the Unicode characters up to Unicode 3.0, which is why I precised you could do it for any Java char.
Because Java was designed way before Unicode 3.1 came and hence Java's char primitive is inadequate to represent Unicode 3.1 and up: there's not a "one Unicode character to one Java char" mapping anymore (instead a monstrous hack is used).
So you really have to check your requirements here: do you need to support Java char or any possible Unicode character?
If you have Java 5, use char c = ...; String s = String.format ("\\u%04x", (int)c);
If your source isn't a Unicode character (char) but a String, you must use charAt(index) to get the Unicode character at position index.
Don't use codePointAt(index) because that will return 24bit values (full Unicode) which can't be represented with just 4 hex digits (it needs 6). See the docs for an explanation.
[EDIT] To make it clear: This answer doesn't use Unicode but the method which Java uses to represent Unicode characters (i.e. surrogate pairs) since char is 16bit and Unicode is 24bit. The question should be: "How can I convert char to a 4-digit hex number", since it's not (really) about Unicode.
private static String toUnicode(char ch) {
return String.format("\\u%04x", (int) ch);
}
char c = 'a';
String a = Integer.toHexString(c); // gives you---> a = "61"
I found this nice code on web.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class Unicode {
public static void main(String[] args) {
System.out.println("Use CTRL+C to quite to program.");
// Create the reader for reading in the text typed in the console.
InputStreamReader inputStreamReader = new InputStreamReader(System.in);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
try {
String line = null;
while ((line = bufferedReader.readLine()).length() > 0) {
for (int index = 0; index < line.length(); index++) {
// Convert the integer to a hexadecimal code.
String hexCode = Integer.toHexString(line.codePointAt(index)).toUpperCase();
// but the it must be a four number value.
String hexCodeWithAllLeadingZeros = "0000" + hexCode;
String hexCodeWithLeadingZeros = hexCodeWithAllLeadingZeros.substring(hexCodeWithAllLeadingZeros.length()-4);
System.out.println("\\u" + hexCodeWithLeadingZeros);
}
}
} catch (IOException ioException) {
ioException.printStackTrace();
}
}
}
Original Article
are you picky with using Unicode because with java its more simple if you write your program to use "dec" value or (HTML-Code) then you can simply cast data types between char and int
char a = 98;
char b = 'b';
char c = (char) (b+0002);
System.out.println(a);
System.out.println((int)b);
System.out.println((int)c);
System.out.println(c);
Gives this output
b
98
100
d
Related
I am investigating some mess that has been done to our languages-support (it is used in our IDN functionality, if that rings a bell)...
I used an SQL GUI client to quickly see the structure of our language definitions. So, when I do select charcodes from ourCharCodesTable where language = 'myLanguage';, I get results for some values of 'myLanguage', E.G.:
myLanguage = "ASCII":
result = "-0123456789abcdefghijklmnopqrstuvwxyz"
myLanguage = "Russian":
result = "-0123456789абвгдежзийклмнопрстуфхцчшщъьюяѐѝ"
(BTW: can already see a language mistake here, if you are a polyglot like me!)
I thought: "OK, I can work with this! Let's write a Java program and put some logic to find mistakes..."
I need my logic to receive one char at a time from the 'result' and, according to the current table context, apply my logic to flag if it should or should not be there...
However! When I am at:
myLanguage = "Belarusian" :
One would think this language is rather similar to Russian, but the very format of the result, as coming from the database is totally different: result = "U+002D\nU+0030\nU+0030..." !
And, there's another format!
myLanguage = "Chinese" :
result = "#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"
FWIW: charcodes column is of CLOB type.
I know U+002D is '-' and U+0030 is '0'...
My current idea is to:
1] Check if the entire response is in 'щ' format or 'U+0449` format (whether the 'U+****'s are separated with ';', ',' or '\n' - I am just going to treat them as standalone chars)
a. If it is the "easy one", just send the char on to my testing method
b. If it is the "hard one", get the hex part (0449), convert to decimal (1097) and cast to char (щ)
So, again, my questions are:
What is this "U+043E;U+006F,U+004D" format?
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
UPDATED
What is this "U+043E;U+006F,U+004D" format?
In a comment, OP provided a link to https://www.iana.org/domains/idn-tables/tables/academy_zh_1.0.txt, which has the following text:
This table conforms to the format specified in RFC 3743.
RFC 3743 can be found at https://www.rfc-editor.org/rfc/rfc3743
If it is a widely-used standard, does Java offer any methods to convert a whole String of these into a char array?
It is not a widely-used standard, so Java does not offer that natively, but it is easy to convert to regular String using regex, so you can then process the string normally.
// Java 11+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
Character.toString(Integer.parseInt(mr.group().substring(2), 16)));
}
// Java 9+
static String decodeUnicode(String input) {
return Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input).replaceAll(mr ->
new String(new int[] { Integer.parseInt(mr.group().substring(2), 16) }, 0, 1));
}
// Java 1.5+
static String decodeUnicode(String input) {
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("U\\+[0-9A-F]{4,6}").matcher(input);
while (m.find()) {
String hexString = m.group().substring(2);
int codePoint = Integer.parseInt(hexString, 16);
String unicodeCharacter = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(buf, unicodeCharacter);
}
return m.appendTail(buf).toString();
}
Test
System.out.println(decodeUnicode("#\nU+002D;U+002D;U+003D,U+004D,U+002D\nU+0030;U+0030;U+0030"));
Output
#
-;-;=,M,-
0;0;0
U+0000 is a representation of a Unicode Codepoint and the format is defined in Apendix A of the Unicode Standard. The numbers are simply the hex-encoded number of the represented codepoint. For historical reasons they are always left-padded to at least 4 digits with 0, but can be up to 6 digits long.
It is not primarily meant as a machine-readable encoding, but rather as a human-readable representation of Unicode codepoints for use in running text (i.e. paragraphs such as this one). Note especially that this format does not have a way to distinguish a four-character number followed by some numbers from a 5- or 6-digit number. So U+123456 could be interpreted in 3 different was: U+1234 followed by the text 56, U+12345 followed by the text 6 or U+123456. This makes it unsuited for automatic replacement and use as a general-purpose encoding.
As such there is no built-in functionality to parse this into its equivalent String or similar in Java.
The following code can be used to parse a single Unicode codepoint reference into the appropriate codepoint in a String:
public static String codePointToString(String input) {
if (!input.startsWith("U+")) {
throw new IllegalArgumentException("Malformed input, doesn't start with U+");
}
int codepoint = Integer.parseInt(input.substring(2), 16);
if (codepoint < 0 || codepoint > Character.MAX_CODE_POINT) {
throw new IllegalArgumentException("Malformed input, codepoint value out of valid range: " + codepoint);
}
return Character.toString(codepoint);
}
(Before Java 11 the return line needs to use new String(new int[] { codepoint }, 0, 1) instead).
And if you want to replace all Unicode codepoints represented in a text by their actual text (which might render it unreadable in some cases) you can use this (together with the method above):
private static final Pattern PATTERN = Pattern.compile("U\\+[0-9A-Za-z]{4,6}");
public static String decodeCodePoints(String input) {
return PATTERN
.matcher(input)
.replaceAll(result -> codePointToString(result.group()));
}
Actually, I wrote an Open Source Library called MgntUtils that has a utility that can very much help you. The codes that you see are unicode sequences where each U+XXXX represents a character. The utility in the library can convert any string in any language (including special characters) into Unicode sequences and vise-versa. Here is a sample of how it works:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I'm trying to write a code that pick-up a word from a file according to an index entered by the user but the problem is that the method readChar() from the RandomAccessFile class is returning japanese characters, I must admit that it's not the first time that I've seen this on my lenovo laptop , sometimes on some installation wizards I can see mixed stuff with normal characters mixed with japanese characters, do you think it comes from the laptop or rather from the code?
This is the code:
package com.project;
import java.io.*;
import java.util.StringTokenizer;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1))+i);
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ');
System.out.println("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
}
}while(N!=0);
buffer.close();
}
}
i get this output :
瑯潕啰灰灥敲牃䍡慳獥攨⠩⤍ഊੴ瑯潌䱯潷睥敲牃䍡慳獥攨⠩⤍ഊ捯潭浣捡慴琨⡓却瑲物楮湧朩⤍ഊ捨桡慲牁䅴琨⡩楮湴琩⤍ഊੳ獵畢扳獴瑲物楮湧木⠠獴瑡慲牴琠楮湤摥數砬Ⱐ敮湤搠楮湤摥數砩⤍ഊੴ瑲物業洨⠩Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 100 out of bounds for length 100
at Main.main(Main.java:21)
There are many things wrong, all of which have to do with fundamental misconceptions.
First off: A file on your disk - never mind the File interface in Java, or any other programming language; the file itself - does not and cannot store text. Ever. It stores bytes. That is, raw data, as (on every machine that's been relevant for decades, but historically there have been other ways to do it) quantified in bits, which are organized into groups of 8 that are called bytes.
Text is an abstraction; an interpretation of some particular sequence of byte values. It depends - fundamentally and unavoidably - on an encoding. Because this isn't a blog, I'll spare you the history lesson here, but suffice to say that Java's char type does not simply store a character of text. It stores an unsigned two-byte value, which may represent a character of text. Because there are more characters of text in Unicode than two bytes can represent, sometimes two adjacent chars in an array are required to represent a character of text. (And, of course, there is probably code out there that abuses the char type simply because someone wanted an unsigned equivalent of short. I may even have written some myself. That era is a blur for me.)
Anyway, the point is: using .readChar() is going to read two bytes from your file, and store them into a char within your char[], and the corresponding numeric value is not going to be anything like the one you wanted - unless your file happens to be encoded using the same encoding that Java uses natively, called UTF-16.
You cannot properly read and interpret the file without knowing the file encoding. Full stop. You can at best delude yourself into believing that you can read it. You also cannot have "random access" to a text file - i.e., indexing according to a number of characters of text - unless the encoding in question is constant width. (Otherwise, of course, you can't just calculate the distance-in-bytes into the file where a given character of text is; it depends on how many bytes the previous characters took up, which depends on which characters they are.) Many text encodings are not constant width. One of the most popular, which frankly is the sane default recommendation for most tasks these days, is not. In which case you are simply out of luck for the problem you describe.
At any rate, once you know the encoding of your file, the expected way to retrieve a character of text from a file in Java is to use one of the Reader classes, such as InputStreamReader:
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
(Here, charset simply means an instance of the class that Java uses to represent text encodings.)
You may be able to fudge your problem description a little bit: seek to a byte offset, and then grab the text characters starting at that offset. However, there is no guarantee that the "text characters starting at that offset" make any sense, or in fact can be decoded at all. If the offset happens to be in the middle of a multi-byte encoding for a character, the remaining part isn't necessarily valid encoded text.
char is 16 bits, i.e. 2 bytes.
seek seeks to a byte boundary.
If the file contains chars then they are at even offsets: 0, 2, 4...
The expression (2*(N-1))+i) is even iff i is even; if odd, you are sure to land in the middle of a char, and thus read garbage.
i starts at zero, but you increment by 1, i.e., half a character.
Your seek argument should probably be (2*(N-1+i)).
Alternative explanation: your file does not contain chars at all; for example, you created an ASCII file in which a character is a single byte.
In that case, the error is attempting to read ASCII (an obsolete character encoding) with a readChar function.
But if the file contains ASCII, the purpose of multiplying by 2 in the seek argument is obscure. It apparently serves no useful purpose.
I changed the encoding of the file to UTF-16 and modified the programe in order to display the right indexes, those that represents the beginning of each word, now it works fine, Thank you guys.
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
int N, i=0, j=0, k=0;
char C;
char[] charArray = new char[100];
String fileLocation = "file.txt";
BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
DataInputStream in = new DataInputStream(new FileInputStream(fileLocation));
boolean EOF=false;
do {
try {
j++;
C = in.readChar();
if((C==' ')||(C=='\n')){
System.out.print(j+1+"\t");
}
}catch (IOException e){
EOF=true;
}
}while (EOF!=true);
System.out.println("\n");
do {
System.out.println("enter the index of the word");
N = Integer.parseInt(buffer.readLine());
if (N!=0) {
RandomAccessFile word = new RandomAccessFile(new File(fileLocation), "r");
do {
word.seek((2*(N-1+i)));
C = word.readChar();
charArray[i] = C;
i++;
}while(charArray[i-1] != ' ' && charArray[i-1] != '\n');
System.out.print("the word of index " + N + " is: " );
for (char carTemp : charArray )
System.out.print(carTemp);
System.out.print("\n");
i=0;
charArray = new char[100];
}
}while(N!=0);
buffer.close();
}
}
Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.
I'm looking a clean way to replace these characters.
Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.
N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.
We ended up implementing the following method in Java for this problem.
Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.
The offset calculations are to make sure we stay on the unicode code points.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String s) {
final int length = s.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
b.append(CharUtils.REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(CharUtils.REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}
Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:
text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.
Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.
I am new to java but I am very fluent in C++ and C# especially C#. I know how to do xor encryption in both C# and C++. The problem is the algorithm I wrote in Java to implement xor encryption seems to be producing wrong results. The results are usually a bunch of spaces and I am sure that is wrong. Here is the class below:
public final class Encrypter {
public static String EncryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
public static String DecryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
}
Strings in Java are Unicode - and Unicode strings are not general holders for bytes like ASCII strings can be.
You're taking a string and converting it to bytes without specifying what character encoding you want, so you're getting the platform default encoding - probably US-ASCII, UTF-8 or one of the Windows code pages.
Then you're preforming arithmetic/logic operations on these bytes. (I haven't looked at what you're doing here - you say you know the algorithm.)
Finally, you're taking these transformed bytes and trying to turn them back into a string - that is, back into characters. Again, you haven't specified the character encoding (but you'll get the same as you got converting characters to bytes, so that's OK), but, most importantly...
Unless your platform default encoding uses a single byte per character (e.g. US-ASCII), then not all of the byte sequences you will generate represent valid characters.
So, two pieces of advice come from this:
Don't use strings as general holders for bytes
Always specify a character encoding when converting between bytes and characters.
In this case, you might have more success if you specifically give US-ASCII as the encoding. EDIT: This last sentence is not true (see comments below). Refer back to point 1 above! Use bytes, not characters, when you want bytes.
If you use non-ascii strings as keys you'll get pretty strange results. The bytes in the kbytes array will be negative. Sign-extension then means that val will come out negative. The cast to char will then produce a character in the FF80-FFFF range.
These characters will certainly not be printable, and depending on what you use to check the output you may be shown "box" or some other replacement characters.
I am actually writing a program to generate some truely random numbers. So, i am trying to write an algorithm to calculate from various factor.And i want to define my own encoding, so that when it is converted, it can only be strings of certain characters.
for example, the user wants only small letters, so i want something like this:
char[] result;
String result = new String(result,"MY OWN ENCODING")
Is there a way ?
Thanks alot !
You probably want to create a converter class:
public class Converter {
private String charset;
public Converter(String charset) {this.charset = charset;}
public char[] convert(char[] input) {
char[] result = new char[input.length];
// do your own magic
return result;
}
}
Once the magic converter logic is implemented and in place:
Converter converter = new Converter("abcdefghijklmnopqrstuvwxyz");
char[] result = converter.convert(myCharArray);
You seem to have a misunderstanding of what a "character encoding" is.
Character encodings, which you can specify as the second argument in the String(byte[], String) constructor, do not have anything to do with what you seem to have in mind.
A character encoding is what specifies how characters in a string are converted to raw bytes and vice versa. For example the ASCII encoding specifies that the a byte with the value 65 represents the character 'A', 66 is 'B' etc.
A character encoding is not something that you can use to "convert" or "encode" a char[] with into a String object, and is not something that can filter out for example non-lower-case characters or transform arbitrary characters to lower-case characters. Forget about character encodings for this purpose, because that's not what character encodings are for.
Just write some method that converts the output of your random generator to characters in a suitable range. Here's a simple example of a program that generates a string from random lower-case characters:
public class Example {
public static void main(String[] args) {
System.out.println(generate(10));
}
public static String generate(int len) {
Random random = new Random();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; ++i) {
sb.append((char)('a' + random.nextInt(26)));
}
return sb.toString();
}
}