How to read UTF-8 characters from the file as bytes? - java

I am not able to read a UTF-8 characters from the file as bytes.
the UTF-8 characters are displaying as questionmarak(?) while converting to character from the bytes.
Below code snippet shows file reading.
Please tell me how can we read UTF-8 chanracters from a file.
and plz tell me what is the problem with byte array reading process?
public static void getData {
FormFile file = actionForm.getFile("UTF-8");
byte[] mybt;
try
{
byte[] fileContents = file.getFileData();
StringBuffer sb = new StringBuffer();
for(int i=0;i<fileContents.length;i++){
sb.append((char)fileContents[i]);
}
System.out.println(sb.toString());
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
Output ::??Docum??ents (input file content is : "ÞDocumÿents" , it contains some spanish characters. )

This is the problem:
for(int i=0;i<fileContents.length;i++){
sb.append((char)fileContents[i]);
}
You're converting each byte to a char just by casting it. That's effectively using ISO-Latin-1.
To read text from an InputStream, you adapt it via InputStreamReader, specifying the character encoding.
The simplest way of reading the whole of a file into a string would be to use Guava:
String text = Files.toString(file, Charsets.UTF_8);
Or to convert a byte array:
String text = new String(fileContents, "UTF-8");

Related

Problem in reading text from the file using FileInputStream in Java

I have a file input.txt in my system and I want to read data from that file using FileInputStream in Java. There is no error in the code, but still it does not work. It does not display the output. Here is the code, any one help me out kindly.
package com.company;
import java.io.FileInputStream;
import java.io.InputStream;
public class Main {
public static void main(String[] args) {
// write your code here
byte[] array = new byte[100];
try {
InputStream input = new FileInputStream("input.txt");
System.out.println("Available bytes in the file: " + input.available());
// Read byte from the input stream
input.read(array);
System.out.println("Data read from the file: ");
// Convert byte array into string
String data = new String(array);
System.out.println(data);
// Close the input stream
input.close();
} catch (Exception e) {
e.getStackTrace();
}
}
}
Use utility class Files.
Path path = Paths.get("input.txt");
try {
String data = Files.readString(path, Charset.defaultCharset());
System.out.println(data);
} catch (Exception e) {
e.printStackTrace();
}
For binary data, non-text, one should use Files.readAllBytes.
available() is not the file length, just the number of bytes alread buffered by the system; reading more will block while physically reading the disk device.
String.getBytes(Charset) and new String(byte[], Charset) explicitly specify the charset of the actual bytes. String will then keep the text in Unicode, so it may combine all scripts of the world.
Java was designed with text as Unicode, due to the situation then with C and C++. So in a String you can mix Arabic, Greek, Chinese and math symbols. For that binary data (byte[], InputStream, OutputStream) must be given the encoding, Charset, the bytes are in, and then a conversion to Unicode happens for text (String, char, Reader, Writer).
FileInputStream.read(byte[]) requires using the result and just reads one single buffer, must be repeated.

Text content of file store in String not converting unicode to ISO_8859_1

I am trying to convert the Unicode into ISO_8859_1. It is quite easy when declaring the Unicode in the Java String variable, e.g.
String myString = "\u00E9checs";
byte[] bytesOfString = myString.getBytes();
String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);
System.out.println(encoded_String);
Ouput:
échecs
So far so good, but when I try to convert the same text saved in a file, it's not converting just printing as it is, here I am enclosing the code to read from the file and perform conversions.
String path = "st.txt"; //where st.txt contains only one line i.e. \u00E9checs
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream);
while (sc.hasNextLine()) {
byte[] bytesOfString = sc.nextLine().getBytes();
String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);
System.out.println(encoded_String);
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
Output:
\u00E9checs
Note:
This is a test code, therefore I am using a single line in a file; I need to apply the same procedure on a large file, for that I am using Scanner Class to save the memory utilization.
Could anyone guide me on how to achieve the same result for the text in a file as I am getting when the Unicode directly declare in the Java String variable?
Thank you in advance and looking forward to your early response.
This is the problem:
byte[] bytesOfString = sc.nextLine().getBytes();
String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);
So:
there are some 8859-1 bytes in a file
the scanner reads them under the assumption they're Unicode
you then convert the Unicode data into some UTF-8 bytes
and then convert the bytes into Unicode in the pretense that they're 8859-1
You should use a Scanner that expects 8859-1 input:
new Scanner(inputstream, StandardCharsets.ISO_8859_1);
and then nextLine will do the correct conversion; no more code-juggling is needed.

FileInputStream read method keeps returning 194

I'm teaching myself Java IO currently and I'm able to read basic ASCII characters from a .txt file but when I get to other Latin-1 or characters within the 255 range it prints it as 194 instead of the correct character decimal number.
For example, I can read abcdefg from the txt file but if I throw in a character like © I dont get 169, I for some reason get 194. I tried testing this out by just printing all chars between 1-255 with a loop but that works. Reading this input seems to not though... so I'm a little perplexed. I understand I can use a reader object or whatever but I want to cover the basics first by learning the byte streams. Here is what I have though:
InputStream io = null;
try{
io = new FileInputStream("thing.txt");
int yeet = io.read();
System.out.println(yeet);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
UTF-8 encoding table and Unicode characters
You can see here that HEX code for © is c2 a9 i.e. 194 169. It seems that your file has Big Endian Endian Endianness and you read the first byte which is 194.
P.S. Read a file character by character/UTF8 this is another good example of java encodings, code-points, etc.
I have some solutions for you.
The first solution
There is a full understanding of the book on this site
The second solution
I have a sample code for you
public class Example {
public static void main(String[] args) throws Exception {
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8");
String result = new String(charset, "UTF-8");
System.out.println(result);
}
}
Output:
hey捦
Let us understand the above program. Firstly we converted a given Unicode string to UTF-8 for future verification using the getBytes() method
String str = "hey\u6366";
byte[] charset = str.getBytes("UTF-8")
Then we converted the charset byte array to Unicode by creating a new String object as follows
String result = new String(charset, "UTF-8");
System.out.println(result);
Good luck

Java - Printing unicode from text file doesn't output corresponding UTF-8 character

I have this text file with numerous unicodes and trying to print the corresponding UTF-8 characters in the console but all it prints is the hex string. Like if I copy any of the values and paste them into a System.out it works fine, but not when reading them from the text file.
The following is my code for reading the file, which contains lines of values like \u00C0, \u00C1, \u00C2, \u00C3 which are printed to the console and not the values I want.
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v -> { System.out.println(v); });
} catch (IOException e) {
e.printStackTrace();
}
}
This is the method I used to parse html that had the unicodes in the first place.
private void parseGermanEncoding() {
try
{
File encoding = new File("encoding.html");
Document document = Jsoup.parse(encoding, "UTF-8", "http://example.com/");
Element table = document.getElementsByClass("codetable").first();
Path f = Paths.get("unicode.txt");
try (BufferedWriter wr = new BufferedWriter(new FileWriter(f.toFile())))
{
for (Element row : table.select("tr"))
{
Elements tds = row.select("td");
String unicode = tds.get(0).text();
if (unicode.startsWith("U+"))
{
unicode = unicode.substring(2);
}
wr.write("\\u" + unicode);
wr.newLine();
}
wr.flush();
wr.close();
}
} catch (IOException e)
{
e.printStackTrace();
}
}
You will need to convert the string from unicode encoded string to UTF-8 encoded string. You could follow the steps, 1.convert the string to byte array using myString.getBytes("UTF-8") and 2.get the UTF-8 encoded string using new String(byteArray, "UTF-8"). The code block needs to be surrounded with try/catch for UnsupportedEncodingException.
Thanks to OTM's comment above I was able to get a working solution for this. You take the unicode string, convert to hex using Integer.parseInt() and finally casting to char to get the actual value. This solution is based on this post provided by OTM - How to convert a string with Unicode encoding to a string of letters
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v ->
{
String output = "";
// Takes unicode digits and converts to HEX value
int parse = Integer.parseInt(v, 16);
// Get the actual value of the hex value
output += (char) parse;
System.out.println(output);
});
} catch (IOException e) {
e.printStackTrace();
}
}

How do I convert List<String[]> values from UTF-8 to String?

I want to convert some greek text from UTF-8 to String, because they cannot be recognized by Java. Then, I want to populate them into a JTable. So I use List to help me out. Below I have the code snippet:
String[][] rowData;
List<String[]> myEntries;
//...
try {
this.fileReader = new FileReader("D:\\Book1.csv");
this.reader = new CSVReader(fileReader, ';');
myEntries = reader.readAll();
//here I want to convert every value from UTF-8 to String
convertFromUTF8(myEntries); //???
this.rowData = myEntries.toArray(new String[0][]);
} catch (FileNotFoundException ex) {
Logger.getLogger(VJTable.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(VJTable.class.getName()).log(Level.SEVERE, null, ex);
}
//...
I created a method
public String convertFromUTF8(List<String[]> s) {
String out = null;
try {
for(String stringValues : s){
out = new String(s.getBytes("ISO-8859-1"), "UTF-8");
}
} catch (java.io.UnsupportedEncodingException e) {
return null;
}
return out;
}
but I cannot continue, because there is no getBytes() method for List.
What should I do. Any idea would be very helpful. Thank you in advance.
The problem is your use of FileReader which only supports the "default" character set:
this.fileReader = new FileReader("D:\\Book1.csv");
The javadoc for FileReader is very clear on this:
The constructors of this class assume that the default character
encoding and the default byte-buffer size are appropriate. To specify
these values yourself, construct an InputStreamReader on a
FileInputStream.
The appropriate way to get a Reader with a character set specified is as follows:
this.fileStream = new FileInputStream("D:\\Book1.csv");
this.fileReader = new InputStreamReader(fileStream, "utf-8");
To decode UTF-8 bytes to Java String, you can do something like this (Taken from this)
Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
Once you've read the data into a String, you don't have control over encoding anymore. Java stores Strings as UTF-16 internally. If the CSV file you're reading from is written using UTF-8 encoding, you should read it as UTF-8 into the byte array. And then you again decode the byte array into a Java String using above method. Now once you have the complete String, you can probably think about splitting it to the list of Strings based on the delimiter or other parameters (I don't have clue about the data you've).

Categories