Reading UTF-8 characters using Scanner

Reading UTF-8 characters using Scanner - java

public boolean isValid(String username, String password) {
boolean valid = false;
DataInputStream file = null;
try{
Scanner files = new Scanner(new BufferedReader(new FileReader("files/students.txt")));
while(files.hasNext()){
System.out.println(files.next());
}
}catch(Exception e){
e.printStackTrace();
}
return valid;
}
How come when I am reading a file that has been written by UTF-8(By another java program) it displays with weird symbols followed by its String name?
I wrote it using this
private static void addAccount(String username,String password){
File file = new File(file_name);
try{
DataOutputStream dos = new DataOutputStream(new FileOutputStream(file,true));
dos.writeUTF((username+"::"+password+"\n"));
}catch(Exception e){
}
}

Here is a simple way to do that:
File words = new File(path);
Scanner s = new Scanner(words,"utf-8");

From the FileReader Javadoc:
Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So perhaps something like new InputStreamReader(new FileInputStream(file), "UTF-8"))

When using DataOutput.writeUTF/DataInput.readUTF, the first 2 bytes form an unsigned 16-bit big-endian integer denoting the size of the string.
First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.
These are likely the cause for your issues. You'd need to skip the first 2 bytes and then specify your Scanner use UTF-8 to read properly.
That being said, I do not see any reason to use DataOutput/DataInput here. You can merely use FileReader and FileWriter instead. These will use the default system encoding.

Related

The Java.io.InputStream class is the superclass of all classes representing an input stream of bytes. How is it reading a file with characters?

I have a file named mark_method.json containing ABCDE in it and I am reading this file using the InputStream class.
By definition, the InputStream class reads an input stream of bytes. How does this work? I don't have bytes in the file, but characters?
I am trying to understand how a stream reading bytes is reading characters from the file?
public class MarkDemo {
public static void main(String args[]) throws Exception {
InputStream is = null;
try {
is = new FileInputStream("C:\\Users\\s\\Documents\\EB\\EB_02_09_2020_with_page_number_and_quote_number\\Old_images\\mark_method.json");
}
catch(Exception e) {
e.printStackTrace();
} finally {
if(is != null) {
is.close();
}
}
}
}

Every data on the computer is stored in bits and bytes. Here the content of the files is also stored in bytes.
We have programs which convert these bytes into human-readable forms thus we see the mark_method.json file containing characters and not bytes.

An character is a byte. (At least in ASCII).
Each byte from 0 to 127 has a character value. For example 0 is the Null-character, 0xa is \n, 0xd is \r, 0x41 is 'A' and so on.
The implementation only knows bytes. It doesn't know, that the char 0x2709 is ✉. It only sees it as two bytes: 0x27 and 0x09.
Only the texteditor interprets the bytes and show the matching symbol/letter

I think what you are actually asking here is how to convert the bytes you read from file using FileInputStream in to a Java String object you can print and manipulate.
FileInputStream does not have any read methods for directly producing a String object so if that is what you want, you need to further manipulate the input you get.
Option one is to use the Scanner class:
Scanner scanner = new Scanner(is);
String word = scanner.next();
Another option is to read the bytes and use the constructor of the String class that works with byte array:
byte [] bytes = new byte[10];
is.read(bytes);
String text = new String(bytes);
Note that for simplicity I just assumed you can read 10 valid bytes from your file.
In real code you would need some logic to make sure you are reading correct number of bytes.
Also, if your file is not stored using your system default character set, you will need to specify the character set as a parameter to the String constructor.
Finally, you can use another wrapper class, BufferedReader that has a readLine function which takes care of all the logic needed to read bytes representing a line of text from a file and return them in a String.
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
String line = in.readLine();

java - modify and return a buffredInputStream

I have a BufferedInputStream that I got from a FileInputStream object like :
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)
now, I want to remove the chars { and } from the buffredInputStream (I know the file has those chars in it).
I thought that I can easily do it somehow like string replace but I saw that there is no simple way of doing it with BufferedInputStream.
any ideas how can I replace those specific chars from the BufferedInputStreamand return the new modified BufferedInputStream?
EDIT:
At the end I want to decide the charset of a file. though the chars {} are causing me some issues so I want to remove them before deciding the charset of a file. this i show I am trying to decide the charset:
static String detectCharset(File file) {
try (FileInputStream fileInputStream = new FileInputStream(file);
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)) {
CharsetDetector charsetDetector=new CharsetDetector();
charsetDetector.setText(bufferedInputStream);
charsetDetector.enableInputFilter(true);
CharsetMatch cm=charsetDetector.detect();
return cm.getName();
} catch (Exception e) {
return null;
}
}

NB: Adding a note to respond to the edit you have done to your question: You can't really filter } from a bag of bytes unless you know the encoding, so if you want to filter } out in order to guess at encoding you're in a chicken-and-egg situation. I do not understand how removing { and } would somehow help a charset encoding detector, though. That sounds like the detector is buggy or you're misinterpreting what it is doing. If you must, rewrite your brain to treat this as 'removing byte 123 and 125 from an inputstream' instead of 'remove chars { and } from an inputstream' and you're closer to a workable job definition. The same principle applies, except you'd write a FilterInputStream instead of a FilterReader with almost the same methods, except 123 and 125 instead of '{' and '}'.
-- original answer --
[1] InputStream refers to bytes, Reader is the same concept, except, for characters. It does not make sense to say: "filter all { from an inputstream". It would make sense to say "filter all occurrences of byte '123' from an inputstream". If it's UTF-8 or ASCII, these two are equivalent, but there's no guarantee, and it's not 'nice' code in any fashion. To read files as text, this is how:
import java.nio.file.*;
Path p = Paths.get("/path/to/file");
try (BufferedReader br = Files.newBufferedReader(p)) {
// operate on the reader here
}
note that unlike most java methods, the methods in Files assume UTF_8. You can specify the encoding explicitly (Files.newBufferedReader(p, [ENCODING HERE])) instead. You should never rely on the system default encoding being the right one; you cannot read a file as text unless you know in what text encoding it is written!
If you must use old API:
try (FileInputStream fis = new FileInputStream("/path/to/file");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr)) {
}
note that you MUST specify charset here or things break is subtle ways.
[2] to filter out certain characters, you can either do it 'inline' (in the code that reads chars from the reader), which is trivial, or you can create a wrapper stream that can do it. Something like:
class RemoveBracesReader extends java.io.FilterReader {
public RemoveBracesReader(Reader in) {
super(in);
}
public int read() throws java.io.IOException {
while (true) {
int c = in.read();
if (c != '{' && c != '}') return c;
}
}
}

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

So you know you can use AsynchronousFileChannel to read an entire file to a String:
AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ);
long len = fileChannel.size();
ReadAttachment readAttachment = new ReadAttachment();
readAttachment.byteBuffer = ByteBuffer.allocate((int) len);
readAttachment.asynchronousChannel = fileChannel;
CompletionHandler<Integer, ReadAttachment> completionHandler = new CompletionHandler<Integer, ReadAttachment>() {
#Override
public void completed(Integer result, ReadAttachment attachment) {
String content = new String(attachment.byteBuffer.array());
try {
attachment.asynchronousChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
completeCallback.accept(content);
}
#Override
public void failed(Throwable exc, ReadAttachment attachment) {
exc.printStackTrace();
exceptionError(errorCallback, completeCallback, String.format("error while reading file [%s]: %s", path, exc.getMessage()));
}
};
fileChannel.read(
readAttachment.byteBuffer,
0,
readAttachment,
completionHandler);
Suppose that now, I don't want to allocate an entire ByteBuffer, but read line by line. I could use a ByteBuffer of fixed width and keep recalling read many times, always copying and appending to a StringBuffer until I don't get to a new line... My only concern is: because the encoding of the file that I am reading could be multi byte per character (UTF something), it may happen that the read bytes end with an uncomplete character. How can I make sure that I'm converting the right bytes into strings and not messing up the encoding?
UPDATE: answer is in the comment of the selected answer, but it basically points to CharsetDecoder.

If you have clear ASCII separator which you have in your case (\n), you'll not need to care about incomplete string as this character maps to singlebyte (and vice versa).
So just search for '\n' byte in your input and read and convert anything before into String. Loop until no more new lines are found. Then compact the buffer and reuse it for next read. If you don't find new line you'll have to allocate bigger buffer, copy the content of the old one and only then call the read again.
EDIT: As mentioned in the comment, you can pass the ByteBuffer to CharsetDecoder on the fly and translate it into CharBuffer (then append to StringBuilder or whatever is preffered solution).

Try Scanner:
Scanner sc = new Scanner(FileChannel.open(filePath, StandardOpenOption.READ));
String line = sc.readLine();
FileChannel is InterruptibleChannel

How to read a string stream in Java discarding illegal characters?

I have to parse a stream of bytes coming from a TCP connection that's supposed to only give me printable characters, but in reality that's not always the case. I've seen some binary zeros in there, at the start and end of some fields. I have no control over the source of the data and I need to process the "dirty" lines. If I could just filter out the invalid characters, that'd be OK. The relevant code is as such:
srvr = new ServerSocket(myport);
skt = srvr.accept();
// Tried with no encoding argument too
in = new Scanner(skt.getInputStream(), "ISO-8859-1");
in.useDelimiter("[\r\n]");
for (;;) {
String myline = in.next();
if (!myline.equals(""))
ProcessRecord(myline);
}
I get an exception at every line that has "dirt." What's a good way to filter out invalid characters while still being able to obtain the rest of the string?

You have to wrap your InputStream in a CharsetDecoder, defining an empty error handler:
//let's create a decoder for ISO-8859-1 which will just ignore invalid data
CharsetDecoder decoder=Charset.forName("ISO-8859-1").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
//let's wrap the inputstream into the decoder
InputStream is=skt.getInputStream();
in = new Scanner(decoder.decode(is));
you can also use a custom CodingErrorAction and define your own action in case of coding error.

The purest solution is to filter the InputStream (binary bytes-level I/O).
in = new Scanner(new DirtFilterInputStream(skt.getInputStream()), "Windows-1252");
public class DirtFilterInputStream extends InputStream {
private InputStream in;
public DirtFilterInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int ch = in.read();
if (ch != -1) {
if (ch == 0) {
ch = read();
}
}
return ch;
}
}
(You need to override all methods, and delegate to the original stream.)
Windows-1252 is Windows Latin-1, an extended Latin 1, ISO-8859-1, using 0x80 - 0xBF.

I was completely off base. I get the "dirty" strings no problem (and NO, I have NO option to clean up the data source, it's from a closed system and I have to just grin and deal with it) but trying to store them in PostgreSQL is what gets me the exception. That means I have total freedom to clean it up before processing.

After changing file encoding Windows get it wrong

I wanted to change file's encoding form ones to the other(doesn't matter which).
But when i open the file with the result(file w.txt) it is messed up inside. Windows does not understand it correct.
What result encoding should i put (args[1]) so it will be interpreted by windows notepad correct?
import java.io.*;
import java.nio.charset.Charset;
public class Kodowanie {
public static void main(String[] args) throws IOException {
args = new String[2];
args[0] = "plik.txt";
args[1] = "ISO8859_2";
String linia, s = "";
File f = new File(args[0]), f1 = new File("w.txt");
FileInputStream fis = new FileInputStream(f);
InputStreamReader isr = new InputStreamReader(fis,
Charset.forName("UTF-8"));
BufferedReader in = new BufferedReader(isr);
FileOutputStream fos = new FileOutputStream(f1);
OutputStreamWriter osw = new OutputStreamWriter(fos,
Charset.forName(args[1]));
BufferedWriter out = new BufferedWriter(osw);
while ((linia = in.readLine()) != null) {
out.write(linia);
out.newLine();
}
out.close();
in.close();
}
}
input:
Ala
ma
Kota
output:
?Ala
ma
Kota
Why there is a '?'

The default encoding in Windows is Cp1252.

US-ASCII is a subset of unicode (a pretty small one by the way). You are reading a file in UTF-8 and then you write it back in US-ASCII. Thus your the encoder will have to take a desicion when a given UTF character cannot be expressed in terms of the reduced 7-bit US-ASCII subset. Clasically, this is repaced by a default charcter, like ?.
Take into account that characters in UTF-8 are multibyte in many cases, whereas US-ASCII is only 7-bit long. This means that al unicode characters above byte 127 cannot be expressed in US-ASCII. That could explain the question marks that you see once the file has been converted.
I had answered a similar question Reading Strange Unicode Characters in Java. Perhaps it helps.
I also recommend you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading UTF-8 characters using Scanner - java

Here is a simple way to do that: File words = new File(path); Scanner s = new Scanner(words,"utf-8");

Related

The Java.io.InputStream class is the superclass of all classes representing an input stream of bytes. How is it reading a file with characters?

java - modify and return a buffredInputStream

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

How to read a string stream in Java discarding illegal characters?

After changing file encoding Windows get it wrong

Categories

Resources