Reading from a file containing unmappable characters - java

I am attempting to use File and Scanner to read through a .txt file and grab the useful information within into a separate file. Some of these files contain Chinese characters and its causing my Scanner to throw the following error "java.nio.charset.UnmappableCharacterException:". The Chinese characters are of no importance, so how do I make the scanner ignore the Chinese characters and keep searching the rest of the file for useful information?
Here is the code:
try {
File source = new File(this.parentDirectory + File.separator + this.fileName.getText());
Scanner reader = new Scanner(source);
StringBuilder str = new StringBuilder();
while (reader.hasNextLine()) {
str.append(reader.nextLine());
str.append("\n");
}
if (reader.ioException() != null) {
throw reader.ioException();
}
reader.close();
this.input.setText(str.toString());
} catch (FileNotFoundException e1) {
JOptionPane.showMessageDialog(this, "File not found!");
return;
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}

A scanner implicitly converts between an external sequence of bytes, and the 16-bit Unicode characters used by all Java Strings.
You need to know the actual encoding used for the external data (i.e., the file content). Then you declare your Scanner as
Scanner reader = new Scanner(file, charset);
Having done that correctly, then there should be no 'unmappable' characters.
If you don't specify the charset explicitly, then the platform default is used, which is probably UTF-8.
Alternatively, it seems that you're not really using the Scanner to any significant degree; you're just using it to collect lines. You could drop down a level and use a FileInputStream to read the file as a sequence of bytes, and use whatever heuristics you think appropriate to determine the 'useful' parts of the file.

Related

Can't read integer file

I'm trying to read data from a file that contains integers, but the Scanner doesn't read anything from that file.
I've tried to read the file from the Scanner :
// switch() blablabla
case POPULATION:
try {
while (sc.hasNextInt()) {
this.listePops.add(sc.nextInt());
}
} catch (Exception e) {
System.err.println("~ERREUR~ : " + e.getMessage());
}
break;
And if I try to print each sc.nextInt() to the console, it just prints a blank line and then stops.
Now when I read the same file as a String:
?652432
531345
335975
164308
141220
1094283
328278
270582
// (Rest of the data)
So, I guess it can't read the file as a list of integers since there's a question mark at the beginning, but the problem is that this question mark doesn't appear anywhere in my file, so I can't remove it. What am I supposed to do?
If the first character in the file is a question mark (?) and its original origin is unknown then it is usually the UTF-8 Byte Order Mark (BOM). This means the file was saved as UTF-8. The Microsoft Notepad application will add a BOM to the saved text file if that file was saved in UTF-8 instead of ANSI. There are also other BOM characters for UTF-16, UTF-32, etc.
Reading a text file as String doesn't look like a bad idea now. Changing the save format of the file can work to but that BOM may have actual intended purpose for another application, so, that may not be a viable option. Let's read the file as String lines (read comments in code):
// Variable to hold the value of the UTF-8 BOM:
final String UTF8_BOM = "\uFEFF";
// List to hold the Integer numbers in file.
List<Integer> listePops = new ArrayList<>();
// 'Try With Resources' used to to auto-close file and free resources.
try (Scanner reader = new Scanner(new File("data.txt"))) {
String line;
int lineCount = 0;
while (reader.hasNextLine()) {
line = reader.nextLine();
line = line.trim();
// Skip blank lines (if any):
if (line.isEmpty()) {
continue;
}
lineCount++;
/* Is this the first line and is there a BOM at the
start of this line? If so, then remove it. */
if (lineCount == 1 && line.startsWith(UTF8_BOM)) {
line = line.substring(1);
}
// Validate Line Data:
// Is the line a String representation of an Integer Number?
if (line.matches("\\d+")) {
// Yes... then convert that line to Integer and add it to the List.
listePops.add(Integer.parseInt(line));
}
// Move onto next file line...
}
}
catch (FileNotFoundException ex) {
// Do what you want with this exception (but don't ignore it):
System.err.println(ex.getMessage());
}
// Display the gathered List contents:
for (Integer ints : listePops) {
System.out.println(ints);
}

Problem in reading text from the file using FileInputStream in Java

I have a file input.txt in my system and I want to read data from that file using FileInputStream in Java. There is no error in the code, but still it does not work. It does not display the output. Here is the code, any one help me out kindly.
package com.company;
import java.io.FileInputStream;
import java.io.InputStream;
public class Main {
public static void main(String[] args) {
// write your code here
byte[] array = new byte[100];
try {
InputStream input = new FileInputStream("input.txt");
System.out.println("Available bytes in the file: " + input.available());
// Read byte from the input stream
input.read(array);
System.out.println("Data read from the file: ");
// Convert byte array into string
String data = new String(array);
System.out.println(data);
// Close the input stream
input.close();
} catch (Exception e) {
e.getStackTrace();
}
}
}
Use utility class Files.
Path path = Paths.get("input.txt");
try {
String data = Files.readString(path, Charset.defaultCharset());
System.out.println(data);
} catch (Exception e) {
e.printStackTrace();
}
For binary data, non-text, one should use Files.readAllBytes.
available() is not the file length, just the number of bytes alread buffered by the system; reading more will block while physically reading the disk device.
String.getBytes(Charset) and new String(byte[], Charset) explicitly specify the charset of the actual bytes. String will then keep the text in Unicode, so it may combine all scripts of the world.
Java was designed with text as Unicode, due to the situation then with C and C++. So in a String you can mix Arabic, Greek, Chinese and math symbols. For that binary data (byte[], InputStream, OutputStream) must be given the encoding, Charset, the bytes are in, and then a conversion to Unicode happens for text (String, char, Reader, Writer).
FileInputStream.read(byte[]) requires using the result and just reads one single buffer, must be repeated.

Why does my java.util.Scanner not read my File?

my Scanner doesn't read my existing File which is read by a BufferedReader but BufferedReaders don't support UTF-8 encoding which my file needs.
I've already used a BufferedReader(even with UTF-8 which didn't give me letters like "ä"(german letter) but gave me awkward question mark symbols instead). And I've of course already used a Scanner.
public ArrayList<String> getThemefile2() {
Scanner s;
try {
s = new Scanner(themefile);
} catch (FileNotFoundException e) {
e.printStackTrace();
return new ArrayList<>();
}
ArrayList<String> list = new ArrayList<>();
while (s.hasNextLine()) {
list.add(s.nextLine());
}
s.close();
return list;
}
It just returns an empty ArrayList, but doesn't trigger the FileNotFoundException. themefile is an existing File.
If you're using Java 8+ I would recommend to use Files#lines method:
try (Stream<String> stream = Files.lines(themeFile.toPath())) {
stream.collect(Collectors.toList()); //need to be stored in a variable.
} catch (IOException e) {
e.printStackTrace();
}
Documentations:
Files#lines
Collectors
File#toPath
You need to specify the encoding for the file, if it's anything other than your system default. In your case, this will be where you create the Scanner.
s = new Scanner(themefile, "UTF-8");
Without the file to look at, we're all just guessing at the problem.
Here's one guess: there is no next line, therefore the while loop immediately breaks off, and you get an empty arraylist. This would be the case if there is no newline at all in the text file.
I had the same problem with api 28 level.
This worked for me:
s = new Scanner(new FileReader(themefile));
and must import:
import java.io.FileReader;

Reading a text file to a string ALWAYS results in empty string?

For the record, I know that reading the text file to a string does not ALWAYS result in an empty string, but in my situation, I can't get it to do anything else.
I'm currently trying to write a program that reads text from a .txt file, manipulates it based on certain arguments, and then saves the text back into the document. No matter how many different ways I've tried, I can't seem to actually get text from .txt file. The string just returns as an empty string.
For example, I pass in the arguments "-c 3 file1.txt" and parse the arguments for the file (the file is always passed in last). I get the file with:
File inputFile = new File(args[args.length - 1]);
When I debug the code, it seems to recognize the file as file1.txt and if I pass in the name of a different file, which doesn't exist, and error is thrown. So it is correctly recognizing this file. From here I have attempted every type of file text parsing I can find online, from old Java version techniques up to Java 8 techniques. None have worked. A few I've tried are:
String fileText = "";
try {
Scanner input = new Scanner(inputFile);
while (input.hasNextLine()) {
fileText = input.nextLine();
System.out.println(fileText);
}
input.close();
} catch (FileNotFoundException e) {
usage();
}
or
String fileText = null;
try {
fileText = new String(Files.readAllBytes(Paths.get(filename)), StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
I've tried others too. Buffered readers, scanners, etc. I've tried recompiling the project, I've tried 3rd party libraries. Still just getting an empty string. I'm thinking it must be some sort of configuration issue, but I am stumped.
For anyone wondering, the file seems to be in the correct place, when I reference the wrong location an exception is thrown. And the file DOES in fact have text in it. I've quadruple checked.
Even though your first code snippet might read the file, it does in fact not store the contents of the file in your fileText variable but only the file's last line.
With
fileText = input.nextLine();
you set fileText to the contents of the current line thereby overwriting the previous value of fileText. You need to store all the lines from your file. E.g. try
static String read( String path ) throws IOException {
StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new FileReader(path))) {
for (String line = br.readLine(); line != null; line = br.readLine()) {
sb.append(line).append('\n');
}
}
return sb.toString();
}
My suggestion would be to create a method for reading the file into a string which throws an exception with a descriptive message whenever an unexpected state is found. Here is a possible implementation of this idea:
public static String readFile(Path path) {
String fileText;
try {
if(Files.size(path) == 0) {
throw new RuntimeException("File has zero bytes");
}
fileText = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
if(fileText.trim().isEmpty()) {
throw new RuntimeException("File contains only whitespace");
}
return fileText;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
This method checks 3 anomalies:
File not found
File empty
File contains only spaces

Read special characters in java

I have a question, I'm trying to read from a file, a set of key and value pairs ( Like a Dictionary). For this I'm using the following code:
InputStream is = this.getClass().getResourceAsStream(PROPERTIES_BUNDLE);
properties=new Hashtable();
InputStreamReader isr=new InputStreamReader(is);
LineReader lineReader=new LineReader(isr);
try {
while (lineReader.hasLine()) {
String line=lineReader.readLine();
if(line.length()>1 && line.substring(0,1).equals("#")) continue;
if(line.indexOf("=")!=-1){
String key=line.substring(0,line.indexOf("="));
String value=line.substring(line.indexOf("=")+1,line.length());
properties.put(key, value);
}
}
} catch (IOException e) {
e.printStackTrace();
}
And the readLine function.
public String readLine() throws IOException{
int tmp;
StringBuffer out=new StringBuffer();
//Read in data
while(true){
//Check the bucket first. If empty read from the input stream
if(bucket!=-1){
tmp=bucket;
bucket=-1;
}else{
tmp=in.read();
if(tmp==-1)break;
}
//If new line, then discard it. If we get a \r, we need to look ahead so can use bucket
if(tmp=='\r'){
int nextChar=in.read();
if(tmp!='\n')bucket=nextChar;//Ignores \r\n, but not \r\r
break;
}else if(tmp=='\n'){
break;
}else{
//Otherwise just append the character
out.append((char) tmp);
}
}
return out.toString();
}
Everything is fine, however I want it to be able to parse special characters. For example: ó that would be codified into \u00F3, however in this case it's not replacing it with the correct character... What would be the way to do it?
EDIT: Forgot to say that since I'm using JavaME the Properties class or anything similar does not exist, that's why it may seem that I'm reinventing the wheel...
If it's encoded with UTF-16, can you not just
InputStreamReader isr = new InputStreamReader(is, "UTF16")?
This would recognize your special characters right from the get-go and you wouldn't need to do any replacements.
You need to ensure that you character encoding is set in your InputStreamReader to be that of the file. If it doesn't match some characters can be incorrect.

Categories