In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it's UTF-8. If they create it using note pad it's ANSI.
I am wondering if i can detect the type of file then handle it accordingly.
Thanks.
You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can't verify it, and on the fact that java treats the BOM as a particular "character" \uFEFF.
FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line = br.readLine();
if (line.startsWith("\uFEFF")) {
// it's UTF-8, throw away the BOM character and continue
line = line.substring(1);
} else {
// it's not UTF-8, reopen
br.close(); // also closes fis
fis = new FileInputStream(file); // reopen from the start
br = new BufferedReader(new InputStreamReader(fis, "Cp1252"));
line = br.readLine();
}
// now line contains the first line, and br.readLine() will get the next
Some more information on the UTF-8 Byte Order Mark and detection of encoding at http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
Related
User uploads file with the character encoding : Cp1252
Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command.
Java source code:
OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
writ.write(in);
writ.newLine();
}
writ.flush();
writ.close();
It seems that characters are not converted correctly. Converted unicode file has � and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.
One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:
CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
try(FileInputStream is=new FileInputStream(filepath);
BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
FileOutputStream fw=new FileOutputStream(destpath);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {
for(String in; (in = reader.readLine()) != null; ) {
out.write(in);
out.newLine();
}
}
Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.
Further, note that this won’t help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252" actually meant the system’s default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is default → UTF-8.
I have a simple java program that reads a line from file and writes it to another file.
My source file has words like: it's but the destination file is having words like it�s.
I am using BufferedReader br = new BufferedReader(new FileReader(inputFile)); to read the source file and PrintWriter writer = new PrintWriter(resultFile, "UTF-8"); to write the destination file.
How to get the actual character in my destination file too?
You need to specify a CharacterSet when creating the BufferedReader, otherwise the platform default encoding is used:
BufferedReader br = new BufferedReader(new FileReader(inputFile),"UTF-8");
I know this question is a bit old, but I thought I'd put down my answer anyway.
You can use java.nio.file.Files to read the file and java.io.RandomAccessFile to write to your destination file. For example:
public void copyContentsOfFile(File source, File destination){
Path p = Paths.get(source.toURI());
try {
byte[] bytes = Files.readAllBytes(p);
RandomAccessFile raf = new RandomAccessFile(destination, "rw");
raf.writeBytes(new String(bytes));
raf.close();
} catch (IOException e) {
e.printStackTrace();
}
}
I have a java which calls windows bat file which does some processing and generates the output file.
Process p = Runtime.getRuntime().exec("cmd /c "+filename);
Now when reading the file from following program. (filexists() is function which checks whether file exists or not). Output file contains only single line
if ( filexists("output.txt") == true)
{ String FileLine;
FileInputStream fstream = new FileInputStream("output.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
FileLine = br.readLine();
fstream.close();
filein.close();
}
Variable FileLine contains 3 junk charcters in the starting. I also checked few other files in the progam and no file has this issue except for the fact it is created with Runtime function.
9087.
As you can see three junk characters are coming in the output file. When opened with Notepad++, i am not able to see those junk characters.
Please suggest
This is happening because you have not mentioned the file encoding while creating your FileInputStream.Assuming your file is UTF-8 encoded, you need to do something like this
new FileInputStream("output.txt, "UTF-8"));
Change the encoding as per the encoding of your file
That looks like the byte order mark for UTF-8 encoding. See https://en.wikipedia.org/wiki/Byte_order_mark
May be its an issue with file encoding. Though I am not sure.
Can you please try following piece of code and see if it works for you
BufferedReader in = new BufferedReader(
new InputStreamReader( new FileInputStream("output.txt"), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
I'm having troubles with reading a UTF-8 encoded text file in Hebrew.
I read all Hebrew characters successfully, except to two letters = 'מ' and 'א'.
Here is how I read it:
FileInputStream fstream = new FileInputStream(SCHOOLS_LIST_PATH);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
if(strLine.contains("zevel")) {
continue;
}
schools.add(getSchoolFromLine(strLine));
}
Any idea?
Thanks,
Tomer
You're using InputStreamReader without specifying the encoding, so it's using the default for your platform - which may well not be UTF-8.
Try:
new InputStreamReader(in, "UTF-8")
Note that it's not obvious why you're using DataInputStream here... just create an InputStreamReader around the FileInputStream.
In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();
The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream
In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.
The Javadoc for FileReader says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:
FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);