Read and Write file in java whilst keeping the special characters

Read and Write file in java whilst keeping the special characters - java

After reading and writing the file, the bullet points get replaced with symbolic unreadable text "ï¿½". Here is the code:
String str = FileUtils.readFileToString(new File(sourcePath), "UTF-8");
nextTextFile.append(redactStrings(str, redactedStrings));
FileUtils.writeStringToFile(new File(targetPath), nextTextFile.toString());
Link to sample file
generated file with funny characters

I checked it out on Windows and if the source file is encoded in UTF-8, the following code will produce the desired output to the console and to a file, which is then encoded in UTF-8 as well, making use of java.nio:
public static void main(String[] args) {
Path inPath = Paths.get(sourcePath);
Path outPath = Paths.get(targetPath);
try {
List<String> lines = Files.readAllLines(inPath, StandardCharsets.UTF_8);
lines.forEach(line -> {
System.out.println(line);
});
Files.write(outPath, lines, StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
}
Please note that the source file has to be encoded in UTF-8, otherwise this may throw an IOException stating something like Input length = 1. Play around with the StandardCharsets if it does not meet your requirements or make sure the encoding of the source file is UTF-8.

Related

UTF-8 cannot properly pass Japanese character (Hiragana and Katakana) strings as an argument

For example the file that I needed is found at this filepath and it will be passed as an argument:
"C:\Users\user.name\docs\jap\あああいいいうううえええおおおダウンロード\filename.txt"
I used this code to decode the characters:
String new_path = new String(args[0].getBytes("Shift_JIS"), StandardCharsets.UTF_8);
System.out.println(new_path);
However, the output is:
C:\Users\user.name\docs\jap\あああい�?�?�?�?�?えええおおお�?ウンロード\filename.txt
Some of the characters have not been decoded properly. I already changed the text encoding and encoding of the console to UTF-8 but it still didn't work.
But if I would just print it regularly, it displays just fine.
System.out.println("C:\\Users\\user.name\\docs\\jap\\あああいいいうううえええおおおダウンロード\\filename.txt");
which displays:
C:\Users\user.name\docs\jap\あああいいいうううえええおおおダウンロード\filename.txt
Please tell me how to read the other characters, it really be a great help. Thanks!

public static void main(String[] args) throws UnsupportedEncodingException {
// it is your code
String newPath = new String(args[0].getBytes("Shift_JIS"), StandardCharsets.UTF_8);
System.out.println(newPath);
// instead of your code
newPath = args[0];
System.out.println(newPath);
}
maybe, you can show "あああいいいうううえええおおおダウンロード".
if you create the String object with a byte array and corresponding charset, you can convert it to any charset for it.

Disable auto-escaping when creating an HTML file with PrintWriter

I don't know why this is so much harder than expected.
I'm trying to create an HTML file with Java, and it is not working. The code creates a file, but the contents are not what I inputted.
My simplified code is as follows:
File file = new File("text.html");
PrintWriter out = null;
try {
out = new PrintWriter(file);
out.write("<b>Hello World!</b>");
} catch (Exception e) { }
out.close();
Instead of the contents "Hello World!", the HTML file contains the escaped form "<b>Hello World!</b>". When I open the file with TextWrangler, I see that Java has automatically escaped all my angle brackets into < and >, which breaks all the formatting and is NOT what I want.
How do I avoid this?

BeanIO - UnidentifiedRecordException when parsing UTF8 file

I have a problem when parsing a file that is encoded with UTF8.
I have two files which are completely identical, except for their encoding. (I simply just copied the file and saved it with UTF8, so the contents are identical). One is encoded using ANSI, the other with UTF8. The file which is encoded with ANSI is succesfully parsed while the other file causes BeanIO to throw an UnidentifiedRecordException when calling the BeanReader.read() method:
org.beanio.UnidentifiedRecordException: Unidentified record at line 1
I have tried to solve this by explicitly setting the encoding to UTF8 using this code:
public static BeanReader getBeanReader(File file, StreamBuilder builder) {
StreamFactory factory = StreamFactory.newInstance();
factory.define(builder);
InputStream iStream;
try {
iStream = new FileInputStream(file);
} catch (FileNotFoundException e) {
throw new CustomException("Could not create BeanReader, file not found", e);
}
Reader reader = new InputStreamReader(iStream, StandardCharsets.UTF_8);
return factory.createReader("reader", reader);
}
which doesn't solve the issue.
What could be the reason for this error?

As the first line is claimed erroneous, did you save the UTF-8 without BOM (that infamous zero-width space at file start)?

Java Apache FileUtils readFileToString and writeStringToFile problems

I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.

A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.

Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).

Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!

Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?

There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html

Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.

Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read and Write file in java whilst keeping the special characters - java

Related

UTF-8 cannot properly pass Japanese character (Hiragana and Katakana) strings as an argument

Disable auto-escaping when creating an HTML file with PrintWriter

BeanIO - UnidentifiedRecordException when parsing UTF8 file

Java Apache FileUtils readFileToString and writeStringToFile problems

How to save Chinese Characters to file with java?

Categories

Resources