java.util.Scanner to read files with different character encoding - java

I use Java to read list of files. Some of these has different encoding, ANSI instead of UTF-8. java.util.Scanner is unable to read these files and get empty output string.
I tried another approach:
FileInputStream fis = new FileInputStream(my_file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
InputStreamReader isr = new InputStreamReader(fis);
isr.getEncoding();
I am not sure how to change character encoding in case of ANSI ones. UTF-8 and ANSI files are mixed in same folder. I try to use Apache Tika for this.
After I get encoding of file, I use Scanner, but I get empty output.
Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
line = scanner.nextLine();

There is a library called juniversalchardet, which can help you at guessing the right encoding. It was updated recently and is currently located on GitHub:
https://github.com/albfernandez/juniversalchardet
However, there is no fail-safe tool to detect encodings, as there are many things unknown:
Is this file text at all or some PNG?
Is it stored in a (1,...,k,...,n)-bit encoding?
Which k-bit encoding was used?
Some guesswork can be done by counting the amount of control characters that are not commonly used. When a file contains many control symbols, it is likely that you've chosen the wrong encoding. (Then try the next one.)
Juniversalchardet tries multiple and also more successful ways to determine encodings (even chinese ones). It also provides convenient ways to open a reader from a file with the correct encoding already selected:
(Snippet taken from https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding and adapted)
import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;
public class TestCreateReaderFromFile {
public static void main (String[] args) throws IOException {
if (args.length != 1) {
System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
System.exit(1);
}
Reader reader = null;
try {
File file = new File(args[0]);
reader = ReaderFactory.createBufferedReader(file);
String line;
while((line=reader.readLine())!=null){
System.out.println(line); //Print each line to console
}
}
finally {
if (reader != null) {
reader.close();
}
}
}
}
Edit: Added ScannerFactory
/*
(C) Copyright 2016-2017 Alberto Fernández <infjaf#gmail.com>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/
import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;
/**
* Create a scanner from a file with correct encoding
*/
public final class ScannerFactory {
private ScannerFactory() {
throw new AssertionError("No instances allowed");
}
/**
* Create a scanner from a file with correct encoding
* #param file The file to read from
* #param defaultCharset defaultCharset to use if can't be determined
* #return Scanner for the file with the correct encoding
* #throws java.io.IOException if some I/O error ocurrs
*/
public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
String detectedEncoding = UniversalDetector.detectCharset(file);
if (detectedEncoding != null) {
cs = Charset.forName(detectedEncoding);
}
if (!cs.toString().contains("UTF")) {
return new Scanner(file, cs.name());
}
Path path = file.toPath();
return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
}
/**
* Create a scanner from a file with correct encoding. If charset cannot be determined,
* it uses the system default charset.
* #param file The file to read from
* #return Scanner for the file with the correct encoding
* #throws java.io.IOException if some I/O error ocurrs
*/
public static Scanner createScanner(File file) throws IOException {
return createScanner(file, Charset.defaultCharset());
}
}

Your approach will not give you the right encoding.
FileInputStream fis = new FileInputStream(my_file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
InputStreamReader isr = new InputStreamReader(fis);
isr.getEncoding();
This will return the encoding being used by this InputStream (read javadoc) and not that of the charcters written in the file (my_file in your case). And if the encoding is wrong Scanner won't be able to read the file properly.
In fact, do correct me if i am wrong, there is no way to get encoding used for a particular file with 100% accuracy. There are few projects which have a better success rate at guessing the encoding but not 100% accuracy. On the other hand if you know the encoding used then you can read the file using,
Scanner scanner = new Scanner(my_file, "charset");
scanner.nextLine();
Also, find out the correct charset name used in java for ANSI. It's either US-ASCII or Cp1251.
Whichever path you go, be on lookout for any IOException which might point you in the right direction.

To make Scanner available to work with different encoding, you have to provide correct one to the scanner's constructor.
To define file encoding it is better to use external lib (e.g https://github.com/albfernandez/juniversalchardet). But if you definitely know possible encodings, you can check it manually according to Wikipedia
public static void main(String... args) throws IOException {
List<String> lines = readLinesFromFile(new File("d:/utf8.txt"));
}
public static List<String> readLinesFromFile(File file) throws IOException {
try (Scanner scan = new Scanner(file, getCharsetName(file))) {
List<String> lines = new LinkedList<>();
while (scan.hasNext())
lines.add(scan.nextLine());
return lines;
}
}
private static String getCharsetName(File file) throws IOException {
try (InputStream in = new FileInputStream(file)) {
if (in.read() == 0xEF && in.read() == 0xBB && in.read() == 0xBF)
return StandardCharsets.UTF_8.name();
return StandardCharsets.US_ASCII.name();
}
}

Related

Why does my BufferedReader code leak memory?

I've got wrapper for BufferedReader that reads in files one after the other to create an uninterrupted stream across multiple files:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.zip.GZIPInputStream;
/**
* reads in a whole bunch of files such that when one ends it moves to the
* next file.
*
* #author isaak
*
*/
class LogFileStream implements FileStreamInterface{
private ArrayList<String> fileNames;
private BufferedReader br;
private boolean done = false;
/**
*
* #param files an array list of files to read from, order matters.
* #throws IOException
*/
public LogFileStream(ArrayList<String> files) throws IOException {
fileNames = new ArrayList<String>();
for (int i = 0; i < files.size(); i++) {
fileNames.add(files.get(i));
}
setFile();
}
/**
* advances the file that this class is reading from.
*
* #throws IOException
*/
private void setFile() throws IOException {
if (fileNames.size() == 0) {
this.done = true;
return;
}
if (br != null) {
br.close();
}
//if the file is a .gz file do a little extra work.
//otherwise read it in with a standard file Reader
//in either case, set the buffer size to 128kb
if (fileNames.get(0).endsWith(".gz")) {
InputStream fileStream = new FileInputStream(fileNames.get(0));
InputStream gzipStream = new GZIPInputStream(fileStream);
// TODO this probably needs to be modified to work well on any
// platform, UTF-8 is standard for debian/novastar though.
Reader decoder = new InputStreamReader(gzipStream, "UTF-8");
// note that the buffer size is set to 128kb instead of the standard
// 8kb.
br = new BufferedReader(decoder, 131072);
fileNames.remove(0);
} else {
FileReader filereader = new FileReader(fileNames.get(0));
br = new BufferedReader(filereader, 131072);
fileNames.remove(0);
}
}
/**
* returns true if there are more lines available to read.
* #return true if there are more lines available to read.
*/
public boolean hasMore() {
return !done;
}
/**
* Gets the next line from the correct file.
* #return the next line from the files, if there isn't one it returns null
* #throws IOException
*/
public String nextLine() throws IOException {
if (done == true) {
return null;
}
String line = br.readLine();
if (line == null) {
setFile();
return nextLine();
}
return line;
}
}
If I construct this object on a large list of files (300MB worth of files), then print nextLine() over and over again in a while loop performance continually degrades until there is no more RAM to use. This happens even if I'm reading in files that are ~500kb and using a virtual machine that has 32MB of memory.
I want this code to be able to run on positively massive data-sets (hundreds of gigabytes worth of files) and it is a component of a program that needs to run with 32MB or less of memory.
The files that are used are mostly labeled CSV files, hence the use of Gzip to compress them on disk. This reader needs to handle gzip and uncompressed files.
Correct me if I'm wrong, but once a file has been read through and had its lines spat out the data from that file, the objects related to that file, and everything else should be viable for garbage collection?
With Java 8, GZIP support has moved from Java code to native zlib usage.
Non-closed GZIP streams leak native memory (I really said "native" not "heap" memory) and it is far from easy to diagnose. Depending of application usage of such streams, operating system may reach its memory limit quite fast.
Symptom is that operating system process memory usage is not consistent with JVM memory usage produced by Native Memory Tracking https://docs.oracle.com/javase/8/docs/technotes/guides/vm/nmt-8.html
You will find full story details at http://www.evanjones.ca/java-native-leak-bug.html
The last call to setFile won't close your BufferedReader so you are leaking ressources.
Indeed in nextLine you read the first file until the end. When the end is reached you call setFile and check if there is more file to process. However if there is no more file you return imediatly without closing the last BufferReader user.
Furthermore if you don't process all files you will have a ressource still in use.
There is at least one leak in your code: Method setFile() does not close the last BufferedReader because the if (fileNames.size() == 0) check comes before if (br != null) check.
However, this could lead to the described effect only if LogFileStream is instantiated multiple times.
It would also be better to use LinkedList instead of ArrayList as fileNames.remove(0) is more 'expensive' on the ArrayList than on the LinkedList. You could instantiate it using following single line in the constructor: fileNames = new LinkedList<>(files);
Every once in a while, you could flush() or close() the BufferedReader. This will clear the reader's contents, so maybe every time you use the setFile() method, flush the reader. Then, just before every call like br = new BufferedReader(decoder, 131072), close() the BufferedReader
The GC starts to work after you close your connection/ reader. If you are using Java 7 or above, you may want to consider to use the try-with-resource statement which is a better way to deal with IO operation.https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html

How to get encoding type of a .txt or .sql file

Is there a possibility to get the encoding of a existing .txt file? for example: you know a customer needs a specific encoding and you want to automize the process of .sql-data delivery. then you read out the endcoding from a client config and compare it to the current encoding of the file to be delivered. if they differ you change the encoding. could not find a solution till now. any help would be appreciated.
There is no explicit declaration of text encoding in files, but you can guess the encoding by analyzing specific byte sequences that are characteristic of a certain encoding.
Chardet does exactly that and tries to guess. If it can't say for sure what the encoding is, it will give you a list with confidence values (e.g. "90% this is utf8"). The project includes both a Python module and a command line tool. For a Java version, see JChardet.
My 2cents: if you just need a quick way to detect, the command line chardet tool is the way to go.
juniversalchardet is one of the best available API for detecting the encoding type. Please checkout this link. You can go through the list of encoding types supported by it
Working Example from the site
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Hope this helps!

Reading Java Properties file without escaping values

My application needs to use a .properties file for configuration.
In the properties files, users are allow to specify paths.
Problem
Properties files need values to be escaped, eg
dir = c:\\mydir
Needed
I need some way to accept a properties file where the values are not escaped, so that the users can specify:
dir = c:\mydir
Why not simply extend the properties class to incorporate stripping of double forward slashes. A good feature of this will be that through the rest of your program you can still use the original Properties class.
public class PropertiesEx extends Properties {
public void load(FileInputStream fis) throws IOException {
Scanner in = new Scanner(fis);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext()) {
out.write(in.nextLine().replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}
}
Using the new class is a simple as:
PropertiesEx p = new PropertiesEx();
p.load(new FileInputStream("C:\\temp\\demo.properties"));
p.list(System.out);
The stripping code could also be improved upon but the general principle is there.
Two options:
use the XML properties format instead
Writer your own parser for a modified .properties format without escapes
You can "preprocess" the file before loading the properties, for example:
public InputStream preprocessPropertiesFile(String myFile) throws IOException{
Scanner in = new Scanner(new FileReader(myFile));
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext())
out.write(in.nextLine().replace("\\","\\\\").getBytes());
return new ByteArrayInputStream(out.toByteArray());
}
And your code could look this way
Properties properties = new Properties();
properties.load(preprocessPropertiesFile("path/myfile.properties"));
Doing this, your .properties file would look like you need, but you will have the properties values ready to use.
*I know there should be better ways to manipulate files, but I hope this helps.
The right way would be to provide your users with a property file editor (or a plugin for their favorite text editor) which allows them entering the text as pure text, and would save the file in the property file format.
If you don't want this, you are effectively defining a new format for the same (or a subset of the) content model as the property files have.
Go the whole way and actually specify your format, and then think about a way to either
transform the format to the canonical one, and then use this for loading the files, or
parse this format and populate a Properties object from it.
Both of these approaches will only work directly if you actually can control your property object's creation, otherwise you will have to store the transformed format with your application.
So, let's see how we can define this. The content model of normal property files is simple:
A map of string keys to string values, both allowing arbitrary Java strings.
The escaping which you want to avoid serves just to allow arbitrary Java strings, and not just a subset of these.
An often sufficient subset would be:
A map of string keys (not containing any whitespace, : or =) to string values (not containing any leading or trailing white space or line breaks).
In your example dir = c:\mydir, the key would be dir and the value c:\mydir.
If we want our keys and values to contain any Unicode character (other than the forbidden ones mentioned), we should use UTF-8 (or UTF-16) as the storage encoding - since we have no way to escape characters outside of the storage encoding. Otherwise, US-ASCII or ISO-8859-1 (as normal property files) or any other encoding supported by Java would be enough, but make sure to include this in your specification of the content model (and make sure to read it this way).
Since we restricted our content model so that all "dangerous" characters are out of the way, we can now define the file format simply as this:
<simplepropertyfile> ::= (<line> <line break> )*
<line> ::= <comment> | <empty> | <key-value>
<comment> ::= <space>* "#" < any text excluding line breaks >
<key-value> ::= <space>* <key> <space>* "=" <space>* <value> <space>*
<empty> ::= <space>*
<key> ::= < any text excluding ':', '=' and whitespace >
<value> ::= < any text starting and ending not with whitespace,
not including line breaks >
<space> ::= < any whitespace, but not a line break >
<line break> ::= < one of "\n", "\r", and "\r\n" >
Every \ occurring in either key or value now is a real backslash, not anything which escapes something else.
Thus, for transforming it into the original format, we simply need to double it, like Grekz proposed, for example in a filtering reader:
public DoubleBackslashFilter extends FilterReader {
private boolean bufferedBackslash = false;
public DoubleBackslashFilter(Reader org) {
super(org);
}
public int read() {
if(bufferedBackslash) {
bufferedBackslash = false;
return '\\';
}
int c = super.read();
if(c == '\\')
bufferedBackslash = true;
return c;
}
public int read(char[] buf, int off, int len) {
int read = 0;
if(bufferedBackslash) {
buf[off] = '\\';
read++;
off++;
len --;
bufferedBackslash = false;
}
if(len > 1) {
int step = super.read(buf, off, len/2);
for(int i = 0; i < step; i++) {
if(buf[off+i] == '\\') {
// shift everything from here one one char to the right.
System.arraycopy(buf, i, buf, i+1, step - i);
// adjust parameters
step++; i++;
}
}
read += step;
}
return read;
}
}
Then we would pass this Reader to our Properties object (or save the contents to a new file).
Instead, we could simply parse this format ourselves.
public Properties parse(Reader in) {
BufferedReader r = new BufferedReader(in);
Properties prop = new Properties();
Pattern keyValPattern = Pattern.compile("\s*=\s*");
String line;
while((line = r.readLine()) != null) {
line = line.trim(); // remove leading and trailing space
if(line.equals("") || line.startsWith("#")) {
continue; // ignore empty and comment lines
}
String[] kv = line.split(keyValPattern, 2);
// the pattern also grabs space around the separator.
if(kv.length < 2) {
// no key-value separator. TODO: Throw exception or simply ignore this line?
continue;
}
prop.setProperty(kv[0], kv[1]);
}
r.close();
return prop;
}
Again, using Properties.store() after this, we can export it in the original format.
Based on #Ian Harrigan, here is a complete solution to get Netbeans properties file (and other escaping properties file) right from and to ascii text-files :
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
/**
* This class allows to handle Netbeans properties file.
* It is based on the work of : http://stackoverflow.com/questions/6233532/reading-java-properties-file-without-escaping-values.
* It overrides both load methods in order to load a netbeans property file, taking into account the \ that
* were escaped by java properties original load methods.
* #author stephane
*/
public class NetbeansProperties extends Properties {
#Override
public synchronized void load(Reader reader) throws IOException {
BufferedReader bfr = new BufferedReader( reader );
ByteArrayOutputStream out = new ByteArrayOutputStream();
String readLine = null;
while( (readLine = bfr.readLine()) != null ) {
out.write(readLine.replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}//while
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}//met
#Override
public void load(InputStream is) throws IOException {
load( new InputStreamReader( is ) );
}//met
#Override
public void store(Writer writer, String comments) throws IOException {
PrintWriter out = new PrintWriter( writer );
if( comments != null ) {
out.print( '#' );
out.println( comments );
}//if
List<String> listOrderedKey = new ArrayList<String>();
listOrderedKey.addAll( this.stringPropertyNames() );
Collections.sort(listOrderedKey );
for( String key : listOrderedKey ) {
String newValue = this.getProperty(key);
out.println( key+"="+newValue );
}//for
}//met
#Override
public void store(OutputStream out, String comments) throws IOException {
store( new OutputStreamWriter(out), comments );
}//met
}//class
You could try using guava's Splitter: split on '=' and build a map from resulting Iterable.
The disadvantage of this solution is that it does not support comments.
#pdeva: one more solution
//Reads entire file in a String
//available in java1.5
Scanner scan = new Scanner(new File("C:/workspace/Test/src/myfile.properties"));
scan.useDelimiter("\\Z");
String content = scan.next();
//Use apache StringEscapeUtils.escapeJava() method to escape java characters
ByteArrayInputStream bi=new ByteArrayInputStream(StringEscapeUtils.escapeJava(content).getBytes());
//load properties file
Properties properties = new Properties();
properties.load(bi);
It's not an exact answer to your question, but a different solution that may be appropriate to your needs. In Java, you can use / as a path separator and it'll work on both Windows, Linux, and OSX. This is specially useful for relative paths.
In your example, you could use:
dir = c:/mydir

Read resource text file to String in Java [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Is there a way to read a text file in the resource into a String?
I suppose this is a popular requirement, but I couldn't find any utility after Googling.
Yes, Guava provides this in the Resources class. For example:
URL url = Resources.getResource("foo.txt");
String text = Resources.toString(url, StandardCharsets.UTF_8);
You can use the old Stupid Scanner trick oneliner to do that without any additional dependency like guava:
String text = new Scanner(AppropriateClass.class.getResourceAsStream("foo.txt"), "UTF-8").useDelimiter("\\A").next();
Guys, don't use 3rd party stuff unless you really need that. There is a lot of functionality in the JDK already.
Pure and simple, jar-friendly, Java 8+ solution
This simple method below will do just fine if you're using Java 8 or greater:
/**
* Reads given resource file as a string.
*
* #param fileName path to the resource file
* #return the file's contents
* #throws IOException if read fails for any reason
*/
static String getResourceFileAsString(String fileName) throws IOException {
ClassLoader classLoader = ClassLoader.getSystemClassLoader();
try (InputStream is = classLoader.getResourceAsStream(fileName)) {
if (is == null) return null;
try (InputStreamReader isr = new InputStreamReader(is);
BufferedReader reader = new BufferedReader(isr)) {
return reader.lines().collect(Collectors.joining(System.lineSeparator()));
}
}
}
And it also works with resources in jar files.
About text encoding: InputStreamReader will use the default system charset in case you don't specify one. You may want to specify it yourself to avoid decoding problems, like this:
new InputStreamReader(isr, StandardCharsets.UTF_8);
Avoid unnecessary dependencies
Always prefer not depending on big, fat libraries. Unless you are already using Guava or Apache Commons IO for other tasks, adding those libraries to your project just to be able to read from a file seems a bit too much.
For java 7:
new String(Files.readAllBytes(Paths.get(getClass().getResource("foo.txt").toURI())));
For Java 11:
Files.readString(Paths.get(getClass().getClassLoader().getResource("foo.txt").toURI()));
yegor256 has found a nice solution using Apache Commons IO:
import org.apache.commons.io.IOUtils;
String text = IOUtils.toString(this.getClass().getResourceAsStream("foo.xml"),
"UTF-8");
Guava has a "toString" method for reading a file into a String:
import com.google.common.base.Charsets;
import com.google.common.io.Files;
String content = Files.toString(new File("/home/x1/text.log"), Charsets.UTF_8);
This method does not require the file to be in the classpath (as in Jon Skeet previous answer).
apache-commons-io has a utility name FileUtils:
URL url = Resources.getResource("myFile.txt");
File myFile = new File(url.toURI());
String content = FileUtils.readFileToString(myFile, "UTF-8"); // or any other encoding
I like akosicki's answer with the Stupid Scanner Trick. It's the simplest I see without external dependencies that works in Java 8 (and in fact all the way back to Java 5). Here's an even simpler answer if you can use Java 9 or higher (since InputStream.readAllBytes() was added at Java 9):
String text = new String(AppropriateClass.class.getResourceAsStream("foo.txt")
.readAllBytes());
If you're concerned about the filename being wrong and/or about closing the stream, you can expand this a little:
String text = null;
InputStream stream = AppropriateClass.class.getResourceAsStream("foo.txt");
if (null != stream) {
text = stream.readAllBytes();
stream.close()
}
You can use the following code form Java
new String(Files.readAllBytes(Paths.get(getClass().getResource("example.txt").toURI())));
I often had this problem myself. To avoid dependencies on small projects, I often
write a small utility function when I don't need commons io or such. Here is
the code to load the content of the file in a string buffer :
StringBuffer sb = new StringBuffer();
BufferedReader br = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("path/to/textfile.txt"), "UTF-8"));
for (int c = br.read(); c != -1; c = br.read()) sb.append((char)c);
System.out.println(sb.toString());
Specifying the encoding is important in that case, because you might have
edited your file in UTF-8, and then put it in a jar, and the computer that opens
the file may have CP-1251 as its native file encoding (for example); so in
this case you never know the target encoding, therefore the explicit
encoding information is crucial.
Also the loop to read the file char by char seems inefficient, but it is used on a
BufferedReader, and so actually quite fast.
If you want to get your String from a project resource like the file
testcase/foo.json in src/main/resources in your project, do this:
String myString=
new String(Files.readAllBytes(Paths.get(getClass().getClassLoader().getResource("testcase/foo.json").toURI())));
Note that the getClassLoader() method is missing on some of the other examples.
Here's a solution using Java 11's Files.readString:
public class Utils {
public static String readResource(String name) throws URISyntaxException, IOException {
var uri = Utils.class.getResource("/" + name).toURI();
var path = Paths.get(uri);
return Files.readString(path);
}
}
Use Apache commons's FileUtils. It has a method readFileToString
I'm using the following for reading resource files from the classpath:
import java.io.IOException;
import java.io.InputStream;
import java.net.URISyntaxException;
import java.util.Scanner;
public class ResourceUtilities
{
public static String resourceToString(String filePath) throws IOException, URISyntaxException
{
try (InputStream inputStream = ResourceUtilities.class.getClassLoader().getResourceAsStream(filePath))
{
return inputStreamToString(inputStream);
}
}
private static String inputStreamToString(InputStream inputStream)
{
try (Scanner scanner = new Scanner(inputStream).useDelimiter("\\A"))
{
return scanner.hasNext() ? scanner.next() : "";
}
}
}
No third party dependencies required.
At least as of Apache commons-io 2.5, the IOUtils.toString() method supports an URI argument and returns contents of files located inside jars on the classpath:
IOUtils.toString(SomeClass.class.getResource(...).toURI(), ...)
With set of static imports, Guava solution can be very compact one-liner:
toString(getResource("foo.txt"), UTF_8);
The following imports are required:
import static com.google.common.io.Resources.getResource
import static com.google.common.io.Resources.toString
import static java.nio.charset.StandardCharsets.UTF_8
package test;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
try {
String fileContent = getFileFromResources("resourcesFile.txt");
System.out.println(fileContent);
} catch (Exception e) {
e.printStackTrace();
}
}
//USE THIS FUNCTION TO READ CONTENT OF A FILE, IT MUST EXIST IN "RESOURCES" FOLDER
public static String getFileFromResources(String fileName) throws Exception {
ClassLoader classLoader = Main.class.getClassLoader();
InputStream stream = classLoader.getResourceAsStream(fileName);
String text = null;
try (Scanner scanner = new Scanner(stream, StandardCharsets.UTF_8.name())) {
text = scanner.useDelimiter("\\A").next();
}
return text;
}
}
Guava also has Files.readLines() if you want a return value as List<String> line-by-line:
List<String> lines = Files.readLines(new File("/file/path/input.txt"), Charsets.UTF_8);
Please refer to here to compare 3 ways (BufferedReader vs. Guava's Files vs. Guava's Resources) to get String from a text file.
Here is my approach worked fine
public String getFileContent(String fileName) {
String filePath = "myFolder/" + fileName+ ".json";
try(InputStream stream = Thread.currentThread().getContextClassLoader().getResourceAsStream(filePath)) {
return IOUtils.toString(stream, "UTF-8");
} catch (IOException e) {
// Please print your Exception
}
}
If you include Guava, then you can use:
String fileContent = Files.asCharSource(new File(filename), Charset.forName("UTF-8")).read();
(Other solutions mentioned other method for Guava but they are deprecated)
The following cods work for me:
compile group: 'commons-io', name: 'commons-io', version: '2.6'
#Value("classpath:mockResponse.json")
private Resource mockResponse;
String mockContent = FileUtils.readFileToString(mockResponse.getFile(), "UTF-8");
I made NO-dependency static method like this:
import java.nio.file.Files;
import java.nio.file.Paths;
public class ResourceReader {
public static String asString(String resourceFIleName) {
try {
return new String(Files.readAllBytes(Paths.get(new CheatClassLoaderDummyClass().getClass().getClassLoader().getResource(resourceFIleName).toURI())));
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
class CheatClassLoaderDummyClass{//cheat class loader - for sql file loading
}
I like Apache commons utils for this type of stuff and use this exact use-case (reading files from classpath) extensively when testing, especially for reading JSON files from /src/test/resources as part of unit / integration testing. e.g.
public class FileUtils {
public static String getResource(String classpathLocation) {
try {
String message = IOUtils.toString(FileUtils.class.getResourceAsStream(classpathLocation),
Charset.defaultCharset());
return message;
}
catch (IOException e) {
throw new RuntimeException("Could not read file [ " + classpathLocation + " ] from classpath", e);
}
}
}
For testing purposes, it can be nice to catch the IOException and throw a RuntimeException - your test class could look like e.g.
#Test
public void shouldDoSomething () {
String json = FileUtils.getResource("/json/input.json");
// Use json as part of test ...
}
public static byte[] readResoureStream(String resourcePath) throws IOException {
ByteArrayOutputStream byteArray = new ByteArrayOutputStream();
InputStream in = CreateBffFile.class.getResourceAsStream(resourcePath);
//Create buffer
byte[] buffer = new byte[4096];
for (;;) {
int nread = in.read(buffer);
if (nread <= 0) {
break;
}
byteArray.write(buffer, 0, nread);
}
return byteArray.toByteArray();
}
Charset charset = StandardCharsets.UTF_8;
String content = new String(FileReader.readResoureStream("/resource/...*.txt"), charset);
String lines[] = content.split("\\n");

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?
There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)
If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.
Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.
Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

Categories