I'm trying to parse a very large file (~1.2 GB). Some lines of the file are bigger than the maximum allowed String size.
FileReader fileReader = new FileReader(filePath);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
//Do something
}
bufferedReader.close();
Error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at sax.parser.PrettyPrintXML.format(PrettyPrintXML.java:30)
line 30 :
while ((line = bufferedReader.readLine()) != null) {
Can anyone suggest any alternative approach for this case.
You are using readLine() on a file that doesn't have lines. So it tries to read the entire file as a single lines. This does not scale.
Solution: don't. Read a chunk at a time, or maybe even a character at a time: whatever is dictated by the unstated structure of your file.
I believe maximum string character length is 2^31-1 [2,147,483,647] and 1.2GB txt file(assuming is a txt file) can store about 1,200,000,000 characters. Why do you need to read all the data? What are you using it for? Can you split the file up into several files or read and parse it as a smaller string. Need more info.
You can use Apache commons IO :
https://commons.apache.org/proper/commons-io/description.html
example:
InputStream in = new URL( "http://commons.apache.org" ).openStream();
try {
System.out.println( IOUtils.toString( in ) );
} finally {
IOUtils.closeQuietly(in);
}
Related
I have several XML files ( in size of GBs ) that are to be converted to JSON. I am easily able to convert small sized files ( in KiloBytes ) using the JSON library ( org.json - https://mvnrepository.com/artifact/org.json/json/20180813 ).
Here's the code that i am using
static String line="",str="";
BufferedReader br = new BufferedReader(new FileReader(link));
FileWriter fw = new FileWriter(outputlink);
JSONObject jsondata = null;
while ((line = br.readLine()) != null)
{
str+=line;
}
jsondata = XML.toJSONObject(str);
But the large files ( even the <100 MB ones ) are taking too long to process and the larger ones are throwing java.lang.OutOfMemoryError: Java heap space. So, how to optimize the code to process large files ( or any other approach/library ).
UPDATE
I have updated the code and I am writing XML into JSON segment by segment
My XML :
<PubmedArticleSet>
<PubmedArticle>
</PubmedArticle>
<PubmedArticle>
</PubmedArticle>
...
</PubmedArticleSet>
So I am ignoring the root node <PubmedArticleSet> ( I will add it later ) converting each <PubmedArticle> </PubmedArticle> to JSON and writing at a time
br = new BufferedReader(new FileReader(link));
fw = new FileWriter(outputlink,true);
StringBuilder str = new StringBuilder();
br.readLine(); // to skip the first three lines and the root
br.readLine();
br.readLine();
while ((line = br.readLine()) != null) {
JSONObject jsondata = null;
str.append(line);
System.out.println(str);
if (line.trim().equals("</PubmedArticle>")) { // split here
jsondata = XML.toJSONObject(str.toString());
String jsonPrettyPrintString = jsondata.toString(PRETTY_PRINT_INDENT_FACTOR);
fw.append(jsonPrettyPrintString.toString());
System.out.println("One done"); // One section done
str= new StringBuilder();
}
}
fw.close();
I am no longer getting the HeapError but still the processing is taking hours for ~300 MB range files. Kindly provide any suggestions to speed up this process.
This statement is the main reason that kills your performance:
str+=line;
This causes the allocation, copying and deallocation of numerous of String objects.
You need to use a StringBuilder:
StringBuilder builder = new StringBuilder();
while ( ... ) {
builder.append(line);
}
It may also help (to a lesser extent) to read the file in larger chunks and not line by line.
The IO operation of reading a large file is very time consuming. Try utilizing a library to handle this for you. For example with apache commons IO:
File xmlFile= new File("D:\\path\\file.xml");
String xmlStr= FileUtils.readFileToString(xmlFile, "UTF-8");
JSONObject xmlJson = XML.toJSONObject(xmlStr);
I have a java program that uploads files from local to Minio browser. The file size is around 900 MB. When I'm executing the java program I get -
Java.lang.OutOfMemoryError - Java heap Size
I tried increasing heap size both in eclipse.ini as well as under Run-->Configurations-->Project to -Xms4096M -Xmx8192M.
After increasing the heap size when I executed the program I recieve -
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
How to upload large size files to Minio using Java ?
This is how my java program looks like -
StringBuilder stringBuilder = new StringBuilder();
File[] files = new File(path).listFiles();
showFiles(files);
System.out.println(pathList);
ListIterator<String> itr=pathList.listIterator();
while(itr.hasNext()){
String relativePath=itr.next();
if(relativePath!=null) {
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
System.out.println(absolutePath);
System.out.println(relativePath);
File f =new File(absolutePath);
BufferedReader reader = new BufferedReader(new FileReader(f));
String line = null;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if(stringBuilder.length()!=0) {
// delete the last new line separator
stringBuilder.deleteCharAt(stringBuilder.length() - 1);
}
reader.close();
// Create a InputStream for object upload.
ByteArrayInputStream bais = new ByteArrayInputStream(stringBuilder.toString().getBytes("UTF-8"));
Do you absolutely need to remove a trailing line separator from your text file?
If this not absolutely required you could let the minio client libraries handle the upload transparently:
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
File f =new File(absolutePath);
minio.putObject("bucketName", f.getName(), absolutePath);
According to the minio docs this allows uploads of up to 5 GB. This is easier to implement and faster than any other solution.
If you absolutely need to remove a trailing line separator, you should at least pre-size the StringBuilder (and use the correct code to remove the trailing line separator):
File f = new File(absolutePath);
stringBuilder.ensureCapacity((int) f.length()+2);
try (BufferedReader reader = new BufferedReader(new FileReader(f))) {
String line;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if (stringBuilder.length() != 0) {
// delete the last new line separator
stringBuilder.setLength(stringBuilder.length() - ls.length());
}
}
Please beware that this code can never upload files larger than about 2GB:
arrays in java cannot be larger than Integer.MAX_VALUE-5
therefore StringBuilder cannot be used to create strings with more than Integer.MAX_VALUE-5 characters
transforming the string into a UTF-8 encoded byte array cannot produce a byte array longer than Integer.MAX_VALUE-5 bytes
since UTF-8 is a multibyte encoding, transforming a string with Integer.MAX_VALUE-5 characters into a byte array might not be possible
Our application need to read a file with a single line and that single line contains large amount data . What we are doing is that , read the line from file and store it in string and tokenize the string with - and store to a list . From that list some entries are to be checked.
the method is as follows
public bollean checkMessage(String filename){
boolean retBool = true;
LinkedList tokenList;
int size;
String line = "";
try {
File file = new File(filename);
FileInputStream fs = new FileInputStream(file);
InputStreamReader is = new InputStreamReader(fs);
BufferedReader br = new BufferedReader(is);
while ((line = br.readLine()) != null) {
line.trim();
tokenList = tokenizeString(line, "-");
if (tokenList == null) {
retBool = false;
resultMsg = "Error in File.java "
}
if (retBool) {
retBool = checkMessagePart(tokenList);
}
}
}
the error occurs at line , while ((line = br.readLine()) != null)
error is
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.BufferedReader.readLine(BufferedReader.java:363)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
Actually increasing heapsize didn't work. the size of the file trying to read is more than 1gb. Also tried to read as chunks of bytes , but when adding the read data to StringBuilder or list will again generate the MemoryError
If the problem is that you cannot read the file to a String, then don't do it. Read it token by token by using some other method. The easy one is using Scanner with the right delimiter ("-" in your case). If you find its performance lacking, you could resort to implementing your own version of BufferedReader in which the "lines" are split by that character instead of the normal values.
I have big file (about 30mb) and here the code I use to read data from the file
BufferedReader br = new BufferedReader(new FileReader(file));
try {
String line = br.readLine();
while (line != null) {
sb.append(line).append("\n");
line = br.readLine();
}
Then I need to split the content I read, so I use
String[] inst = sb.toString().split("GO");
The problem is that sometimes the sub-string is over the maximum String length and I can't get all the data inside the string. How can I get rid of this?
Thanks
Scanner s = new Scanner(input).useDelimiter("GO"); and use s.next()
WHY PART:- The erroneous result may be the outcome of non contiguous heap segment as the CMS collector doesn't de-fragment memory.
(It does not answer your how to solve part though).
You may opt for loading the whole string partwise, i.e using substring
My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.
Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}
While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.
Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...
If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]
How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html
I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}
The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}
If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.