GC overhead limit exceeded while reading huge txt files

GC overhead limit exceeded while reading huge txt files - java

I have a 20GB folder which consist of 358 txt files and there are 733,019,372 lines total and all txt files format is below
77 clueweb12-0211wb-83-00000
88 clueweb12-0211wb-83-00001
82 clueweb12-0211wb-83-00002
82 clueweb12-0211wb-83-00003
64 clueweb12-0211wb-83-00004
80 clueweb12-0211wb-83-00005
83 clueweb12-0211wb-83-00006
75 clueweb12-0211wb-83-00007
My purpose is while program is traversing all the txt file recursively reading files line by line,seperate into two parts each line(e.g 88 and clueweb12-0211wb-83-0003) and put these parts into a LinkedHashMap<String, List<String>>. Afther that, take docIds(clueweb12-0211wb-83-00006) from user as an argument and put the score belongs to this docIds(83).If a non-existing docID is encountered, -1 should be returned as a score.For example:
clueweb12-0003wb-22-11553,foo,clueweb12-0109wb-78-15059,bar,clueweb12-0302wb-50-22339
should print out : 84,-1,19,-1,79
And i take the path to file from user as an argument.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.*;
import static java.nio.file.FileVisitResult.CONTINUE;
public class App extends SimpleFileVisitor<Path>{
public LinkedHashMap<String, List<String>> list = new LinkedHashMap<>(); // Put there scores and docIds
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException {
File file = new File(path.toString());
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while((line = br.readLine()) != null){
if(list.containsKey(line.split(" ")[0])){
list.get(line.split(" ")[0]).add(line.split(" ")[1]);
}
else{
list.put(line.split(" ")[0],new ArrayList(Arrays.asList(line.split(" ")[1])));
}
}
return CONTINUE;
}
public static void main(String args[]) throws IOException {
if (args.length < 2) {
System.err.println("Usage: java App spamDir docIDs ...");
return;
}
Path spamDir = Paths.get(args[0]);
String[] docIDs = args[1].split(",");
App ap = new App();
Files.walkFileTree(spamDir, ap);
ArrayList scores = new ArrayList(); // keep scores in that list
//Search the Lists in LinkedHashMap
for(int j=0; j<docIDs.length; j++){
Set set = ap.list.entrySet();
Iterator i = set.iterator();
int counter = 0;
while(i.hasNext()){
// if LinkedHashMap has the docID add it to scores List
Map.Entry me = (Map.Entry) i.next();
ArrayList searchList = (ArrayList) me.getValue();
if(searchList.contains(docIDs[j])){
scores.add(me.getKey());
counter++;
break;
}
else {
continue;
}
}
// if LinkedHashMap has not the docId add -1 to scores List
if(counter == 0){
scores.add("-1");
}
}
String joined = String.join("," , scores);
System.out.println(joined);
}
}
But I encountered that problem :
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.util.ArrayList.grow(ArrayList.java:261)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
at java.util.ArrayList.add(ArrayList.java:458)
at ceng.bim208.App.visitFile(App.java:35)
at ceng.bim208.App.visitFile(App.java:18)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at ceng.bim208.App.main(App.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
I tried to use XmX2048M to increase heap size but it doesn't solve my problem. What should i do?
In addition, if i run the program on a different path(include 2 txt file same format) it works correctly.

This sounds like an interview homework assignment. I bet that the task here is to take the mental leap and separate the data which you must hold in memory from those which you can store in an index while you parse the files.
No matter how much memory you have you'll run out of it eventually if you keep doing it like this. In your case there are some useful tips which you can use to fix this:
Don't put everything into the memory. Build an index if you can to a separate file which only holds the necessary data
Process the files as streams: this means that you parse the InputStream line by line, file by file and this way you don't have to keep them in memory.
In your case this:
public LinkedHashMap<String, List<String>> list
fills up the memory with the parsed Strings. From what I understand you don't need to store the Strings themselves but only the score. If you clarify what your task is I can help you further but currently it is not clear what your task is.
my task is take the docIds as a command-line argument and print out their scores.
What you need is a lookup for the scores:
Map<String, Map<Integer, Integer>> docIdsWithScoresAndCounts;
or
Map<String, List<Integer>> docIdsWithScores;
depending on whether you want to count how many times a score appeared. The outer Map holds the doc ids as keys and the inner maps are lookups themselves for score -> count. This is a tricky variation of the counting sort algorithm: you only need to keep track of the doc ids and the scores of each doc id and since the scores are limited in size (how many digits can they have?) you end up with O(1) memory consumption. The rest of the data can be thrown away.
Note that you only need to store the keys of the doc ids you are interested in. You can throw away the rest.

The following does it the new way and corrects some faults.
public Map<String, List<String>> map = new HashMap<>();
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes attr)
throws IOException {
Files.lines(path).forEach(line -> {
String[] keyValue = line.split(" ", 2);
map.compute(keyValue[0],
(key, oldList) -> {
List<String> list = oldList == null
? new ArrayList<>()
: oldList;
list.add(keyValue[1]);
return list;
});
});
return CONTINUE;
}
A LinkedHashMap maintains the order of adding, which is needlessly costing memory.
The splitting should be done once.
The file should be closed. I use Files.lines which allows terse coding.
(Tip) The Charset (encoding) is not given, hence platform default. One might consider adding it as parameter.
The Map.compute comes in handy to decide on the old value (a List) whether to create a new one.
One might save on memory by not storing a List<String> but something like List<byte[]> with the bytes something like:
byte[] bytes = keyValue[1].getBytes(Charset.defaultCharset());
String s = new String(bytes, Charset.defaultCharset());
Compared to a String for plain ASCII you'll save half the bytes (a char is two bytes).
Probably a database, like an embedded java Derby or H2 will do better.

Related

Deal with large amount of strings to validation with two for-loop iteration

I'm facing this problem and kinda dont know how to deal with it. I need to process a csv file that can contain 100 or 100 thousand lines.
I need to do some validations before to proceed the processing, one of them is to check if each document has same typeOfDoc. Let me explain:
Content of file:
document;typeOfDoc
25693872076;2
25693872076;2
...
25693872076;1
This validations consists in check if to document has different type of typeOfDoc along the file, and if it is, show that's invalid.
Initially I thinked in two for-loop to iterate over first occurrence of document (which I assume that's correct, because I don't know what I'm going to receive), and for that correct document I iterate over the rest of file to verify if has another occurence of it, and if have same document but if typeOfDoc is different of first occurence, I store this validation on a object to show that this file has one document with two different types. But.... you'll imagine where it is going. This can't happen with 100k lines, even with 100.
Which is the better way to do that?
Something that can help.
This is how I open and process the file (try-catch, close(), and properly names were omitted):
List<String> lines = new BufferedReader(new FileReader(path)).lines().skip(1).collect(Collectors.toList());
for (String line : lines) {
String[] arr = line.split(";");
String document = arr[0];
String typeOfDoc = arr[1];
for (String line2 : lines) {
String[] arr2 = line2.split(";");
String document2 = arr2[0];
String typeOfDoc2 = arr2[1];
if (document.equals(document2) && !typeOfDoc.equals(typeOfDoc2)) {
...create object to show that error on grid...
}
}
}

You can try to look for duplicate keys and values in a Hashmap, which makes it easier.
public class App {
public static void main(String[] args) throws IOException {
String delimiter = ";";
Map<String, String> map = new HashMap<>();
Stream<String> lines = Files.lines(Paths.get("somefile.txt"));
lines.forEach(line -> checkAndPutInMap(line,map,delimiter));
lines.close();
}
private static void checkAndPutInMap(String line, Map<String,String> map, String delimiter) {
String document = line.split(delimiter)[0];
String typeOfDoc = line.split(delimiter)[1];
if (map.containsKey(document) && !map.get(document).equals(typeOfDoc)) {
...create object to show that error on grid...
}
else
map.put(document, typeOfDoc));
}
}

Saving a HashMap piece by piece to File

I am running a large loop in Java where, in every pass, data is populated in a HashMap.
The loop is very long so I cannot hold the complete HashMap in memory. So I need to find a way to export the Hashmap to a file after every 1000 iterations or so.
I was thinking about exporting the HashMap using serialization after every 1000 steps to a file, clearing the HashMap variable and repeating the process by appending the next to the same file. But the problem would then occur while retrieving the complete HashMap from the file as there would be metadata appended to the file every time I export. So is there any other way to do this?
Edit:
The HashMap structure is given below:
HashMap<Key, double[]>
Key {
String name;
BitSet set;
}

Yes. You have a great idea of clearing the file every N amount of iterations, which would look something similar to this:
public void exportHashTable() {
HashMap<String, Object> map = new HashMap<>();
map.put("hi", "world");
for (int i = 0; i < map.size(); i++) {
// Some logic ..
if (i % 1000 == 0) {
appendToFile(map);
map.clear();
}
}
}
In order to import you don't have to read the entire file, but read it line by line, in case you exported it (not serialized it). Let's say you export it as CSV or maybe even JSON. In that case, you can import the HashMap and process N amount of rows, then clear and proceed further.
public void importHashTable() {
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line, add to hashmap or do some other operation
}
}
}

Java running out of memory for big variable?

I have a massive file of approximately 32000 lines. I am making some operations over its content in Java, so I created a smaller, minified file of it to test my program. It works fine, but when I use the actual file (the larger one, of 32000 lines), it explodes, saying that:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at translator.MainLinkedHashMap.createLinkedHashMapFromString(MainLinkedHashMap.java:100)
at translator.MainLinkedHashMap.main(MainLinkedHashMap.java:25)
Please note the ArrayIndexOutOfBoundsException is > 1.
I´ve been debugging and I saw that the LinkedHashMap, where I am storing the lines of the file, has 30400 lines instead of 32000 in the debugger.
Is this stating that Java ran out of memory? (The file is not so big itself, 2M, but there is a lot of lines.)
Thanks.
UPDATE: Here is the code:
private static LinkedHashMap<String, String> createLinkedHashMapFromString(String rawString) {
LinkedHashMap<String, String> resultMap = new LinkedHashMap<String, String>();
String [] values = rawString.split(",");
for (int i = 0; i < values.length; i++) {
values[i] = values[i].trim();
}
String [] pair = null;
for (String value : values) {
pair = value.split("=");
resultMap.put(pair[0], pair[1]);
}
return resultMap;
}

I don't know the content of you file but the exception ist 100% thrown in this block
for (String value : values) {
pair = value.split("=");
resultMap.put(pair[0], pair[1]);
}
in line resultMap.put(pair[0], pair[1]);
simply saying, the result of String#spit is just 1 element length(remember that the first element of array is indexed as 0) and that is why you are getting your error. I bet that not all "lines" in your files are in the form you are expect them to be.

How can I retrieve the value in a Hashmap stored in an arraylist type hashmap?

I am a beginner in Java. Basically, I have loaded each text document and stored each individual words in the text document in the hasmap. Afterwhich, I tried storing all the hashmaps in an ArrayList. Now I am stuck with how to retrieve all the words in my hashmaps that is in the arraylist!
private static long numOfWords = 0;
private String userInputString;
private static long wordCount(String data) {
long words = 0;
int index = 0;
boolean prevWhiteSpace = true;
while (index < data.length()) {
//Intialise character variable that will be checked.
char c = data.charAt(index++);
//Determine whether it is a space.
boolean currWhiteSpace = Character.isWhitespace(c);
//If previous is a space and character checked is not a space,
if (prevWhiteSpace && !currWhiteSpace) {
words++;
}
//Assign current character's determination of whether it is a spacing as previous.
prevWhiteSpace = currWhiteSpace;
}
return words;
} //
public static ArrayList StoreLoadedFiles()throws Exception{
final File f1 = new File ("C:/Users/Admin/Desktop/dataFiles/"); //specify the directory to load files
String data=""; //reset the words stored
ArrayList<HashMap> hmArr = new ArrayList<HashMap>(); //array of hashmap
for (final File fileEntry : f1.listFiles()) {
Scanner input = new Scanner(fileEntry); //load files
while (input.hasNext()) { //while there are still words in the document, continue to load all the words in a file
data += input.next();
input.useDelimiter("\t"); //similar to split function
} //while loop
String textWords = data.replaceAll("\\s+", " "); //remove all found whitespaces
HashMap<String, Integer> hm = new HashMap<String, Integer>(); //Creates a Hashmap that would be renewed when next document is loaded.
String[] words = textWords.split(" "); //store individual words into a String array
for (int j = 0; j < numOfWords; j++) {
int wordAppearCount = 0;
if (hm.containsKey(words[j].toLowerCase().replaceAll("\\W", ""))) { //replace non-word characters
wordAppearCount = hm.get(words[j].toLowerCase().replaceAll("\\W", "")); //remove non-word character and retrieve the index of the word
}
if (!words[j].toLowerCase().replaceAll("\\W", "").equals("")) {
//Words stored in hashmap are in lower case and have special characters removed.
hm.put(words[j].toLowerCase().replaceAll("\\W", ""), ++wordAppearCount);//index of word and string word stored in hashmap
}
}
hmArr.add(hm);//stores every single hashmap inside an ArrayList of hashmap
} //end of for loop
return hmArr; //return hashmap ArrayList
}
public static void LoadAllHashmapWords(ArrayList m){
for(int i=0;i<m.size();i++){
m.get(i); //stuck here!
}

Firstly your login wont work correctly. In the StoreLoadedFiles() method you iterate through the words like for (int j = 0; j < numOfWords; j++) { . The numOfWords field is initialized to zero and hence this loop wont execute at all. You should initialize that with length of words array.
Having said that to retrieve the value from hashmap from a list of hashmap, you should first iterate through the list and with each hashmap you could take the entry set. Map.Entry is basically the pair that you store in the hashmap. So when you invoke map.entrySet() method it returns a java.util.Set<Map.Entry<Key, Value>>. A set is returned because the key will be unique.
So a complete program will look like.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.Scanner;
public class FileWordCounter {
public static List<HashMap<String, Integer>> storeLoadedFiles() {
final File directory = new File("C:/Users/Admin/Desktop/dataFiles/");
List<HashMap<String, Integer>> listOfWordCountMap = new ArrayList<HashMap<String, Integer>>();
Scanner input = null;
StringBuilder data;
try {
for (final File fileEntry : directory.listFiles()) {
input = new Scanner(fileEntry);
input.useDelimiter("\t");
data = new StringBuilder();
while (input.hasNext()) {
data.append(input.next());
}
input.close();
String wordsInFile = data.toString().replaceAll("\\s+", " ");
HashMap<String, Integer> wordCountMap = new HashMap<String, Integer>();
for(String word : wordsInFile.split(" ")){
String strippedWord = word.toLowerCase().replaceAll("\\W", "");
int wordAppearCount = 0;
if(strippedWord.length() > 0){
if(wordCountMap.containsKey(strippedWord)){
wordAppearCount = wordCountMap.get(strippedWord);
}
wordCountMap.put(strippedWord, ++wordAppearCount);
}
}
listOfWordCountMap.add(wordCountMap);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} finally {
if(input != null) {
input.close();
}
}
return listOfWordCountMap;
}
public static void loadAllHashmapWords(List<HashMap<String, Integer>> listOfWordCountMap) {
for(HashMap<String, Integer> wordCountMap : listOfWordCountMap){
for(Entry<String, Integer> wordCountEntry : wordCountMap.entrySet()){
System.out.println(wordCountEntry.getKey() + " - " + wordCountEntry.getValue());
}
}
}
public static void main(String[] args) {
List<HashMap<String, Integer>> listOfWordCountMap = storeLoadedFiles();
loadAllHashmapWords(listOfWordCountMap);
}
}
Since you are beginner in Java programming I would like to point out a few best practices that you could start using from the beginning.
Closing resources : In your while loop to read from files you are opening a Scanner like Scanner input = new Scanner(fileEntry);, But you never closes it. This causes memory leaks. You should always use a try-catch-finally block and close resources in finally block.
Avoid unnecessary redundant calls : If an operation is the same while executing inside a loop try moving it outside the loop to avoid redundant calls. In your case for example the scanner delimiter setting as input.useDelimiter("\t"); is essentially a one time operation after a scanner is initialized. So you could move that outside the while loop.
Use StringBuilder instead of String : For repeated string manipulations such as concatenation should be done using a StringBuilder (or StringBuffer when you need synchronization) instead of using += or +. This is because String is an immutable object, meaning its value cannot be changed. So each time when you do a concatenation a new String object is created. This results in a lot of unused instances in memory. Where as StringBuilder is mutable and values could be changed.
Naming convention : The usual naming convention in Java is starting with lower-case letter and first letter upper-case for each word. So its a standard practice to name a method as storeLoadedFiles as opposed to StoreLoadedFiles. (This could be opinion based ;))
Give descriptive names : Its a good practice to give descriptive names. It helps in later code maintenance. Say its better to give a name as wordCountMap as opposed to hm. So in future if someone tries to go through your code they'll get a better and faster understanding about your code with descriptive names. Again opinion based.
Use generics as much as possible : This avoid additional casting overhead.
Avoid repetition : Similar to point 2 if you have an operation that result in the same output and need to be used multiple times try moving it to a variable and use the variable. In your case you were using words[j].toLowerCase().replaceAll("\\W", "") multiple times. All the time the result is the same but it creates unnecessary instances and repetitions. So you could move that to a String and use that String elsewhere.
Try using for-each loop where ever possible : This relieves us from taking care of indexing.
These are just suggestions. I tried to include most of it in my code but I wont say its the perfect one. Since you are a beginner if you tried to include these best practices now itself it'll get ingrained in you. Happy coding.. :)

for (HashMap<String, Integer> map : m) {
for(Entry<String,Integer> e:map.entrySet()){
//your code here
}
}
or, if using java 8 you can play with lambda
m.stream().forEach((map) -> {
map.entrySet().stream().forEach((e) -> {
//your code here
});
});
But before all you have to change method signature to public static void LoadAllHashmapWords(List<HashMap<String,Integer>> m) otherwise you would have to use a cast.
P.S. are you sure your extracting method works? I've tested it a bit and had list of empty hashmaps all the time.

Store associative array of strings with length as keys

I have this input:
5
it
your
reality
real
our
First line is number of strings comming after. And i should store it this way (pseudocode):
associative_array = [ 2 => ['it'], 3 => ['our'], 4 => ['real', 'your'], 7 => ['reality']]
As you can see the keys of associative array are the length of strings stored in inner array.
So how can i do this in java ? I came from php world, so if you will compare it with php, it will be very well.

MultiMap<Integer, String> m = new MultiHashMap<Integer, String>();
for(String item : originalCollection) {
m.put(item.length(), item);
}

djechlin already posted a better version, but here's a complete standalone example using just JDK classes:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
public class Main {
public static void main(String[] args) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String firstLine = reader.readLine();
int numOfRowsToFollow = Integer.parseInt(firstLine);
Map<Integer,Set<String>> stringsByLength = new HashMap<>(numOfRowsToFollow); //worst-case size
for (int i=0; i<numOfRowsToFollow; i++) {
String line = reader.readLine();
int length = line.length();
Set<String> alreadyUnderThatLength = stringsByLength.get(length); //int boxed to Integer
if (alreadyUnderThatLength==null) {
alreadyUnderThatLength = new HashSet<>();
stringsByLength.put(length, alreadyUnderThatLength);
}
alreadyUnderThatLength.add(line);
}
System.out.println("results: "+stringsByLength);
}
}
its output looks like this:
3
bob
bart
brett
results: {4=[bart], 5=[brett], 3=[bob]}

Java doesn't have associative arrays. But it does have Hashmaps, which mostly accomplishes the same goal. In your case, you can have multiple values for any given key. So what you could do is make each entry in the Hashmap an array or a collection of some kind. ArrayList is a likely choice. That is:
Hashmap<Integer,ArrayList<String>> words=new HashMap<Integer,ArrayList<String>>();
I'm not going to go through the code to read your list from a file or whatever, that's a different question. But just to give you the idea of how the structure would work, suppose we could hard-code the list. We could do it something like this:
ArrayList<String> set=new ArrayList<String)();
set.add("it");
words.put(Integer.valueOf(2), set);
set.clear();
set.add("your");
set.add("real");
words.put(Integer.valueOf(4), set);
Etc.
In practice, you probably would regularly be adding words to an existing set. I often do that like this:
void addWord(String word)
{
Integer key=Integer.valueOf(word.length());
ArrayList<String> set=words.get(key);
if (set==null)
{
set=new ArrayList<String>();
words.put(key,set);
}
// either way we now have a set
set.add(word);
}
Side note: I often see programmers end a block like this by putting "set" back into the Hashmap, i.e. "words.put(key,set)" at the end. This is unnecessary: it's already there. When you get "set" from the Hashmap, you're getting a reference, not a copy, so any updates you make are just "there", you don't have to put it back.
Disclaimer: This code is off the top of my head. No warranties expressed or implied. I haven't written any Java in a while so I may have syntax errors or wrong function names. :-)

As your key appears to be small integer, you could use a list of lists. In this case the simplest solution is to use a MultiMap like
Map<Integer, Set<String>> stringByLength = new LinkedHashMap<>();
for(String s: strings) {
Integer len = s.length();
Set<String> set = stringByLength.get(s);
if(set == null)
stringsByLength.put(len, set = new LinkedHashSet<>());
set.add(s);
}

private HashMap<Integer, List<String>> map = new HashMap<Integer, List<String>>();
void addStringToMap(String s) {
int length = s.length();
if (map.get(length) == null) {
map.put(length, new ArrayList<String>());
}
map.get(length).add(s);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

GC overhead limit exceeded while reading huge txt files - java

Related

Deal with large amount of strings to validation with two for-loop iteration

Saving a HashMap piece by piece to File

Java running out of memory for big variable?

How can I retrieve the value in a Hashmap stored in an arraylist type hashmap?

Store associative array of strings with length as keys

Categories

Resources