Read data from multiple files and apply business logic - java

Hi all please help me achieve this scenario where I have multiple files like aaa.txt, bbb.txt, ccc.txt with data as
aaa.txt:
100110,StringA,22
200110,StringB,2
300110,StringC, 12
400110,StringD,34
500110,StringE,423
bbb.txt as:
100110,StringA,20.1
200110,StringB,2.1
300110,StringC, 12.2
400110,StringD,3.2
500110,StringE,42.1
and ccc.txt as:
100110,StringA,2.1
200110,StringB,2.1
300110,StringC, 11
400110,StringD,3.2
500110,StringE,4.1
Now I have to read all the three files (huge files) and report the result as
100110: (22, 20.1,2.1).
Issue is with the size of files and how to achieve this in optimized way.

I assume you have some sort of code to handle reading the files line by line, so I'll pseudocode a scanner that can keep pulling lines.
The easiest way to handle this would be to use a Map. In this case, I'll just use a HashMap.
HashMap<String, String[]> map = new HashMap<>();
while (aaa.hasNextLine()) {
String[] lineContents = aaa.nextLine().split(",");
String[] array = new String[3];
array[0] = lineContents[2].trim();
map.put(lineContents[0], array);
}
while (bbb.hasNextLine()) {
String[] lineContents = bbb.nextLine().split(",");
String[] array = map.get(lineContents[0]);
if (array != null) {
array[1] = lineContents[2].trim();
map.put(lineContents[0], lineContents[2].trim());
} else {
array = new String[3];
array[1] = lineContents[2].trim();
map.put(lineContents[0], array);
}
}
// same for c, with a new index of 2
To add synchronicity, you would probably use one of these maps.
Then you'd create 3 threads that just read and put.

Unless you are doing a lot of processing on loading these files, or are reading a lot of smaller files, it might work better as a sequential operation.

If your files are all ordered, simply maintain an array of Scanner pointing to your files and read the lines one by one, output the result file in a file as you go.
Doing so, you will only keep in memory as many lines as the number of files. It is both time and memory efficient.
If your files are not ordered, you can use the sort command to sort them.

Related

It takes too much time to operate on BLOB

I have to read a BLOB column which contains only text. It worked quite efficiently (reading 100k blobs in 3 minutes) before but it is taking awful amount of time in a different environment although with same hardware.
Here's my code :-
while (rs.next()) {
is = rs.getBinaryStream(3);
while ((len = is.read(buffer)) != -1) {
baos.write(buffer, 0, len);
}
is.close();
blobByte = baos.toByteArray();
baos.close();
String blob = new String(blobByte);
String msisdn = rs.getString(2);
blobData = blob.split("\\|");
//some operations
}
I took jstack at intervals of 5 seconds and found application always in this line :-
blobData = blob.split("\\|");
And sometimes in :-
new String(blobByte);
My java options :-
-ms10g -mx12g -XX:NewSize=1g -XX:MaxNewSize=1g
Is some part of my code un-optimized? Or is there a significantly efficient way to read BLOB?
You get an InputStream for a BLOB to be able to avoid having the entire BLOB data in memory. But then, you do the entire opposite
You use a ByteArrayOutputStream to transfer the whole data into a byte[] array. Note that the data even exists twice in memory, once inside ByteArrayOutputStream’s own buffer, then in the copy created and returned by baos.toByteArray()
Then, you convert the entire array into a potentially humongous String via new String(blobByte), bearing the 3rd copying of the entire data (including the charset conversion).
split("\\|") will run over the entire String, creating substrings for each sequence between the delimiters, which implies another copying of the entire data, into the substrings (minus the delimiter characters) by then, you have four copies of the entire data in memory, depending on the source’s buffering, it might be five times. Additionally, an array containing references to all these substrings is created and populated
Not all copy operation can be avoided. But we can avoid having the entire data in memory:
try(Scanner s = new Scanner(is).useDelimiter("\\|")) {
while(s.hasNext()) {
String next = s.next();
System.out.println(next);// replace with actual processing
}
}
When you are able to process items individually, not keeping a reference to the previous item(s), these strings may get garbage collected, with a minor collection in the best case.
Even when a String[] array with all elements is required for your processing, which makes one copy of the entire data (in form of individual strings) unavoidable, you can avoid all the other copies:
try(Scanner s = new Scanner(is).useDelimiter("\\|")) {
List<String> list = new ArrayList<>();
while(s.hasNext()) list.add(s.next());
System.out.println(list);// replace with actual processing as List
String[] array = list.toArray(new String[0]); // when an array really is required
}
Starting with Java 9, you can use
try(Scanner s = new Scanner(is).useDelimiter("\\|")) {
List<String> list = s.tokens().collect(Collectors.toList());
System.out.println(list); // replace with actual processing as List
}
or
try(Scanner s = new Scanner(is).useDelimiter("\\|")) {
String[] array = s.tokens().toArray(String[]::new);
System.out.println(Arrays.toString(array)); // replace with actual processing
}
But processing the elements individually, without holding all of them in memory, is the preferred way.
Another possible optimization is to avoid multiple (internal) Pattern.compile("\\|") calls by doing it once yourself and passing the prepared Pattern instead of the "\\|" string to the useDelimiter method.
Note that all of these example use the system’s default charset encoding, just like your original code. Since the default charset of the environment running your code is not necessarily the same as for the database, you should be explicit, i.e. use new Scanner(is, charset), just like you should have used new String(blobByte, charset) in your original code, instead of new String(blobByte).
Or you use a CLOB in the first place.

Java Reference while adding elements to a list

first of all thanks for the help.
I'm aware of the Reference passing mechanism of java and I need to read one million of lines (a word + a_list_of_integers each line) from a text file and put them in some structures that are class attributes, one hashmap and two arraylist.
The problem is that with the code below, written to save memory reusing the list "termine_frequenza", when I try to get and element from the "frequency" arraylist or the "dictionaryMarTD" hashmap, the list that returns is always the last list that I added.
Adding the declaration of the "Arraylist termine_frequenza" into the While obviously solves the problem but I receive a prevedible "GC overhead limit exceeded" error because of multiple declaration (i tried to increase heap o disable it, but GC fills the cpu capacity trying to free memory.
The question is simple: how can I save memory and at the same time have a correct reading? Thanks.
//Class attributes
private HashMap<String, ArrayList> dictionaryMapTD;
private ArrayList<String> words;
private ArrayList<ArrayList> frequency;
//This is the code of a method of the class that reads from a file
br = new BufferedReader(new FileReader("dictionary.txt"));
s = br.readLine();
String[] splitted;
ArrayList<Integer> termine_frequenza = new ArrayList<>();
while(s!=null)
{
termine_frequenza.clear();
splitted = s.split(" ");
words.add(splitted[0]);
for (int i = 1; i < splitted.length; i++)
{
termine_frequenza.add(Integer.valueOf(splitted[i]));
}
frequency.add(termine_frequenza);
dictionaryMapTD.put(splitted[0], termine_frequenza);
s = br.readLine();
}
//END
Change your XMS/XMX parameters in your eclips.ini file.
I set it -Xms256m-Xmx7024m for 3000000
If it has no effect then try to modify that parameters for application
In your eclips go to
RunConfigurations->Arguments->VM Arguments
for your application and put
-Xms256m
-Xmx7024m
Then in your code move
termine_frequenza = new ArrayList<>();
inside while and remove
termine_frequenza.clear();
GC should not complain
In my case It runs for 7000000 records
Let me know if it helps

Java: Getting the 500 most common words in a text via HashMap

I'm storing my wordcount into the value field of a HashMap, how can I then get the 500 top words in the text?
public ArrayList<String> topWords (int numberOfWordsToFind, ArrayList<String> theText) {
//ArrayList<String> frequentWords = new ArrayList<String>();
ArrayList<String> topWordsArray= new ArrayList<String>();
HashMap<String,Integer> frequentWords = new HashMap<String,Integer>();
int wordCounter=0;
for (int i=0; i<theText.size();i++){
if(frequentWords.containsKey(theText.get(i))){
//find value and increment
wordCounter=frequentWords.get(theText.get(i));
wordCounter++;
frequentWords.put(theText.get(i),wordCounter);
}
else {
//new word
frequentWords.put(theText.get(i),1);
}
}
for (int i=0; i<theText.size();i++){
if (frequentWords.containsKey(theText.get(i))){
// what to write here?
frequentWords.get(theText.get(i));
}
}
return topWordsArray;
}
One other approach you may wish to look at is to think of this another way: is a Map really the right conceptual object here? It may be good to think of this as being a good use of a much-neglected-in-Java data structure, the bag. A bag is like a set, but allows an item to be in the set multiple times. This simplifies the 'adding a found word' very much.
Google's guava-libraries provides a Bag structure, though there it's called a Multiset. Using a Multiset, you could just call .add() once for each word, even if it's already in there. Even easier, though, you could throw your loop away:
Multiset<String> words = HashMultiset.create(theText);
Now you have a Multiset, what do you do? Well, you can call entrySet(), which gives you a collection of Multimap.Entry objects. You can then stick them in a List (they come in a Set), and sort them using a Comparator. Full code might look like (using a few other fancy Guava features to show them off):
Multiset<String> words = HashMultiset.create(theWords);
List<Multiset.Entry<String>> wordCounts = Lists.newArrayList(words.entrySet());
Collections.sort(wordCounts, new Comparator<Multiset.Entry<String>>() {
public int compare(Multiset.Entry<String> left, Multiset.Entry<String> right) {
// Note reversal of 'right' and 'left' to get descending order
return right.getCount().compareTo(left.getCount());
}
});
// wordCounts now contains all the words, sorted by count descending
// Take the first 50 entries (alternative: use a loop; this is simple because
// it copes easily with < 50 elements)
Iterable<Multiset.Entry<String>> first50 = Iterables.limit(wordCounts, 50);
// Guava-ey alternative: use a Function and Iterables.transform, but in this case
// the 'manual' way is probably simpler:
for (Multiset.Entry<String> entry : first50) {
wordArray.add(entry.getElement());
}
and you're done!
Here you can find a guide how to sort a HashMap by the values. After the sorting you can just iterate over the first 500 entries.
Take a look at the TreeBidiMap provided by the Apache Commons Collections package. http://commons.apache.org/collections/api-release/org/apache/commons/collections/bidimap/TreeBidiMap.html
It allows you to sort the map according to both the key or the value set.
Hope it helps.
Zhongxian

Working with a List of Lists in Java

I'm trying to read a CSV file into a list of lists (of strings), pass it around for getting some data from a database, build a new list of lists of new data, then pass that list of lists so it can be written to a new CSV file. I've looked all over, and I can't seem to find an example on how to do it.
I'd rather not use simple arrays since the files will vary in size and I won't know what to use for the dimensions of the arrays. I have no issues dealing with the files. I'm just not sure how to deal with the list of lists.
Most of the examples I've found will create multi-dimensional arrays or perform actions inside the loop that's reading the data from the file. I know I can do that, but I want to write object-oriented code. If you could provide some example code or point me to a reference, that would be great.
ArrayList<ArrayList<String>> listOLists = new ArrayList<ArrayList<String>>();
ArrayList<String> singleList = new ArrayList<String>();
singleList.add("hello");
singleList.add("world");
listOLists.add(singleList);
ArrayList<String> anotherList = new ArrayList<String>();
anotherList.add("this is another list");
listOLists.add(anotherList);
Here's an example that reads a list of CSV strings into a list of lists and then loops through that list of lists and prints the CSV strings back out to the console.
import java.util.ArrayList;
import java.util.List;
public class ListExample
{
public static void main(final String[] args)
{
//sample CSV strings...pretend they came from a file
String[] csvStrings = new String[] {
"abc,def,ghi,jkl,mno",
"pqr,stu,vwx,yz",
"123,345,678,90"
};
List<List<String>> csvList = new ArrayList<List<String>>();
//pretend you're looping through lines in a file here
for(String line : csvStrings)
{
String[] linePieces = line.split(",");
List<String> csvPieces = new ArrayList<String>(linePieces.length);
for(String piece : linePieces)
{
csvPieces.add(piece);
}
csvList.add(csvPieces);
}
//write the CSV back out to the console
for(List<String> csv : csvList)
{
//dumb logic to place the commas correctly
if(!csv.isEmpty())
{
System.out.print(csv.get(0));
for(int i=1; i < csv.size(); i++)
{
System.out.print("," + csv.get(i));
}
}
System.out.print("\n");
}
}
}
Pretty straightforward I think. Just a couple points to notice:
I recommend using "List" instead of "ArrayList" on the left side when creating list objects. It's better to pass around the interface "List" because then if later you need to change to using something like Vector (e.g. you now need synchronized lists), you only need to change the line with the "new" statement. No matter what implementation of list you use, e.g. Vector or ArrayList, you still always just pass around List<String>.
In the ArrayList constructor, you can leave the list empty and it will default to a certain size and then grow dynamically as needed. But if you know how big your list might be, you can sometimes save some performance. For instance, if you knew there were always going to be 500 lines in your file, then you could do:
List<List<String>> csvList = new ArrayList<List<String>>(500);
That way you would never waste processing time waiting for your list to grow dynamically grow. This is why I pass "linePieces.length" to the constructor. Not usually a big deal, but helpful sometimes.
Hope that helps!
If you are really like to know that handle CSV files perfectly in Java, it's not good to try to implement CSV reader/writer by yourself. Check below out.
http://opencsv.sourceforge.net/
When your CSV document includes double-quotes or newlines, you will face difficulties.
To learn object-oriented approach at first, seeing other implementation (by Java) will help you. And I think it's not good way to manage one row in a List. CSV doesn't allow you to have difference column size.
The example provided by #tster shows how to create a list of list. I will provide an example for iterating over such a list.
Iterator<List<String>> iter = listOlist.iterator();
while(iter.hasNext()){
Iterator<String> siter = iter.next().iterator();
while(siter.hasNext()){
String s = siter.next();
System.out.println(s);
}
}
Something like this would work for reading:
String filename = "something.csv";
BufferedReader input = null;
List<List<String>> csvData = new ArrayList<List<String>>();
try
{
input = new BufferedReader(new FileReader(filename));
String line = null;
while (( line = input.readLine()) != null)
{
String[] data = line.split(",");
csvData.add(Arrays.toList(data));
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally
{
if(input != null)
{
input.close();
}
}
I'd second what xrath said - you're better off using an existing library to handle reading / writing CSV.
If you do plan on rolling your own framework, I'd also suggest not using List<List<String>> as your implementation - you'd probably be better off implementing CSVDocument and CSVRow classes (that may internally uses a List<CSVRow> or List<String> respectively), though for users, only expose an immutable List or an array.
Simply using List<List<String>> leaves too many unchecked edge cases and relying on implementation details - like, are headers stored separately from the data? or are they in the first row of the List<List<String>>? What if I want to access data by column header from the row rather than by index?
what happens when you call things like :
// reads CSV data, 5 rows, 5 columns
List<List<String>> csvData = readCSVData();
csvData.get(1).add("extraDataAfterColumn");
// now row 1 has a value in (nonexistant) column 6
csvData.get(2).remove(3);
// values in columns 4 and 5 moved to columns 3 and 4,
// attempting to access column 5 now throws an IndexOutOfBoundsException.
You could attempt to validate all this when writing out the CSV file, and this may work in some cases... but in others, you'll be alerting the user of an exception far away from where the erroneous change was made, resulting in difficult debugging.
public class TEst {
public static void main(String[] args) {
List<Integer> ls=new ArrayList<>();
ls.add(1);
ls.add(2);
List<Integer> ls1=new ArrayList<>();
ls1.add(3);
ls1.add(4);
List<List<Integer>> ls2=new ArrayList<>();
ls2.add(ls);
ls2.add(ls1);
List<List<List<Integer>>> ls3=new ArrayList<>();
ls3.add(ls2);
methodRecursion(ls3);
}
private static void methodRecursion(List ls3) {
for(Object ls4:ls3)
{
if(ls4 instanceof List)
{
methodRecursion((List)ls4);
}else {
System.out.print(ls4);
}
}
}
}
Also this is an example of how to print List of List using advanced for loop:
public static void main(String[] args){
int[] a={1,3, 7, 8, 3, 9, 2, 4, 10};
List<List<Integer>> triplets;
triplets=sumOfThreeNaive(a, 13);
for (List<Integer> list : triplets){
for (int triplet: list){
System.out.print(triplet+" ");
}
System.out.println();
}
}

How can I form an ordered list of values extracted from HashMap?

My problem is actually more nuanced than the question suggests, but wanted to keep the header brief.
I have a HashMap<String, File> of File objects as values. The keys are String name fields which are part of the File instances. I need to iterate over the values in the HashMap and return them as a single String.
This is what I have currently:
private String getFiles()
{
Collection<File> fileCollection = files.values();
StringBuilder allFilesString = new StringBuilder();
for(File file : fileCollection) {
allFilesString.append(file.toString());
}
return allFilesString.toString();
}
This does the job, but ideally I want the separate File values to be appended to the StringBuilder in order of int fileID, which is a field of the File class.
Hope I've made that clear enough.
Something like this should work:
List<File> fileCollection = new ArrayList<File>(files.values());
Collections.sort(fileCollection,
new Comparator<File>()
{
public int compare(File fileA, File fileB)
{
final int retVal;
if(fileA.fileID > fileB.fileID)
{
retVal = 1;
}
else if(fileA.fileID < fileB.fileID)
{
retVal = -1;
}
else
{
retVal = 0;
}
return (retVal);
}
});
Unfortunately there is no way of getting data out of a HashMap in any recognizable order. You have to either put all the values into a TreeSet with a Comparator that uses the fileID, or put them into an ArrayList and sort them with Collections.sort, again with a Comparator that compares the way you want.
The TreeSet method doesn't work if there are any duplicates, and it may be overkill since you're not going to be adding things to or removing things from the Set. The Collections.sort method is a good solution for instances like this where you're going to take the whole HashSet, sort the results, and then toss away the sorted collection as soon as you've generated the result.
OK, this is what I've come up with. Seems to solve the problem, returns a String with the File objects nicely ordered by their fileId.
public String getFiles()
{
List<File> fileList = new ArrayList<File>(files.values());
Collections.sort(fileList, new Comparator<File>()
{
public int compare(File fileA, File fileB)
{
if(fileA.getFileId() > fileB.getFileId())
{
return 1;
}
else if(fileA.getFileId() < fileB.getFileId())
{
return -1;
}
return 0;
}
});
StringBuilder allFilesString = new StringBuilder();
for(File file : fileList) {
allFilesString.append(file.toString());
}
return allFilesString.toString();
}
I've never used Comparator before (relatively new to Java), so would appreciate any feedback if I've implemented anything incorrectly.
Why not collect it in an array, sort it, then concatenate it?
-- MarkusQ
You'll have to add your values() Collection to an ArrayList and sort it using Collections.sort() with a custom Comparator instance before iterating over it.
BTW, note that it's pointless to initialize the StringBuffer with the size of the collection, since you'll be adding far more than 1 character per collection element.
Create a temporary List, then add each pair of data into it. Sort it with Collections.sort() according to your custom comparator then you will have the List in your desired order.
Here is the method you're looking for: http://java.sun.com/javase/6/docs/api/java/util/Collections.html#sort(java.util.List,%20java.util.Comparator)
I created LinkedHashMap a dozen times before it was added to the set of collections.
What you probably want to do is create a TreeHashMap collection.
Creating a second collection and appending anything added to both isn't really a size hit, and you get the performance of both (with the cost of a little bit of time when you add).
Doing it as a new collection helps your code stay clean and neat. The collection class should just be a few lines long, and should just replace your existing hashmap...
If you get in the habit of always wrapping your collections, this stuff just works, you never even think of it.
StringBuffer allFilesString = new StringBuffer(fileCollection.size());
Unless all your file.toString() is one character on average, you are probably making the StringBuffer too small. (If it not right, you may as well not set it and make the code simpler) You may get better results if you make it some multiple of the size. Additionally StringBuffer is synchronized, but StringBuilder is not and there for more efficient here.
Remove unnecessary if statment.
List<File> fileCollection = new ArrayList<File>(files.values());
Collections.sort(fileCollection,
new Comparator<File>() {
public int compare(File a, File b) {
return (a.fileID - b.fileID);
}
});

Categories