Im reading values from a file and storing these values in a hashmap, using a bufferedreader, in the following manner --
while((String str=buffread.readLine()).length()>1)
{
hashMap.put(str.substring(0,5),str);
}
I can also verify that the hashmap has all data that was initially present in the file.
Now, Im trying to write the values of exact hashmap to another file in the following manner --
FileWriter outFile = new FileWriter("file path");
PrintWriter out = new PrintWriter(outFile);
Set entries = hashMap.entrySet();
Iterator entryIter = entries.iterator();
while (entryIter.hasNext()) {
Map.Entry entry = (Map.Entry)entryIter.next();
Object value = entry.getValue(); // Get the value.
out.println(value.toString());
}
But this seems to write lesser number of entries into the file than the value of hashMap1.size() or essentially, the number of entries that it initially read from the source file.
Though I have a hunch that its because of the Printwriter and filewriter, if anyone could point me to why this issue is occurring, it would be of great help.
Regards
p1nG
Perhaps you left this out of the code you posted, but are you explicitly calling flush() and close() on the PrintWriter/FileWriter objects when you are done with them?
Each call to println() does not necessarily cause a line to be written to the underlying OutputStream/file.
Unless the first 5 characters on every line of your source file are unique, this line
hashMap.put(str.substring(0,5),str);
will ensure you're overwriting some entries in the Map.
There is a possibility that something fails when writing to file:
Methods in this (PrintWriter) class never throw I/O exceptions. The client may inquire as to whether any errors have occurred by invoking checkError().
In general, I don't think it's a problem with HashMap, as you said that the data was read correctly.
You can't possibly read a file correctly with that code. You have to check the result of readLine() for null before you do anything else with it, unless you like catching NullPointerExceptions of course.
You don't need the Iterator at this point, just use a keyset at iterate this
Set<String> keys = hashMap.keySet();
for(String key : keys){
out.println(hashMap.get(key));
}
should do it.
Related
I need to check a List of Strings to contain certain predefined strings and - in case all these predefined string are contained into the list - I need to write the list to a File.
As a first approach I thought to do something like
if(doesTheListContainPredefinedStrings(list))
writeListIntoFile(list);
Where the doesTheListContainPredefinedStrings and writeListIntoFile executes loops to check for the prefefinedStrings and to write every element of the list to a file, respectively.
But - since in this case I have to worry about performance - I wanted to leverage the fact that in the doesTheListContainPredefinedStrings method I'm still evaluating the elements of the list once.
I also thought about something like
String[] predefinedStrings = {...};
...
PrintWriter pw = new FileWriter("fileName");
int predefinedStringsFound = 0;
for (String string : list)
{
if (predefinedStrings.contains(string))
predefinedStringsFound++;
pw.println(string);
}
if (predefinedStringsFound == predefinedStrings.length)
pw.close();
Since I observed that - at least on the system where I'm developing (Ubuntu 19.04) - if I don't close the stream the strings aren't written to the file.
Nevertheless, this solution seems really bad and the file would still be created, so - if the list wouldn't pass the check - I'd have the problem to delete it (which requires another access to the storage) anyway.
Someone could suggest me a better/the best approach to this and explain why it is better/the best to me?
check the reverse case — is any string from predefs in the strings-to-check-list missing?
Collection<String> predefs; // Your certain predefined strings
List<String> list; // Your list of strings to check
if( ! predefs.parallelStream().anyMatch( s -> ! list.contains( s ) ) )
writeListIntoFile(list);
The above lambda expression stops as soon as the first string from predefs can't be found in the strings-to-check-list and returns true — You must not write the file in this case.
It does not check if any additional strings are in the strings-to-check-list, that are not contained in the predefs strings.
I am trying to write a huge data around 64000 records at a time to a file. I am getting exceptions that I attached bellow.
the code that I used to write is
Path outputpath = Paths.get("file1.json");
try (BufferedWriter writer = Files.newBufferedWriter(outputpath, StandardCharsets.UTF_8, WRITE)) {
writer.write(jsonObject.toString());
} catch (Exception e) {
//error msg
}
Here my "jsonObject" is nothing but a json Array which contains 65000 rows .
Can you please help me to write this to my file in an efficient way ,so that I can avoid that heap Space Error.
You've cut your stacktrace a little bit short, but I'll assume the exception happens in jsonObject.toString().
Basically, you have to decide between two things: either allocate more memory or break the big operation into several smaller ones. Adding memory is quick and simple, but if you expect even more data in the future, it won't solve your problem forever. As others have mentioned, use -Xmx and/or -Xms on java command line.
The next thing you could try is to use a different JSON library. Perhaps the one you are using now is not particularly suited for large JSON objects. Or there might be a newer version.
As the last resort, you can always construct the JSON yourself. It's a string after all, and you already have the data in memory. To be as efficient as possible, you don't even need to build the entire string at once, you could just go along and write bits and pieces to your BufferedWriter.
You can try to iterate through your json object:
Iterator<String> keys = (Iterator<String>) jsonObject.keys();
while (keys.hasNext()) {
String key = keys.next();
JSONObject value = jsonObject.getJSONObject(key);
writer.write(value.toString());
}
PS. You need to check your json object's structure.
Consider that I have a data file storing rules in the following format:
//some header info
//more header info
//Rule: some_uuid_1234
rule "name"
data
data
data
end
//Rule: some_uuid_5678
rule "name2"
data
data
data
end
Now, what I would like is to be able to either read(id) or delete(id) a rule given the ID number. My question therefore is, how could I select and delete a rule (perhaps using a regex expression), and then delete this specific rule from the file, without altering anything else.
Simply replace <some_id> in your select/delete function with the actual true ID number.
//Rule: <some_id>.+?rule.+?end
NOTE: Don't forget SingleLine option.
There are 2 solutions I can think of and they have varied performance, so you can choose the one that suits you best.
Index the file
You could write an inverted index for this rule file and keep it updated for any operation that modifies the file. Of course your word index will be limited to one file and the only words in it will be the unique UUIDs. You can use a RandomAccess file to quickly read() from a given offset. The delete() operation can overwrite the target rule until it encounters the word 'end'. This solution requires more work, but you can retrieve values instantly.
Use a regex
You can alternatively read each line in the file and match it with a regex pattern that matches your rule UUID. Keep reading until you hit the 'end' of the rule and return it. A delete will involve over-writing the rule once you know the desired index. This solution is easy to write but the performance will suck. There is a lot of IO and it could become a bottleneck. (You could also load the entire file into memory and run a regex on the whole string, depending on how large the file / string is expected to be. This can get ugly real quick though.)
Whichever solution you choose you might also want to think about file level locks and how that affects CRUD operations. If this design has not been implemented yet, please consider moving the rules to a database.
I wouldn't use regular expressions to solve this particular problem - it would require loading the whole file in memory, processing it and rewriting it. That's not inherently bad, but if you have large enough files, a stream-based solution is probably better.
What you'd do is process your input file one line at a time and maintain a boolean value that:
becomes true when you find a line that matches the desired rule's declaration header.
becomes false when it's true and you find a line that matches end.
Discard all lines encountered while your boolean is set to true, write all other ones to a temporary output file (created, for example, with File#createTempFile).
For each line, if your boolean value is true, ignore it. Otherwise, write it to a temporary output file.
At the end of the process, overwrite your input file with your temporary output file using File#renameTo.
Note that this solution has the added advantage of being atomic: there is no risk for your input file to be partially written should an error occur in the middle of processing. It will either be overwritten entirely or not at all, which protects you against unexpected IOExceptions.
The following code demonstrates how you could implement that. It's not necessarily a perfect implementation, but it should illustrate the algorithm - lost somewhere in the middle of all that boilerplate code.
public void deleteFrom(String id, File file) throws IOException {
BufferedReader reader;
String line;
boolean inRule;
File temp;
PrintWriter writer;
reader = null;
writer = null;
try {
// Streams initialisation.
temp = File.createTempFile("delete", "rule");
writer = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp), "utf-8")));
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "utf-8"));
inRule = false;
// For each line in the file...
while((line = reader.readLine()) != null) {
// If we're parsing the rule to delete, we're only interested in knowing when we're done.
if(inRule) {
if(line.trim().equals("end"))
inRule = false;
}
// Otherwise, look for the beginning of the targetted rule.
else if(line.trim().equals("rule \"" + id + "\""))
inRule = true;
// Normal line, we want to keep it.
else
writer.println(line);
}
}
// Stream cleanup.
finally {
if(reader != null)
reader.close();
if(writer != null)
writer.close();
}
// We're done, copy the new file over the old one.
temp.renameTo(file);
}
I am trying to read the number of line in a binary file using readObject, but I get IOException EOF. Am I doing this the right way?
FileInputStream istream = new FileInputStream(fileName);
ObjectInputStream ois = new ObjectInputStream(istream);
/** calculate number of items **/
int line_count = 0;
while( (String)ois.readObject() != null){
line_count++;
}
readObject() doesn't return null at EOF. You could catch the EOFException and interpret it as EOF, but this would fail to detect distinguish a normal EOF from a file that has been truncated.
A better approach would be to use some meta-data. That is, rather than asking the ObjectInput how many objects are in the stream, you should store the count somewhere. For example, you could create a meta-data class that records the count and other meta-data and store an instance as the first object in each file. Or you could create a special EOF marker class and store an instance as the last object in each file.
I had the same problem today. Although the question is quite old, the problem remains and there was no clean solution provided. Ignoring EOFException should be avoided as it may be thrown when some object was not saved correctly. Writing null obviously prevents you from using null values for any other purposes. Finally using available() on the objects stream always returns zero, as the number of objects is unknown.
My solution is quite simple. ObjectInputStream is just a wrapper for some other stream, such as FileInputStream. Although ObjectInputStream.available () returns zero, the FileInputStream.available will return some value.
FileInputStream istream = new FileInputStream(fileName);
ObjectInputStream ois = new ObjectInputStream(istream);
/** calculate number of items **/
int line_count = 0;
while( istream.available() > 0) // check if the file stream is at the end
{
(String)ois.readObject(); // read from the object stream,
// which wraps the file stream
line_count++;
}
No. Catch EOFException and use that to terminate the loop.
If you write a null object at the end of the file, when you read it back you will get a null value and can terminate your loop.
Just add:
out.writeObject(null);
when you serialize the data.
It's curious that the API doesn't supply a more elegant solution to this. I guess the EOFException would work but I've always been encouraged to see exceptions as unexpected events whereas here you would often expect the object stream to come to an end.
I tried to work around this by writing a kind of "marker" object to signify the end of the object stream:
import java.io.Serializable;
public enum ObjectStreamStatus implements Serializable {
EOF
}
Then in the code reading the object i checked for this EOF object in the object reading loop.
No, you need to know how many objects there is in the binary file. You could write the number of objects at the beginning of the file (using writeInt for example) and read it while loading it.
Another option is to call ois.available() and loop until it returns 0. However, I am not sure if this is 100% sure.
It looks like the problem is with the data that you wrote out. Assuming the data is written as expected by this code, there shouldn't be a problem.
(I see you are reading Strings. This ObectInputStream isn't for reading text files. Use InputStreamReader and BufferedReader.readLine for that. Similarly if you have written the file with DataOutputSteam.writeUTF, read it with DataInputStream.readUTF)
The available method of ObjectInputStream cannot used to terminate the loop as it returns 0 even if there are objects to be read in a file. Writing a null to a file doen't seem to be a good solution either since objects can be null which then would be interpreted as the end of file. I think catching the EOFException to terminate the loops is a better practice since if EOFException occurs(either because you reached the end of the file or some other reason), you have to terminate the loop anyway.
The best possible way to end the loop could be done by adding a null object at the end. While reading the null object can be used as a boundary condition to exit the loop. Catching the EOFException also solves the purpose but it takes few m
As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) {
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for (String unique : lines) {
writer.write(unique);
writer.newLine();
}
writer.close();
}
If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.
Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:
Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
Read Line.
If not(Line in hashSet)
Add Line to hashset.
Write Line to output.
End If.
End While.
Free hashset.
Close input.
Close output.
Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.
A similar approach
public void stripDuplicatesFromFile(String filename) {
IOUtils.writeLines(
new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
"\n", new FileOutputStream(filename + ".uniq"));
}
Something like this, perhaps:
BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
out.println(line);
LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.
You could use Set in the Collections library to store unique, seen values as you read the file.
Set<String> uniqueStrings = new HashSet<String>();
// read your file, looping on newline, putting each line into variable 'thisLine'
uniqueStrings.add(thisLine);
// finish read
for (String uniqueString:uniqueStrings) {
// do your processing for each unique String
// i.e. System.out.println(uniqueString);
}
If the order does not matter, the simplest way is shell scripting:
<infile sort | uniq > outfile
Try a simple HashSet that stores the lines you have already read.
Then iterate over the file.
If you come across duplicates they are simply ignored (as a Set can only contain every element once).
Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)
The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.
Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.
If you could use UNIX shell commands you could do something like the following:
for(i = line 0 to end)
{
sed 's/\$i//2g' ; deletes all repeats
}
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.
For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.
The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.
The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing
the result, just the line numbers to be deleted.
Then copy the original file line by line, ignoring the line numbers you have stored above.
Does it matter in which order the lines come, and how many duplicates are you counting on seeing?
If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.
I have made two assumptions for this efficient solution:
There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.
Based on these assumptions solution is:
1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n).
While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.
So consider the complexity and memory usage:
Hashmap memory ,space complexity = O(n) where n is number of lines
Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter.
That was worst case.
In other cases we save on comparisons although we will have little extra space required in hashmap.
Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.
void deleteDuplicates(File filename) throws IOException{
#SuppressWarnings("resource")
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new LinkedHashSet<String>();
String line;
String delims = " ";
System.out.println("Read the duplicate contents now and writing to file");
while((line=reader.readLine())!=null){
line = line.trim();
StringTokenizer str = new StringTokenizer(line, delims);
while (str.hasMoreElements()) {
line = (String) str.nextElement();
lines.add(line);
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for(String unique: lines){
writer.write(unique+" ");
}
writer.close();
}
}
System.out.println(lines);
System.out.println("Duplicate removal successful");
}
These answers all rely on the file being small enough to store in memory.
If it is OK to sort the file, this is an algorithm that can be used on any sized file.
You need this library: https://github.com/lemire/externalsortinginjava
I assume you start with a file fileDumpCsvFileUnsorted and you will end up with a new file fileDumpCsvFileSorted that is sorted and has no dupes.
ExternalSort.sort(fileDumpCsvFileUnsorted, fileDumpCsvFileSorted);
int numDupes = 0;
File dupesRemoved = new File(fileDumpCsvFileSorted.getAbsolutePath() + ".nodupes");
String previousLine = null;
try (FileWriter fw = new FileWriter(dupesRemoved);
BufferedWriter bw = new BufferedWriter(fw);
FileReader fr = new FileReader(fileDumpCsvFileSorted);
LineIterator lineIterator = new LineIterator(fr)
) {
while (lineIterator.hasNext()) {
String nextLine = lineIterator.nextLine();
if (StringUtils.equals(nextLine, previousLine)) {
++numDupes;
continue;
}
bw.write(String.format("%s%n", nextLine));
previousLine = nextLine;
}
}
logger.info("Removed {} dupes from {}", numDupes, fileDumpCsvFileSorted.getAbsolutePath());
FileUtils.deleteQuietly(fileDumpCsvFileSorted);
FileUtils.moveFile(dupesRemoved, fileDumpCsvFileSorted);
The file fileDumpCsvFileSorted is now created sorted with no dupes.