I am new to Java IO. Currently, I have these lines of code which generates an input stream based on string.
String sb = new StringBuilder();
for(...){
sb.append(...);
}
String finalString = sb.toString();
byte[] objectBytes = finalString.getBytes(StandardCharsets.UTF_8);
InputStream inputStream = new ByteArrayInputStream(objectBytes);
Maybe, I am misunderstanding something, but is there a better way to generate InputStream from String other than using getBytes()?
For instance, if String is really large, 50MB, and there is no way to create another copy (getBytes() for another 50MB) of it due to resource constraints, it could potentially throw an out of memory error.
I just wanted to know if above lines of code is the efficient way to generate InputStream from String. For instance, is there a way which I can "stream" String into input stream without using additional memory? Like a Reader-like abstraction on top of String?
I think what you're looking for is a StringReader which is defined as:
A character stream whose source is a string.
To use this efficiently, you would need to know exactly where the bytes are located that you wish to read. It supports both random and sequential access, so you can read the entire String, char by char, if you prefer.
You are producing data, actually writing and you want to almost immediately consume the data, reading.
The Unix technique is to pipe the output of one process to the input of an other process. In java one also needs at least two threads. They will synchronize on producing and consuming.
PipedInputStream in = new PipedInputStream();
PipedOutputStream out = new PipedOutputStream(in);
new Thread(() -> writeAllYouveGot(out)).start();
readAllYouveGot(in);
Here I started a Thread for writing with a Runnable that calls some self-defined method on out. Instead of using new Thread you might prefer an ExecutorService.
Piped I/O is rather seldomly used, though the asynchrone behaviour is optimal. One can even set the pipe's size on the PipedInputStream. The reason for that rare usage, is the need for a second thread.
To complete things, one would probably wrap the binary Input/OutputStreams in new InputStreamReader(in, "UTF-8") and new OutputStreamWriter(out, "UTF-8").
Try something like this (no promises about typos:)
BufferedReader reader = new BufferedRead(new InputStreamReader(yourInputStream), Charset.defaultCharset());
final char[] buffer = new char[8000];
int charsRead = 0;
while(true) {
charsRead = reader.read(buffer, 0, 8000);
if (charsRead == -1) {
break;
}
// Do something with buffer
}
The InputStreamReader converts from byte to char, using the Charset. BufferedReader allows you to read blocks of char.
For really large input streams, you may want to process the input in chunks, rather than reading the entire stream into memory and then processing.
I was just recently asked an interview question that held to deal with reading from a CSV file and summing up entries in certain cells. When asked to optimize it, I couldn't answer how to deal with the case of running out of memory if we were given a CSV of size say 100 gigs.
In Java, how exactly does reading from a file work? How do we know when something is too big? How do we deal with that? I was told that you could pass in the intermediate reader object instead of trying to read the entire thing?
The interviewer gave you a hint - BufferedReader. It is an efficient choice for reading a large file line by line.
Small example:
String line;
BufferedReader br = new BufferedReader("c:/test.txt");
while ((line= br.readLine()) != null) {
//do processing
}
br.close();
Here is the documentation
There are several ways to read from a file in Java, some of them involve keeping all of the files lines (or data) in memory as you "read" the data delimited by something like a newline character (reading line by line for example).
For large files you want to process smaller bits at a time using the Scannerclass (or something like it to read specific bytes at a time).
Sample code:
FileInputStream inputStream = new FileInputStream(path);
Scanner sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
// System.out.println(line);
}
You can use RandomAccessFile to read the file. It may not be the best solution though.
I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.
I am reading and looping a big file using BufferedReader in java. I want to ask if there is a way that I can move 1000 lines behind in the loop? It is a simple CSV file and I am saving all records in a String in each loop.
BufferedReader br = new BufferedReader(new FileReader("D:\\file.txt"));
String line;
while ((line = br.readLine()) != null) {
// normal processing
if (condition)
// move 1000 lines back again in the loop above (i.e. re-read 1000 lines again)
}
Is there any way I can move back?
You can use mark() and reset(), provided the buffer is big enough, or you could maintain the most recent 1000 lines in a deque, but why? Re-reading files is just a sign of bad design. Most files can be read left to right in a single pass. Compilers do it: so can you.
As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) {
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for (String unique : lines) {
writer.write(unique);
writer.newLine();
}
writer.close();
}
If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.
Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:
Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
Read Line.
If not(Line in hashSet)
Add Line to hashset.
Write Line to output.
End If.
End While.
Free hashset.
Close input.
Close output.
Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.
A similar approach
public void stripDuplicatesFromFile(String filename) {
IOUtils.writeLines(
new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
"\n", new FileOutputStream(filename + ".uniq"));
}
Something like this, perhaps:
BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
out.println(line);
LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.
You could use Set in the Collections library to store unique, seen values as you read the file.
Set<String> uniqueStrings = new HashSet<String>();
// read your file, looping on newline, putting each line into variable 'thisLine'
uniqueStrings.add(thisLine);
// finish read
for (String uniqueString:uniqueStrings) {
// do your processing for each unique String
// i.e. System.out.println(uniqueString);
}
If the order does not matter, the simplest way is shell scripting:
<infile sort | uniq > outfile
Try a simple HashSet that stores the lines you have already read.
Then iterate over the file.
If you come across duplicates they are simply ignored (as a Set can only contain every element once).
Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)
The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.
Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.
If you could use UNIX shell commands you could do something like the following:
for(i = line 0 to end)
{
sed 's/\$i//2g' ; deletes all repeats
}
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.
For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.
The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.
The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing
the result, just the line numbers to be deleted.
Then copy the original file line by line, ignoring the line numbers you have stored above.
Does it matter in which order the lines come, and how many duplicates are you counting on seeing?
If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.
I have made two assumptions for this efficient solution:
There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.
Based on these assumptions solution is:
1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n).
While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.
So consider the complexity and memory usage:
Hashmap memory ,space complexity = O(n) where n is number of lines
Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter.
That was worst case.
In other cases we save on comparisons although we will have little extra space required in hashmap.
Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.
void deleteDuplicates(File filename) throws IOException{
#SuppressWarnings("resource")
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new LinkedHashSet<String>();
String line;
String delims = " ";
System.out.println("Read the duplicate contents now and writing to file");
while((line=reader.readLine())!=null){
line = line.trim();
StringTokenizer str = new StringTokenizer(line, delims);
while (str.hasMoreElements()) {
line = (String) str.nextElement();
lines.add(line);
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for(String unique: lines){
writer.write(unique+" ");
}
writer.close();
}
}
System.out.println(lines);
System.out.println("Duplicate removal successful");
}
These answers all rely on the file being small enough to store in memory.
If it is OK to sort the file, this is an algorithm that can be used on any sized file.
You need this library: https://github.com/lemire/externalsortinginjava
I assume you start with a file fileDumpCsvFileUnsorted and you will end up with a new file fileDumpCsvFileSorted that is sorted and has no dupes.
ExternalSort.sort(fileDumpCsvFileUnsorted, fileDumpCsvFileSorted);
int numDupes = 0;
File dupesRemoved = new File(fileDumpCsvFileSorted.getAbsolutePath() + ".nodupes");
String previousLine = null;
try (FileWriter fw = new FileWriter(dupesRemoved);
BufferedWriter bw = new BufferedWriter(fw);
FileReader fr = new FileReader(fileDumpCsvFileSorted);
LineIterator lineIterator = new LineIterator(fr)
) {
while (lineIterator.hasNext()) {
String nextLine = lineIterator.nextLine();
if (StringUtils.equals(nextLine, previousLine)) {
++numDupes;
continue;
}
bw.write(String.format("%s%n", nextLine));
previousLine = nextLine;
}
}
logger.info("Removed {} dupes from {}", numDupes, fileDumpCsvFileSorted.getAbsolutePath());
FileUtils.deleteQuietly(fileDumpCsvFileSorted);
FileUtils.moveFile(dupesRemoved, fileDumpCsvFileSorted);
The file fileDumpCsvFileSorted is now created sorted with no dupes.