I'm trying to write an API to replace all the lines containing a certain substring with a different string in a text file.
I’m using Java 8 stream to filter the lines which contains the given pattern. I’m having problem with the file write part.
Files.lines(targetFile).filter(line -> line.contains(plainTextPattern)).parallel()
.map(line-> line.replaceAll(plainTextPattern, replaceWith)).parallel();
Above code reads the file line-wise, filters the lines that match the pattern and replaces with the give string and gives back a stream of strings which has only the replaced lines.
We need to write these lines back to file. Since we lose the stream once the pipeline ends, I appended the following to the pipeline:
.forEach(line -> {
try {
Files.write(targetFile, line.toString().getBytes());
} catch (IOException e) {
e.printStackTrace();
}
I was hoping it would write to the file only the modified (since it is in the pipeline) line and keep the other lines untouched.
But it seems to truncate the file for each line in the file and keep only the last processed line and deletes all the lines that were not matched in the pipeline.
Is there something I’m missing about handling files using streams?
Using filter eliminates anything that doesn't match the filter from the stream. (Additionally, for what it's worth, a) you only need to use parallel once, b) parallel isn't that effective on streams coming from I/O sources, c) it's almost never a good idea to use parallel until you've actually tried it non-parallel and found it too slow.)
That said: there's no need to filter out the lines that match the pattern if you're going to do a replaceAll. Your code should look like this:
try (Stream<String> lines = Files.lines(targetFile)) {
List<String> replaced = lines
.map(line-> line.replaceAll(plainTextPattern, replaceWith))
.collect(Collectors.toList());
Files.write(targetFile, replaced);
}
So sorry to tell you that this is not how files work. If you want to write to the middle of a file, you need to have RandomAccess; Get a FilePointer, seek, that pointer, and write from there.
This holds if the size of data you want write is equal to the size of data you want to overwrite. If this is not the case, you have to copy the tail of the file to a temp buffer and append it to the text you wish to write.
And btw, parallelStreams on IO bound tasks is often a bad idea.
You might want to implement a stream, like Jenkov has done it here:
http://tutorials.jenkov.com/java-howto/replace-strings-in-streams-arrays-files.html
This simple one is specifically replacing tokens in the form of ${tokenName}.
There are more general algorithms.
Related
I have a java program that reads lines of a text file into a buffer and when the buffer is full it outputs the lines so that after all lines have been through the buffer the output is partially sorted.
The output will be in blocks of lines so I need a way to mark the end of each block in the output. Since the output is lines of text I'm not sure what character to use as a marker since the text can contain any characters. I'm thinking of using the ascii null or unit separator but I'm not sure if this would be reliable since it could also be in text.
You could use a Map, so you can set a key for every buffergroup something like that
Hash<int,Buffer> myMap = new HashMap<>();
if you are not sure how to discriminate lines, I suggest you take a look at a sentence tokenizer tool which is usually used in NLP. These programs contain patterns that discriminate lines from each other. That way, you can send all your date through and get the lines without worrying about wich character to use. There are plenty libraries for Java which does the job perfectly (Assuming your text is in English)
I'm currently working with pulling a CSV file from a URL and modifying it's entries. I'm currently using a StreamReader to read each line of the CSV and split it into an array, where I can modify each entry based on its position.
The CSV is generated from an e-form provider where a particular form entry is a Multi-Line field, where a user can add multiple notes. However, when a user enters a new note, they are separating each note by a line return.
CSV Example:
"FName","LName","Email","Note 1: some text
Note 2: some text"
Since my code is splitting each CSV entry by line, once it reaches these notes, it believes it to be a new CSV entry. This is causing my code that modifies the entries to not work since the element positions become incorrect. (CSV entries with empty or single line note fields work fine)
Any ideas on the best approach to take for this? I've tried adding code to replace carriage returns or to skip empty lines but it doesn't seem to help.
You can check for first column value in a row is null or not. If it is null continue to read next line.
Assuming the CSV example you have provided is supposed to be just one entry in the CSV file (with the last field spanning over several different lines due to newline breaks), you could try something like this, using 2 loops.
Keep a variable for the current CSV record (of String[] type) currentRecord and a recordList (a List or an array) to keep all the CSV records.
Read a line of the CSV file
Split it into an array of strings using the comma as the delimiter. Keep this array in a temporary variable.
If the size of this array is 1, append this string to the last element (4th) in currentRecord (if currentRecord is not null).
Keep reading lines off the CSV file, and repeating step 4 until the array size is 4.
If the size is 4, then this indicates that the record is the next record in the CSV file and you can add the currentRecord to recordList
Keep repeating steps 2 to 6 until you reach the end of the CSV file
It would be better if you can remove the line breaks in the field and clean the CSV file before parsing it though. It'll make things much simpler.
Use a proper CSV library to handle the writing and parsing. There's a few edge cases to handle here, not only the new line. Users could also insert commas or quotes in their notes and it will become very messy to handle this by yourself.
Try uniVocity-parsers as it can handle all sorts of situations when parsing and writing CSV.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I have a file I need to read that's over 50gb large with all characters in one line.
Now comes the tricky part:
I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").
Question:
Are there some progressive search implementations or other methods that I could use instead of filling up my memory?
To simplify:
There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.
Something about the file:
It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.
This is how the value presents itself inside the file.
<sometag:sometext srsName="value">
One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.
One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.
You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
This would allow you to specify the number of characters to read in to memory at once using the read method.
I've done it like this:
String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();
charBuff=(char)br.read();
try{
while(true){
myBuff=myBuff.substring(1)+charBuff;
if(myBuff.startsWith("srsName"))break;
charBuff=(char)br.read();
}
}
catch(Exception e){}
value = myBuff.split("\"")[1];
where br is my BufferedReader
i want to write strings to a textfile, everytime to the bottom of the file. And then if im searching for a certain string in the textfile and finds it, i want to replace that line with another.
I'm thinking this: Count rows in textfile and add +1 and then write the string i want to write to that index. But is it even possible to write to a certain linenumber in a textfile?
And how about to update a certain row to another string ?
thanks!
You do not want to do that: it is a recipe for disaster. If, during the original file modification, you fail to write to it, the original file will be corrupted.
Use a double write protocol, write the modified file to another file, and only if the write suceeds, rename that file to the original.
Provided your file is not too big, for some definition of "big", I'd recommend creating a List<String> for the destination file: read the original file line by line, add to that list; once the list processing is complete (your question is unclear what should really happen), write each String to the other file, flush and close, and if that succeeds, rename to the original.
If you want to append strings, the FileOutputStream does have an alternate constructor which you can set to true so you can open for appending.
If you'd like, say, to replace strings into a file without copying it, your best bet would be to rely in RandomAccessFile instead. However, if the line length is varying, this is unreliable. For fixed-length records, this should work as such:
Move to the offset
Write
You can also 'truncate' (via setLength), so if there's a trailing block you need to get rid, you could discard as such.
A Third Solution would be to rely in mmap. This requires on a Memory-Mapped Bytebuffer for the whole file. I'm not considering the whole feasibility of the solution (it works in plain C), but that actually 'looks' the more correct, if you consider both the Java Platform + the Operating System
What will be the most eficient way to split a file in Java ?
Like to get it grid ready...
(Edit)
Modifying the question.
Basically after scouring the net I understand that there are generally two methods followed for file splitting....
Just split them by the number of bytes
I guess the advantage of this method is that it is fast, but say I have all the data in a line and suppose the file split puts half the data in one split and the other half the data in another split, then what do I do ??
Read them line by line
This will keep my data intact, fine, but I suppose this ain't as fast as the above method
Well, just read the file line by line and start saving it to a new file. Then when you decide it's time to split, start saving the lines to a new place.
Don't worry about efficiency too much unless it's a real problem later.
My first impression is that you have something like a comma separated value (csv) file. The usual way to read / parse those files is to
read them line by line
skip headers and empty lines
use String#split(String reg) to split a line into values (reg is chosen to match the delimiter)