Processing and splitting large files with Java 8

Processing and splitting large files with Java 8 - java

I'm new to Java 8 and I have just started using the NIO package for file-handling. I need help in how to process large files--varying from 100,000 lines to 1,000,000 lines per file--by transforming each line into a specific format and writing the formatted lines to new files. The new file(s) generated must only contain a maximum of 100,000 lines per file. So:
if I have a 500,000-line file for processing, I must transform those
lines and distribute and print them on 5 new files.
if I have a 745,000-line file for processing, I must transform those
lines and print them on 8 new files.
I'm having a hard time figuring out an approach that will efficiently utilize the new features of Java 8. I've started out with determining the number of new files to be generated based on the line count of the large file, and then creating those new empty files:
Path largFile = Path.get("path\to\file");
long recordCount = Files.lines(file).count();
int maxRecordOfNewFiles = 100000;
int numberOfNewFiles = 1;
if (recordCount > maxRecordOfNewFiles) {
numberOfNewFiles = Math.toIntExact(recordCount / maxRecordOfNewFiles);
if (Math.toIntExact(recordCount % maxRecordOfNewFiles) > 0) {
numberOfNewFiles ++;
}
}
IntStream.rangeClosed(1, numberOfNewFiles).forEach((i)
-> {
try {
Path newFile = Paths.get("path\to\newFiles\newFile1.txt");
Files.createFile(cdpFile);
} catch (IOException iOex) {
}
});
But as I go through the the lines of the largeFile through the Files.lines(largeFile).forEach(()) capability, I got lost on how to proceed with formatting the first 100,000 lines and then determining the first of the new files and printing them on that file, and then the second batch of 100,000 to the second new file, and so on.
Any help will be appreciated. :)

When you start conceiving batch processes, I think you should consider using a framework specialized in that. You may want to handle restarts, scheduling... Spring Batch is very good for that and already provides what you want: MultiResourceItemWriter that writes to multiple files with max lines per file and FlatFileItemReader to read data from a file.
In this case, what you want is to loop over each line of an input file and write a transformation of each line in multiple output files.
One way to do that would be to create a Stream over the lines of the input file, map each line and send it to a custom writer. This custom writer would implement the logic of switching writer when it has reached the maximum number of lines per file.
In the following code MyWriter opens a BufferedWriter to a file. When the maxLines is reached (a multiple of it), this writer is closed and another one is opened, incrementing currentFile. This way, it is transparent for the reader that we're writing to multiple files.
public static void main(String[] args) throws IOException {
try (
MyWriter writer = new MyWriter(10);
Stream<String> lines = Files.lines(Paths.get("path/to/file"));
) {
lines.map(l -> /* do transformation here */ l).forEach(writer::write);
}
}
private static class MyWriter implements AutoCloseable {
private long count = 0, currentFile = 1, maxLines = 0;
private BufferedWriter bw = null;
public MyWriter(long maxLines) {
this.maxLines = maxLines;
}
public void write(String line) {
try {
if (count % maxLines == 0) {
close();
bw = Files.newBufferedWriter(Paths.get("path/to/newFiles/newFile" + currentFile++ + ".txt"));
}
bw.write(line);
bw.newLine();
count++;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
#Override
public void close() throws IOException {
if (bw != null) bw.close();
}
}

From what I understand in question. A simple way can be:
BufferedReader buff = new BufferedReader(new FileReader(new File("H:\\Docs\\log.txt")));
Pair<Integer, BufferedWriter> ans = buff.lines().reduce(new Pair<Integer, BufferedWriter>(0, null), (count, line) -> {
try {
BufferedWriter w;
if (count.getKey() % 1000 == 0) {
if (count.getValue() != null) count.getValue().close();
w = new BufferedWriter(new FileWriter(new File("f" + count.getKey() + ".txt")));
} else w = count.getValue();
w.write(line + "\n"); //do something
return new Pair<>(count.getKey() + 1, w);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}, (x, y) -> {
throw new RuntimeException("Not supproted");
});
ans.getValue().close();

Related

Remove word after line from txt file is read

I have this code which is used to read lines from a file and insert it into Postgre:
try {
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(
"C:\\in_progress\\test.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
Thread.sleep(100);
Optional<ProcessedWords> isFound = processedWordsService.findByKeyword(line);
if(!isFound.isPresent()){
ProcessedWords obj = ProcessedWords.builder()
.keyword(line)
.createdAt(LocalDateTime.now())
.build();
processedWordsService.save(obj);
}
// read next line
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
catch (Exception e) {
e.printStackTrace();
}
How I can remove a line from the file after the line is inserted into SQL database?

The issues with the current code:
Adhere to the Single responsibility principle. Your code is doing too many things: reads from a file, performs findByKeyword() call, prepares the data and hands it out to store in the database. It's hardly can be thoroughly tested, and it's very difficult to maintain.
Always use try-with-recourses to get your recourses closed at any circumstances.
Don't catch the general Exception type - your code should only catch thous exceptions, which are more or less expected and for which there's a clear scenario on how to handle them. But don't catch all the exceptions.
How I can remove a line from the file after the line is inserted into SQL database?
It is not possible to remove a line from a file in the literal sense. You can override the contents of the file or replace it with another file.
My advice would be to file data in memory, process it, and then write the lines which should be retained into the same file (i.e. override the file contents).
You can argue that the file is huge and dumping it into memory would result in an OutOfMemoryError. And you want to read a line from a file, process it somehow, then store the processed data into the database and then write the line into a file... So that everything is done line by line, all actions in one go for a single line, and as a consequence all the code is crammed in one method. I hope that's not the case because otherwise it's a clear XY-problem.
Firstly, File System isn't a reliable mean of storing data, and it's not very fast. If the file is massive, then reading and writing it will a take a considerable amount of time, and it's done just it in order to use a tinny bit of information then this approach is wrong - this information should be stored and structured differently (i.e. consider placing into a DB) so that it would be possible to retrieve the required data, and there would be no problem with removing entries that are no longer needed.
But if the file is lean, and it doesn't contain critical data. Then it's totally fine, I will proceed assuming that it's the case.
The overall approach is to generate a map Map<String, Optional<ProcessedWords>> based on the file contents, process the non-empty optionals and prepare a list of lines to override the previous file content.
The code below is based on the NIO2 file system API.
public void readProcessAndRemove(ProcessedWordsService service, Path path) {
Map<String, Optional<ProcessedWords>> result;
try (var lines = Files.lines(path)) {
result = processLines(service, lines);
} catch (IOException e) {
result = Collections.emptyMap();
logger.log();
e.printStackTrace();
}
List<String> linesToRetain = prepareAndSave(service, result);
writeToFile(linesToRetain, path);
}
Processing the stream of lines from a file returned Files.lines():
private static Map<String, Optional<ProcessedWords>> processLines(ProcessedWordsService service,
Stream<String> lines) {
return lines.collect(Collectors.toMap(
Function.identity(),
service::findByKeyword
));
}
Saving the words for which findByKeyword() returned an empty optional:
private static List<String> prepareAndSave(ProcessedWordsService service,
Map<String, Optional<ProcessedWords>> wordByLine) {
wordByLine.forEach((k, v) -> {
if (v.isEmpty()) saveWord(service, k);
});
return getLinesToRetain(wordByLine);
}
private static void saveWord(ProcessedWordsService service, String line) {
ProcessedWords obj = ProcessedWords.builder()
.keyword(line)
.createdAt(LocalDateTime.now())
.build();
service.save(obj);
}
Generating a list of lines to retain:
private static List<String> getLinesToRetain(Map<String, Optional<ProcessedWords>> wordByLine) {
return wordByLine.entrySet().stream()
.filter(entry -> entry.getValue().isPresent())
.map(Map.Entry::getKey)
.collect(Collectors.toList());
}
Overriding the file contents using Files.write(). Note: since varargs OpenOption isn't provided with any arguments, this call would be treated as if the CREATE, TRUNCATE_EXISTING, and WRITE options are present.
private static void writeToFile(List<String> lines, Path path) {
try {
Files.write(path, lines);
} catch (IOException e) {
logger.log();
e.printStackTrace();
}
}

For Reference
import java.io.*;
public class RemoveLinesFromAfterProcessed {
public static void main(String[] args) throws Exception {
String fileName = "TestFile.txt";
String tempFileName = "tempFile";
File mainFile = new File(fileName);
File tempFile = new File(tempFileName);
try (BufferedReader br = new BufferedReader(new FileReader(mainFile));
PrintWriter pw = new PrintWriter(new FileWriter(tempFile))
) {
String line;
while ((line = br.readLine()) != null) {
if (toProcess(line)) { // #1
// process the code and add it to DB
// ignore the line (i.e, not add to temp file)
} else {
// add to temp file.
pw.write(line + "\n"); // #2
}
}
} catch (Exception e) {
e.printStackTrace();
}
// delete the old file
boolean hasDeleted = mainFile.delete(); // #3
if (!hasDeleted) {
throw new Exception("Can't delete file!");
}
boolean hasRenamed = tempFile.renameTo(mainFile); // #4
if (!hasRenamed) {
throw new Exception("Can't rename file!");
}
System.out.println("Done!");
}
private static boolean toProcess(String line) {
// any condition
// sample condition for example
return line.contains("aa");
}
}
Read the file.
1: The condition to decide whether to delete the line or to retain it.
2: Write those line which you don't want to delete into the temporary file.
3: Delete the original file.
4: Rename the temporary file to original file name.

The basic idea is the same as what #Shiva Rahul said in his answer.
However another approach can be , store all the line numbers you want to delete in a list. After you have all the required line numbers that you want to delete you can use LineNumberReader to check and duplicate your main file.
Mostly I have used this technique in batch-insert where I was unsure how many lines may have a particular file plus before removal of lines had to do lot of processing.
It may not be suitable for your case ,just posting the suggestion here if any one bumps to this thread.
private void deleteLines(String inputFilePath,String outputDirectory,List<Integer> lineNumbers) throws IOException{
File tempFile = new File("temp.txt");
File inputFile = new File(inputFilePath);
// using LineNumberReader we can fetch the line numbers of each line
LineNumberReader lineReader = new LineNumberReader(new FileReader(inputFile));
//writter for writing the lines into new file
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(tempFile));
String currentLine;
while((currentLine = lineReader.readLine()) != null){
//if current line number is present in removeList then put empty line in new file
if(lineNumbers.contains(lineReader.getLineNumber())){
currentLine="";
}
bufferedWriter.write(currentLine + System.getProperty("line.separator"));
}
//closing statements
bufferedWriter.close();
lineReader.close();
//delete the main file and rename the tempfile to original file Name
boolean delete = inputFile.delete();
//boolean b = tempFile.renameTo(inputFile); // use this to save the temp file in same directory;
boolean b = tempFile.renameTo(new File(outputDirectory+inputFile.getName()));
}
To use this function all you have to do is gather all the required line numbers.inputFilePath is the path of the source file and outputDirectory is where I want store the file after processing.

Spring Boot java: Process/Compare lines of very large file

I have this app where i process a very large file, Extract the lines that have the same first 5 characters (i call this currentlineId ), use them to create an object and do something with it, example sample of the file contents:
AZDFS12345678998765432345678
AZDFS09876545432345678987654
AZDFS34568987654567890987654
AZDFS12345670987654345678998
AZDFS12345098734567765123456
// the lines above have the same first 5 characters, they create Object1.
FGHJUY121324
FGHJUY089909
FGHJUYTTUTUU
//same for the lines above, they create Object2.
NB: the lines will always be in an order where lines with the same first 5 will always be together (abover/below each other) so i wonn't have lines all over the place
My current function code:
private void processScpFile(File file) {
LOGGER.info("Processing File: {} ", file.getName());
try (var br = new BufferedReader(new FileReader(file))) {
String currentLine;
String lastLineId = null;
List<String> similarLineIdsList = new ArrayList<>();
while ((currentLine = br.readLine()) != null) {
if (StringUtils.isEmpty(lastLineId)) {
lastLineId = currentLine.substring(0,5);
}
if (lastLineId.equals(currentLine.substring(0,5))) {
similarLineIdsList.add(currentLine);
}
else if (!lastLineId.equals(currentLine.substring(0,5))) {
doSomethinsWithTheList(similarLineIdsList);
similarLineIdsList.clear();
similarLineIdsList.add(currentLine);
lastLineId= currentLine.substring(0,5);
}
}
doSomethinsWithTheList(similarLineIdsList);
}
catch (IOException e) {
LOGGER.error("Couldn't read file, {}", e.getMessage(), e);
}
}
Now this has worked well up until now, but going forward i have to process files where i would have for instance over 100k lines with same first 5, which makes this process very slow.
Please do you have any suggestion on haow to make this process faster, thank you
Edit: just to be precise it's the generating the list with the same first 5 chars that's slower as the number of similar lines gets larger.

Creating an inverted index with limited memory in java

Im curious on how create an Inverted Index on data that doesn't fit into memory. So right now I'm reading a file directory and indexing the files based on the contents inside the file, I am using a HashMap to store the index. The code below is a snippet from a function I use and I call the function on an entire directory. What do I do if this directory was just massive and the HashMap can't fit all the entries. Yes, This does sound like premature optimization. Im just having fun. I don't want to use Lucene so don't even mention it because I'm tired as to seeing that as the majority answer to "Index" stuff. This HashMap is my only constraint everything else is stored in files to easily reference stuff later on.
Im just curious how I can do this since it stores it in the map like so
keyword -> file1,file2,file3,etc..(locations)
keyword2 -> file9,file11,file13,etc..(locations)
My thoughts were to create a file which would some how be able to update itself to be like the format above but I feel thats not efficient.
Code Snippet
br = new BufferedReader(new FileReader(file));
while ((line = br.readLine()) != null) {
for (String _word : line.split("\\W+")) {
word = _word.toLowerCase();
if (!ignore_words.contains(word)) {
fileLocations = index.get(word);
if (fileLocations == null) {
fileLocations = new LinkedList<Long>();
index.put(word, fileLocations);
}
fileLocations.add(file_offset);
}
}
}
br.close();
Update:
So I managed to come up with something, but performance wise I feel this is slow, especially if there was a large amount of data. I basically created a file that would just have to word and its offset on each line the word appeared.Lets name it index.txt.
It had the format of like so
word1:offset
word2:offset
word1:offset <-encountered again.
word3:offset
etc...
I then created multiple files for each word and appended the offset to that file each time it was encountered in the index.txt file.
So basically the format of the word files are like so
word1.txt -- Format
word1:offset1:offset2:offset3:offset4...and so on
each time word1 is encountered in the index.txt file it would append it to the word1.txt file and add to end.
Then finally, I go through all the word files I created and overwrite the index.txt file with the final output in the index file looking like so
word1:offset1:offset2:offset3:offset4:...
word2:offset9:offset11:offset13:offset14:...
etc..
Then to finish it up, I delete all the word files.
The nasty code snippet for this is below, its a fair amount.
public void createIndex(String word, long file_offset)
{
PrintWriter writer;
try {
writer = new PrintWriter(new FileWriter(this.file,true));
writer.write(word + ":" + file_offset + "\n");
writer.close();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
public void mergeFiles()
{
String line;
String wordLine;
String[] contents;
String[] wordContents;
BufferedReader reader;
BufferedReader mergeReader;
PrintWriter writer;
PrintWriter mergeWriter;
try {
reader = new BufferedReader(new FileReader(this.file));
while((line = reader.readLine()) != null)
{
contents = line.split(":");
writer = new PrintWriter(new FileWriter(
new File(contents[0] + ".txt"),true));
if(this.words.get(contents[0]) == null)
{
this.words.put(contents[0], contents[0]);
writer.write(contents[0] + ":");
}
writer.write(contents[1] + ":");
writer.close();
}
//This could be put in its own method below.
mergeWriter = new PrintWriter(new FileWriter(this.file));
for(String word : this.words.keySet())
{
mergeReader = new BufferedReader(
new FileReader(new File(word + ".txt")));
while((wordLine = mergeReader.readLine()) != null)
{
mergeWriter.write(wordLine + "\n");
}
}
mergeWriter.close();
deleteFiles();
}
catch(IOException ioe)
{
ioe.printStackTrace();
}
}
public void deleteFiles()
{
File toDelete;
for(String word : this.words.keySet())
{
toDelete = new File(word + ".txt");
if(toDelete.exists())
{
toDelete.delete();
}
}
}

java split string[] array to multiple files

I'm having a problem figuring out how to split a string to multiple files. At the moment I should get two files both with JSON data. The code below writes to the first file but leaves the second empty. Any ideas why?
public void splitFile(List<String> results) throws IOException {
int name = 0;
for (int i=0; i<results.size(); i ++) {
write = new FileWriter("/home/tom/files/"+ name +".json");
out = new BufferedWriter(write);
out.write(results.get(i));
if (results.get(i).startsWith("}")) {
name++;
}
}
}
Edit: it splits at line starting with { because that denotes the end of a JSON document.

Enhance the cut-control
Get togher this:
write = new FileWriter("/home/tom/files/"+ name +".json");
out = new BufferedWriter(write);
and this:
name++;
Check for starting, not for end
Check for line starting with {, and execute those three lines to open the file.
Remember to close and flush
If it's not the first line (i > 0) then close the last writer (write.close();).
Close the last opened writer
if (!results.isEmpty())
out.close();
Result
It should look something like this:
public void splitFile(List<String> results) throws IOException {
int name = 0;
BufferedWriter out = null;
for (int i=0; i<results.size(); i ++) {
String line = results.get(i);
if (line.startsWith("{")) {
if (out != null) // it's not the first
out.close(); // tell buffered it's going to close, it makes it flush
FileWriter writer = new FileWriter("/home/tom/files/"+ name +".json");
out = new BufferedWriter(writer);
name++;
}
if (out == null)
throw new IllegalArgumentException("first line doesn't start with {");
out.write(line);
}
if (out != null) // there was at least one file
out.close();
}

I would close your buffered writer after each completed write sequence. i.e. after each iteration through the loop before you assign write to a new FileWriter().
Closing the BufferedWriter will close the underlying FileWriter, and consequently force a flush on the data written to the disk.
Note: If you're using a distinct FileWriter per loop then I'd scope that variable to that inner loop e.g.
FileWriter write = new FileWriter("/home/tom/files/"+ name +".json");
The same goes for the BufferedWriter. In fact you can write:
BufferedWriter outer = new BufferedWriter(new FileWriter(...
and just deal with outer.

Try the following code..
public void splitFile(List<String> results) throws IOException {
int name = 0;
for (int i = 0; i < results.size(); i++) {
write = new FileWriter("/home/tom/files/" + name + ".json");
out = new BufferedWriter(write);
out.write(results.get(i));
out.flush();
out.close(); // you have to close your stream every time in your case.
if (results.get(i).startsWith("}")) {
name++;
}
}
}

Java - Scanner not scanning after a certain number of lines

I'm doing some relatively simple I/O in Java. I have a .txt files that I'm reading from using a Scanner and a .txt file I'm writing to using a BufferedWriter. Another Scanner then reads that file and another BufferedWriter then creates another .txt file. I've provided the code below just in case, but I don't know if it will help too much, as I don't think the code is the issue here. The code compiles without any errors, but it's not doing what I expect it to. For some reason, charReader will only read about half of its file, then hasNext() will return false, even though the end of the file hasn't been reached. These aren't big text files - statsReader's file is 34 KB and charReader's file is 29 KB, which is even weirder, because statsReader reads its entire file fine, and it's bigger! Also, I do have that code surrounded in a try/catch, I just didn't include it.
From what I've looked up online, this may happen with very large files, but these are quite small, so I'm pretty lost.
My OS is Windows 7 64-bit.
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(outputFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.next());
name = statsReader.nextLine();
temp = statsReader.nextLine();
if (temp.contains("form")) {
name += " " + temp;
temp = statsReader.next();
}
statsWriter.write(name);
statsWriter.newLine();
statsWriter.write(temp);
if (! (temp = statsReader.next()).equals("-"))
statsWriter.write("/" + temp);
statsWriter.write("\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "");
statsWriter.newLine();
statsReader.nextInt();
}
Scanner charReader = new Scanner(charFile);
BufferedWriter codeWriter = new BufferedWriter(new FileWriter(codeFile));
while (charReader.hasNext()) {
color = charReader.next();
name = charReader.nextLine();
name = name.replaceAll("\t", "");
typing = pokeReader.next();
place = charReader.nextInt();
area = charReader.nextInt();
def = charReader.nextInt();
shape = charReader.nextInt();
size = charReader.nextInt();
spe = charReader.nextInt();
index = typing.indexOf('/');
if (index == -1) {
typeOne = determineType(typing);
typeTwo = '0';
}
else {
typeOne = determineType(typing.substring(0, index));
typeTwo = determineType(typing.substring(index+1, typing.length()));
}
}
SSCCE:
public class Tester {
public static void main(String[] args) {
File statsFile = new File("stats.txt");
File testFile = new File("test.txt");
try {
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(testFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.nextLine());
statsWriter.newLine();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

This is a classic problem: You need to flush and close the output stream (in this case statsWriter) before reading the file.
Being buffered, it doesn't actually write to the file with ever call to write. Calling flush forces it to complete any pending write operations.
Here's the javadoc for OutputStream.flush():
Flushes this output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.

After you have written your file with your statsWriter, you need to call:
statsWriter.flush();
statsWriter.close();
or simply:
statsWriter.close(); // this will call flush();
This is becuase your are using a Buffered Writer, it does not write everything out to the file as you call the write functions, but rather in buffered chunks. When you call flush() and close(), it empties all the content it still has in it's buffer out to the file, and closes the stream.
You will need to do the same for your second writer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing and splitting large files with Java 8 - java

Related

Remove word after line from txt file is read

Spring Boot java: Process/Compare lines of very large file

Creating an inverted index with limited memory in java

java split string[] array to multiple files

Java - Scanner not scanning after a certain number of lines

Categories

Resources