How can I process a large file via CSVParser? - java

I have a large .csv file (about 300 MB), which is read from a remote host, and parsed into a target file, but I don't need to copy all the lines to the target file. While copying, I need to read each line from the source and if it passes some predicate, add the line to the target file.
I suppose that Apache CSV ( apache.commons.csv ) can only parse whole file
CSVFormat csvFileFormat = CSVFormat.EXCEL.withHeader();
CSVParser csvFileParser = new CSVParser("filePath", csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
so I can't use BufferedReader. Based on my code, a new CSVParser() instance should be created for each line, which looks inefficient.
How can I parse a single line (with known header of the table) in the case above?

No matter what you do, all of the data from your file is going to come over to your local machine because your system needs to parse through it to determine validity. Whether the file arrives via a file read through the parser (so you can parse each line), or whether you just copy the entire file over for parsing purposes, it will all come over to local. You will need to get the data local, then trim the excess.
Calling csvFileParser.getRecords() is already a lost battle because the documentation explains that that method loads every row of your file into memory. To parse the record while conserving active memory, you should instead iterate over each record; the documentation implies the following code loads one record to memory at a time:
CSVParser csvFileParser = CSVParser.parse(new File("filePath"), StandardCharsets.UTF_8, csvFileFormat);
for (CSVRecord csvRecord : csvFileParser) {
... // qualify the csvRecord; output qualified row to new file and flush as needed.
}
Since you explained that "filePath" is not local, the above solution is prone to failure due to connectivity issues. To eliminate connectivity issues, I recommend you copy the entire remote file over to local, ensure the file copied accurately by comparing checksums, parse the local copy to create your target file, then delete the local copy after completion.

This is a late response, but you CAN use a BufferedReader with the CSVParser:
try (BufferedReader reader = new BufferedReader(new FileReader(fileName), 1048576 * 10)) {
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(reader);
for (CSVRecord line: records) {
// Process each line here
}
catch (...) { // handle exceptions from your bufferedreader here

Related

BufferedReader still reads from a file even after the file have been deleted

I run a small experiment trying to read a file using BufferedReader, and I wanted to see what would happen if I call the delete method on the file before the read is complete, and given that BufferedReader will only read a chunk of the file at a time I expected the operation to fail, but to my surprise the read was successful.
Here is the code I used:
val file = File("test.txt")
val bufferedReader = BufferedReader(InputStreamReader(FileInputStream(file)), 1)
if (file.delete())
println("file deleted successfully")
println(bufferedReader.readLines().size)
I used a relatively big file for the test with around 300mb of size, and I also set the buffer size to the minimum value possible, and the execution returns this:
file deleted successfully
1303692
Did I misunderstand something here? and could someone please explain this behavior?
The motivation behind this experiment is that I have a method in my application that returns a sequence of all lines in a temporary file, and I wanted to remove the temporary file once all lines were read like this:
fun getTempFileLines(): Sequence<String> {
val file = File("temp.txt")
val bufferedReader = BufferedReader(InputStreamReader(FileInputStream(file)))
val sequenceOfLines = generateSequence {
bufferedReader.readLine()
}
file.delete()
return sequenceOfLines
}
From https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html
"Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines."
So while the actual file may already be deleted, the bufferedReader still contains contents.

Java | method to write a datahandler to a file takes more time than expected

I am trying to read the mails from the MS Exchange by using camel and getting the attachments as DataHandler. A 10MB file takes around 3hrs to write into the location.
File outputFile = new File(someDirectory, someFileName);
DataHandler attachment_data = destination1Attachments.get("someFileName.txt");
try (FileOutputStream fos = new FileOutputStream(outputFile)) {
attachment_data.writeTo(fos);
}
I have also noticed that sometimes a 6 to 7Mb file takes around 2 to 3 minutes and when another mail comes just after that it takes more time than expected.
Because of GC ?
Trying to find the exact root cause or any other method to write the data to the file.
Update 1 :
Tried using BufferedOutputStream around FileOutputSteam as mentioned by #user207421 in the comment. No much change could find (just 1sec or little more).
This could be due to the default implementation of write mechanism.
attachment_data.writeTo(fos);
If the DataHandler.getDataSource()!=null then this theory will work
In this method implementation 8 bytes are getting read at a time and writing into the stream. The number of read and writes are more and this might be causing the issue.
Try reading the on your own from DataHandler.getInputStream and write to file by increasing the read content from the input stream.
One must assume that the object is loaded in memory or writeTo very inefficient. Hence specify the DataFlavor and inspect attachment_data.getTransferDataFlavors().
DataFlavor flavor = new DataFlavor(InputStream.class, "application/octetstream");
try (InputStream in = (InputStream) attachment_data.getTransferData(flavor)) {
Some fiddling needed.

Java split one line file

I have just relised that I have a file where only one line exists with a long string. This file (line) can be300MB heavy.
I would like stream some data from this string and save in another file
i.e the line from the file would look like:
String line = "{{[Metadata{"this, is my first, string"}]},{[Metadata{"this, is my second, string"}]},...,{[Metadata{"this, is my 5846 string"}]}}"
Now I would like to take 100 items from this string from one "Metadata" to another "Metadata", save it in the file and continue with the rest of the data.
So in the nutshell from one line I would like to get N files with i.e. 100 Metadata strings each
BufferedReader reader = new BufferedReader(new StringReader(line));
This is what I've got and I don't know what I can do with the reader.
Probably
reader.read(????)
but I don't know what to put inside :(
Can you please help

Effecient way to convert a stream of Strings into grouped list of strings

I have a function which will receive a Stream<String>. This stream represents the lines in a file (as called by Files.lines(somePath)). The file itself is actually the concatenation of many files into a single file, something like this:
__HEADER__ # for file 1
data
more data
...
__HEADER__ # file 2 starts here
some more data...
...
I need to convert the stream into multiple physical files on the filesystem.
I've tried the simple approach, something along the lines of:
String allLinesJoined = lineStream.collect(Collectors.joining());
// This solution seems to get stuck on the line above ^
String files[] = allLinesJoined.split("__HEADER__");
for (fileStr : files)
{
// This function will write each fileStr to a separate file
// (filename is determined by contents of fileStr)
writeToPhysicalFile(fileStr);
}
But the input file is about ~300 MB (and could get larger) and this solution seems to get stuck on the first line. Maybe it would complete if I had more memory...?
Is there a better way to do this, if my starting point is a Stream<String>, or should I start making other changes so that this bit of code can just read through the file line by line, without using the streaming API?
(the order of the lines does matter, in the context of these files)
tl;dr
I need to turn one big file represented as Stream<String> in to many little files. Each little file begins with __HEADER__ and all lines after, until the next __HEADER__. The current library uses streams to provide the file, but is it even worth trying to do this with streams, or will my life be easier if I change the library to offer non-stream functionality?
That kills the whole idea of streams.
Try forEach():
Stream<String> lineStream = Files.lines(Paths.get("your_file"));
lineStream.forEachOrdered((s) -> {
if ("HEADER".equals(s)) {
// create new file
}
else {
// append to this file
}
});

Java delete or modify one record in File

// Create file
FileWriter fstream = new FileWriter(fileName, true);
BufferedWriter out = new BufferedWriter(fstream);
out.write(c.toString());
//Close the output stream
out.close();
// Code I used to read record I am using | as a seperator name and id
String fileName = folderPath + "listCatalogue.txt";
String line = "";
Scanner scanner;
String name, id;
scanner = new Scanner(fileName);
System.out.println(scanner.hasNextLine());
while (scanner.hasNextLine()) {
line = scanner.nextLine();
System.out.println(line);
StringTokenizer st = new StringTokenizer(line, "|");
name = st.nextToken();
id = st.nextToken();
catalogues.add(new Catalogue(name, id));
}
Above is the code to create file and read file, How can I do delete certain record, in File, what I have found from google is delete the file but not delete certain record example I provide name, if match delete this record. As well as modify that record? Is it possible to do modify record using File?
Deleting certain record in file is impossible for several reasons.
First, record is your application level term. File system knows only sequence of bytes. Second streams you are using to access file are abstractions that do not support deletion as well.
So, what's the solution.
You can read file record-by record and write other file record-by-record unless you see record you want to delete. Avoid writing this record to other file. This gives you effect of deletion. But this method is not effective for big files.
File streams support mark() and skip() methods. If you data structure allows retrieving the position of record you want to delete you can arrive to this position immediately by calling skip(). The problem is to delete. to solve it you should either create data structure that allows marking record as deleted. In this case you do not really delete record but just mark it as deleted. Other solution is to write special (e.g. null) values over the record but your reader have to be able to skip such null values.
This solution has disadvantage: if you remove many records the file is will not be smaller.
Variant: you can use RandomAccessFile API. You can also use FileChannel.

Categories