Java library to split text into smaller files

Java library to split text into smaller files - java

I am looking for a java library that allows me to specify max size or max number of lines in output files, and then splits a large xml/text file into smaller files.
I saw that there is a 2 year old question on SO for the same, however the answers there were for specific cloud platforms....I just want a library for use in java desktop apps.

You could use Guava CountingOutputStream to keep track of how much data is written to a file. Write one line at a time, check the number of bytes written and once you exceed the threshold close the file and open a new one.

Related

Recent ways to obtain file size in Java

I know this question has been widely discussed in different posts:
java get file size efficiently
How to get file size in Java
Get size of folder or file
https://www.mkyong.com/java/how-to-get-file-size-in-java/
My problem is that I need to obtain the sizes of a large number of files (regular files existing in a HD), and for this I need a solution that provides the best performance. My intuition is that it should be done through a method that reads directly the file system table, not obtaining the size of the file by reading the whole file contents. It is difficult to know which specific method is used by reading the documentation.
As stated in this page:
Files has the size() method to determine the size of the file. This is
the most recent API and it is recommended for new Java applications.
But this is apparently not the best advise, in terms of performance. I have made different measurements of different methods:
file.length();
Files.size(path);
BasicFileAttributes attr = Files.readAttributes(path, BasicFileAttributes.class); attr.size();
And my surprise is that file.length(); is the fastest, having to create a File object instead of using the newer Path. I do not now if this also reads the file system or the contents. So my question is:
What is the fastest, recommended way to obtain file sizes in recent Java versions (9/10/11)?
EDIT
I do not think these details add anything to the question. Basically the benchmark reads like this:
Length: 49852 with previous instanciation: 84676
Files: 3451537 with previous instanciation: 5722015
Length: 48019 with previous instanciation:: 79910
Length: 47653 with previous instanciation:: 86875
Files: 83576 with previous instanciation: 125730
BasicFileAttr: 333571 with previous instanciation:: 366928
.....
Length is quite consistent. Files is noticeable slow on the first call, but it must cache something since later calls are faster (still slower than Length). This is what other people observed in some of the links I reference above. BasicFileAttr was my hope but still is slow.
I am asing what is recommended in modern Java versions, and I considered 9/10/11 as "modern". It is not a dependency, nor a limitation, but I suppose Java 11 is supposed to provide better means to get file sizes than Java 5. If Java 8 released the fastest way, that is OK.
It is not a premature optimisation, at the moment I am optimising a CRC check with an initial size check, because it should be much faster and does not need, in theory, to read file contents. So I can use directly the "old" Length method, and all I am asking is what are the new advances on this respect in modern Java, since the new methods are apparently not as fast as the old ones.

Organizing data in java (on a mac)

I'm writing a tool to analyze stock market data. For this I download data and then save all the data corresponding to a stock as a double[][] 20*100000 array in a data.bin on my hd, I know I should put it in some database but this is simply performance wise the best method.
Now here is my problem: I need to do updates and search on the data:
Updates: I have to append new data to the end of the array as time progresses.
Search: I want to iterate over different data files to find a minimum or calculate moving averages etc.
I could do both of them by reading the whole file in and update it writing or do search in a specific area... but this is somewhat overkill since I don't need the whole data.
So my question is: Is there a library (in Java) or something similar to open/read/change parts of the binary file without having to open the whole file? Or searching through the file starting at a specific point?

RandomAccessFile allows seeking into particular position in a file and updating parts of the file or adding new data to the end without rewriting everything. See the tutorial here: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html

You could try looking at Random Access Files:
Tutorial: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
API: http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html
... but you will still need to figure out the exact positions you want to read in a binary file.
You might want to consider moving to a database, maybe a small embedded one like H2 (http://www.h2database.com)

Indexing multiple files in one file

I have a program that is reading from plain text files. the amount of these files can be more that 5 Million!
When I'm reading them I found them by name! the names are basically save as x and y of a matrix for example 440x300.txt
Now I want to put all of them in one big file and index them
I mean I want to now exactly for example 440x300.txt is saved in the file from which byte and end in which byte!
My first Idea was to create a separate file and save this info in that like each line contains 440 x 300 150883 173553
but finding this info will also a lot of time!
I want to know if the is a better way to find out where do they start and end!
Somehow index the files
Please help
By the way I'm programming in Java.
Thanks in advance for your time.

If you only need to read these files I would archive them in batches. e.g. use ZIP or Jar format. This support the naming and indexing of files and you can build, update and check them using standard tools.
It is possible to place 5 million file sin one archive but using a small number of archives may be more manageable.
BTW: As the files are text, compressing them will also make them smaller. You can try this yourself by create a ZIP or JAR with say 1000 of them.

If you want to be able to do direct addressing within your file, then you have two options:
Have an index at the beginning of your file so you can lookup the start/end address based on (x, y)
Make all records exactly the same size (in bytes) so you can easily compute the location of a record in your files.
Choosing the right option should be done based on the following criteria:
Do you have records for each cell in your matrix?
Do the matrix values change?
Does the matrix dimension change?
Can the values in the matrix have a fixed byte length (i.e. are they numbers or strings)?

How to read arbitrary but continuous n lines from a huge file

I would like to read arbitrary number of lines. The files are normal ascii text files for the moment (they may be UTF8/multibyte character files later)
So what I want is for a method to read a file for specific lines only (for example from 101-200) and while doing so it should not block any thing (ie same file can be read by another thread for 201-210 and it should not wait for the first reading operation.
In the case there are no lines to read it should gracefully return what ever it could read. The output of the methods could be a List
The solution I thought up so far was to read the entire file first to find number of lines as well as the byte positions of each new line character. Then use the RandomAccessFile to read bytes and convert them to lines. I have to convert the bytes to Strings (but that can be done after the reading is done). I would avoid the end of file exception for reading beyond file by proper book keeping. The solution is bit inefficient as it does go through the file twice, but the file size can be really big and we want to keep very little in the memory.
If there is a library for such thing that would work, but a simpler native java solution would be great.
As always I appreciate your clarification questions and I will edit this question as it goes.

Why not use Scanner and just loop through hasNextLine() until you get to the count you want, and then grab as many lines as you wish... if it runs out, it'll fail gracefully. That way you're only reading the file once (unless Scanner reads it fully... I've never looked under the hood... but it doesn't sound like you care, so... there you go :)

If you want to minimise memory consumption, I would use a memory mapped file. This uses almost no heap. The amount of the file kept in memory is handled by the OS so you don't need to tune the behaviour yourself.
FileChannel fc = new FileInputStream(fileName).getChannel();
final MappedByteBuffer map = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
If you have a file of 2 GB or more, you need multiple mappings. In the simplest case you can scan the data and remember all the indexes. The indexes them selves could take lots of space so you might only remember every Nth e.g. every tenth.
e.g. a 2 GB file with 40 byte lines could have 50 million lines requiring 400 MB of memory.
Another way around having a large index is to create another memory mapped file.
FileChannel fc = new RandomAccessFile(fileName).getChannel();
final MappedByteBuffer map2 = fc.map(FileChannel.MapMode.READ_WRITE, 0, fc.size()/10);
The problem being, you don't know how big the file needs to be before you start. Fortunately if you make it larger than needed, it doesn't consume memory or disk space, so the simplest thing to do is make it very large and truncate it when you know the size it needs to be.
This could also be use to avoid re-indexing the file each time you load the file (only when it is changed) If the file is only appended to, you could index from the end of the file each time.
Note: Using this approach can use a lot of virtual memory, for a 64-bit JVM this is no problem as your limit is likely to 256 TB. For a 32-bit application, you limits is likely to be 1.5 - 3.5 GB depending on your OS.

Shift the file while writing?

Is it possible to shift the contents of a file while writing to it using FileWriter?
I need to write data constants to the head of the file and if I do that it overwrites the file.
What technique should I use to do this or should I make make copies of the file (with the new data on top) on every file write?

If you want to overwrite certain bytes of the file and not others, you can use seek and write to do so. If you want to change the content of every byte in the file (by, for example, adding a single byte to the beginning of the file) then you need to write a new file and potentially rename it after you've done writing it.
Think of the answer to the question "what will be the contents of the byte at offset x after I'm done?". If, for a large percent of the possible values of x the answer is "not what it used to be" then you need a new file.

Rather than contending ourselves with the question "what will be the contents of the byte at offset x after I'm done?", lets change the mindset and ask why can't the file system or perhaps the hard disk firmware do : a) provide another mode of accessing the file [let's say, inline] b) increase the length of the file by the number of bytes added at the front or in the middle or even at the end c) move each byte that starts from the crossection by the newcontent.length positions
It would be easier and faster to handle these operations at the disk firmware or file system implementation level rather than leaving that job to the application developer. I hope file system writers or hard disk vendors would offer such feature soon.
Regards,
Samba

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.