Working with huge file (>10GB)

Working with huge file (>10GB) - java

I was googling and didnt find answer.
So I have a huge file (>10GB), that I cant store in memory. The words are divided with "|". I need to find top 100000 most frequently used phrases.
So I am going to read this file line by line using InputStream so I need memory only for 1 line. And then Im planning to parse line into phrases.
But how can I store the phrases? I want to use file for this (format: #Phrase# #Count#).
File structure can be like this:
Phrase | Count
"Phrase1" 17
"Phrase2" 5
"Phrase3" 6
Each time I get phrase I am finding it in file, if there is no such phrase, i put it to the end of file and set count to 1. Otherwise I increment count of this phrase.
Is it possible to do? I mean to write to a certain position in file? If so how can I do this? Maybe there is some libs? Or any other suggestions?

Since your goal is finding equal values, sorting all the phrases will work, but since you don't have enough memory to store all the data at once, a disk-based merge-sort is likely your best option.
On Wikipedia, it's called an External merge sort:
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM.

Do not write to the file as you go along, instead you should keep a data structure with key value pairs where the key is the phrase and the value is the number of times it appears. Then once you have read through the input file in its entirety, and everything is counted and properly stored in your data structure, THEN and ONLY THEN should you output the contents of the data structure to a text file using your own self-imposed constraints.

Related

Java - Sorting and csv: good practice with huge data

I need to order a huge csv file (10+ million records) with several algorithms in Java but I've some problem with memory amount.
Basically I have a huge csv file where every record has 4 fields, with different type (String, int, double).
I need to load this csv into some structure and then sort it by all fields.
What was my idea: write a Record class (with its own fields), start read csv file line by line, make a new Record object for every line and then put them into an ArrayList. Then call my sorter algorithms for each field.
It doesn't work.. I got and OutOfMemoryException when I try lo load all Record object into my ArrayList.
In this way I create tons of object and I think that is not a good idea.
What should I do when I have this huge amount of data? Which method/data structure can ben less expensive in terms of memory usage?
My point is just to use sort algs and look how they work with big set of data, it's not important save the result of sorting into a file.
I know that there are some libs for csv, but I should implements it without external libs.
Thank you very much! :D

Cut your file into pieces (depending on the size of the file) and look into merge sort. That way you can sort even big files without using a lot of memory, and it's what databases use when they have to do huge sorts.

I would use an in memory database such as h2 in in-memory-mode (jdbc:h2:mem:)
so everything stays in ram and isn't flushed to disc (provided you have enough ram, if not you might want to use the file based url). Create your table in there and write every row from the csv. Provided you set up the indexes properly sorting and grouping will be a breeze with standard sql

How can I output an array with a bunch of assignments from a folder of files?

I have 151 images I would like stored in an array that includes their file path and some attributes which will be String data extracted from the file names.
I am guessing I'll be using File IO/NIO for this but of these two options:
write the array from the disk every time the program is run
write the array once with a throwaway program so I can just copy the
code of the array and have it be hardcoded
Two seems much more sensible. I just don't know how

Check if there any previous record. If there are no records, write the array and save it. If there are records, read from it.

Indexing multiple files in one file

I have a program that is reading from plain text files. the amount of these files can be more that 5 Million!
When I'm reading them I found them by name! the names are basically save as x and y of a matrix for example 440x300.txt
Now I want to put all of them in one big file and index them
I mean I want to now exactly for example 440x300.txt is saved in the file from which byte and end in which byte!
My first Idea was to create a separate file and save this info in that like each line contains 440 x 300 150883 173553
but finding this info will also a lot of time!
I want to know if the is a better way to find out where do they start and end!
Somehow index the files
Please help
By the way I'm programming in Java.
Thanks in advance for your time.

If you only need to read these files I would archive them in batches. e.g. use ZIP or Jar format. This support the naming and indexing of files and you can build, update and check them using standard tools.
It is possible to place 5 million file sin one archive but using a small number of archives may be more manageable.
BTW: As the files are text, compressing them will also make them smaller. You can try this yourself by create a ZIP or JAR with say 1000 of them.

If you want to be able to do direct addressing within your file, then you have two options:
Have an index at the beginning of your file so you can lookup the start/end address based on (x, y)
Make all records exactly the same size (in bytes) so you can easily compute the location of a record in your files.
Choosing the right option should be done based on the following criteria:
Do you have records for each cell in your matrix?
Do the matrix values change?
Does the matrix dimension change?
Can the values in the matrix have a fixed byte length (i.e. are they numbers or strings)?

creating a simple index on a text file in java

I need to implement a simple indexing scheme for a big text file. The text file contains key value pairs and I need to read back a specific key value pair without loading the complete file in memory. The text file is huge and contains millions of entries and the keys are not sorted. Different key-value pairs need to be read depending on user-input. So I don't want the complete file to be read every time. Please let me know the exact classes and methods in java file handling api that would help to implement this in a simple and efficient way.I want to do this without using an external library such as lucene.

As the comments pointed out, you're going to need to do a linear search of the entire file in worst case, and half of it on average. But fortunately there are some tricks you can do.
If the file doesn't change much, then create a copy of the file in which the entries are sorted. Ideally make records in the copy the same length, so that you can go straight to the Nth entry in the sorted file.
If you don't have the disk space for that, then create an index file, which has all the keys in the original file as key and the offset into the original file as the value. Again used fixed length records. Or better, make this index file a database. Or load the original file into a database. In either case, disk storage is very cheap.
EDIT: To create the index file, open the main file using RandomAccessFile and read it sequentially. Use the 'getFilePointer()' method at the start of each entry to read the position in the file, and store that plus the key in the index file. When looking up something read the file pointer from the index file and use the 'seek(long)' method to jump to the point in the original file.

I'd recommend building an index file. Scan the input file and write every key and its offset into a List, then sort the list and write it to the index file. Then, whenever you want to look up a key, you read in the index file and do a binary search on the list. Once you find the key you need, open the data file as a RandomAccessFile and seek to the position of the key. Then you can read the key and the value.

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.

You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.

I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.

As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.