Indexing multiple files in one file

Indexing multiple files in one file - java

I have a program that is reading from plain text files. the amount of these files can be more that 5 Million!
When I'm reading them I found them by name! the names are basically save as x and y of a matrix for example 440x300.txt
Now I want to put all of them in one big file and index them
I mean I want to now exactly for example 440x300.txt is saved in the file from which byte and end in which byte!
My first Idea was to create a separate file and save this info in that like each line contains 440 x 300 150883 173553
but finding this info will also a lot of time!
I want to know if the is a better way to find out where do they start and end!
Somehow index the files
Please help
By the way I'm programming in Java.
Thanks in advance for your time.

If you only need to read these files I would archive them in batches. e.g. use ZIP or Jar format. This support the naming and indexing of files and you can build, update and check them using standard tools.
It is possible to place 5 million file sin one archive but using a small number of archives may be more manageable.
BTW: As the files are text, compressing them will also make them smaller. You can try this yourself by create a ZIP or JAR with say 1000 of them.

If you want to be able to do direct addressing within your file, then you have two options:
Have an index at the beginning of your file so you can lookup the start/end address based on (x, y)
Make all records exactly the same size (in bytes) so you can easily compute the location of a record in your files.
Choosing the right option should be done based on the following criteria:
Do you have records for each cell in your matrix?
Do the matrix values change?
Does the matrix dimension change?
Can the values in the matrix have a fixed byte length (i.e. are they numbers or strings)?

Related

Working with huge file (>10GB)

I was googling and didnt find answer.
So I have a huge file (>10GB), that I cant store in memory. The words are divided with "|". I need to find top 100000 most frequently used phrases.
So I am going to read this file line by line using InputStream so I need memory only for 1 line. And then Im planning to parse line into phrases.
But how can I store the phrases? I want to use file for this (format: #Phrase# #Count#).
File structure can be like this:
Phrase | Count
"Phrase1" 17
"Phrase2" 5
"Phrase3" 6
Each time I get phrase I am finding it in file, if there is no such phrase, i put it to the end of file and set count to 1. Otherwise I increment count of this phrase.
Is it possible to do? I mean to write to a certain position in file? If so how can I do this? Maybe there is some libs? Or any other suggestions?

Since your goal is finding equal values, sorting all the phrases will work, but since you don't have enough memory to store all the data at once, a disk-based merge-sort is likely your best option.
On Wikipedia, it's called an External merge sort:
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM.

Do not write to the file as you go along, instead you should keep a data structure with key value pairs where the key is the phrase and the value is the number of times it appears. Then once you have read through the input file in its entirety, and everything is counted and properly stored in your data structure, THEN and ONLY THEN should you output the contents of the data structure to a text file using your own self-imposed constraints.

How to change specific part of a file using java?

I was writing a program that implements a dictionary.
Actually what I did is just to write a java applet to show the words which is defined in a .xml file. And I did that with the org.w3c.dom package.
Now, I want to add a new feature that users can modify a word in the dictionary in the the program then the modification will be saved to the original .xml file.
Here is my question: what should I do to save the changes? Note that users can only modify one word a time so I don't want to load the whole file and modify the certain part and re-write the whole file to the disk. Is there a novel way to do that?

An XML file is a sequential text file. This means that there is no formula or other convenient way to locate the n-th word in a dictionary stored in XML. Elements need to be written one after the other, character by character (and one character may or may not result in a byte). Thus, what is called a random update, is out.
Look at JAXB for a most convenient way to read and write XML, and invest some work so that a user cannot update in memory and terminate the program without saving.

Reading and writing files in specific formats is a little bit trickier that what you portray.
Seen with "XML eyes" you are only changing a portion of the file - but to do that on the file level you need to seek to the position of change and write new bytes from there. The problem with that is that the content after that position won't adjust according to the new portion you write.
TL;DR - no - you need to read+write the complete XML file when making changes.

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.

You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.

I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.

As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

Shift the file while writing?

Is it possible to shift the contents of a file while writing to it using FileWriter?
I need to write data constants to the head of the file and if I do that it overwrites the file.
What technique should I use to do this or should I make make copies of the file (with the new data on top) on every file write?

If you want to overwrite certain bytes of the file and not others, you can use seek and write to do so. If you want to change the content of every byte in the file (by, for example, adding a single byte to the beginning of the file) then you need to write a new file and potentially rename it after you've done writing it.
Think of the answer to the question "what will be the contents of the byte at offset x after I'm done?". If, for a large percent of the possible values of x the answer is "not what it used to be" then you need a new file.

Rather than contending ourselves with the question "what will be the contents of the byte at offset x after I'm done?", lets change the mindset and ask why can't the file system or perhaps the hard disk firmware do : a) provide another mode of accessing the file [let's say, inline] b) increase the length of the file by the number of bytes added at the front or in the middle or even at the end c) move each byte that starts from the crossection by the newcontent.length positions
It would be easier and faster to handle these operations at the disk firmware or file system implementation level rather than leaving that job to the application developer. I hope file system writers or hard disk vendors would offer such feature soon.
Regards,
Samba

How to write data to a file through java?

I want to make a GUI application that contains three functions as follows:
Add a record
Edit a record
Delete a record
A record contains two fields - Name and Profession
There are two restrictions for the application
You can't use database to store info. You have to use a flat file.
Total file should not be re-written for every add/delete operation.
So, my questions are mentioned below:
Q1. Which file format would be better? (.xml or .csv or .txt or any other)
Q2. How can we perform the add/delete operation without the whole file being re-written?

The second part of your question is answered here : Best Way to Write Bytes in the Middle of a File in Java
As for the format - I would go with something as simple as possible. You don't want to have to deal with a bunch of markup processing, as using RandomAccessFile, you will going directly to a byte position. A fixed width style format would be good, so that based on the record number, you can calculate the starting position of a record or field in the file, without having to read everything in the file. The fields would then be padded out to the fixed width with spaces or some other suitable character.

I would go with CSV, zipped. it is both readable, and editable externally.
If CSV is your choice, this can help: http://javacsv.sourceforge.net/
Did you look at this? http://sourceforge.net/projects/flatworm/
Also consider Apache Derbi and HSQLDB
Another solution is this http://www.coyotegulch.com/products/jisp/index.html
You can reinvent the wheel, but that is only required if this is an academic assigment...

Given that the whole file must not be rewritten, I would suggest using RandomAccessFile that allow you to read and write only the record you want.
For the file format, a binary file, using fixed length for the record : ex: Name on 20 characters, Profession on 30.
This will allow you to use the seek() method of RandomAccessFile to directly access your data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.