creating a simple index on a text file in java - java

I need to implement a simple indexing scheme for a big text file. The text file contains key value pairs and I need to read back a specific key value pair without loading the complete file in memory. The text file is huge and contains millions of entries and the keys are not sorted. Different key-value pairs need to be read depending on user-input. So I don't want the complete file to be read every time. Please let me know the exact classes and methods in java file handling api that would help to implement this in a simple and efficient way.I want to do this without using an external library such as lucene.

As the comments pointed out, you're going to need to do a linear search of the entire file in worst case, and half of it on average. But fortunately there are some tricks you can do.
If the file doesn't change much, then create a copy of the file in which the entries are sorted. Ideally make records in the copy the same length, so that you can go straight to the Nth entry in the sorted file.
If you don't have the disk space for that, then create an index file, which has all the keys in the original file as key and the offset into the original file as the value. Again used fixed length records. Or better, make this index file a database. Or load the original file into a database. In either case, disk storage is very cheap.
EDIT: To create the index file, open the main file using RandomAccessFile and read it sequentially. Use the 'getFilePointer()' method at the start of each entry to read the position in the file, and store that plus the key in the index file. When looking up something read the file pointer from the index file and use the 'seek(long)' method to jump to the point in the original file.

I'd recommend building an index file. Scan the input file and write every key and its offset into a List, then sort the list and write it to the index file. Then, whenever you want to look up a key, you read in the index file and do a binary search on the list. Once you find the key you need, open the data file as a RandomAccessFile and seek to the position of the key. Then you can read the key and the value.

Related

Working with huge file (>10GB)

I was googling and didnt find answer.
So I have a huge file (>10GB), that I cant store in memory. The words are divided with "|". I need to find top 100000 most frequently used phrases.
So I am going to read this file line by line using InputStream so I need memory only for 1 line. And then Im planning to parse line into phrases.
But how can I store the phrases? I want to use file for this (format: #Phrase# #Count#).
File structure can be like this:
Phrase | Count
"Phrase1" 17
"Phrase2" 5
"Phrase3" 6
Each time I get phrase I am finding it in file, if there is no such phrase, i put it to the end of file and set count to 1. Otherwise I increment count of this phrase.
Is it possible to do? I mean to write to a certain position in file? If so how can I do this? Maybe there is some libs? Or any other suggestions?
Since your goal is finding equal values, sorting all the phrases will work, but since you don't have enough memory to store all the data at once, a disk-based merge-sort is likely your best option.
On Wikipedia, it's called an External merge sort:
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM.
Do not write to the file as you go along, instead you should keep a data structure with key value pairs where the key is the phrase and the value is the number of times it appears. Then once you have read through the input file in its entirety, and everything is counted and properly stored in your data structure, THEN and ONLY THEN should you output the contents of the data structure to a text file using your own self-imposed constraints.

How to change specific part of a file using java?

I was writing a program that implements a dictionary.
Actually what I did is just to write a java applet to show the words which is defined in a .xml file. And I did that with the org.w3c.dom package.
Now, I want to add a new feature that users can modify a word in the dictionary in the the program then the modification will be saved to the original .xml file.
Here is my question: what should I do to save the changes? Note that users can only modify one word a time so I don't want to load the whole file and modify the certain part and re-write the whole file to the disk. Is there a novel way to do that?
An XML file is a sequential text file. This means that there is no formula or other convenient way to locate the n-th word in a dictionary stored in XML. Elements need to be written one after the other, character by character (and one character may or may not result in a byte). Thus, what is called a random update, is out.
Look at JAXB for a most convenient way to read and write XML, and invest some work so that a user cannot update in memory and terminate the program without saving.
Reading and writing files in specific formats is a little bit trickier that what you portray.
Seen with "XML eyes" you are only changing a portion of the file - but to do that on the file level you need to seek to the position of change and write new bytes from there. The problem with that is that the content after that position won't adjust according to the new portion you write.
TL;DR - no - you need to read+write the complete XML file when making changes.

Delete file contents using RandomAccessFile

I have a file which contains lot of zeros and as per the requirement the zeros in the file are invalid. I am using RandomAccessFile api to locate data in the file. Is there way so that all the zeros can be removed from the file using the same api.
You'll have to stream through the file and write out the content, minus the zeros, to a separate temporary file. You can then close and delete the original and rename the new file to the old file name. That's your best alternative for this particular use case.
You can use RandomAccessFile to read the files' data, and when you reach a point where you need to change the data you can overwrite the existing number of bytes with equal number of bytes. It's iff the new value is exactly the same length as the old value.
With RandomAccessFile its difficult and equally complex when the size of two, the one being changed and the new value are different. It involves a lot of seeks, reads and writes to move data back
Try to read the whole file, change the bits you have to change and write a new file. You might process one line at a time or read the whole file into memory, modify it and write it all back out again. It is a good idea to perform the edit in the following manner:
Read file
Write to Temporary File [just to back-up]
Rename original to back-up
Work on Temporary file.
Remove Backup if you were successful.

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.
You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.
I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.
As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

How to write data to a file through java?

I want to make a GUI application that contains three functions as follows:
Add a record
Edit a record
Delete a record
A record contains two fields - Name and Profession
There are two restrictions for the application
You can't use database to store info. You have to use a flat file.
Total file should not be re-written for every add/delete operation.
So, my questions are mentioned below:
Q1. Which file format would be better? (.xml or .csv or .txt or any other)
Q2. How can we perform the add/delete operation without the whole file being re-written?
The second part of your question is answered here : Best Way to Write Bytes in the Middle of a File in Java
As for the format - I would go with something as simple as possible. You don't want to have to deal with a bunch of markup processing, as using RandomAccessFile, you will going directly to a byte position. A fixed width style format would be good, so that based on the record number, you can calculate the starting position of a record or field in the file, without having to read everything in the file. The fields would then be padded out to the fixed width with spaces or some other suitable character.
I would go with CSV, zipped. it is both readable, and editable externally.
If CSV is your choice, this can help: http://javacsv.sourceforge.net/
Did you look at this? http://sourceforge.net/projects/flatworm/
Also consider Apache Derbi and HSQLDB
Another solution is this http://www.coyotegulch.com/products/jisp/index.html
You can reinvent the wheel, but that is only required if this is an academic assigment...
Given that the whole file must not be rewritten, I would suggest using RandomAccessFile that allow you to read and write only the record you want.
For the file format, a binary file, using fixed length for the record : ex: Name on 20 characters, Profession on 30.
This will allow you to use the seek() method of RandomAccessFile to directly access your data.

Categories