How to efficiently manage files on a filesystem in Java?

How to efficiently manage files on a filesystem in Java? - java

I am creating a few JAX-WS endpoints, for which I want to save the received and sent messages for later inspection. To do this, I am planning to save the messages (XML files) into filesystem, in some sensible hierarchy. There will be hundreds, even thousands of files per day. I also need to store metadata for each file.
I am considering to put the metadata (just a couple of fields) into database table, but the XML file content itself into files in a filesystem in order not to bloat the database with content data (that is seldomly read).
Is there some simple library that helps me in saving, loading, deleting etc. the files? It's not that tricky to implement it myself, but I wonder if there are existing solutions? Just a simple library that already provides easy access to filesystem (preferrably over different operating systems).
Or do I even need that, should I just go with raw/custom Java?

Is there some simple library that
helps me in saving, loading, deleting
etc. the files? It's not that tricky
to implement it myself, but I wonder
if there are existing solutions? Just
a simple library that already provides
easy access to filesystem (preferrably
over different operating systems).
Java API
Well, if what you need to do is really simple, you should be able to achieve your goal with java.io.File (delete, check existence, read, write, etc.) and a few stream manipulations with FileInputStream and FileOutputStream.
You can also throw in Apache commons-io and its handy FileUtils for a few more utility functions.
Java is independent of the OS. You just need to make sure you use File.pathSeparator, or use the constructor File(File parent, String child) so that you don't need to explicitly mention the separator.
The Java file API is relatively high-level to abstract the differences of the many OS. Most of the time it's sufficient. It has some shortcomings only if you need some relatively OS-specific feature which is not in the API, e.g. check the physical size of a file on the disk (not the the logical size), security rights on *nix, free space/quota of the hard drive, etc.
Most OS have an internal buffer for file writing/reading. Using FileOutputStream.write and FileOutputStream.flush ensure the data have been sent to the OS, but not necessary written on the disk. The Java API support also this low-level integration to manage these buffering issue (example here) for system such as database.
Also both file and directory are abstracted with File and you need to check with isDirectory. This can be confusing, for instance if you have one file x, and one directory /x (I don't remember exactly how to handle this issue, but there is a way).
Web service
The web service can use either xs:base64Binary to pass the data, or use MTOM (Message Transmission Optimization Mechanism) if files are large.
Transactions
Note that the database is transactional and the file system not. So you might have to add a few checks if operations fails and are re-tried.
You could go with a complicated design involving some form of distributed transaction (see this answer), or try to go with a simpler design that provides the level of robustness that you need. A possible design could be:
Update. If the user wants to overwrite a file, you actually create a new one. The level of indirection between the logical file name and the physical file is stored in database. This way you never overwrite a physical file once written, to ensure rollback is consistent.
Create. Same story when user want to create a file
Delete. If the user want to delete a file, you do it only in database first. A periodic job polls the file system to identify files which are not listed in database, and removes them. This two-phase deletes ensures that the delete operation can be rolled back.
This is not as robust as writting BLOB in real transactional database, but provide some robustness. You could otherwise have a look at commons-transaction, but I feel like the project is dead (2007).

There is DataNucleus, a Java persistence provider. It is little too heavy for this case, but it supports JPA and JDO java standards with different datastores (RDBMS, object storage, XML, JSON, Excel, etc.). If the product is already using JPA or JDO, it might be worth considering using NataNucleus, as saving data into different datastores should be transparent. I suppose DataNucleus supports splitting the data into several files, creating the sensible directory/file structure I wanted (in my question), but this is just a guess.
Support for XML and JSON seems to be experimental.

Related

Saving Data from a JavaFX-Application without Database

Unfortunately I couldn't find anything specific to this topic / to my problem. Here we go:
I'm building a JavaFX Business Application for a friend of mine. Unfortunately I do not have any possibility to connect to a Database. I want the Application to load a savestate from a file. The application contains a list with clients and the clients got some specific properties. I do not want to hardcode this to a .prop or .txt file, because I'm sure that there's a different way of doing this, isn't there?
Thanks in advance, appreciate it!

Lots of choices for persisting data to local storage. The exact choice depends on your needs. You do not describe enough details to make a specific recommendation.
Here is a list of possibilities, roughly in increasing order of complexity of your data.
Text file
If you have small amounts of simple data, save to a text file. You can store each piece in a separate file, or combine into a single file. Recent versions of Java have new classes to make this easier than ever. See Oracle Tutorial.
Comma-separate & Tab-delimited
For sets of structured data, write to text files in comma-separated values (CSV) or tab-delimited values. For example a list of people with rows for each person, and columns for name, phone number, and email address.
While reading/writing such files is easy enough to program yourself, I suggest using an established library to eliminate the drudgery, avoid bugs, and save yourself some time. There are a few such libraries written in Java.
My favorite is the Apache Commons CSV project. This library makes easy work of the chore of reading/writing such files. Despite the name, this library supports tab-delimited as well as comma-separated formats. I've written a few Answers here on Stack Overflow showing how to use this library, as you can see here, here, and here.
By the way, plain old ASCII defines a few character positions explicitly for delimiting in data files, with four levels of grouping (document, group, record/row, and field). Unicode, of course, inherits these from ASCII as code points. I am puzzled why these have remained so obscure and so infrequently used. Seems much more logical to me than using commas and tabs which may well exist inside the data payload.
Serialization
You can write out the data values stored within an object. This is called serialization. Java has a serialization facility built-in, but be sure to study up on the details.
To more simply write out an object’s values and later read them back in to reconstitute an object, I have enjoyed using the Simple XML Serialization project. This works well for relatively simple needs, and is aimed at the situation where you want the structure of a class to drive the process of determining what to write.
Java has other XML binding facilities both built-in and third-party. These are much more powerful in their flexibility. They are especially good for when you want to define and verify the XML structure in a rigid fashion such as defining a XML DTD or XML Schema against which to validate the data and perhaps even generate the Java class in which to represent the data.
Embedded database
For more complicated data, use an embedded relational database.
The SQLite database is bundled with many platforms. This is a C-based library, not pure Java. As the name indicates, SQLite is indeed quite “lite“, lacking rigid data types and many other common database features. SQLite is meant to be an alternative to writing text files than as a competitor to more serious databases. It is a great product if your needs fit the sweet-spot of its capabilities.
My first choice for an embedded database would be H2 Database Engine. Built in pure Java. Can be run inside your app, or separately as a server (you choice). Has sophisticated relational database features. Has been around for years, often updated, and is well-worn. The principal author has much experience in the field.

JSP: Best practices uploading files to server

I am uploading files using multipart form, Apache FileUpload, etc. It work fine.
But, I want to know what are the best practices or common practices when saving files in server, according to following:
Naming the files in server (i.e.: What name is better? Some UUID generated, or the row ID generated by db table when I insert the file associated data)
The best location for files inside the server (What is better? i.e. In linux server which folder or partition I should use. Do I have to encrypt the uploaded files?)
When I put a link to access the files from browser: Is better a direct access, or using a servlet?

If you do it this way (files in filesystem, metadata in DB) then row ID for filename is not a bad idea (at least it ensures uniqueness). Unfortunately you will have to take care that filesystem and database are in sync, so it will require careful coding.
If you care for performance files can be stored on a separate HDD (or NAS). Note that if the number of files is going to be big (thousands) you should not put all of them in one folder, but instead group them in subfolders, each containing at most several hundreds of files. It will ensure low access time if the number of files gets big. The use of encryption should depend on your business needs (do the files contain confidential data?).
Servlet is a better way, as it hides the real storage details from the client and it's more proof for future changes in the application. It has also some other benefits (eg. you can implement your access control, you get caching in browsers/proxies out-of-the-box, etc ). And it's a must if you use encryption.

After having had recurring trouble with server file system operations (missing permissions, different behaviour on different platforms) I would recommend just stuffing file data as BLOBs in your database. This way, you do not need to elaborate on unique file naming schemes, and all sensitive data will lie in one place.
In this case, you will need a servlet for downloading, which IMHO is the better way even for accessing data stored in files.

java - write two files atomically

I am facing a problem for which I don't have a clean solution. I am writing a Java application and the application stores certain data in a limited set of files. We are not using any database, just plain files. Due to some user-triggered action, certain files needs to be changed. I need this to be a all-or-nothing operation. That is, either all files are updated, or none of them. It is disastrous if for example 2 of the 5 files are changed, while the other 3 are not due to some IOException.
What is the best strategy to accomplish this?
Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Thanks a lot!

A safe approach IMO is:
Backup
Maintain a list of processed files
On exception, restore the ones that have been processed with the backed up one.
It depends on how heavy it is going to be and the limits for time and such.

What is the best strategy to accomplish this? Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Yes. If you want transactional behavior, use a well-tested system that was designed with that in mind instead of trying to roll your own on top of an unreliable substrate.
File systems do not, in general, support transactions involving multiple files.
Non-Windows file-systems and NTFS tend to have the property that you can do atomic file replacement, so if you can't use a database and
all of the files are under one reasonably small directory
which your application owns and
which is stored on one physical drive:
then you could do the following:
Copy the directory contents using hard-links as appropriate.
Modify the 5 files.
Atomically swap the modified copy of the directory with the original

Ive used the apache commons transactions library for atomic file operations with success. This allows you to modify files transactionally and potentially roll back on failures.
Here's a link: http://commons.apache.org/transaction/

My approach would be to use a lock, in your java code. So only one process could write some file at each time. I'm assuming your application is the only which writes the files.
If even so some write problem occurs to "rollback" your files you need to save a copy of files like upper suggested.

Can't you lock all the files and only write to them once all files have been locked?

Recommend an indexed file format that can be updated via random access in Java

I need an indexed file format that can hold a few hundred large variable sized binary blobs.
Blobs are around 1-5MB and the file could be as large as 1 GB. I need to be able to quickly find, read, add and remove blobs without recreating the the entire file. I have no need to compress the blobs, however if blobs were removed, I'd like to reclaim or reuse the space.
Ideally there would be a Java API.
I'm currently doing this with a ZIP format, but there's no known way to update a ZIP file without recreating it and performance is bad.
I've looked into SQLite but its blob performance was slow, and its overkill for my needs.
Any thoughts, or should I roll my own?
And if I do roll my own, any book or web page suggestions?

Berkeley DB Java Edition does what you need. It's free.

You need some virtual file system. Our SolFS is the one of the options yet we have only JNI layer, as the engine is written in C. There exists one more option, CodeBase, but as they don't provide an evaluation version of their file system, I know a few about it.
SolFS is ideally suitable for your task, because it lets you have alternative streams for files and associate searchable metadata with each file or even alternative stream.

Are flat file databases any good? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Informed options needed about the merits of flat file database. I'm considering using a flat file database scheme to manage data for a custom blog. It would be deployed on Linux OS variant and written in Java.
What are the possible negatives or positives regarding performance for reading and writing of both articles and comments?
Would article retrieval crap out because of it being a flat file rather than a RDBMS if it were to get slash-doted? (Wishful thinking)
I'm not against using a RDBMS, just asking the community their opinion on the viability of such a software architecture scheme.
Follow Up:
In the case of this question I would see “Flat file == file system–based” For example each blog entry and its accompanying metadata would be in a single file. Making for many files organized by date structure of the file folders (blogs\testblog2\2008\12\01) == 12/01/2008

Flat file databases have their place and are quite workable for the right domain.
Mail servers and NNTP servers of the past really pushed the limits of how far you can really take these things (which is actually quite far -- files systems can have millions of files and directories).
Flat file DBs two biggest weaknesses are indexing and atomic updates, but if the domain is suitable these may not be an issue.
But you can, for example, with proper locking, do an "atomic" index update using basic file system commands, at least on Unix.
A simple case is having the indexing process running through the data to create the new index file under a temporary name. Then, when you are done, you simply rename (either the system call rename(2) or the shell mv command) the old file over the new file. Rename and mv are atomic operations on a Unix system (i.e. it either works or it doesn't and there's never a missing "in between state").
Same with creating new entries. Basically write the file fully to a temp file, then rename or mv it in to its final place. Then you never have an "intermediate" file in the "DB". Otherwise, you might have a race condition (such as a process reading a file that is still being written, and may get to the end before the writing process is complete -- ugly race condition).
If your primary indexing works well with directory names, then that works just fine. You can use a hashing scheme, for example, to create directories and subdirectories to locate new files.
Finding a file using the file name and directory structure is very fast as most filesystems today index their directories.
If you're putting a million files in a directory, there may well be tuning issues you'll want to look in to, but out of that box most will handle 10's of thousands easily. Just remember that if you need to SCAN the directory, there's going to be a lot of files to scan. Partitioning via directories helps prevent that.
But that all depends on your indexing and searching techniques.
Effectively, a stock off the shelf web server serving up static content is a large, flat file database, and the model works pretty good.
Finally, of course, you have the plethora of free Unix file system level tools at your disposal, but all them have issues with zillions of files (forking grep 1000000 times to find something in a file will have performance tradeoffs -- the overhead simply adds up).
If all of your files are on the same file system, then hard links also give you options (since they, too, are atomic) in terms of putting the same file in different places (basically for indexing).
For example, you could have a "today" directory, a "yesterday" directory, a "java" directory, and the actual message directory.
So, a post could be linked in the "today" directory, the "java" directory (because the post is tagged with "java", say), and in its final place (say /articles/2008/12/01/my_java_post.txt). Then, at midnight, you run two processes. The first one takes all files in the "today" directory, checks their create date to make sure they're not "today" (since the process can take several seconds and a new file might sneak in), and renames those files to "yesterday". Next, you do the same thing for the "yesterday" directory, only here you simply delete them if they're out of date.
Meanwhile, the file is still in the "java" and the ".../12/01" directory. Since you're using a Unix file system, and hard links, the "file" only exists once, these are all just pointers to the file. None of them are "the" file, they're all the same.
You can see that while each individual file move is atomic, the bulk is not. For example, while the "today" script is running, the "yesterday" directory can well contain files from both "yesterday" and "the day before" because the "yesterday" script had not yet run.
In a transactional DB, you would do that all at once.
But, simply, it is a tried and true method. Unix, in particular, works VERY well with that idiom, and the modern file systems can support it quite well as well.

(answer copied and modified from here)
I would advise against using a flat file for anything besides read-only access, because then you'd have to deal with concurrency issues like making sure only one process is writing to the file at once. Instead, I recommend SQLite, a fully functional SQL database that's stored in a file. SQLite already has built-in concurrency, so you don't have to worry about things like file locking, and it's really fast for reads.
If, however, you are doing lots of database changes, it's best to do them all at once inside a transaction. This will only write the changes to the file once, as opposed to every time an change query is issued. This dramatically increases the speed of doing multiple changes.
When a change query is issued, whether it's inside a tranasction or not, the whole database is locked until that query finishes. This means that extremely large transactions could adversely affect the performance of other processes because they must wait for the transaction to finish before they can access the database. In practice, I haven't found this to be that noticeable, but it's always good practice to try to minimize the number of database modifying queries you issue, and it's certainly faster then trying to use a flat file.

This has been done with asp.net with Dasblog. It uses file based storage.
A few details are listed on this older link.
http://www.hanselman.com/blog/UpcomingDasBlog19.aspx
You can also get more details on http://dasblog.info/Features.aspx
I've heard some mixed opinions on the performance. I'd suggest you research that a bit more to see if that type of system would work well for you. This is the closest thing I have heard about yet.

Writing your own engine in native code can outperform a general purpose database.
However, the quality of the engine and the feature level will never approach that. All the things that databases give you as core features - indexing, transactions, referential integrity - you would have to implement all them yourself.
There's nothing wrong than reinventing the wheel (after all, Linux was just that), but keep in mind your expectations and time commitment.

I'm answering this not to answer why flat file databases are good or bad, others have done an ample job at that.
However, some have been pointing at SQLite which does it's job just fine. Since you are using Java, your best option would be to use HSQLDB, which does precisely the same as SQLite, but is implemented in Java and embeds into your application.

Most of the time a flat file database is enough now. But you will thank your younger self if you start your project with a database. This could be SQLite, if you don't want to set up a whole database system like PostgreSQL.

Check this out http://jsondb.io a opensource Java based database has most of what you are looking for.
Saves data as flat .json files, Multithreading Support, Encryption Support, ORM support, Atomicity Support, XPATH based advanced query support.
Disclaimer: I created this database.

Horrible idea. Appending would involve seeking to the end of the file every time you want to add something. Updating would require rewriting the entire file each time. Reading involves a table scan (or maintaining a separate index, which would have the same problems with writing/updating). Just use a database unless, of course, you re-implement all the stuff that an RDBMS already provides to make your solution even moderately scalable.

They seem to work quite well for high-write, low-read, no-update databases, where new data is appended.
Web servers and their cousins rely on them heavily for log files.
DBMS software as well use them for logs.
If your design falls within these limits, you're in good company, it seems. You might want to keep metadata and pointers in a database, and set up some kind of fast asynchronous queue-writer to buffer the comments, but the filesystem is already pretty good at that level of buffering and write-locking.

Flat file databases are possible but consider the following.
Databases need to attain all the ACID elements (atomicity, consistency, isolation, durability) and, if you're going to ensure that's all done in a flat file (especially with concurrent access), you've basically written a full-blown DBMS.
So why not use a full-blown DBMS in the first place?
You'll save yourself the time and money involved with writing (and re-writing many times, I'll guarantee) if you just go with one of the free options (SQLite, MySQL, PostgresSQL, and so on).

You can use fiat file databases if it is small enough does not have lost of random access. Big file with lot of random access will be very slow. And no complex queries. No joins, no sum, group by etc. You also can not expect to fetch hierarchical data from flat file. XML format is much better for complex structures.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.