I quote from Apache Commons Page for Commons FileUpload
This page describes the traditional API of the commons fileupload
library. The traditional API is a convenient approach. However, for
ultimate performance, you might prefer the faster Streaming API.
My Question
What specific differences make Streaming API faster than traditional API?
The key difference is in the way you're handling the file, as you noticed by yourself with the factory class.
The streaming API is not saving in disk while getting the input stream. In the end, you'll be able to handle the file faster (with a cost on temporary memory)... but the idea is to avoid saving the binary in disk unless you really want/need to.
After that, you are able to save the data to disk, of course, using a bufferedinputstream, a byte array or similar.
EDIT: The handler when you open the stream ( fileItemStreamElement.openStream() ) is a common InputStream instance. So, the answer to your "what if it's a big file" is something like this Memory issues with InputStream in Java
EDIT: The streaming API should not save to disk OR save in memory. It simply provides a stream you can read from to copy the file to where ever you want. This is a way to avoid having a temp directory and also avoid allocating enough memory to hold the file. This should be faster at least because it is not copied twice, once from the browser to disk/memory and then again from disk/memory to where ever you save it.
The traditional API, which is described in the User Guide, assumes, that file items must be stored somewhere, before they are actually accessable by the user. This approach is convenient, because it allows easy access to an items contents. On the other hand, it is memory and time consuming.
http://commons.apache.org/fileupload/streaming.html
The streaming API should not save to disk OR save in memory. It simply provides a stream you can read from to copy the file to where ever you want. This is a way to avoid having a temp directory and also avoid allocating enough memory to hold the file. This should be faster at least because it is not copied twice, once from the browser to disk/memory and then again from disk/memory to where ever you save it.
Streaming generally refers to a API (like Apache FileUpload or StAX) in which data is transmitted and parsed serially at application run time, often in real time, and often from dynamic sources whose contents are not precisely known beforehand.
Traditional models refer to APIs like (Traditional file handling APIs, DOM API) which provide a lot more detail information about the data.
Like for a FileHandling API Traditional approach assumes that file items must be stored somewhere, before they are actually accessible by the user. This approach is convenient, because it allows easy access to an items contents. On the other hand, it is memory and time consuming.
An Streaming API will have a smaller memory footprint and smaller processor requirements and can have higher performance in certain situations.
It works on the fundamental of "cardboard tube" view of the document you are working with.
Related
What is my specific use case?
I have set of objects representing e. g. profiles. Objects can be modified (updated), deleted or added. Each object has several properties, but modification of single property value just marks the whole object as "modified" (so from persistence layer point of view, an object is atomic). There are no relations between the objects.
Size of such set is between 10 - 50000 (but theoretically there's no limit - user can append additional objects). Single object size is up to 500KB (but usually it will be smaller, about 60KB).
Objects should be read and updated as fast as possible. There's also one more key requirement: they should be persisted on hard disk with possibility to copy or move them. My app is written in Java and run on Windows 7-10 OS.
What was my initial approach?
I came to conclusion that each object can be easily represented as single JSON file. The problem lies in keeping such large set of files on a disk. Windows filesystem doesn't seem to be good at handling too many (even small) files.
Then I thought that my files can be stored in virtual filesystem. The first obvious solution was to pack them in ZIP archive in such way:
profiles.zip:
--- profile1.json
--- profile2.json
...
--- profile10000.json
It would be great solution in terms of portability and the read performance is also ok. BUT, it seems the new objects can't be appended to ZIP archive without copying all files stored in the archive... Or at least I didn't find a way to do it.
What should I do then...?
I've searched for other solutions. I consider using:
Fast relational database - but I feel it is like to take a sledgehammer to crack a nut. Especially I don't need to handle relations or transactions (I don't even need a server, it is only for one local user).
NoSQL object databases, e.g. MapDb or Nitrite - it sounds ok, but I couldn't find any reliable comparisons or popularity ratings. It is important for me to pick a credible solution.
Some other virtual filesystems that can be managed in Java? Maybe I missed something?
Could you provide any ideas or advices based on experience? I need fast read/update of whole objects in large datasets with portability (that can be achieved in Java and Windows OS).
It is very hard to answer the question unless we know the size of each object in memory. One suggestion I can give is to try hybrid frameworks which support in memory access as well as persistence to disk.
Ehcache is one of the frameworks which I think will work for you and it it easily supports 50000 objects in memory itself. Even Couchbase supports similar options and a flexibility of immediate or eventual persistence.
I am facing a problem for which I don't have a clean solution. I am writing a Java application and the application stores certain data in a limited set of files. We are not using any database, just plain files. Due to some user-triggered action, certain files needs to be changed. I need this to be a all-or-nothing operation. That is, either all files are updated, or none of them. It is disastrous if for example 2 of the 5 files are changed, while the other 3 are not due to some IOException.
What is the best strategy to accomplish this?
Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Thanks a lot!
A safe approach IMO is:
Backup
Maintain a list of processed files
On exception, restore the ones that have been processed with the backed up one.
It depends on how heavy it is going to be and the limits for time and such.
What is the best strategy to accomplish this? Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Yes. If you want transactional behavior, use a well-tested system that was designed with that in mind instead of trying to roll your own on top of an unreliable substrate.
File systems do not, in general, support transactions involving multiple files.
Non-Windows file-systems and NTFS tend to have the property that you can do atomic file replacement, so if you can't use a database and
all of the files are under one reasonably small directory
which your application owns and
which is stored on one physical drive:
then you could do the following:
Copy the directory contents using hard-links as appropriate.
Modify the 5 files.
Atomically swap the modified copy of the directory with the original
Ive used the apache commons transactions library for atomic file operations with success. This allows you to modify files transactionally and potentially roll back on failures.
Here's a link: http://commons.apache.org/transaction/
My approach would be to use a lock, in your java code. So only one process could write some file at each time. I'm assuming your application is the only which writes the files.
If even so some write problem occurs to "rollback" your files you need to save a copy of files like upper suggested.
Can't you lock all the files and only write to them once all files have been locked?
Basically, I am looking for a simple way to list and access a set of strings in stream form in an abstract manner. The only issue is that Java's file-accessing API can only be used for listing and reading files, and any sort of non-filesystem storage of the data uses a different API. My question is whether there is some common API I could use (whether included in Java or as an external API) so that I could access both in an abstract manner, but also somewhat efficiently.
Essentially I want a set of lazily streamed text files. Something like Set might be reasonable, except on a filesystem, you would have to open the text streams even if you don't end up wanting to access that file.
Some sort of api like
String[] TextStorage.list()
InputStream TextStorage.open(String elementname);
which could abstractly be used to access either filesystems or databases, or some other storage mechanism I invent in the future (maybe fetching something across the internet).
Is there a library which already does this? Can I do this with the already existing Java API? Do I need to write this myself? I'd be surprised if no-one has encountered this problem before, but my google-fu and stackoverflow searches don't seem to find anything.
you might use HSQL
http://hsqldb.org/
I need an indexed file format that can hold a few hundred large variable sized binary blobs.
Blobs are around 1-5MB and the file could be as large as 1 GB. I need to be able to quickly find, read, add and remove blobs without recreating the the entire file. I have no need to compress the blobs, however if blobs were removed, I'd like to reclaim or reuse the space.
Ideally there would be a Java API.
I'm currently doing this with a ZIP format, but there's no known way to update a ZIP file without recreating it and performance is bad.
I've looked into SQLite but its blob performance was slow, and its overkill for my needs.
Any thoughts, or should I roll my own?
And if I do roll my own, any book or web page suggestions?
Berkeley DB Java Edition does what you need. It's free.
You need some virtual file system. Our SolFS is the one of the options yet we have only JNI layer, as the engine is written in C. There exists one more option, CodeBase, but as they don't provide an evaluation version of their file system, I know a few about it.
SolFS is ideally suitable for your task, because it lets you have alternative streams for files and associate searchable metadata with each file or even alternative stream.
I am creating a few JAX-WS endpoints, for which I want to save the received and sent messages for later inspection. To do this, I am planning to save the messages (XML files) into filesystem, in some sensible hierarchy. There will be hundreds, even thousands of files per day. I also need to store metadata for each file.
I am considering to put the metadata (just a couple of fields) into database table, but the XML file content itself into files in a filesystem in order not to bloat the database with content data (that is seldomly read).
Is there some simple library that helps me in saving, loading, deleting etc. the files? It's not that tricky to implement it myself, but I wonder if there are existing solutions? Just a simple library that already provides easy access to filesystem (preferrably over different operating systems).
Or do I even need that, should I just go with raw/custom Java?
Is there some simple library that
helps me in saving, loading, deleting
etc. the files? It's not that tricky
to implement it myself, but I wonder
if there are existing solutions? Just
a simple library that already provides
easy access to filesystem (preferrably
over different operating systems).
Java API
Well, if what you need to do is really simple, you should be able to achieve your goal with java.io.File (delete, check existence, read, write, etc.) and a few stream manipulations with FileInputStream and FileOutputStream.
You can also throw in Apache commons-io and its handy FileUtils for a few more utility functions.
Java is independent of the OS. You just need to make sure you use File.pathSeparator, or use the constructor File(File parent, String child) so that you don't need to explicitly mention the separator.
The Java file API is relatively high-level to abstract the differences of the many OS. Most of the time it's sufficient. It has some shortcomings only if you need some relatively OS-specific feature which is not in the API, e.g. check the physical size of a file on the disk (not the the logical size), security rights on *nix, free space/quota of the hard drive, etc.
Most OS have an internal buffer for file writing/reading. Using FileOutputStream.write and FileOutputStream.flush ensure the data have been sent to the OS, but not necessary written on the disk. The Java API support also this low-level integration to manage these buffering issue (example here) for system such as database.
Also both file and directory are abstracted with File and you need to check with isDirectory. This can be confusing, for instance if you have one file x, and one directory /x (I don't remember exactly how to handle this issue, but there is a way).
Web service
The web service can use either xs:base64Binary to pass the data, or use MTOM (Message Transmission Optimization Mechanism) if files are large.
Transactions
Note that the database is transactional and the file system not. So you might have to add a few checks if operations fails and are re-tried.
You could go with a complicated design involving some form of distributed transaction (see this answer), or try to go with a simpler design that provides the level of robustness that you need. A possible design could be:
Update. If the user wants to overwrite a file, you actually create a new one. The level of indirection between the logical file name and the physical file is stored in database. This way you never overwrite a physical file once written, to ensure rollback is consistent.
Create. Same story when user want to create a file
Delete. If the user want to delete a file, you do it only in database first. A periodic job polls the file system to identify files which are not listed in database, and removes them. This two-phase deletes ensures that the delete operation can be rolled back.
This is not as robust as writting BLOB in real transactional database, but provide some robustness. You could otherwise have a look at commons-transaction, but I feel like the project is dead (2007).
There is DataNucleus, a Java persistence provider. It is little too heavy for this case, but it supports JPA and JDO java standards with different datastores (RDBMS, object storage, XML, JSON, Excel, etc.). If the product is already using JPA or JDO, it might be worth considering using NataNucleus, as saving data into different datastores should be transparent. I suppose DataNucleus supports splitting the data into several files, creating the sensible directory/file structure I wanted (in my question), but this is just a guess.
Support for XML and JSON seems to be experimental.