Are transactions on top of "normal file system" possible?

Are transactions on top of "normal file system" possible? - java

It seems to be possible to implement transactions on top of normal file systems using techniques like write-ahead logging, two-phase commit, and shadow-paging etc.
Indeed, it must have been possible because a transactional database engine like InnoDB can be deployed on top of a normal file system. There are also libraries like XADisk.
However, Apache Commons Transaction state:
...we are convinced that the main advertised feature transactional file access can not be implemented reliably. We are convinced that no such implementation can be possible on top of an ordinary file system. ...
Why did Apache Commons Transactions claim implementing transactions on top of normal file systems is impossible?
Is it impossible to do transactions on top of normal file systems?

Windows offers transactions on top of NTFS. See the description here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968806%28v=vs.85%29.aspx
It's not recommended for use at the moment and there's an extensive discussion of alternative scenarios right in MSDN: http://msdn.microsoft.com/en-us/library/windows/desktop/hh802690%28v=vs.85%29.aspx .
Also if you take a definition of the filesystem, DBMS is also a kind of a filesystem and a filesystem (like NTFS or ext3) can be implemented on top (or in) DBMS as well. So Apache's statement is a bit, hmm, incorrect.

This answer is pure speculation, but you may be comparing apples and oranges. Or perhaps more accurately, milk and dairy products.
When a database uses a file system, it is only using a small handful of predefined files on the system (per database). These include data files and log files. The one operation that is absolutely necessary for ACID-compliant transactions is the ability to force a write to permanent memory (either disk or static RAM). And, I think most file systems provide this capability.
With this mechanism, the database can maintain locks on objects in the database as well as control access to all objects. Happily, the database has layers of memory/page management built on top of the file system. The "database" itself is written in terms of things like pages, tables, and indexes, not files, directories, and disk blocks.
A more generic transactional system has other challenges. It would need, for instance, atomic actions for more things. E.g. if you "transactionally" delete 10 files, all these would have to disappear at the same time. I don't think "traditional" file systems have this capability.
In the database world, the equivalent would be deleting 10 tables. Well, you essentially create new versions of the system tables without the tables — within a transaction, while the old tables are being used. Then you put a full lock on the system tables (preventing reads and writes), waiting until they are available. Then you swap in the new table definitions (i.e. without the tables), unlock the tables, and clean up the data. (This is intended as an intuitive view of the locking mechanism in this case, not a 100% accurate description.)
So, notice that locking and transactions are deeply embedded in the actions the database is doing. I suspect that the authors of this module come to realize that they had to basically fully re-implement all existing file system functionality to support their transactions — and that was a bit too much scope to take on.

Related

Should I have a single or multiple script files?

I'm creating a program in Java that uses scripting. I'm just wondering if I should split my scripts into one file for each script (more realistically every type of script like "math scripts" and "account scripts" etc.), or if I should use one clumped file for all scripts.
I'm looking for an answer from more of a technical viewpoint rather than a practical viewpoint if possible, since this question kind of already explained the practical side (separate often modified scripts and large scripts).

In terms of technical performance impacts, one could argue that using a single Globals instance is actually more efficient since any libraries are loaded only once instead of multiple times. However the question about usage of multiple files really depends. Multiple physical lua files can be loaded using the same Globals, or a single file can be loaded using the Globals instance, either way the Globals table contains the same amount of data in the end regardless of whether it was loaded from multiple files or not. If you use multiple Globals for each file this is not the case.
Questions like this really depend on what the intended goals are that you wish to use lua for. Using a single Globals instance will use RAM more efficiently, but besides that will not really give any performance increase. Loading multiple files versus a single file may take slightly longer, as the time to open and close the file handles, but this is such a micro optimization it seriously isn't worth the hassle it requires to write all the code in a single file, not to mention how hard it'd be to organize it efficiently.
There are a few advantages to using multiple Globals as well however, each Globals instance has it's own global storage, so changing something, like overloading operators on an objects metatable or overriding functions don't carry over to other instances. If this isn't a problem for you, then my suggestion may be to write the code in multiple files, and load them all with a single Globals instance. However if you do this be careful to structure all your files properly, if you use the global scope a lot you may find that keeping track of object names becomes difficult and is prone to accidentally modifying values from other files by naming them the same. To avoid this each file can define all of its functionality in it's own table, and then these Tables work as individual modules, where you can select features based on the tables, almost like choosing from a specific file.
In the end it really doesn't make much of a difference, but depending on which you choose you may need to take care to ensure good organization of the code.
Using multiple Globals takes more RAM, but can allow each file to have their own custom libraries without affecting others, but comes at the cost of requiring more structural management from the Java end of your software to keep all the files organized.
Using a single Globals takes less RAM, but all files share the same global scope, making customized versions of libraries more difficult and requires more structural organization from the Lua end of the software to prevent names and other functionality from conflicting.
If you intend other users to use your Lua API to add-on to your software through an addon system for example, you may wish to use multiple instance of Globals, because requiring the user creating addons to be the one responsible for ensuring they're code won't conflict with other addons is not only dangerous but also a burden that doesn't need to exist. An inexperienced user comes along trying to make an addon, doesn't organize it properly, and may mess up parts of the software or software addons.

Running Neo4j purely in memory without any persistence

I don't want to persist any data but still want to use Neo4j for it's graph traversal and algorithm capabilities. In an embedded database, I've configured cache_type = strong and after all the writes I set the transaction to failure. But my write speeds (node, relationship creation speeds) are a slow and this is becoming a big bottleneck in my process.
So, the question is, can Neo4j be run without any persistence aspects to it at all and just as a pure API? I tried others like JGraphT but those don't have traversal mechanisms like the ones Neo4j provides.

As far as I know, Neo4J data storage and Lucene indexes are always written to files. On Linux, at least, you could set up a ramfs filing system to hold the files in-memory.
See also:
Loading all Neo4J db to RAM

How many changes do you group in each transaction? You should try to group up to thousands of changes in each transaction since committing a transaction forces the logical log to disk.
However, in your case you could instead begin your transactions with:
db.tx().unforced().begin();
Instead of:
db.beginTx();
Which makes that transaction not wait for the logical log to force to disk and makes small transactions much faster, but a power outage could have you lose the last couple of seconds of data potentially.
The tx() method sits on GraphDatabaseAPI, which for example EmbeddedGraphDatabase implements.

you can try a virtual drive. It would make neo4j persist to the drive, but it would all happen in memory
https://thelinuxexperiment.com/create-a-virtual-hard-drive-volume-within-a-file-in-linux/

Sharing nHibernate and hibernate 2nd level cache

Is it possible to share the 2nd level cache between a hibernate and nhibernate solution? I have an environment where there are servers running .net and servers running java who both access the same database.
there is some overlap in the data they access, so sharing a 2nd level cache would be desirable. Is it possible?
If this is not possible, what are some of the solutions other have come up with?

There is some overlap in the data they access, so sharing a 2nd level cache would be desirable. Is it possible?
This would require (and this is very likely oversimplified):
Being able to access a cache from Java and .Net.
Having cache provider implementations for both (N)Hibernate.
Being able to read/write data in a format compatible with both languages (or there is no point at mutualizing the cache).
This sounds feasible but:
I'm not aware of an existing ready-to-use solution implementing this (my first idea was Memcache but AFAIK Memcache stores a serialized version of the data so this doesn't meet the requirement #3 which is the most important).
I wonder if using a language neutral format to store data would not generate too much overhead (and somehow defeat the purpose of using a cache).
If this is not possible, what are some of the solutions other have come up with?
I never had to do this but if we're talking about a read-write cache and if you use two separate caches, you'll have to invalidate a given Java cache region from the .Net side and inversely. You'll have to write the code to handle that.

As Pascal said, it's improbable that sharing the 2nd cache is technically possible.
However, you can think about this from a different perspective.
It's unlikely that both applications read and write the same data. So, instead of sharing the cache, what you could implement is a cache invalidation service (using the communications stack of your choice).
Example:
Application A mostly reads Customer data and writes Invoice data
Application B mostly reads Invoice data and writes Customer data
Therefore, Application A caches Customer data and Application B caches Invoice data
When Application A, for example, modifies an invoice, it sends a message to Application B and tells it to evict the invoice from the cache.
You can also evict whole entity types, collections and regions.

What is the best way to serialize an EMF model instance?

I have an Eclipse RCP application with an instance of an EMF model populated in memory. What is the best way to store that model for external systems to access? Access may occur during and after run time.
Reads and writes of the model are pretty balanced and can occur several times a second.
I think a database populated using Hibernate + Teneo + EMF would work nicely, but I want to know what other options are out there.

I'm using CDO (Connected Data Objects) in conjunction with EMF to do something similar. If you use the examples in the Eclipse wiki, it doesn't take too long to get it running. A couple of caveats:
For data that changes often, you probably will want to use nonAudit mode for your persistence. Otherwise, you'll save a new version of your EObject with every commit, retaining the old ones as well.
You can choose to commit every time your data changes, or you can choose to commit at less frequent intervals, depending on how frequently you need to publish your updates.
You also have fairly flexible locking options if you choose to do so.
My application uses Derby for persistence, though it will be migrated to SQL Server before long.
There's a 1 hour webinar on Eclipse Live (http://live.eclipse.org/node/635) that introduces CDO and gives some good examples of its usage.

I'd go with Teneo to do the heavy lifting unless performance is a real problem (which it won't be unless your models are vast). Even if it is slow you can tune it using JPA annotations.

Strategy for Offline/Online data synchronization

My requirement is I have server J2EE web application and client J2EE web application. Sometimes client can go offline. When client comes online he should be able to synchronize changes to and fro. Also I should be able to control which rows/tables need to be synchronized based on some filters/rules. Is there any existing Java frameworks for doing it? If I need to implement on my own, what are the different strategies that you can suggest?
One solution in my mind is maintaining sql logs and executing same statements at other side during synchronization. Do you see any problems with this strategy?

There are a number of Java libraries for data synchronizing/replication. Two that I'm aware of are daffodil and SymmetricDS. In a previous life I foolishly implemented (in Java) my own data replication process. It seems like the sort of thing that should be fairly straightforward, but if the data can be updated in multiple places simultaneously, it's hellishly complicated. I strongly recommend you use one of the aforementioned projects to try and bypass dealing with this complexity yourself.

The biggist issue with synchronization is when the user edits something offline, and it is edited online at the same time. You need to merge the two changed pieces of data, or deal with the UI to allow the user to say which version is correct. If you eliminate the possibility of both being edited at the same time, then you don't have to solve this sticky problem.
The method is usually to add a field 'modified' to all tables, and compare the client's modified field for a given record in a given row, against the server's modified date. If they don't match, then you replace the server's data.
Be careful with autogenerated keys - you need to make sure your data integrity is maintained when you copy from the client to the server. Strictly running the SQL statements again on the server could put you in a situation where the autogenerated key has changed, and suddenly your foreign keys are pointing to different records than you intended.
Often when importing data from another source, you keep track of the primary key from the foreign source as well as your own personal primary key. This makes determining the changes and differences between the data sets easier for difficult synchronization situations.

Your synchronizer needs to identify when data can just be updated and when a human being needs to mediate a potential conflict. I have written a paper that explains how to do this using logging and algebraic laws.

What is best suited as the client-side data store in your application? You can choose from an embedded database like SQLite or a message queue or some object store or (if none of these can be used since it is a web application) files/ documents saved on the client using Web DB or IndexedDB through HTML 5's LocalStorage API.
Check the paper Gold Rush: Mobile Transaction Middleware with Java-Object Replication. Microsoft's documentation of occasionally connected systems describes two approaches: service-oriented or message-oriented and data-oriented. Gold Rush takes the earlier approach. The later approach uses database merge-replication.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.