OrientDB - ensure consistent state

OrientDB - ensure consistent state - java

I am not quite sure how to ask this question but I hope you get my drift...
I am using OrientDB as an embedded database that is used by a single application. I would like to ensure that should this application crash, the database is always in a consistent state so that my application can be started again without having to perform an maintenance on the database or loosing any data.
Ie so when I change the database and get a success message, I know that the changes have been written.
Is this support by OrientDB, if so, what is the option to enable?
(P.S. if I knew what the general accepted term for this kind of setup is called, I could search myself...)

OrientDB uses some kind of rollback journal which means that by default it logs all operations are performed with data stored on disk and put them into append only log. Records of this log are cached and flushed every second. If application crashes WAL (write ahead)/operation log will be read and all operations will be applied once again. Also WAL has notion of transactions which means that if transaction will not be finished at the time of crash all applied changes will be rolled back. So you can be sure about following in OrientDB:
All data the were written before one second interval before crash will be restored.
All data are written inside of transaction will be in consistent state.
You can lost part of the data in last one second interval.
Interval of flushes of WAL cache can be changed but it may lead to performance slowdown.

Related

how does lost update differ from non repetable read?

Am trying to understand isolation levels and various issues ..... i.e. dirty read , non repeatable read , phantom read and lost update .
Was reading about Non repeatable read
Had also read about Lost update
what I am confused about is to me both of these look very similar i.e. in NRR ( Non repeatable read ) Tx B updated the row between two reads of the same row by Tx A so Tx A got different results.
In case of Lost update - Tx B overwrites changes committed by Tx A
So to me really it seems that both of these seem quite similar and related.
Is that correct ?
My understanding is if we use 'optimistic locking' it will prevent the issue of 'lost update'
(Based on some very good answers here )
My confusion :
However would it also imply / mean that by using 'optimistic locking' we also eliminate the issue of 'non repeatable read' ?
All of these questions pertain to a Java J2EE application with Oracle database.
NOTE : to avoid distractions I am not looking for details pertaining to dirty reads and phantom reads - my focus presently is entirely on non repeatable reads and lost update

Non-repeatable reads, lost updates, phantom reads, as well as dirty reads, are about transaction isolation levels, rather than pessimistic/optimistic locking. I believe Oracle's default isolation level is read committed, meaning that only dirty reads are prevented.
Non-repeatable reads and lost updates are indeed somehow related, as they may or may not occur on the same level of isolation. Neither can be avoided by locking only unless you set the correct isolation level, but you can use versioning (a column value that is checked against and increments on every update) to at least detect the issue (and take necessary action).

The purpose of repeatable reads is to provide read-consistent data:
within a query, all the results should reflect the state of the data at a
specific point in time.
within a transaction, the same query should return the same results
even if it is repeated.
In Oracle, queries are read-consistent as of the moment the query started. If data changes during the query, the query reads the version of the data that existed at the start of the query. That version is available in the "UNDO".
Bottom line: Oracle by default has an isolation level of READ COMMITTED, which guarantees read-consistent data within a query, but not within a transaction.
You talk about Tx A and Tx B. In Oracle, a session that does not change any data does not have a transaction.
Assume the default isolation level of READ COMMITTED. Assume the J2EE application uses a connection pool and is stateless.
app thread A connects to session X and reads a row.
app thread B connects to session Y and updates the row with commit.
app thread A connects to session Z and reads the same row, seeing a different result.
Notice that there is nothing any database can do here. Even if all the sessions had the SERIALIZABLE isolation level, session Z has no idea what is going on in session X. Besides, thread A cannot leave a transaction hanging in session X when it disconnects.
To your question, notice that app thread A never changed any data. The human user behind app thread A queried the same data twice and saw two different results, that is all.
Now let's do an update:
app thread A connects to session X and reads a row.
app thread B connects to session Y and updates the row with commit.
app thread A connects to session Z and updates the same row with commit.
Here the same row had three different values, not two. The human user behind thread A saw the first value and changed it to the third value without ever seeing the second value! That is what we mean by a "lost update".
The idea behind optimistic locking is to notify the human user that, between the time they queried the data and the time they asked to update it, someone else changed the data first. They should look at the most recent values before confirming the update.
To simplify:
"non-repeatable reads" happen if you query, then I update, then you query.
"lost updates" happen if you query, then I update, then you update. Notice that if you query the data again, you need to see the new value in order to decide what to do next.
Suggested reading: https://blogs.oracle.com/oraclemagazine/on-transaction-isolation-levels
Best regards, Stew Ashton

How can I discard pending DocumentReference writes when offline?

I'm using Firebase Firestore documents to publish the location of my users on a map, so they see each other. This works fine when all of them have good connectivity, but sometimes their mobiles can't connect to the Firebase servers and it seems that the writes are cached: whenever they recover connectivity all the pending location writes are sent in bulk.
The effect for other users is that they see the position of a person stop, and after a while they start moving really quick until the map position catches the real value. This is annoying and a waste of bandwidth.
I have tried disabling the persistence cache but this doesn't help (it would only help if the transmisor app would die, but as long as it lives the positions are cached in memory).
Maybe the problem is that I shouldn't be using documents for this purpose and there is another Firebase mechanism which allows discarding stale write data for the purposes of real time communication?

All write operations that are performed while the device is offline are queued until the connection with the Firebase servers is reestablished. Unfortunately, there is no API that can help you control which write operation are queued and which are not.
The simplest solution I can think of is to use Firestore transactions, which are currently not persisted to disk and thus will be lost when the application is offline.
So, transactions are not supported for offline use, they can't be cached or saved for later. This is because a transaction absolutely requires round trip communications with server in order to ensure that the code inside the transaction completes successfully. So you can use transaction only while online because the transactions are network dependent.

You can work around this by only making requests if you can connect to firestore. Here's a helper function that will determine if you're connected or not. It's similar to using a transaction since both methods involve making a read request.
If you don't plan on invoking the function that much, the read costs will probably be negligible. However, to save the cost of reads, you could also consider pinging some server or a cloud function instead of firestore itself. Doing so might be a less accurate way of testing connection to firestore though.
import {
doc,
getDocFromServer,
} from "firebase/firestore"
async function canConnectToFirestore(){
//navigator.onLine can only say for certain if you're disconnected
//For more info on navigator.onLine: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/onLine
if (!navigator.onLine)
return false
//db is initialized from getFirestore()
try{
await getDocFromServer(doc(db, "IrrelevantText","IrrelevantText"))
return true
}
catch(e){
return false
}
}
async function example(){
if(await canConnectToFirestore()) console.log("Do something with firestore")
}

Reliably tracking changes made by Hibernate

I'm using a PostUpdateEventListener registered via
registry.appendListeners(EventType.POST_COMMIT_UPDATE, listener)
and a few other listeners in order to track changes made by Hibernate. This works perfectly, however, I see a problem there:
Let's say, for tracking some amount by id, I simply execute
amountByIdConcurrentMap.put(id, amount);
on every POST_COMMIT_UPDATE (let's ignore other operations). The problem is that this call happens some time after the commit. So with two commits writing the same entity shortly one after the other, I can receive the events in the wrong order, ending up with the older amount stored.
Is this really possible or are the operations synchronized somehow?
Is there a way how to prevent or at least detect such situation?

Two questions and a proposal later
Are you sure, that you need this optimization. Why not fetch the amount as it is written to the database by querying there. What gives you reason to work with caching.
How do you make sure, that the calculation of the amount before writing it to the database is properly synchronized, so that multiple threads or probably nodes do not use old data to calculate the amount and therefore overwrite the result of a later calculation?
I suppose you handle question number 2 right. Then you have to options:
Pessimistic locking, that means that immediatly before the commit you can exclusively update your cache without concurrency issues.
Optimistic locking: In that case you have a kind of timestamp or counter in your database-record you can also put into the cache together with the amount. This value you can use to find out, what the more recent value is.

No, there are no ordering guarantees, so you'll have to take care to ensure proper synchronization manually.
If the real problem you are solving is caching of entity state and if it is suitable to use second-level cache for the entity in question, then you would get everything out of the box by enabling the L2 cache.
Otherwise, instead of updating the map from the update listeners directly, you could submit tasks to an Executor or messaging system that would asynchronously start a new transaction and select for update the amount for the given id from the database. Then update the map in the same transaction while holding the corresponding row lock in the db, so that map updates for the same id are done serially.

datastore: deleting entities outside transactions

I'm unable to find documentation that fully explains entities being deleted from datastore (I'm using JDO deletePersistent) without being in a transaction. I can afford loosing data accuracy during parallel updates when not using transaction for the sake of performance and avoiding contention.
But how can i make sure when my code is running on different machines at the same time that a delete operation would not be overridden by a later update / put on a previous read to that entity on another machine, I'm letting PersistenceManager take care of implicit updates to attached objects.
EDIT:
Trying to update that entity after deletePersistent will result in an exception but that is when trying to update the exact same copy being passed to deletePersistent. but if it was a different copy on another machine would be treated as updating a deleted entity (not valid) or as an insert or update resulting in putting that entity back?

This is taken from GAE Documentation:
Using Transactions
A transaction is a set of Datastore operations on one or more entity.
Each transaction is guaranteed to be atomic, which means that
transactions are never partially applied. Either all of the operations
in the transaction are applied, or none of them are applied.
An operation may fail when:
Too many users try to modify an entity group simultaneously. The
application reaches a resource limit. The Datastore encounters an
internal error.
Since transactions are guaranteed to be atomic, an ATOMIC operation like a single delete operation will always work inside or outside a transaction.

The answer is yes, even after the object was deleted if it was read before and the update was being committed after delete was committed it will be put back because as #Nick Johnson commented inserts and updates are the same. tested that using 20 seconds thread sleep after getting object for update allowing the object to be deleted and then being put back.

Is something written to HDFS or Hbase visible to all other nodes in Hadoop Cluster immediately?

While a Hadoop Job is running or in progress if I write something to HDFS or Hbase then will that
data be visible to all nodes in the cluster
1.)immediately?
2.)If not immediately then after how much time?
3.)Or the time really cannot be determined?

HDFS is strongly consistent, so once a write has completed successfully, the new data should be visible across all nodes immediately. Clearly the actual writing takes some time - see replication pipelining for some details on this.
This is in contrast to eventually consistent systems, where it may take an indefinite time (though often only a few milliseconds) before all nodes see a consistent view of the data.
Systems such as Cassandra have tunable consistency - each read and write can be performed at a different level of consistency to suit the operation being performed.

In best of my understanding the data is visible immediately, after write operation is finished.
Lets see some aspects of the process:
When client writes to HDFS data is written in all replicas, and after the write operation finished it should be perfectly available
There is also only one place with metadata - NameNode which also do not have any notion of isolation which would enable hiding data till some larger peace of work is done.
HBase is a different case - since it will write only LOG to HDFS immediately and its HFiles will be updated with new data after compaction only. In the same time - after HBase itself write something into HDFS - data will be visible immediately.

In HDFS data is visible once it is flushed or synced using hflush() or hsync() method - these methods were introduced in 0.21 version I guess. HFlush gives you a guarantee that data is visible to all readers. Hsync gives you a guarantee that data was saved to disk (altough it may still be in your disk cache). The write method does not give you any guarantees. To answer your question - in HDFS data is visible immediately to everyone after doing hflush() or hsync().

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.