How to initialize and fill Apache Ignite database by first node?

How to initialize and fill Apache Ignite database by first node? - java

I would like to use Apache Ignite as failover read-only storage so my application will be able to access the most sensitive data if main storage (Oracle) is down.
So I need to
Start nodes
Create schema (execute DDL queries)
Load data from Oracle to Ignite
Seems like it's not the same as database caching and I don't need to use Cache. However, this page says that I need to implement a store to load a large amount of data from 3rd parties.
So, my questions are:
How to effectively transfer data from Oracle to Ignite? Data Streamers?
Who should init this transfer? First started node? How to do that? (tutorials explain how to achieve that via clients, should I follow this advice?)

Actually, I think, use of a cache store without read/write-through would be a suitable option here. You can configure a CacheJdbcPojoStore, for example, and call IgniteCache#loadCache(...) on your cache, once the cluster is up. More on this topic: https://apacheignite.readme.io/docs/3rd-party-store
If you don't want to use a cache store, then IgniteDataStreamer could be a good choice. This is the fastest way to upload big amount of data to the cluster. Data loading is usually performed from a client node, when all server nodes are up and running.

Related

AWS Multi region ElastiCache sync

Need information on best practices for below AWS specific use case,
Our Java web application is deployed in us-east-1 and us-west-2 regions.
It communicates to Dynamo DB and Memcached based ElastiCache is sitting on top of Dynamo DB in both the regions.
We have Dynamo DB replication enabled between us-east-1 and us-west-2.
Route 53 directs API calls to the appropriate region.
Now, the issue is when we create or update a record in Dynamo DB it get's inserted in Dynamo DB and get's cached in that particular region. Record get's replicated to other regions Dynamo DB as well, but cache doesn't remain in sync since there is no replication between ElastiCache.
How do we address this issue in a best possible way?

DAX is not an answer here. I'm surprised nobody is pointing out this.
Ivan is wrong about this one.
DAX is read-write-through cache indeed, but it doesn't capture data changes happens through direct access to DynamoDB (obviously), nor the data changes that happened through "other" DAX cluster.
So in this scenario,
you have 2 DAX cluster one at west-1 one at east-1, and those two clusters are totally independent.
thus, if you make a data change on west-1 through DAX, that change does get propagated to east-1 on DynamoDB Table level, Not the cache(DAX) level
In other words, if you update the record that had been accessed from both region, (which than cached in both DAX Cluster), you'll still get the same problem.
Fundamentally, this is the problem of syncing cache layer across different regions, which is pretty hard. if you really need it, there are some ways of doing this by making your own Kafka or Kinesis stream that keep the Cache entry changes and consumed by cache clusters in multi-regions. which is not possible with elasticache by its own now. (but it is possible if you setup lambda or EC2 for this task only)
Checkout case studies from Netflix

One option as other suggested is to use DynamoDB DAX. It's a write-through cache, meaning that you do not need to synchronise database data with your cache. It happens when behind the curtains when you write data using normal DynamoDB API.
But if you want to still use ElastiCache you can use DynamoDB Streams. You can implement a Lambda function that will be triggered on every DynamoDB update in a table and write new data into ElastiCache.
This article may give you some ideas about how to do this.

Java web - sessions design for horizontal scalability

I'm definitely not an expert Java coder, I need to implement sessions in my java Servlet based web application, and as far as I know this is normally done through HttpSession. However this method stores the session data in the local filesystem and I don't want to have such constraints to horizontal scalability. Therefore I thought to save sessions in an external database to which the application communicates through a REST interface.
Basically in my application there are users performing some actions such as searches. Therefore what I'm going to persist in sessions is essentialy the login data, and the meta data associated to searches.
As the main data storage I'm planning to use a graph noSQL database, the question is: let's say I can eventually also use another database of another kind for sessions, which architecture fits better for this kind of situation?
I currently thought to two possible ways. the first one uses another db (such as an SQL db) to store sessions data. In this way I would have a more distributed workload since I'm not using the main storage also for sessions. Moreover I'd also have a more organized environment being session state variables and persisten ones not mixed up.
The second way instead consists in storing every information relative to any session into the "user node" of the main database. The sessionid will be at this point just a "shortcut" for an authentication. This way I dont have to rely on a second database, however I move all the workload to the main db mixing the session data with the persistent ones.
is there any standard general architecture to which I can ake reference? DO I miss some important point which should constraint my architecture?

Your idea to store sessions in a different location is good. How about using an in-memory cache like memcached or redis? Session data is generally not long-lived so you have other options other than a full-blown database. Memcached & Redis can both be clustered and can scale horizontally.

what is the better way to index data from Oracle/relational tables into elastic search?

What are the options to index large data from Oracle DB to elastic search cluster? Requirement is to index 300Million records one time into multiple indexes and also incremental updates having around approximate 1 Million changes every day.
I have tried JDBC plugin for elasticsearch river/feeder, both seems to be running inside or require locally running elastic search instance. Please let me know if there is any better option for running elastic search indexer as a standalone job (probably java based). Any suggestions will be very helpful.
Thanks.

We use ES as a reporting db and when new records are written to SQL we take the following action to get them into ES:
Write the primary key into a queue (we use rabbitMQ)
Rabbit picks up the primary key (when it has time) and queries the relation DB to get the info it needs and then writes the data into ES
This process works great because it handles both new data and old data. For old data just write a quick script to write 300M primary keys into rabbit and you're done!

there are many integration options - I've listed out a few to give you some ideas, the solution is really going to depend on your specific resources and requirements though.
Oracle Golden Gate will look at the Oracle DB transaction logs and feed them in real-time to ES.
ETL for example Oracle Data Integrator could run on a schedule and pull data from your DB, transform it and send to ES.
Create triggers in the Oracle DB so that data updates can be written to ES using a stored procedure. Or use the trigger to write flags to a "changes" table that some external process (e.g. a Java application) monitors and uses to extract data from the Oracle DB.
Get the application that writes to the Oracle DB to also feed ES. Ideally your application and Oracle DB should be loosely coupled - do you have an integration platform that can feed the messages to both ES and Oracle?

Infinispan Operational modes

I have recently started taking a look into Infinispan as our caching layer. After reading through the operation modes in Infinispan as mentioned below.
Embedded mode: This is when you start Infinispan within the same JVM as your applications.
Client-server mode: This is when you start a remote Infinispan instance and connect to it using a variety of different protocols.
Firstly, I am confuse now which will be best suited to my application from the above two modes.
I have a very simple use case, we have a client side code that will make a call to our REST Service using the main VIP of the service and then it will get load balanced to individual Service Server where we have deployed our service and then it will interact with the Cassandra database to retrieve the data basis on the user id. Below picture will make everything clear.
Suppose for example, if client is looking for some data for userId = 123 then it will call our REST Service using the main VIP and then it will get load balanced to any of our four service server, suppose it gets load balanced to Service1, and then service1 will call Cassandra database to get the record for userId = 123 and then return back to Client.
Now we are planning to cache the data using Infinispan as compaction is killing our performance so that our read performance can get some boost. So I started taking a look into Infinispan and stumble upon two modes as I mentioned below. I am not sure what will be the best way to use Infinispan in our case.
Secondly, As from the Infinispan cache what I will be expecting is suppose if I am going with Embedded Mode, then it should look like something like this.
If yes, then how Infinispan cache will interact with each other? It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well? if yes, then what configuration setup I need to have to make sure this thing is working fine.
Pardon my ignorance if I am missing anything. Any clear information will make things more clear to me to my above two questions.

With regards to your second image, yes, architecture will exactly look like this.
If yes, then how Infinispan cache will interact with each other?
Please, take a look here: https://docs.jboss.org/author/display/ISPN/Getting+Started+Guide#GettingStartedGuide-UsingInfinispanasanembeddeddatagridinJavaSE
Infinispan will manage it using JGroups protocol and sending messages between nodes. The cluster will be formed and nodes will be clustered. After that you can experience expected behaviour of entries replication across particular nodes.
And here we go to your next question:
It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well?
Infinispan was developed for this scenario so you don't need to worry about it at all. If you have for example 4 nodes and setting distribution mode with numberOfOwners=2, your cached data will live on exactly 2 nodes in every moment. When you issue GET command on NON owner node, entry will be fetched from the owner.
You can also set clustering mode to replication, where all nodes contain all entries. Please, read more about modes here: https://docs.jboss.org/author/display/ISPN/Clustering+modes and choose what is the best for your use case.
Additionally, when you add new node to the cluster there will StateTransfer take place and synchronize/rebalance entries across cluster. NonBlockingStateTransfer is implemented already so your cluster will be still capable of serving responses during that joining phase. See: https://community.jboss.org/wiki/Non-BlockingStateTransferV2
Similarly for removing/crashing nodes in your cluster. There will be automatic rebalancing process so for example some entries (numOwners=2) which after crash live only at one node will be replicated respectively to live on 2 nodes according to numberOfOwners property in distribution mode.
To sum it up, your cluster will be still up to date and this does not matter which node you are asking for particular entry. If it does not contain it, entry will be fetched from the owner.
if yes, then what configuration setup I need to have to make sure this thing is working fine.
Aforementioned getting started guide is full of examples plus you can find some configuration file examples in the Infinispan distribution: ispn/etc/config-samples/*
I would suggest you to take a look at this source too: http://refcardz.dzone.com/refcardz/getting-started-infinispan where you can find even more basic and very quick configuration examples.
This source also provides decision related information for your first question: "Should I use embedded mode or remote client-server mode?" From my point of view, using remote cluster is more enterprise-ready solution (see: http://howtojboss.com/2012/11/07/data-grid-why/). Your caching layer is very easily scalable, high-available and fault tolerant then and is independent of your database layer and application layer because it simply sits between them.
And you could be interested about this feature as well: https://docs.jboss.org/author/display/ISPN/Cache+Loaders+and+Stores

I think in newest Infinispan release supports to work in a special, compatibility mode for those users interested in accessing Infinispan in multiple ways .
follow below link to configure your cache environment to support either embedded or remotely.
Interoperability between Embedded and Remote Server Endpoints

any benefit from using Hazelcast instead of MongoDB to store user sessions/keys?

We are running mongodb instance to store data in a collections, no problems with it and mongo is our main data storage.
Today, we are going to develop Oauth2 support for the product and have to store the user sessions (security key, access token and etc.. ) and the access token have to be validated against the authentication server only after the defined timeout so that not every request will wait for validation by authentication server.
First request for secured resource (create) shall always be authenticated against the authentication server. Any subsequent request will be validated internally (cache) and check the internal timeout and only if expired, another request to the authentication server will be issued.
To solve that requirements, we have to introduce some kind of a distributed cache, to store (with TTL support) the user sessions and etc, expire it based on a ttl.. .i wrote about that above.
Two options here:
store user session in the hazelcast and share it across all App servers - nice choice, to persists all user session in eviction map.
store user sessions in MongoDb - and do the same.
Do you see any benefits of using Hazelcast instead of storing the temp data inside Mongo? Any significant performance improvements you're aware of ?
I'm new to Hazelcast, so don't aware about all killer features.

Disclaimer: I am the founder of Hazelcast...
Hazelcast is much simpler and simplicity matters a lot.
You can embed Hazelcast into your application (if your application is
written in Java). No need to deploy and maintain remote nosql
cluster.
Hazelcast works directly with your application objects. No
JSON or any other format. Write and read java objects.
You can
execute Java code on your in-memory data. No need to fetch and
process data; send your code over to the data.
You can listen for
the updates on your data. "Notify me when this map or key is
updated".
Hazelcast has rich set of data structures like queue,
topic, semaphores, locks, multimap etc. Imagine sharing a queue
across multiple nodes and be able to do blocking queue poll/take
operation... this is really cool :)

Hazelcast is an in-memory grid so it should be significantly faster than MongoDB for that kind of usage. They also have pre-made session clustering code for Java servlets if you do not want to create that yourself.
Code for the session clustering here on github. Or here for Maven artifact.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.