Elastic search java specific configuration

Elastic search java specific configuration - java

I try to access a local instance of elasticserach through java api.
According to elastic search doc, I can use the "cluster.name" property to specify the name of the cluster to use. Perfect.
Sadly can't I specify the node name to use? I can see that this one is also configurable in configuration.
Maybe it would be a bad practice??
Also, I can seehere that I can define a custom service ID which I did, but how to specify it to my java Transport Client?
Thank you so much for your help.

The whole point of Elasticsearch is to create a cluster of highly available data. Not all nodes contain all the data and not all nodes might be up all the time. If you want to connect to a single node by specifying its name and for some reason that node is down (it was killed, it is being upgraded, it is being re-provisioned, data was wiped to be reindexed, etc), then your client wouldn't be able to run queries and get results.
Instead, if you connect to the cluster, ES will make sure to route your queries to the cluster nodes that are up, whatever the state of certain nodes that might be down. So the best practice is to always connect via the cluster.name to get the best out of your ES cluster.
As for the `SERVICE_ID, it's not something you specify in your code, it's simply the name you want to give to the Elasticsearch service when running on Windows.

Related

How to initialize and fill Apache Ignite database by first node?

I would like to use Apache Ignite as failover read-only storage so my application will be able to access the most sensitive data if main storage (Oracle) is down.
So I need to
Start nodes
Create schema (execute DDL queries)
Load data from Oracle to Ignite
Seems like it's not the same as database caching and I don't need to use Cache. However, this page says that I need to implement a store to load a large amount of data from 3rd parties.
So, my questions are:
How to effectively transfer data from Oracle to Ignite? Data Streamers?
Who should init this transfer? First started node? How to do that? (tutorials explain how to achieve that via clients, should I follow this advice?)

Actually, I think, use of a cache store without read/write-through would be a suitable option here. You can configure a CacheJdbcPojoStore, for example, and call IgniteCache#loadCache(...) on your cache, once the cluster is up. More on this topic: https://apacheignite.readme.io/docs/3rd-party-store
If you don't want to use a cache store, then IgniteDataStreamer could be a good choice. This is the fastest way to upload big amount of data to the cluster. Data loading is usually performed from a client node, when all server nodes are up and running.

Architecture on AWS : Running a distributed algorithm on dynamic nodes

As shown in the digram,the pet-project that I am working on has two following components.
a) The "RestAPI layer" (set of micro-services)
b) "Scalable Parallelized Algorithm" component.
I am planing on running this on AWS.I realized that I can use ElasticBeanTalk to deploy my RestAPI module.(Spring Boot JAR with embedded tomcat)
I am thinking how to architect the "Scalable Parallelized Algorithm" component.Here are some design details about this:
This consist of couple of Nodes which share the same data stored on
S3.
Each node perform the "algorithm" on a chunk of S3 data.One node works as master node and rest of the nodes send the partial result to
this node.(embarrassingly parallel,master-slave paradigm).Master node
get invoked by the RestAPI layer.
A "Node" is a Spring Boot application which communicates with other nodes through HTTP.
Number of "Nodes" is dynamic ,which means I should be able to manually add a new Node depend on the increasing data size of S3.
There is a "Node Registry" on Redis which contains IPs of all the nodes.Each node register itself , and use the list of IPs in the
registry to communicate with each other.
My questions:
1) Shall I use EC2 to deploy "Nodes" or can I use ElasticBeanStalk to deploy these nodes as well.I know with EC2 I can manage the number of nodes depend on the size of S3 data, but is it possible to do this with ElasticBeanStalk?
2) Can I use
Inet4Address.getLocalHost().getHostAddress()
to get the IP of the each Node ? Do EC2 instances have more than one IP ? This IP should be allow the RestAPI Layer to communicate with the "master" Node.
3) Whats the component I should use expose my RestAPI layer to the external world ? But I dont want to expose my "Nodes".
Update :
I cant use MapReduce since the nodes have state. ie, During initialization , each Node read its chunk of data from S3 and create the "vector space" in memory.This a time consuming process , so thats why this should be stored in memory.Also this system need near-real-time response , cannot use a "batch" system like MR.

1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ
https://aws.amazon.com/cloudformation/faqs/
2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:
curl http://169.254.169.254/latest/meta-data/public-ipv4
or
curl http://169.254.169.254/latest/meta-data/local-ipv4
Full reference to EC2 instance metadata:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
3) Check out the API Gateway service, it might be what you are looking for:
https://aws.amazon.com/api-gateway/faqs/

Some general principles
Use infrastructure automation: CloudFormation or Troposphere over CloudFormation. This would make your system clean and easy to maintain.
Use Tagging: this keeps your AWS account nice and tidy. Also you can do funky scripts like describe all instances based on Tags, which can be a one-liner CLI/SDK call returning all the IPs of your "slave" instances.
Use more Tags, it can be really powerful.
ElasticBeanstalk VS "manual" setup
ElasticBeanstalk sounds like a good choice to me, but it's important to see, it's using the same components which I would recommend:
Create an AMI which contains your Slave Instance ready to go, or
Create an AMI and use UserData to configure your Slave, or
Create an AMI and/or use an orchestration tool like Chef or Puppet to configure your slave instance.
Use this AMI in an Autoscaling Launch Config
Create an AutoScalingGroup which can be on a fix number of instances or can scale based on a metric.
Pro setup: if you can somehow count the jobs waiting for execution, that can be a metric for scaling up or down automatically
Pro+ tip: use the Master node to create the jobs, put the jobs into an SQS queue. The length of the queue is a good metric for scaling. Failed jobs are back in the queue and will be re-executed. ( The SQS message contains only a reference, not the full data of the job.)
Using a queue would decouple your environment which is highly recommended
To be clear, ElasticBeanstalk does something similar. Actually if you create a multi node Beanstalk stack, it will run a CloudFromation template, create an ELB, an ASG, a LCFG, and Instances. You just have a bit less control but also less management overhead.
If you go with Beanstalk, you need Worker Environment which also creates the SQS queue for you. If you go for a Worker Environment, you can find tutorials, working examples, which makes your start easier.
Further to read:
Background Task Handling for AWS Elastic Beanstalk
Architectural Overview
2) You can use CLI, it has some filtering capabilities, or you can use other commands like jq for filtering/formatting the output.
Here is a similar example.
Note: Use tags and then you can easily filter the instances. Or you can query based on the ELB/ASG.
3) Exposing your API via the API Gateway sounds a good solution. I assume you want to expose only the Master node(s) since thats what managing the tasks.

Configuring storm cluster for production cluster

We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows
Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.
As new bee to storm, looking for an expert advice on the following questions
Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?
Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.
Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?
Any help would be much appreciated.

Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.
See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.

AWS Elasticache SDK doesn't work for Redis ElastiCache

I want to dynamically configure my API servers depending on the name of the "cluster".
So I'm using AmazonElastiCacheClient to discover the clusters name and need to extract the endpoint of the one that has a specific name.
The problem is that I can find it but there doesn't seem to be a way to get an endpoint.
foundCluster.getCacheNodes() returns an empty list, even if there is 1 Redis instance appearing in the AWS console, in-sync and running.
foundCluster.getConfigurationEndpoint() returns null.
Any idea?

Try adding
DescribeCacheClustersRequest.setShowCacheNodeInfo(true);

I am making a guess:
AWS Elastic Cache with redis currenlty supports only single node clusers (so no auto discovery etc). I am not sure this is due this. Memcached based cluster is different.
"At this time, ElastiCache supports single-node Redis cache clusters." http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/CacheNode.Redis.html

Infinispan Operational modes

I have recently started taking a look into Infinispan as our caching layer. After reading through the operation modes in Infinispan as mentioned below.
Embedded mode: This is when you start Infinispan within the same JVM as your applications.
Client-server mode: This is when you start a remote Infinispan instance and connect to it using a variety of different protocols.
Firstly, I am confuse now which will be best suited to my application from the above two modes.
I have a very simple use case, we have a client side code that will make a call to our REST Service using the main VIP of the service and then it will get load balanced to individual Service Server where we have deployed our service and then it will interact with the Cassandra database to retrieve the data basis on the user id. Below picture will make everything clear.
Suppose for example, if client is looking for some data for userId = 123 then it will call our REST Service using the main VIP and then it will get load balanced to any of our four service server, suppose it gets load balanced to Service1, and then service1 will call Cassandra database to get the record for userId = 123 and then return back to Client.
Now we are planning to cache the data using Infinispan as compaction is killing our performance so that our read performance can get some boost. So I started taking a look into Infinispan and stumble upon two modes as I mentioned below. I am not sure what will be the best way to use Infinispan in our case.
Secondly, As from the Infinispan cache what I will be expecting is suppose if I am going with Embedded Mode, then it should look like something like this.
If yes, then how Infinispan cache will interact with each other? It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well? if yes, then what configuration setup I need to have to make sure this thing is working fine.
Pardon my ignorance if I am missing anything. Any clear information will make things more clear to me to my above two questions.

With regards to your second image, yes, architecture will exactly look like this.
If yes, then how Infinispan cache will interact with each other?
Please, take a look here: https://docs.jboss.org/author/display/ISPN/Getting+Started+Guide#GettingStartedGuide-UsingInfinispanasanembeddeddatagridinJavaSE
Infinispan will manage it using JGroups protocol and sending messages between nodes. The cluster will be formed and nodes will be clustered. After that you can experience expected behaviour of entries replication across particular nodes.
And here we go to your next question:
It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well?
Infinispan was developed for this scenario so you don't need to worry about it at all. If you have for example 4 nodes and setting distribution mode with numberOfOwners=2, your cached data will live on exactly 2 nodes in every moment. When you issue GET command on NON owner node, entry will be fetched from the owner.
You can also set clustering mode to replication, where all nodes contain all entries. Please, read more about modes here: https://docs.jboss.org/author/display/ISPN/Clustering+modes and choose what is the best for your use case.
Additionally, when you add new node to the cluster there will StateTransfer take place and synchronize/rebalance entries across cluster. NonBlockingStateTransfer is implemented already so your cluster will be still capable of serving responses during that joining phase. See: https://community.jboss.org/wiki/Non-BlockingStateTransferV2
Similarly for removing/crashing nodes in your cluster. There will be automatic rebalancing process so for example some entries (numOwners=2) which after crash live only at one node will be replicated respectively to live on 2 nodes according to numberOfOwners property in distribution mode.
To sum it up, your cluster will be still up to date and this does not matter which node you are asking for particular entry. If it does not contain it, entry will be fetched from the owner.
if yes, then what configuration setup I need to have to make sure this thing is working fine.
Aforementioned getting started guide is full of examples plus you can find some configuration file examples in the Infinispan distribution: ispn/etc/config-samples/*
I would suggest you to take a look at this source too: http://refcardz.dzone.com/refcardz/getting-started-infinispan where you can find even more basic and very quick configuration examples.
This source also provides decision related information for your first question: "Should I use embedded mode or remote client-server mode?" From my point of view, using remote cluster is more enterprise-ready solution (see: http://howtojboss.com/2012/11/07/data-grid-why/). Your caching layer is very easily scalable, high-available and fault tolerant then and is independent of your database layer and application layer because it simply sits between them.
And you could be interested about this feature as well: https://docs.jboss.org/author/display/ISPN/Cache+Loaders+and+Stores

I think in newest Infinispan release supports to work in a special, compatibility mode for those users interested in accessing Infinispan in multiple ways .
follow below link to configure your cache environment to support either embedded or remotely.
Interoperability between Embedded and Remote Server Endpoints

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.