nodes in Elasticsearch - java

So I'm getting started with Elasticsearch, and I created a few nodes on my machine using:
elasticsearch -Des.node.name=Node-2
Now, as far as I understand a node is another machine/server on a cluster, you can correct me if I'm wrong, now.
1.In order to add nodes to a cluster you need these machines to be on the same network? can I have a node in US and another node in the EU as part of the same structure? Or do they need to be in the same building, same network.
2.What is the idea with nodes? to split the data on multiple machines/nodes and also split power to calculate certain querys?

By default ElasticSearch looks for nodes running the same same clustername on the same network. If you want to configure things differently take a look at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html
The idea to to split data across multiple machines in case it doesn't fit one one machine AND to prevent data loss in case a node fails (by default all data is replicated 3 times) AND to split query computation power. (ElasticSearch automatically splits your query into query's for all separate nodes, and aggregates the results).
Hope this answers your questions :)

Related

Hazelcast data affinity with preferred member as primary

I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!
Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.

Representing and performing IOs on graphs and subgraphs

I have a problem in which I need to perform CRUD operations on cyclic graphs. Now I know that there are a bunch of graph databases out there, but I have a specific set of use cases which are not supported in those databases (or at least I'm not aware of them).
Following are my constructs:
Node: Can have multiple sources and targets
Directed edge: Connects two nodes
Node Group: Multiple nodes (connected with edges) forming a group (simply put, it's a smaller graph)
Directed graph: Comprises of multiple nodes, node groups and edges. The graph can be cyclic.
Following are the functionalities I can have:
I can simply create a node by defining the incoming and outgoing edge definitions.
I can create a simple graph by adding nodes and connecting them with edges.
I can perform standard graph traversals.
I can now group the nodes of a graph and call it as a Node Group which I can use multiple instances of this Node Group (just like a node) in another bigger graph. This can create complex hierarchies.
I can create multiple graphs which in turn use any of the above constructs.
I can make changes to Node and Node Group definitions, which means there can be structural changes to the graph. If I make changes to a Node or Node Group definition, all the instances of this node in all the graphs should be updated too.
Now I understand that all of this can be done best with a relational database which will ensure that the relationships are intact and querying is simple. But the performance will take a hit when there are complex graphs and multiple of those graphs are to be updated.
So, I was wondering if there is a hybrid/better approach to storing, retrieving and updating these graphs that would be much faster compared to relational databases.
Any ideas would be really helpful. Thanks in advance!
I wouldn't fence-out graph databases. You can easily build the missing features yourself, using extra properties/nodes/connections that serve your needs.
E.g. for creating a group, you could create a node with some prop type:Group which shares the same groupId, with all the nodes belonging to that group.
Another option would be for group members to have an extra connection towards their Group: Node-belongsToGroup->GroupNode.
In any of the above solutions, to connect a Node/Group to another Group, would just require to create a connection towards the Group node only.
The same goes for Definitions, e.g. Node-isOfType->DefinitionNode. Then updateDefinition would update all nodes that belong to that Definition.
Based on the above I think it would be easy to create an api like the following:
createGroup
isGroup
addNodesToGroup
createDefinition
updateDefinition
setNodeDefinition
getNodeDefinition
As far as scalability is concearned you could check OrientDb: Distributed-Architecture / comparison to neo4j
...only one server can be the master, so the Neo4j write throughput is limited to the capacity of the single Master server. This means that Neo4j isn’t able to scale on writes.
OrientDB, instead, supports a Multi-Master + Sharded architecture: all the servers are masters. The throughput is not limited by a single server. With OrientDB, the global throughput is the sum of the throughput of all the servers.
api ref:
java api / sql ref

NIFI: limit number of concurrent tasks of a NIFI processor in a NIFI-Cluster

The question says it all. How can I do one of the following things:
How can I limit the number of concurrent tasks running for one processor cluster-wide?
Is there any unique and short ID for the Node, I run on? I could use these ID to append to the database-table-name to load (see details below) and have an exclusive table per connection.
I have a NIFI cluster and a self-written, specialized Processor, that loads heavy amounts of data into a database via JDBC (up to 20Mio rows per Second). It uses some of the database-vendor specific tuning tricks to be really fast in my particular case. One of these tricks needs an exclusive, empty table to load into for each connection.
At the moment, my processor opens one connection per Node in the NIFI-Cluster (it takes a connection from the DBCPConnectionPool). With about 90-100 nodes in the cluster, I'd get 90-100 connections - all of them bulk loading data at the same time.
I'm using NIFI 1.3.0.0
Any help or comment is highly appreciated. Sorry for not showing any code. It's about 700 lines not really helping with the question. But I plan to put it on Git and as part of the open-source project Kylo.
A common way of breaking up tasks in NiFi is to split the flow file into multiple files on the primary node. Then other nodes would pull one of the flow files and process it.
In your case, each file would contain a range of values to pull from the table. Let's say you had a hundred rows and wanted only 3 nodes to pull data. So you'd create 3 flow files each having separate attribute values:
start-row-id=1, end-row-id=33
start-row-id=34, end-row-id=66
start-row-id=67, end-row-id=100
Then a node would pick up a flow file from a remote process group or a queue (such as JMS or SQS). There's only 3 flow files so no more than 3 nodes would being loading data from a connection.

Enforce partition be stored on the specific executor

I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?

MongoDB related scaling issue

Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.

Categories