I have a problem in which I need to perform CRUD operations on cyclic graphs. Now I know that there are a bunch of graph databases out there, but I have a specific set of use cases which are not supported in those databases (or at least I'm not aware of them).
Following are my constructs:
Node: Can have multiple sources and targets
Directed edge: Connects two nodes
Node Group: Multiple nodes (connected with edges) forming a group (simply put, it's a smaller graph)
Directed graph: Comprises of multiple nodes, node groups and edges. The graph can be cyclic.
Following are the functionalities I can have:
I can simply create a node by defining the incoming and outgoing edge definitions.
I can create a simple graph by adding nodes and connecting them with edges.
I can perform standard graph traversals.
I can now group the nodes of a graph and call it as a Node Group which I can use multiple instances of this Node Group (just like a node) in another bigger graph. This can create complex hierarchies.
I can create multiple graphs which in turn use any of the above constructs.
I can make changes to Node and Node Group definitions, which means there can be structural changes to the graph. If I make changes to a Node or Node Group definition, all the instances of this node in all the graphs should be updated too.
Now I understand that all of this can be done best with a relational database which will ensure that the relationships are intact and querying is simple. But the performance will take a hit when there are complex graphs and multiple of those graphs are to be updated.
So, I was wondering if there is a hybrid/better approach to storing, retrieving and updating these graphs that would be much faster compared to relational databases.
Any ideas would be really helpful. Thanks in advance!
I wouldn't fence-out graph databases. You can easily build the missing features yourself, using extra properties/nodes/connections that serve your needs.
E.g. for creating a group, you could create a node with some prop type:Group which shares the same groupId, with all the nodes belonging to that group.
Another option would be for group members to have an extra connection towards their Group: Node-belongsToGroup->GroupNode.
In any of the above solutions, to connect a Node/Group to another Group, would just require to create a connection towards the Group node only.
The same goes for Definitions, e.g. Node-isOfType->DefinitionNode. Then updateDefinition would update all nodes that belong to that Definition.
Based on the above I think it would be easy to create an api like the following:
createGroup
isGroup
addNodesToGroup
createDefinition
updateDefinition
setNodeDefinition
getNodeDefinition
As far as scalability is concearned you could check OrientDb: Distributed-Architecture / comparison to neo4j
...only one server can be the master, so the Neo4j write throughput is limited to the capacity of the single Master server. This means that Neo4j isn’t able to scale on writes.
OrientDB, instead, supports a Multi-Master + Sharded architecture: all the servers are masters. The throughput is not limited by a single server. With OrientDB, the global throughput is the sum of the throughput of all the servers.
api ref:
java api / sql ref
Related
I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!
Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.
I think I don't fully understand yet the Apache Ignite cache persistence. I probably miss an overview.
What I would like to achieve is something like this: Three data nodes that persistently and replicated store the cache data either on their own separate disks or in single 3rd party DB. As long as one of these nodes is available, all data shall be available to the cluster nodes. Configs for these three nodes must have the PersistenceConfiguration, I guess? What about the backups setting? This must be set to 2? What is the correct setting that as long as one of the three node is available all data will be available?
Do all data nodes have to be available for write operations to the cache? Or is one enough and the other two will replicate once they connect?
Other worker nodes shall use the cache but not store on disk. Configs for these node shall not have the Persistent set, I guess?
Sorry for all these questions. You see I may need some background information for the data store.
Thanks for any help!
Ignite native persistence can solve your problem. You can enable it by adding PersistentStoreConfiguration to IgniteConfiguration. Here is documentation on how to use it: https://apacheignite.readme.io/docs/distributed-persistent-store#section-usage
Every node that has persistence enabled will write its primary and backup partitions to disk, so when restarted, it will have this data available locally. If other nodes connect to the cluster after that, they will see the data, and it will be replicated to new nodes if needed.
Judging by your needs, you should use replicated cache. All data in cache will be stored on all nodes at the same time. When node with some data persisted on disk starts its work, it will have all data available, just like you need. Replicated cache is effectively equivalent to having all data backed up on every node, so you don't have to additionally configure backups. Here is documentation on cache modes: https://apacheignite.readme.io/docs/cache-modes
To restrict cache data to be stored on particular nodes only, you can create three server nodes, that will store data, and start other nodes as clients. You can find the difference here: https://apacheignite.readme.io/docs/clients-vs-servers
If you need more than three server nodes, then you can use cache node filter. It is a predicate, that specifies, what nodes should store data of some particular cache. Here is JavaDoc for CacheConfiguration.setNodeFilter method: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/CacheConfiguration.html#setNodeFilter(org.apache.ignite.lang.IgnitePredicate)
Another option to enable persistence is to use CacheStore. It enables you to replicate your data in any external database, but it has lower performance and less features available, so I would recommend to go with the native one. Here is documentation on 3rd pary persistence: https://apacheignite.readme.io/v2.2/docs/3rd-party-store
Imagine a Cassandra cluster needs to be accessed by a client application. In Java api we create a cluster instance and send the read or write request via a Session. If we use read/write consistency ONE, how the api select the actual node (coordinator node) in order to forward the request. Is it a random selection? please help to figure this out.
Cassandra drivers use the "gossip" protocol (and a process called node discovery) to gain information about the cluster. If a node becomes unavailable, the client driver automatically tries other nodes and schedules reconnection times with the dead one(s). According to the DataStax docs:
Gossip is a peer-to-peer communication protocol in which nodes
periodically exchange state information about themselves and about
other nodes they know about. The gossip process runs every second and
exchanges state messages with up to three other nodes in the cluster.
The nodes exchange information about themselves and about the other
nodes that they have gossiped about, so all nodes quickly learn about
all other nodes in the cluster.
Essentially, the list of nodes that you provide your client to connect to, are the initial contact points for gaining information on the entire cluster. This is why your client can communicate with all nodes in the cluster (if need be) even though you may only provide a small subset of nodes in your connection string.
Once your driver has the gossip information on the cluster, it can then make intelligent decisions about which node to run a query on. Node selection is not a process of voting or random selection. Based on the gossip information returned, the client driver applies its Load Balancing Policy. While it does take several factors into consideration, basically it tries to pick the node with the lowest network "distance" from the client.
Edit 20200322
Let me expand a bit on the point about the Load Balancing policy. I encourage developers of high-performance applications to use the TokenAwarePolicy. This policy hashes the partition key values to a "token," and uses this hash to determine which node(s) is responsible for the resulting token range. This has the effect of skipping the intermediate step of selecting a "coordinator" node, and sends the queries directly to the node which contains the requested data.
However, if you are using a non-token aware load balancing policy, or running a query which does not filter on a partition key, then the original process described above applies.
So I'm getting started with Elasticsearch, and I created a few nodes on my machine using:
elasticsearch -Des.node.name=Node-2
Now, as far as I understand a node is another machine/server on a cluster, you can correct me if I'm wrong, now.
1.In order to add nodes to a cluster you need these machines to be on the same network? can I have a node in US and another node in the EU as part of the same structure? Or do they need to be in the same building, same network.
2.What is the idea with nodes? to split the data on multiple machines/nodes and also split power to calculate certain querys?
By default ElasticSearch looks for nodes running the same same clustername on the same network. If you want to configure things differently take a look at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html
The idea to to split data across multiple machines in case it doesn't fit one one machine AND to prevent data loss in case a node fails (by default all data is replicated 3 times) AND to split query computation power. (ElasticSearch automatically splits your query into query's for all separate nodes, and aggregates the results).
Hope this answers your questions :)
I need to extract a subgraph (a subset of nodes and edges) based on a user defined conditions such as attributes values and labels.
This is already feasible using either a query language such as cypher or gremlin, or simply coded using a java methods.
However, since I'm dealing with large graphs, I wish to keep the extracted subgraph for further querying, and even iterate the subextraction-querying process.
I've seen these discussions : Extract subgraph in neo4j , Extracting subgraph from neo4j database. However, I couldn't figure out the answer for my case.
I was thinking of some alternatives :
Building a new index each time I need to extract a subgraph
Use a cache to store the nodes/edges which might be useful for arithmetic computation such as average etc.
Create a new instance of embedded ne4j, however this is really costly !
Another point, is getByID cheaper than index lookup. I know this depends on the case: large graphs or small index ...
You could just create a new neo4j java embedded database to hold your results and query further? No need to boot up another server IMHO.
Also, getByID is generally cheaper than index lookup, since you avoid the index roundtrip. Index lookups are great for more complax lookups like text matching etc.