In memory weighted graph for a linked data (RDF) - java

It is a subset of dbpedia dataset which needs to be loaded in memory.
Operation will mainly involve traversing the graph based on edge weights in both directions
3 types of queries
Finding Path between 2 nodes
Given Subject and Predicate find object.
Given Object and Predicate find subject.
It is a sparse Graph
Performance w.r.t time is important
Implementation Language: Java
Use of weights: Filtering. The edges with weight higher than some threshold will be selected.
As it will be a sparse graph, using the Matrix data structure will be HIGHLY space inefficient.
Initial idea was to make java-objects for every subject and let each subject store a 2d array of 3 columns for predicate, object and weight.
But with this kind of data structure it will be hard to generate a list of subjects given an object. Even though it is good with generating the objects given subject.
Is there any proper(more time efficient) way to achieve this? Which could make traversals in both directions easier.
The graph in question is Categories(SKOS)

Related

How to Divide a weighted cyclic graph into n graphs, breaking as few connections as possible

Background for the question:
At the start of every new year at my daughters school the principal always talks about how difficult it is to divide the kids into classes because they have many requests as to who they want to be in class with, and the kindergarden also has some recommendations.
In my mind that is just a weighted, cyclic graph with kids as nodes and requests/recommendations as edges that needs to be split.
The question:
Imagine if you will, a graph with cycles and weighted edges, possibly disconnected.
I would like to divide that graph into n smaller graphs with at least s nodes in each and at most t nodes in each graph while breaking as few edges as possible.
I assume it is NP hard to solve, so it might be an optimization problem really.
Does this graph algorithm have name?
Are there any java libraries that can help me solve this?
Thanks,
Jesper
A component of a graph is also a graph. Last year i had to write code for image stitching which made use of directed graph datastructure which had 64000 pixels and it was not slow at all.
The resulting image is the last one and code is here github.
Well, in your case you can create a graph datastructur. Create a one dimensional array of Linkedlist and within that linklist save the datastructure connected with three arguments, first one is your first one student, second one is student to which the first one is connected and third one is number which expresses strength bound between them.: Example student 1 is connected to student 5 with strength == 100 then it you create a new node Connected c = new connected(1,5,100); and add it to first entry of array like this: array[i].add(c) where i == 1.
Search:Which student is connected to which one: When you try to find which student is connect to which one then you simple go that array index like:
`for(Connected c : array[i].get();
int firstStudent = c.getFirst()
int secondStudent = c.getSecond()
int strength = c.getStrength();`
Your linked list should have an iterator so that you can travers.
This is what it actually looks like:
This problem sounds like a minimum cut problem but for directed graph.
In graph theory, a minimum cut of a graph is a cut (a partition of the
vertices of a graph into two disjoint subsets that are joined by at
least one edge) that is minimal in some sense. wiki
One not optimized solution for the problem you described is to find all connected components and find the minimum cut for each (which result in breaking them) until all resulting components satisfy your criteria.
Now the issue with this method is that algorithms I know for minimum cut work for undirected graphs. In my search on the topic I found an article for this problem which might be of use.

Algorithms in checking overlapping genomic regions

I have two large list of genomic regions in the form of two bed files, and there are many tools help me check the overlap of the two list.
Any given region (one from list A, another from list B), as long as they overlap in any of their coordinates, they are called overlap. There are available tools to do that. But I wish to write an efficient algorithms that I can maintain a hash-table like structure in list A, and then I iterate all the regions in list B, and for each regions from list B I can use a quick algorithms to tell if some of the regions in list A overlap with this specific regions from list B.
I specifically need an efficient solutions since both lists are very large. Thanks very much.
One option would be to:
Create a 1-dimensional R-tree of the regions in one BED file. Insert a range for each exon.
For each region in the other BED file, search the R-tree for
intersections of each of that region's exons.
For Java, there are multiple implementations of R-trees. One I've used that supports 1-dimensional ranges is SIRtree, in the library JTS. It provides simple methods to insert ranges and search for intersections.
Any data structure represented in memory will be a scalability concern for sufficiently large BED files. You can address that by either increasing the amount of memory available to the VM (hardware and the -Xmx setting) or by representing your data structure on disk.

What is a good design pattern/structure to represent Direct Acyclic Graph in Java?

I need to store a reasonably large Direct Acyclic Graph in Java (order of 100,000 nodes, depth between 7 and 20, irregular shaped, average depth 13).
What would be the best-performing data structure(s) to store it if the predominant operation I need after building the data structure is:
99% operations: Find a full set of accendant paths (from the root down to a given node)
1% operations: Find all children, or more often, all ancestors, of a given node.
As can be obvious, I'd like the first operation to be O(1) if possible, as opposed to O(Average-Depth)
Please note that for the purposes of this question, the data structure is write-once: after I build it from a list of nodes and vertices, the graph topology will never change.
My naive implementation would be to store it as a combination of:
HashMap<Integer, Integer[]> childrenPerParent;
HashMap<Integer, Integer[]> ascendantPaths;
E.g. I store, for each node: a list of children of that node; and separately, a set of paths to the root from that node.
Downside: This seems very wasteful as far as space (we basically store each of the inner graph nodes multiples of multiples of times in the ascendantPaths - e.g. given size estimates, we would store extra 100,000 * 13 = 1,3Million node copies in ascendantPaths, each of which is an object to be created and stored )
I would recommend using Neo4J. It's a graph database implemented in Java with a lot of low-level optimizations (e.g., node and edge attributes are stored in their own blocks so that node identities and their edges can be packed), and it mmaps the on-disk database. Following an edge is independent of the number of nodes in the graph or edges on the origin.

K-D Tree vs R-Tree for small, dynamic data

I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?
It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.
If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.

Quadtree with HashMap

I am considering using a HashMap as the backing structure for a QuadTree. I believe I can use Morton sequencing to uniquely identify each square of my area of interest. I know that my QuadTree will have a height of at most 16. From my calculations, that would be lead to a matrix of 65,536 x 65,536 which should give me at most 4,294,967,296 cells. Does anyone know if that is too many elements for a HashMap? I could always write up a QuadTree using a Tree but I thought that I could get better performance with a HashMap.
Morton sequence of height 1 == (2x2) == 4
Morton sequence of height 2 == (4x4) == 16
Morton sequence of height 3 == (8x8) == 64
Morton Sequencing example for a tree of max height 3.
Here is what I know:
I will get data in lat/lon over a know rectangular area.
The data will not completely cover the whole area and will likely be
consolidated into chunks somewhere in that area. (worse case is data in all 4,294,967,296 cells)
The resolution of the data ends up breaking down the area into 65k by 65k rectangle.
I also know that I will likely get 10 to 1 queries to insert/update of
the data.
Hashmap is not a good idea.
There is a better solution, used in navigation systems:
Assign each Quadtree cell a letter: A (Left,upper), B(right, upper) , C and D.
Now you can adress each quad cell via a String:
ABACE: this identifies the cell in level 5. (A->B->A->C->E)
Search internet for details on that specific Quadtree coding.
Dont forgett: You decide the sub division rule (when to subdivide a cell into smaller ones), and that decides how many cells you get. The number you give is far to high.
It is only an theroetical calculation which reminds me 1:1 on Google Maps Quad tree.
Further it is import to know which type of Quadtree you need for your Application:
Point Quadtree, Region Quadtree (bounbding box), Line Quadtree.
If you know any existing Quadtree implementation in java. please post a comment, or edit this answer.
Further you cannot implement a one for all solution.
You have to know aproxmetly how many elements you will suport.
The theroretical maximum , which is not equal to the expected maximum, is not a good approach.
You have to know that because you must decide whether to store that in main memory, or on disk, this also influences the structure of the quadtree. The "ABCD" solution is suitable
for dynamic loading from disk.
The google approach stores images in the quadtree, this is different from points you want to store, so i doubt that your calculation is realistic.
If you want to store all streets of all countries in the world, you can estimate that
number because the number of points are known (Either OpenStreetMap, TomTom (Teelatlas), or (Nokia Maps) Navteq.
If you realized that you have to store the quadtree on disk, then proably the size is open, and limited by only the disk space.
I think that implementing a Quad Tree as a Tree will give you better results. Actually implementing such a big database in a HashMap is a bad idea anyways. Because if you have a lot of collisions, the performance of a HashMap decreases badly.
And apparently you know exactly how much data you have. In that case, a HashMap is totally redundant. A HashMap is meant for when you do not know how much data there is. But in this case, you know that every node of the tree has four elements. So why even bother using a HashMap.?
Also, your table is apparently at least 4GB large. On most systems, that just barely fits in your memory. And since there is also Java VM overhead, why do you store this in memory? It would be better to find a datastructure that works well on disks. One such datastructure for spatial data (which I assume you are having, since you are using a quad tree), is an R-Tree.
Whoa, we're getting a number of concepts here all at once. First of all, what are you trying to reach? Store a quad tree? A matrix of cells? Hash lookups?
If you want a quad tree, why use a hash map? You know there could be at most 4 child nodes to each node. A hash map is useful for an arbitrary number of key-value mappings where quick lookup is necessary. If you're only going to have 4, a hash might not even be important. Also, while you can nest maps, it's a bit unwieldy. You're better off using some data structure or writing your own.
Also, what are you trying to reach with the quad tree? Quickly looking up a cell in the matrix? Some coordinate mapping function might serve you much better there.
Finally, I'm not so much worried about that amount of nodes in a hash map, as I am by the amount purely on its own. 65536² cells would end up being 4 GiB of memory even at one byte per cell.
I think it would be best to pedal all the way back to the question "what is my goal with this data", then find out which data structures could help you with that (keepign requirements such as lookups in mind) while managing to fit it in memory.
Definitely use directly linked nodes for both space and speed reasons.
With data this big I'd avoid Java altogether. You'll be constantly at the mercy of the garbage collector. Go for a language closer to the metal: C or C++, Pascal/Delphi, Ada, etc.
Put the four child pointers in an array so that you can refer to leaves as packed arrays of 2-bit indices (a nice reason to use Ada, which will let you define such things with no bit fiddling at all). I guess this is Morton sequencing. I did not know that term.
This method of indexing children in itself is a reason to avoid Java. Including a child array in a node class instance will cost you a pointer plus an array size field: 8 or 16 bytes per node that aren't needed in some other languages. With 4 billion cells, that's a lot.
In fact you should do the math. If you use implicit leaf cells, you still have 1 billion nodes to represent. If you use 32-bit indices to reference them (to save memory vice 64-bit pointers), the minimum is 16 bytes per node. Say node attributes are a mere 4 bytes. Then you have 20 Gigabytes just for a full tree even with none of the Java overhead.
Better have a good budget for RAM.
It is true that most typical quad-trees will simply use nodes with four child node pointers and traverse that, without any mention of hashmaps. However, it is also possible to write an efficient quadtree-like spatial indexing method that stores all its nodes in a big hashmap.
The benefit is that by using the Morton sequence (or another similarly generated value) as the key, you become able to retrieve nodes at any level with only one pointer dereference.
In "traditional" quadtree implementations we get cache misses due to repeated pointer dereferencing while looking up nodes, and this becomes the main bottleneck. So provided that the cost of encoding the coordinate space and getting a hash is lower than the cost of dereferencing the node pointers along the search path, such an implementation could be faster. Particularly if the map is very deep (having sparse locations requiring high precision).
You don't really need the Morton sequence, and you hardly need to think of it as a quadtree when doing this. A very simple example implementation:
In order to retrieve a quad of some level, use { x, y, level } as the hashmap key, where x and y are quantized to that level. You only need to include the level in the key if you are storing several levels in the same map.
Whether this is still a quadtree is up for discussion, but the functionality is the same.

Categories