Algorithms in checking overlapping genomic regions

Algorithms in checking overlapping genomic regions - java

I have two large list of genomic regions in the form of two bed files, and there are many tools help me check the overlap of the two list.
Any given region (one from list A, another from list B), as long as they overlap in any of their coordinates, they are called overlap. There are available tools to do that. But I wish to write an efficient algorithms that I can maintain a hash-table like structure in list A, and then I iterate all the regions in list B, and for each regions from list B I can use a quick algorithms to tell if some of the regions in list A overlap with this specific regions from list B.
I specifically need an efficient solutions since both lists are very large. Thanks very much.

One option would be to:
Create a 1-dimensional R-tree of the regions in one BED file. Insert a range for each exon.
For each region in the other BED file, search the R-tree for
intersections of each of that region's exons.
For Java, there are multiple implementations of R-trees. One I've used that supports 1-dimensional ranges is SIRtree, in the library JTS. It provides simple methods to insert ranges and search for intersections.
Any data structure represented in memory will be a scalability concern for sufficiently large BED files. You can address that by either increasing the amount of memory available to the VM (hardware and the -Xmx setting) or by representing your data structure on disk.

Related

In memory weighted graph for a linked data (RDF)

It is a subset of dbpedia dataset which needs to be loaded in memory.
Operation will mainly involve traversing the graph based on edge weights in both directions
3 types of queries
Finding Path between 2 nodes
Given Subject and Predicate find object.
Given Object and Predicate find subject.
It is a sparse Graph
Performance w.r.t time is important
Implementation Language: Java
Use of weights: Filtering. The edges with weight higher than some threshold will be selected.
As it will be a sparse graph, using the Matrix data structure will be HIGHLY space inefficient.
Initial idea was to make java-objects for every subject and let each subject store a 2d array of 3 columns for predicate, object and weight.
But with this kind of data structure it will be hard to generate a list of subjects given an object. Even though it is good with generating the objects given subject.
Is there any proper(more time efficient) way to achieve this? Which could make traversals in both directions easier.
The graph in question is Categories(SKOS)

Java help for modelling a in memory distributed array matrix

I want to distribute a huge two dimensional array Object[][]. Mostly holding primitives but also String, Date and Color objects. I want to allow different machines to modify data in this matrix. My problem is that I need some help with the modelling.
As far as I can see there are only distributed maps available like hazelcalst or openhft. But maps are not fitting my needs exactly. So what I can do is store all rows (or all columns) of my array in a map. But when I act on a whole row/column it is blocked for all other threads even if they operate on another index. When I store every "cell" in a distibuted map the row/column index for the map key might me bigger than the whole array itself. Are there better tools or modelling approaches for this scenario? I would need a matrix where I could distribute data and lock just one single cell.

K-D Tree vs R-Tree for small, dynamic data

I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?

It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.

If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.

Quadtree with HashMap

I am considering using a HashMap as the backing structure for a QuadTree. I believe I can use Morton sequencing to uniquely identify each square of my area of interest. I know that my QuadTree will have a height of at most 16. From my calculations, that would be lead to a matrix of 65,536 x 65,536 which should give me at most 4,294,967,296 cells. Does anyone know if that is too many elements for a HashMap? I could always write up a QuadTree using a Tree but I thought that I could get better performance with a HashMap.
Morton sequence of height 1 == (2x2) == 4
Morton sequence of height 2 == (4x4) == 16
Morton sequence of height 3 == (8x8) == 64
Morton Sequencing example for a tree of max height 3.
Here is what I know:
I will get data in lat/lon over a know rectangular area.
The data will not completely cover the whole area and will likely be
consolidated into chunks somewhere in that area. (worse case is data in all 4,294,967,296 cells)
The resolution of the data ends up breaking down the area into 65k by 65k rectangle.
I also know that I will likely get 10 to 1 queries to insert/update of
the data.

Hashmap is not a good idea.
There is a better solution, used in navigation systems:
Assign each Quadtree cell a letter: A (Left,upper), B(right, upper) , C and D.
Now you can adress each quad cell via a String:
ABACE: this identifies the cell in level 5. (A->B->A->C->E)
Search internet for details on that specific Quadtree coding.
Dont forgett: You decide the sub division rule (when to subdivide a cell into smaller ones), and that decides how many cells you get. The number you give is far to high.
It is only an theroetical calculation which reminds me 1:1 on Google Maps Quad tree.
Further it is import to know which type of Quadtree you need for your Application:
Point Quadtree, Region Quadtree (bounbding box), Line Quadtree.
If you know any existing Quadtree implementation in java. please post a comment, or edit this answer.
Further you cannot implement a one for all solution.
You have to know aproxmetly how many elements you will suport.
The theroretical maximum , which is not equal to the expected maximum, is not a good approach.
You have to know that because you must decide whether to store that in main memory, or on disk, this also influences the structure of the quadtree. The "ABCD" solution is suitable
for dynamic loading from disk.
The google approach stores images in the quadtree, this is different from points you want to store, so i doubt that your calculation is realistic.
If you want to store all streets of all countries in the world, you can estimate that
number because the number of points are known (Either OpenStreetMap, TomTom (Teelatlas), or (Nokia Maps) Navteq.
If you realized that you have to store the quadtree on disk, then proably the size is open, and limited by only the disk space.

I think that implementing a Quad Tree as a Tree will give you better results. Actually implementing such a big database in a HashMap is a bad idea anyways. Because if you have a lot of collisions, the performance of a HashMap decreases badly.
And apparently you know exactly how much data you have. In that case, a HashMap is totally redundant. A HashMap is meant for when you do not know how much data there is. But in this case, you know that every node of the tree has four elements. So why even bother using a HashMap.?
Also, your table is apparently at least 4GB large. On most systems, that just barely fits in your memory. And since there is also Java VM overhead, why do you store this in memory? It would be better to find a datastructure that works well on disks. One such datastructure for spatial data (which I assume you are having, since you are using a quad tree), is an R-Tree.

Whoa, we're getting a number of concepts here all at once. First of all, what are you trying to reach? Store a quad tree? A matrix of cells? Hash lookups?
If you want a quad tree, why use a hash map? You know there could be at most 4 child nodes to each node. A hash map is useful for an arbitrary number of key-value mappings where quick lookup is necessary. If you're only going to have 4, a hash might not even be important. Also, while you can nest maps, it's a bit unwieldy. You're better off using some data structure or writing your own.
Also, what are you trying to reach with the quad tree? Quickly looking up a cell in the matrix? Some coordinate mapping function might serve you much better there.
Finally, I'm not so much worried about that amount of nodes in a hash map, as I am by the amount purely on its own. 65536² cells would end up being 4 GiB of memory even at one byte per cell.
I think it would be best to pedal all the way back to the question "what is my goal with this data", then find out which data structures could help you with that (keepign requirements such as lookups in mind) while managing to fit it in memory.

Definitely use directly linked nodes for both space and speed reasons.
With data this big I'd avoid Java altogether. You'll be constantly at the mercy of the garbage collector. Go for a language closer to the metal: C or C++, Pascal/Delphi, Ada, etc.
Put the four child pointers in an array so that you can refer to leaves as packed arrays of 2-bit indices (a nice reason to use Ada, which will let you define such things with no bit fiddling at all). I guess this is Morton sequencing. I did not know that term.
This method of indexing children in itself is a reason to avoid Java. Including a child array in a node class instance will cost you a pointer plus an array size field: 8 or 16 bytes per node that aren't needed in some other languages. With 4 billion cells, that's a lot.
In fact you should do the math. If you use implicit leaf cells, you still have 1 billion nodes to represent. If you use 32-bit indices to reference them (to save memory vice 64-bit pointers), the minimum is 16 bytes per node. Say node attributes are a mere 4 bytes. Then you have 20 Gigabytes just for a full tree even with none of the Java overhead.
Better have a good budget for RAM.

It is true that most typical quad-trees will simply use nodes with four child node pointers and traverse that, without any mention of hashmaps. However, it is also possible to write an efficient quadtree-like spatial indexing method that stores all its nodes in a big hashmap.
The benefit is that by using the Morton sequence (or another similarly generated value) as the key, you become able to retrieve nodes at any level with only one pointer dereference.
In "traditional" quadtree implementations we get cache misses due to repeated pointer dereferencing while looking up nodes, and this becomes the main bottleneck. So provided that the cost of encoding the coordinate space and getting a hash is lower than the cost of dereferencing the node pointers along the search path, such an implementation could be faster. Particularly if the map is very deep (having sparse locations requiring high precision).
You don't really need the Morton sequence, and you hardly need to think of it as a quadtree when doing this. A very simple example implementation:
In order to retrieve a quad of some level, use { x, y, level } as the hashmap key, where x and y are quantized to that level. You only need to include the level in the key if you are storing several levels in the same map.
Whether this is still a quadtree is up for discussion, but the functionality is the same.

Array of Structs are always faster than Structs of arrays?

I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.

Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.

This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.