Multiple sequence alignment for general strings java

Multiple sequence alignment for general strings java - java

I'm working in hadoop and I must align n strings in java and i want an algorithm which compute general strings (no bioinformatics, genome, etc) in Java. Es.
ASFHASFHASDSAAPJEIHRA <-- seq1
AAPSOFHASFDSOISISN--A <-- seq2
AWP-JWRAIADSDIA--N--A <-- seq3
AOPSJD-A-JDSSDSOQOSSJ <-- seq4
100000000011000000000 <-- score
There's someone can help me for a name, library or something?

You could write your own dynamic programming algorithm, but the complexity is: O(N^k) if N is the sequence length and k the number of sequences. Suppose you have k=2 sequences:
The you have a 2D grid where each point in your grid corresponds to a pair of characters. So position (1,1) corresponds to word1[1] and word2[1]. Horizontal and vertical edges in this grid correspond to insertions and deletions, while diagonal ones are matches or mismatches. For each you have to device a penalty. In your example a match = +1 while the other possibilities are +0. When you arrive at the bottom right of the grid you have the optimal alignment score.

Related

Maximum participant intersection area of given n geometric shapes

I have n geometric shapes defined in GeoJson, I would like to calculate the intersection which involves maximum number of shapes.
I have the following constraints;
none of the shapes may intersects (no intersection between any of shapes, 0-participant intersection)
all shapes may intersects (there is an intersection that is in all shapes, n-participant intersection)
there might be more than one intersection with k participant (shape A B C intersects, shape D E F intersects, there are 2 3-participant intersection, doesn't need to find both, return when first 3-participant found)
So, as a starting point, I though I could do that by using brute force (trying to intersects n, n-1, n-2 combination of given shapes) but it's time complexity will be O(n!). I'm wondering that if the algorithm could be optimized?
EDIT:
Well, I forgot the tell about data types. I'm using Esri/geometry library for shapes. Specifically, Polygon class instances.

This problem feels like you can construct hard cases that cannot be solved efficiently, specifically if the shapes are not convex. Here are two ideas that you could try:
1. Iterative Intersection
Keep a list L of (disjoint) polygons with counts that is empty in the beginning. Now iterate through your given polygons P. For each polygon p from P intersect it with all polygons l from L. If there is an intersection between p and l then remove l from L and
add set_intersection(l, p) with previous count of l +1
add set_minus(l, p) with `previous count of l'
remember set_minus(p, l) and proceed to the next entry of L
when you are through all elements of L then add the remaining part of p to L with count 1.
Eventually you will have a list of disjoint polygons with counts that is equvalent to the number of participant polygons.
2. Space Decomposition
Build a bounding box around all polygons. Then iteratively split that space (similar to a KD-Tree). For each half (rectangle), compute the number of polygons from P intersecting that rectangle. Proceed best-first (always evaluate the rectangle that has the highest count). When you are at a certain level of the KD-Tree then stop and evaluate by brute-force or Iterative Intersection.
Both methods will benefit from a filter using minimum bounding rectangles around the polygons.

A list of unprocessed (non-empty) intersections together with the polygon indices from which the intersection was created is maintained. Initially it is filled with the single polygons. Always the intersection from most polygons and least maximal index of involved polygons is taken out and intersected with all polygons with higher indices (to avoid looking at the same subset of polygons again and again). This allows an efficient pruning:
if the number of polygons involved in the currently processed intersection and the remaining indices do not surpass the highest number of polygons to have a non-empty intersection, we do not need to pursue this intersection any longer.
Here is the algorithm in Python-like notation:
def findIntersectionFromMostMembers(polygons):
unprocessedIntersections = [(polygon, {i}) for i, polygon in enumerate(polygons)]
unprocessedIntersections.reverse() # make polygon_0 the last element
intersectionFromMostMembers, indicesMost = (polygons[0], {0})
while len(unprocessedIntersections) > 0:
# last element has most involved polygons and least maximal polygon index
# -> highest chance of being extended to best intersection
intersection, indices = unprocessedIntersections.pop() # take out unprocessed intersection from most polygons
if len(indices) + n - max(indices) - 1 <= len(indicesMost):
continue # pruning 1: this intersection cannot beat best found solution so far
for i in range(max(indices)+1, n): # pruning 2: only look at polyong with higher indices
intersection1 = intersection.intersect(polygons[i])
if not intersection1.isEmpty(): # pruning 3: empty intersections do not need to be considered any further
unprocessedIntersections.insertSorted(intersection1, indices.union({i}), key = lambda(t: len(t[1]), -max(t[1])))
if len(indices)+1 > len(indicesMost):
intersectionFromMostMembers, indicesMost = (intersection1, indices.union({i}))
return intersectionFromMostMembers, indicesMost
The performance highly depends on how many polygons in average do have an area in common. The fewer (<< n) polygons have areas in common, the more effective pruning 3 is. The more polygons have areas in common, the more effective pruning 1 is. Pruning 2 makes sure that no subset of polygons is considered twice. The worst scenario seems to be when a constant fraction of n (e.g. n/2) polygons have some area in common. Up to n=40 this algorithm terminates in reasonable time (in a few seconds or at most in a few minutes). If the non-empty intersection from most polygons involves only a few (any constant << n) polygons, much bigger sets of polygons can be processed in reasonable time.

Java permutations of offsets

Had a question regarding generating a list of 10-digit phone numbers on a PhonePad, given a set of possible moves and a starting number.
The PhonePad:
1 2 3
4 5 6
7 8 9
* 0 #
Possible moves:
The same number of moves a Queen in chess can make (so north, south, east, west, north-east, north-west, south-east, south-west... n-spaces per each orientation)
Starting number: 5
So far I have implemented the PhonePad as a 2-dimensional char array, implemented the possible moves a Queen can make in a HashMap (using offsets of x and y), and I can make the Queen move one square using one of the possible moves.
My next step is to figure out an algorithm that would give me all 10-digit permutations (phone numbers), using the possible moves in my HasMap. Repetition of a number is allowed. * and # are not allowed in the list of phone numbers returned.
I would imagine starting out with
- 5555555555, 5555555551, 5555555552... and so on up to 0,
- 5555555515, 5555555155, 5555551555.. 5155555555.. and with the numbers 2 upto 0
- 5555555151, 5555551515, 5555515155.. 5151555555.. and with numbers 2 upto 0
... and so on for a two digit combination
Any suggestions on a systematic approach generating 10-digit combinations? Even a pseudocode algorithm is appreciated! Let me know if further clarification is required.
Thanks in advance! :)

In more detail, the simplest approach would be a recursive method, roughly like:
It accepts a prefix string initially empty, a current digit (initially '5'), and a number of digits to generate (initially 10).
If the number of digits is 1, it will simply output the prefix concatenated with the current digit.
If the number of digits is greater than 1, then it will make a list of all possible next digits and call itself recursively with (prefix + (current digit), next digit, (number of digits)-1 ) as the arguments.
Other approaches, and refinements to this one, are possible of course. The "output" action could be writing to a file, adding to a field in the current class or object, or adding to a local variable collection (List or Set) that will be returned as a result. In that last case, the (ndigits>1) logic would have to combine results from multiple recursive calls to get a single return value.

Hilbert sort by divide and conquer algorithm?

I'm trying to sort d-dimensional data vectors by their Hilbert order, for bulk-loading a spatial index.
However, I do not want to compute the Hilbert value for each point explicitly, which in particular requires setting a particular precision. In high-dimensional data, this involves a precision such as 32*d bits, which becomes quite messy to do efficiently. When the data is distributed unevenly, some of these calculations are unnecessary, and extra precision for parts of the data set are necessary.
Instead, I'm trying to do a partitioning approach. When you look at the 2D first order hilbert curve
1 4
| |
2---3
I'd split the data along the x-axis first, so that the first part (not necessarily containing half of the objects!) will consist of 1 and 2 (not yet sorted) and the second part will have objects from 3 and 4 only. Next, I'd split each half again, on the Y axis, but reverse the order in 3-4.
So essentially, I want to perform a divide-and-conquer strategy (closely related to QuickSort - on evenly distributed data this should even be optimal!), and only compute the necessary "bits" of the hilbert index as needed. So assuming there is a single object in "1", then there is no need to compute the full representation of it; and if the objects are evenly distributed, partition sizes will drop quickly.
I do know the usual textbook approach of converting to long, gray-coding, dimension interleaving. This is not what I'm looking for (there are plenty of examples of this available). I explicitly want a lazy divide-and-conquer sorting only. Plus, I need more than 2D.
Does anyone know of an article or hilbert-sorting algorithm that works this way? Or a key idea how to get the "rotations" right, which representation to choose for this? In particular in higher dimensionalities... in 2D it is trivial; 1 is rotated +y, +x, while 4 is -y,-x (rotated and flipped). But in higher dimensionalities this gets more tricky, I guess.
(The result should of course be the same as when sorting the objects by their hilbert order with a sufficiently large precision right away; I'm just trying to save the time computing the full representation when not needed, and having to manage it. Many people keep a hashmap "object to hilbert number" that is rather expensive.)
Similar approaches should be possible for Peano curves and Z-curve, and probably a bit easier to implement... I should probably try these first (Z-curve is already working - it indeed boils down to something closely resembling a QuickSort, using the appropriate mean/grid value as virtual pivot and cycling through dimensions for each iteration).
Edit: see below for how I solved it for Z and peano curves. It is also working for 2D Hilbert curves already. But I do not have the rotations and inversion right yet for Hilbert curves.

Use radix sort. Split each 1-dimensional index to d .. 32 parts, each of size 1 .. 32/d bits. Then (from high-order bits to low-order bits) for each index piece compute its Hilbert value and shuffle objects to proper bins.
This should work well with both evenly and unevenly distributed data, both Hilbert ordering or Z-order. And no multi-precision calculations needed.
One detail about converting index pieces to Hilbert order:
first extract necessary bits,
then interleave bits from all dimensions,
then convert 1-dimensional indexes to inverse Gray code.
If the indexes are stored in doubles:
If indexes may be negative, add some value to make everything positive and thus simplify the task.
Determine the smallest integer power of 2, which is greater than all the indexes and divide all indexes to this value
Multiply the index to 2^(necessary number of bits for current sorting step).
Truncate the result, convert it to integer, and use it for Hilbert ordering (interleave and compute the inverse Gray code)
Subtract the result, truncated on previous step, from the index: index = index - i
Coming to your variant of radix sort, i'd suggest to extend zsort (to make hilbertsort out of zsort) with two binary arrays of size d (one used mostly as a stack, other is used to invert index bits) and the rotation value (used to rearrange dimensions).
If top value in the stack is 1, change pivotize(... ascending) to pivotize(... descending), and then for the first part of the recursion, push this top value to the stack, for second one - push the inverse of this value. This stack should be restored after each recursion. It contains the "decision tree" of last d recursions of radix sort procedure (in inverse Gray code).
After d recursions this "decision tree" stack should be used to recalculate both the rotation value and the array of inversions. The exact way how to do it is non-trivial. It may be found in the following links: hilbert.c or hilbert.c.

You can compute the hilbert curve from f(x)=y directly without using recursion or L-systems or divide and conquer. Basically it's a gray code or hamiltonian path traversal. You can find a good description at Nick's spatial index hilbert curve quadtree blog or from the book hacker's delight. Or take a look at monotonic n-ary gray code. I've written an implementation in php including a moore curve.

I already answered this question (and others) but my answer(s) mysteriously disappeared. The Compact Hilbert Index implemention from http://code.google.com/p/uzaygezen/source/browse/trunk/core/src/main/java/com/google/uzaygezen/core/CompactHilbertCurve.java (method index()) already allows one to limit the number of hilbert index bits computed up to a given level. Each iteration of the loop from the mentioned method computes a number of bits equal to the dimensionality of the space. You can easily refactor the for loop to compute just one level (i.e., a number of bits equal to the dimensionality of the space) at a time, going only as deeply as needed to compare lexicographically two numbers by their Compact Hilbert Index.

Computational geometry: find where the triangle is after rotation, translation or reflection on a mirror

I have a small contest problem in which is given a set of points, in 2D, that form a triangle. This triangle may be subject to an arbitrary rotation, may be subject to an arbitrary translation (both in the 2D plane) and may be subject to a reflection on a mirror, but its dimensions were kept unchanged.
Then, they give me a set of points in the plane, and I have to find 3 points that form my triangle after one or more of those geometric operations.
Example:
5 15
8 5
20 10
6
5 17
5 20
20 5
10 5
15 20
15 10
Output:
5 17
10 5
15 20
I bet it's supposed to apply some known algorithm, but I don't know which. The most common are: convex hull, sweep plane, triangulation, etc.
Can someone give a tip? I don't need the code, only a push, please!

A triangle is uniquely defined (ignoring rotations, flips, and translations) by the lengths of its three sides. Label the vertices of your original triangle A,B,C. You're looking
for points D,E,F such that |AB| = |DE|, |AC| = |DF|, and |BC| = |EF|. The length is given by Pythagoras' formula (but you can save a square root operation at each test by comparing
the squares of the line segment lengths...)

The given triangle is defined by three lengths. You want to find three points in the list separated by exactly those lengths.
Square the given lengths to avoid bothering with sqrt.
Find the square of the distance between every pair of points in the list and only note those that coincide with the given lengths: O(V^2), but with a low coefficient because most lengths will not match.
Now you have a sparse graph with O(V) edges. Find every cycle of size 3 in O(V) time, and prune the matches. (Not sure of the best way, but here is one way with proper big-O.)
Total complexity: O(V^2) but finding the cycles in O(V) may be the limiting factor, depending on the number of points. Spatially sorting the list of points to avoid looking at all pairs should improve the asymptotic behavior, otherwise.

This is generally done with matrix math. This Wikipedia article covers the rotation, translation, and reflection matrices. Here's another site (with pictures).

Since the transformations are just rotation, scaling and mirroring, then you can find the points that form the transformed triangle by checking the dot product of two sides of the triangle:
For the original triangle A, B, C, calculate the dot product of AB.AC, BA.BC and CA.CB
For each set of three points D, E, F, calculate dot product of DE.DF and compare against the three dot products found in 1.
This works since AB.AC = |AB| x |AC| x cos(a), and two lengths and an angle between them defines a triangle.
EDIT: Yes, Jim is right, just one dot product isn't enough, you'll need to do all three, including ED.EF and FD.FE. So in the end, there's the same number of calculations in this as there is in the square distances method.

Fast counting of 2D sub-matrices withing a large, dense 2D matrix?

What's a good algorithm for counting submatrices within a larger, dense matrix? If I had a single line of data, I could use a suffix tree, but I'm not sure if generalizing a suffix tree into higher dimensions is exactly straightforward or the best approach here.
Thoughts?
My naive solution to index the first element of the dense matrix and eliminate full-matrix searching provided only a modest improvement over full-matrix scanning.
What's the best way to solve this problem?
Example:
Input:
Full matrix:
123
212
421
Search matrix:
12
21
Output:
2
This sub-matrix occurs twice in the full matrix, so the output is 2. The full matrix could be 1000x1000, however, with a search matrix as large as 100x100 (variable size), and I need to process a number of search matrices in a row. Ergo, a brute force of this problem is far too inefficient to meet my sub-second search time for several matrices.

For an algorithms course, I once worked an exercise in which the Rabin-Karp string-search algorithm had to be extended slightly to search for a matching two-dimensional submatrix in the way you describe.
I think if you take the time to understand the algorithm as it is described on Wikipedia, the natural way of extending it to two dimensions will be clear to you. In essence, you just make several passes over the matrix, creeping along one column at a time. There are some little tricks to keep the time complexity as low as possible, but you probably won't even need them.
Searching an N×N matrix for a M×M matrix, this approach should give you an O(N²⋅M) algorithm. With tricks, I believe it can be refined to O(N²).

Algorithms and Theory of Computation Handbook suggests what is an N^2 * log(Alphabet Size) solution. Given a sub-matrix to search for, first of all de-dupe its rows. Now note that if you search the large matrix row by row at most one of the de-duped rows can appear at any position. Use Aho-Corasick to search this in time N^2 * log(Alphabet Size) and write down at each cell in the large matrix either null or an identifier for the matching row of the sub-matrix. Now use Aho-Corasick again to search down the columns of this matrix of row matches and signal a match where all the rows are present below each other.

This sounds similar to template matching. If motivated you could probably transform your original array with the FFT and drop a log from a brute force search. (Nlog(M)) instead of (NM)

I don't have a ready answer but here's how I would start:
-- You want very fast lookup, how much (time) can you spend on building index structures? When brute-force isn't fast enough you need indexes.
-- What do you know about your data that you haven't told us? Are all the values in all your matrices single-digit integers?
-- If they are single-digit integers (or anything else you can represent as a single character or index value), think about linearising your 2D structures. One way to do this would be to read the matrix along a diagonal running top-right to bottom-left and scanning from top-left to bottom-right. Difficult to explain in words, but read the matrix:
1234
5678
90ab
cdef
as 125369470c8adbef
(get it?)
Now you can index your super-matrix to whatever depth your speed and space requirements demand; in my example key 1253... points to element (1,1), key abef points to element (3,3). Not sure if this works for you, and you'll have to play around with the parameters to your solution. Choose your favourite method for storing the key-value pairs: a hash, a list, or even build some indexes into the index if things get wild.
Regards
Mark

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.