Java Regex for genome puzzle

Java Regex for genome puzzle - java

I was assigned a problem to find genes when given a string of the letters A,C,G, or T all in a row, like ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA. A gene is started with ATG, and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). The gene consists of triplets of letters, so its length is a multiple of three, and none of those triplets can be the start/end triplets listed above. So, for the string above the genes in it are CTCTCT and CACACACACACA. And in fact my regex works for that particular string. Here's what I have so far (and I'm pretty happy with myself that I got this far):
(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
However, if there is an ATG and end-triplet within another result, and not aligned with the triplets of that result, it fails. For example:
Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG :
TTGCTTATTGTTTTGAATGGGGTAGGA
ACCTGC
It should find also a GGG but doesn't: TTGCTTATTGTTTTGA(ATG|GGG|TAG)GA
I'm new to regex in general and a little stuck...just a little hint would be awesome!

The problem is that the regular expression consumes the characters that it matches and then they are not used again.
You can solve this by either using a zero-width match (in which case you only get the index of the match, not the characters that matched).
Alternatively you can use three similar regular expressions, but each using a different offset:
(?=(.{3})+$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+.$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+..$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
You might also want to consider using a different approach that doesn't involve regular expressions as the above regular expression would be slow.

The problem with things like this is that you can slowly build up a regex, rule by rule, until you have something taht works.
Then your requirements change and you have to start all over again, because its nearly impossible for mere mortals to easily reverse engineer a complex regex.
Personally, I'd rather do it the 'old fashioned' way - use string manipulation. Each stage can be easily commented, and if there's a slight change in the requirements you can just tweak a particular stage.

Here's a possible regex:
(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))
A little test-rig:
public class Main {
public static void main(String[]args) {
String source = "TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG";
Matcher m = Pattern.compile("(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))").matcher(source);
System.out.println("source : "+source+"\nmatches:");
while(m.find()) {
System.out.print(" ");
for(int i = 0; i < m.start(); i++) {
System.out.print(" ");
}
System.out.println(m.group(1));
}
}
}
which produces:
source : TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG
matches:
ATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGA
ATGGGGTAG
ATGACCTGCTAA
ATGTAG

Perhaps you should try with other methods like working with indexes. Something like :
public static final String genome="ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA";
public static final String start_codon = "ATG";
public final static String[] end_codons = {"TAA","TAG","TGA"};
public static void main(String[] args) {
List<Integer>start_indexes = new ArrayList<Integer>();
int curIndex = genome.indexOf(start_codon);
while(curIndex!=-1){
start_indexes.add(curIndex);
curIndex = genome.indexOf(start_codon,curIndex+1);
}
}
do the same for other codons, and see if indexes match the triplet rule. By the way, are you sure that a gene exclude a start codon? (some ATG can be found in a gene)

Related

How to make a regular expression match based on a condition?

I'm trying to make a conditional regex, I know that there are other posts on stack overflow but there too specific to the problem.
The Question
How can I create a regular expression that only looks to match something given a certain condition?
An example
An example of this would be if we had a list of a string(this is in java):
String nums = "42 36 23827";
and we only want to match if there are the same amount of x's at the end of the string as there are at the beginning
What we want in this example
In this example, we would want a regex that checks if there are the same amount of regex's at the end as there are in the beginning. The conditional part: If there are x's at the beginning, then check if there are that many at the end, if there are then it is a match.
Another example
An example of this would be if we had a list of numbers (this is in java) in string format:
String nums = "42 36 23827";
and we want to separate each number into a list
String splitSpace = "Regex goes here";
Pattern splitSpaceRegex = Pattern.compile(splitSpace);
Matcher splitSpaceMatcher = splitSpaceRegex.matcher(text);
ArrayList<String> splitEquation = new ArrayList<String>();
while (splitSpaceMatcher.find()) {
if (splitSpaceMatcher.group().length() != 0) {
System.out.println(splitSpaceMatcher.group().trim());
splitEquation.add(splitSpaceMatcher.group().trim());
}
}
How can I make this into an array that looks like this:
["42", "36", "23827"]
You could try making a simple regex like this:
String splitSpace = "\\d+\\s+";
But that exludes the "23827" because there is no space after it.
and we only want to match if there are the same amount ofx`'s at the end of the string as there are at the beginning
What we want in this example
In this example, we would want a regex that checks if it is the end of the string; if it is then we don't need the space, otherwise, we do. As #YCF_L mentioned we could just make a regex that is \\b\\d\\b but I am aiming for something conditional.
Conclusion
So, as a result, the question is, how do we make conditional regular expressions? Thanks for reading and cheers!

There are no conditionals in Java regexes.
I want a regex that checks if there are the same amount of regex's at the end as there are in the beginning. The conditional part: If there are x's at the beginning, then check if there are that many at the end, if there are then it is a match.
This may or may not be solvable. If you want to know if a specific string (or pattern) repeats, that can be done using a back reference; e.g.
^(\d+).+\1$
will match a line consisting of an arbitrary number digits, any number of characters, and the same digits matched at the start. The back reference \1 matches the string matched by group 1.
However if you want the same number of digits at the end as at the start (and that number isn't a constant) then you cannot implement this using a single (Java) regex.
Note that some regex languages / engines do support conditionals; see the Wikipedia Comparison of regular-expression engines page.

I would like to use split which accept regex like so :
String[] split = nums.split("\\s+"); // ["42", "36", "23827"]
If you want to use Pattern with Matcher, then you can use String \b\d+\b with word boundaries.
String regex = "\\b\\d+\\b";
By using word boundaries, you will avoid cases where the number is part of the word, for example "123 a4 5678 9b" you will get just ["123", "4578"]

I do not see the "conditional" in the question. The problem is solvable with a straight forward regular expression: \b\d+\b.
regex101 demo
A fully fledged Java example would look something like this:
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Ideone {
public static void main(String args[]) {
final String sample = "123 45 678 90";
final Pattern pattern = Pattern.compile("\\b\\d+\\b");
final Matcher matcher = pattern.matcher(sample);
final ArrayList<String> results = new ArrayList<>();
while (matcher.find()) {
results.add(matcher.group());
}
System.out.println(results);
}
}
Output: [123, 45, 678, 90]
Ideone demo

Splitting a String works in Java, doesn't work on Android

So I have this logic which splits a String by 4 characters each. Something like this
0108011305080508000000 gives 0108 0113 0508 0508 0000 00
The logic I used is
String [] splitErrorCode = inputString.split("(?<=\\G....)");
It works great in Java, but when I run it in Android I get the wrong output.
0108 011305080508000000
I have no clue what is going wrong.
After a going through the String's split function, I realized Android uses fastSplit where as the Java version has a huge splitting logic.
Aren't both the functions supposed to work identically? Why is this a problem? Any comments/ suggestions?

\G in Java was added in Java 6 to mimic the Perl construct:
http://perldoc.perl.org/perlfaq6.html#What-good-is-%5CG-in-a-regular-expression:
You use the \G anchor to start the next match on the same string where the last match left off.
Support for this was very poor. This construct is documented in Python to be used in negative variable-length lookbehinds to limit how far back the lookbehind goes. Explicit support was added.
However, the split method in JDK 7 has a fast path for the common case where the limit is a single character. This avoids the need to compile or to use regex. Here is the method (detail source redacted):
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
/* Fast path checks as documented in the comment */ )
{
// Fast path computation redacted!
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
And before:
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
While this fastpath exists, note that deploying Android programs means it must have source compatibility with Java 6. The Android environment is unable to take advantage of the fast path, therefore it delegates to fastSplit and loses some of the Perl construct supports, such as \A.
As for why they didn't like the traditional always-regex path, it's kind of obvious by itself.

Instead of splitting like you do (and by the way this will recompile a pattern for each split operation) just do it like this; it's more simple and performs better:
private static final Pattern ONE_TO_FOUR_DIGITS = Pattern.compile("\\d{1,4}");
// ...
public List<String> splitErrorCodes(final String input)
{
final List<String> ret = new ArrayList<>(input.length() / 4 + 1);
final Matcher m = ONE_TO_FOUR_DIGITS.matcher(input);
while (m.find())
ret.add(m.group());
return ret;
}
Of course, an additional check would need to be performed on the shape of input as a whole but it's really not hard to do. Left as an exercise ;)

Vowel regexp in jflex

So I did an exercise using jflex, which is about counting the amount of words from an input text file that contains more than 3 vowels. What I end up doing was defining a token for word, and then creating a java function that receives this text as input, and check each character. If its a vowel I add up the counter and then I check if its greater than 3, if it is I add up the counter of the amount of words.
What I want to know, if there is a regexp that could match a word with more than 3 vowels. I think it would be a cleaner solution. Thanks in advance.
tokens
Letra = [a-zA-Z]
Palabra = {Letra}+

Very simple. Use this if you want to check that a word contains at least 3 vowels.
(?i)(?:[a-z]*[aeiou]){3}[a-z]*
You only care it that contains at least 3 vowels, so the rest can be any alphabetical characters. The regex above can work in both String.matches and Matcher loop, since the valid word (contains at least 3 vowels) cannot be substring of an invalid word (contains less than 3 vowels).
Out of the question, but for consonant, you can use character class intersection, which is a unique feature to Java regex [a-z&&[^aeiou]]. So if you want to check for exactly 3 vowels (for String.matches):
(?i)(?:[a-z&&[^aeiou]]*[aeiou]){3}[a-z&&[^aeiou]]*
If you are using this in Matcher loop:
(?i)(?<![a-z])(?:[a-z&&[^aeiou]]*[aeiou]){3}[a-z&&[^aeiou]]*(?![a-z])
Note that I have to use look-around to make sure that the string matched (exactly 3 vowels) is not part of an invalid string (possible when it has more than 3 vowels).

Since you yourself wrote a Java method, this can be done as follows in the same:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class VowelChecker {
private static final Pattern vowelRegex = Pattern.compile("[aeiouAEIOU]");
public static void main(String[] args) {
System.out.println(checkVowelCount("aeiou", 3));
System.out.println(checkVowelCount("AEIWW", 3));
System.out.println(checkVowelCount("HeLlO", 3));
}
private static boolean checkVowelCount(String str, int threshold) {
Matcher matcher = vowelRegex.matcher(str);
int count = 0;
while (matcher.find()) {
if (++count > threshold) {
return true;
}
}
return false;
}
}
Here threshold defines the number of vowels you are looking for (since you are looking for greater than 3, hence 3 in the main method). The output is as follows:
true
false
false
Hope this helps!
Thanks,
EG

I ended up using this regexp I came up. If anyone has a better feel free to post
Cons = [bcdBCDfghFGHjklmnJKLMNpqrstPQRSTvwxyzVWXYZ]
Vocal = [aeiouAEIOU]
Match = {Cons}*{Vocal}{Cons}*{Vocal}{Cons}*{Vocal}{Cons}*{Vocal}({Cons}*{Vocal}*|{Vocal}*{Cons}*) | {Vocal}{Cons}*{Vocal}{Cons}*{Vocal}{Cons}*{Vocal}({Cons}*{Vocal}*|{Vocal}*{Cons}*)

How To do this in Regex - code base alterations

I have a complete Java based code base, where members are named:
String m_sFoo;
Array m_arrKeepThings;
Variable/object names includes both a m_ prefix to indicate a member, and an hungarian notation type indicator.
I'm looking for a way to perform a single time code replacment to (for example on the above to cases):
Array keepThings;
String foo;
Of course there are many other alternatives, but I hope that based on two examples, I'll be able to perform the full change.
Performances is not an issue as it's a single time fix.
To clarify, if I had to explain this in lines, it would be:
Match words starting with m_[a-zA-Z].
After m_, drop whatever is there before the first Capital letter.
Change the first capital letter to lower case.

Check out this post: Regex to change to sentence case
Generally I am afraid that you cannot change the case of letters using regular expressions.
I'd recommend you to implement a simple utility (using any language you want). You can do it in java. Just go through your file tree, search for pattern like m_[sidc]([A-Z]), take the captured sequence, call toLowerCase() and perform replace.
Other solution is to search and replace for m_sA, then m_sB, ... m_sZ using eclipse. Total: 26 times. It is a little bit stupid but probably anyway faster than implementing and debugging of your own code.

If you are really, really sure that the proposed changed won't result in clashes (variables that only differ in their prefix) I would do it with a line of perl:
perl -pi.bak -e "s/\bm_[a-z_]+([A-Z]\w*)\b/this.\u$1/g;" *.java
This will perform an inline edit of your Java sources, while keeping a backup with extension .bak replacing your pattern between word boundaries (\b) capitalising the first letter of the replacement (\u) multiple times per line.
You can then perform a diff between the backup files and the result files to see if all went well.

Here is some Java code that works. It is not pure regex, but based on:
Usage:
String str = "String m_sFoo;\n"
+ "Array m_arrKeepThings;\n"
+ "List<? extends Reader> m_lstReaders; // A silly comment\n"
+ "String.format(\"Hello World!\"); /* No m_named vars here */";
// Read the file you want to handle instead
NameMatcher nm = new NameMatcher(str);
System.out.println(nm.performReplacements());
NameMatcher.java
package so_6806699;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* #author martijn
*/
public class NameMatcher
{
private String input;
public static final String REGEX = "m_[a-z]+([A-Z0-9_\\$\\µ\\£]*)";
public static final Pattern PATTERN = Pattern.compile(REGEX);
public NameMatcher(String input)
{
this.input = input;
}
public String performReplacements()
{
Matcher m = PATTERN.matcher(input);
StringBuilder sb = new StringBuilder();
int oldEnd = 0;
while (m.find())
{
int start = m.start();
int end = m.end();
String match = input.substring(start, end);
String matchGroup1 = match.replaceAll(REGEX, "$1");
if (!matchGroup1.isEmpty())
{
char[] match_array = matchGroup1.toCharArray();
match_array[0] = Character.toLowerCase(match_array[0]);
match = new String(match_array);
}
sb.append(input.substring(oldEnd, start));
oldEnd = end;
sb.append(match);
}
sb.append(input.substring(oldEnd));
return sb.toString();
}
}
Demo Output:
String foo;
Array keepThings;
List<? extends Reader> readers; // A silly comment
String.format("Hello World!"); /* No m_named vars here */
Edit 0:
Since dollar signs ($), micro (µ) and pound (£) are valid characters for Java name variables, I edited the regex.
Edit 1: It seems that there are a lot of non-latin characters that are valid (éùàçè, etc). Hopefully you don't have to handle them.
Edit 2: I'm only a human being! So be aware of errors there might be in the code! Make a BACKUP first!
Edit 3: Code improved. A NPE was thrown when the code contains this: m_foo. These will be unhandled.

How do I know if a regexp has more than one possible match?

I am writing Java code that has to distinguish regular expressions with more than one possible match from regular expressions that have only one possible match.
For example:
"abc." can have several matches ("abc1", abcf", ...),
while "abcd" can only match "abcd".
Right now my best idea was to look for all unescaped regexp special characters.
I am convinced that there is a better way to do it in Java. Ideas?
(Late addition):
To make things clearer - there is NO specific input to test against. A good solution for this problem will have to test the regex itself.
In other words, I need a method who'se signature may look something like this:
boolean isSingleResult(String regex)
This method should return true if only for one possible String s1. The expression s1.matches(regex) will return true. (See examples above.)

This sounds dirty, but it might be worth having a look at the Pattern class in the Java source code.
Taking a quick peek, it seems like it 'normalize()'s the given regex (Line 1441), which could turn the expression into something a little more predictable. I think reflection can be used to tap into some private resources of the class (use caution!). It could be possible that while tokenizing the regex pattern, there are specific indications if it has reached some kind "multi-matching" element in the pattern.
Update
After having a closer look, there is some data within package scope that you can use to leverage the work of the Pattern tokenizer to walk through the nodes of the regex and check for multiple-character nodes.
After compiling the regular expression, iterate through the compiled "Node"s starting at Pattern.root. Starting at line 3034 of the class, there are the generalized types of nodes. For example class Pattern.All is multi-matching, while Pattern.SingleI or Pattern.SliceI are single-matching, and so on.
All these token classes appear to be in package scope, so it should be possible to do this without using reflection, but instead creating a java.util.regex.PatternHelper class to do the work.
Hope this helps.

If it can only have one possible match it isn't reeeeeally an expression, now, is it? I suspect your best option is to use a different tool altogether, because this does not at all sound like a job for regular expressions, but if you insist, well, no, I'd say your best option is to look for unescaped special characters.

The only regular expression that can ONLY match one input string is one that specifies the string exactly. So you need to match expressions with no wildcard characters or character groups AND that specify a start "^" and end "$" anchor.
"the quick" matches:
"the quick brownfox"
"the quick brown dog"
"catch the quick brown fox"
"^the quick brown fox$" matches ONLY:
"the quick brown fox"

Now I understand what you mean. I live in Belgium...
So this is something what work on most expressions. I wrote this by myself. So maybe I forgot some rules.
public static final boolean isSingleResult(String regexp) {
// Check the exceptions on the exceptions.
String[] exconexc = "\\d \\D \\w \\W \\s \\S".split(" ");
for (String s : exconexc) {
int index = regexp.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
// Then remove all exceptions:
String regex = regexp.replaceAll("\\\\.", "");
// Now, all the strings how can mean more than one match
String[] mtom = "+ . ? | * { [:alnum:] [:word:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]".split(" ");
// iterate all mtom-Strings
for (String s : mtom) {
int index = regex.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
return true;
}
Martijn

I see that the only way is to check if regexp matches multiple times for particular input.
package com;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AAA {
public static void main(String[] args) throws Exception {
String input = "123 321 443 52134 432";
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(input);
int i = 0;
while (matcher.find()) {
++i;
}
System.out.printf("Matched %d times%n", i);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex for genome puzzle - java

Related

How to make a regular expression match based on a condition?

Splitting a String works in Java, doesn't work on Android

Vowel regexp in jflex

How To do this in Regex - code base alterations

How do I know if a regexp has more than one possible match?

Categories

Resources