How To do this in Regex - code base alterations - java

I have a complete Java based code base, where members are named:
String m_sFoo;
Array m_arrKeepThings;
Variable/object names includes both a m_ prefix to indicate a member, and an hungarian notation type indicator.
I'm looking for a way to perform a single time code replacment to (for example on the above to cases):
Array keepThings;
String foo;
Of course there are many other alternatives, but I hope that based on two examples, I'll be able to perform the full change.
Performances is not an issue as it's a single time fix.
To clarify, if I had to explain this in lines, it would be:
Match words starting with m_[a-zA-Z].
After m_, drop whatever is there before the first Capital letter.
Change the first capital letter to lower case.

Check out this post: Regex to change to sentence case
Generally I am afraid that you cannot change the case of letters using regular expressions.
I'd recommend you to implement a simple utility (using any language you want). You can do it in java. Just go through your file tree, search for pattern like m_[sidc]([A-Z]), take the captured sequence, call toLowerCase() and perform replace.
Other solution is to search and replace for m_sA, then m_sB, ... m_sZ using eclipse. Total: 26 times. It is a little bit stupid but probably anyway faster than implementing and debugging of your own code.

If you are really, really sure that the proposed changed won't result in clashes (variables that only differ in their prefix) I would do it with a line of perl:
perl -pi.bak -e "s/\bm_[a-z_]+([A-Z]\w*)\b/this.\u$1/g;" *.java
This will perform an inline edit of your Java sources, while keeping a backup with extension .bak replacing your pattern between word boundaries (\b) capitalising the first letter of the replacement (\u) multiple times per line.
You can then perform a diff between the backup files and the result files to see if all went well.

Here is some Java code that works. It is not pure regex, but based on:
Usage:
String str = "String m_sFoo;\n"
+ "Array m_arrKeepThings;\n"
+ "List<? extends Reader> m_lstReaders; // A silly comment\n"
+ "String.format(\"Hello World!\"); /* No m_named vars here */";
// Read the file you want to handle instead
NameMatcher nm = new NameMatcher(str);
System.out.println(nm.performReplacements());
NameMatcher.java
package so_6806699;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* #author martijn
*/
public class NameMatcher
{
private String input;
public static final String REGEX = "m_[a-z]+([A-Z0-9_\\$\\µ\\£]*)";
public static final Pattern PATTERN = Pattern.compile(REGEX);
public NameMatcher(String input)
{
this.input = input;
}
public String performReplacements()
{
Matcher m = PATTERN.matcher(input);
StringBuilder sb = new StringBuilder();
int oldEnd = 0;
while (m.find())
{
int start = m.start();
int end = m.end();
String match = input.substring(start, end);
String matchGroup1 = match.replaceAll(REGEX, "$1");
if (!matchGroup1.isEmpty())
{
char[] match_array = matchGroup1.toCharArray();
match_array[0] = Character.toLowerCase(match_array[0]);
match = new String(match_array);
}
sb.append(input.substring(oldEnd, start));
oldEnd = end;
sb.append(match);
}
sb.append(input.substring(oldEnd));
return sb.toString();
}
}
Demo Output:
String foo;
Array keepThings;
List<? extends Reader> readers; // A silly comment
String.format("Hello World!"); /* No m_named vars here */
Edit 0:
Since dollar signs ($), micro (µ) and pound (£) are valid characters for Java name variables, I edited the regex.
Edit 1: It seems that there are a lot of non-latin characters that are valid (éùàçè, etc). Hopefully you don't have to handle them.
Edit 2: I'm only a human being! So be aware of errors there might be in the code! Make a BACKUP first!
Edit 3: Code improved. A NPE was thrown when the code contains this: m_foo. These will be unhandled.

Related

Deal with apostrophe in java regex in replaceALL

Trying the replace only the EXACT & WHOLE OCCURRENCES of pattern using the following code. Apparently you in you'll is being replaced as ###'ll. But what I want is only you to be replaced.
Please suggest.
import java.util.*;
import java.io.*;
public class Fielreadingtest{
public static void main(String[] args) throws IOException {
String MyText = "I knew about you long before I met you. I also know that you’re an awesome person. By the way you’ll be missed. ";
String newLine = System.getProperty("line.separator");
System.out.println("Before:" + newLine + MyText);
String pattern = "\\byou\\b";
MyText = MyText.replaceAll(pattern, "###");
System.out.println("After:" + newLine +MyText);
}
}
/*
Before:
I knew about you long before I met you. I also know that you’re an awesome person. By the way you’ll be missed.
After:
I knew about ### long before I met ###. I also know that ###’re an awesome person. By the way ###’ll be missed.
*/
This being said I have an input file which contains a list of words that I want to skip which looks like this:
Now as per #Anubhav I have to use (^|\\s)you([\\s.]|$) to replace exactly you but not anything else. Is my best bet to use a tool like notepad++ and pre & post fix all my input words as above or change something in the code itslef. The code I'm using is this:
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
source: https://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount2_source.html?scroll=topic_7_1
You can instead use this regex:
String pattern = "(^|\\s)you([\\s.,;:-]|$)";
This will match "you" only at:
start or preceded by a space
end or followed by a space OR a some listed punctuation characters
You can use a negative lookahead:
\b(you)(?!['’])
Escaped for a Java string:
"\\b(you)(?!['’])"
Your demo input contains a different apostrophe than on my keyboard. I've put both in the negative lookahead.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
/**
<P>{#code java ReplaceYouWholeWordWithAtAtAt}</P>
**/
public class ReplaceYouWholeWordWithAtAtAt {
public static final void main(String[] ignored) {
String sRegex = "\\byou(?!['’])";
String sToSearch = "I knew about you long before I met you. I also know that you’re an awesome person. By the way you’ll be missed.";
String sRplcWith = "###";
Matcher m = Pattern.compile(sRegex).matcher(sToSearch);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, sRplcWith);
}
m.appendTail(sb);
System.out.println(sb);
}
}
Output:
[C:\java_code\]java ReplaceYouWholeWordWithAtAtAt
I knew about ### long before I met ###. I also know that youÆre an awesome person. By the way youÆll be missed.

Java Regex for genome puzzle

I was assigned a problem to find genes when given a string of the letters A,C,G, or T all in a row, like ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA. A gene is started with ATG, and ends with either TAA, TAG, or TGA (the gene excludes both endpoints). The gene consists of triplets of letters, so its length is a multiple of three, and none of those triplets can be the start/end triplets listed above. So, for the string above the genes in it are CTCTCT and CACACACACACA. And in fact my regex works for that particular string. Here's what I have so far (and I'm pretty happy with myself that I got this far):
(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
However, if there is an ATG and end-triplet within another result, and not aligned with the triplets of that result, it fails. For example:
Results for TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGG :
TTGCTTATTGTTTTGAATGGGGTAGGA
ACCTGC
It should find also a GGG but doesn't: TTGCTTATTGTTTTGA(ATG|GGG|TAG)GA
I'm new to regex in general and a little stuck...just a little hint would be awesome!
The problem is that the regular expression consumes the characters that it matches and then they are not used again.
You can solve this by either using a zero-width match (in which case you only get the index of the match, not the characters that matched).
Alternatively you can use three similar regular expressions, but each using a different offset:
(?=(.{3})+$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+.$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
(?=(.{3})+..$)(?<=ATG)(([ACGT]{3}(?<!ATG))+?)(?=TAG|TAA|TGA)
You might also want to consider using a different approach that doesn't involve regular expressions as the above regular expression would be slow.
The problem with things like this is that you can slowly build up a regex, rule by rule, until you have something taht works.
Then your requirements change and you have to start all over again, because its nearly impossible for mere mortals to easily reverse engineer a complex regex.
Personally, I'd rather do it the 'old fashioned' way - use string manipulation. Each stage can be easily commented, and if there's a slight change in the requirements you can just tweak a particular stage.
Here's a possible regex:
(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))
A little test-rig:
public class Main {
public static void main(String[]args) {
String source = "TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG";
Matcher m = Pattern.compile("(?=(ATG((?!ATG)[ATGC]{3})*(TAA|TAG|TGA)))").matcher(source);
System.out.println("source : "+source+"\nmatches:");
while(m.find()) {
System.out.print(" ");
for(int i = 0; i < m.start(); i++) {
System.out.print(" ");
}
System.out.println(m.group(1));
}
}
}
which produces:
source : TCGAATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGATGTAG
matches:
ATGTTGCTTATTGTTTTGAATGGGGTAGGATGACCTGCTAATTGGGGGGGGGGATGA
ATGGGGTAG
ATGACCTGCTAA
ATGTAG
Perhaps you should try with other methods like working with indexes. Something like :
public static final String genome="ATGCTCTCTTGATTTTTTTATGTGTAGCCATGCACACACACACATAAGA";
public static final String start_codon = "ATG";
public final static String[] end_codons = {"TAA","TAG","TGA"};
public static void main(String[] args) {
List<Integer>start_indexes = new ArrayList<Integer>();
int curIndex = genome.indexOf(start_codon);
while(curIndex!=-1){
start_indexes.add(curIndex);
curIndex = genome.indexOf(start_codon,curIndex+1);
}
}
do the same for other codons, and see if indexes match the triplet rule. By the way, are you sure that a gene exclude a start codon? (some ATG can be found in a gene)

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

How do I know if a regexp has more than one possible match?

I am writing Java code that has to distinguish regular expressions with more than one possible match from regular expressions that have only one possible match.
For example:
"abc." can have several matches ("abc1", abcf", ...),
while "abcd" can only match "abcd".
Right now my best idea was to look for all unescaped regexp special characters.
I am convinced that there is a better way to do it in Java. Ideas?
(Late addition):
To make things clearer - there is NO specific input to test against. A good solution for this problem will have to test the regex itself.
In other words, I need a method who'se signature may look something like this:
boolean isSingleResult(String regex)
This method should return true if only for one possible String s1. The expression s1.matches(regex) will return true. (See examples above.)
This sounds dirty, but it might be worth having a look at the Pattern class in the Java source code.
Taking a quick peek, it seems like it 'normalize()'s the given regex (Line 1441), which could turn the expression into something a little more predictable. I think reflection can be used to tap into some private resources of the class (use caution!). It could be possible that while tokenizing the regex pattern, there are specific indications if it has reached some kind "multi-matching" element in the pattern.
Update
After having a closer look, there is some data within package scope that you can use to leverage the work of the Pattern tokenizer to walk through the nodes of the regex and check for multiple-character nodes.
After compiling the regular expression, iterate through the compiled "Node"s starting at Pattern.root. Starting at line 3034 of the class, there are the generalized types of nodes. For example class Pattern.All is multi-matching, while Pattern.SingleI or Pattern.SliceI are single-matching, and so on.
All these token classes appear to be in package scope, so it should be possible to do this without using reflection, but instead creating a java.util.regex.PatternHelper class to do the work.
Hope this helps.
If it can only have one possible match it isn't reeeeeally an expression, now, is it? I suspect your best option is to use a different tool altogether, because this does not at all sound like a job for regular expressions, but if you insist, well, no, I'd say your best option is to look for unescaped special characters.
The only regular expression that can ONLY match one input string is one that specifies the string exactly. So you need to match expressions with no wildcard characters or character groups AND that specify a start "^" and end "$" anchor.
"the quick" matches:
"the quick brownfox"
"the quick brown dog"
"catch the quick brown fox"
"^the quick brown fox$" matches ONLY:
"the quick brown fox"
Now I understand what you mean. I live in Belgium...
So this is something what work on most expressions. I wrote this by myself. So maybe I forgot some rules.
public static final boolean isSingleResult(String regexp) {
// Check the exceptions on the exceptions.
String[] exconexc = "\\d \\D \\w \\W \\s \\S".split(" ");
for (String s : exconexc) {
int index = regexp.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
// Then remove all exceptions:
String regex = regexp.replaceAll("\\\\.", "");
// Now, all the strings how can mean more than one match
String[] mtom = "+ . ? | * { [:alnum:] [:word:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]".split(" ");
// iterate all mtom-Strings
for (String s : mtom) {
int index = regex.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
return true;
}
Martijn
I see that the only way is to check if regexp matches multiple times for particular input.
package com;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AAA {
public static void main(String[] args) throws Exception {
String input = "123 321 443 52134 432";
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(input);
int i = 0;
while (matcher.find()) {
++i;
}
System.out.printf("Matched %d times%n", i);
}
}

How to check a string starts with numeric number?

I have a string which contains alphanumeric character.
I need to check whether the string is started with number.
Thanks,
See the isDigit(char ch) method:
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html
and pass it to the first character of the String using the String.charAt() method.
Character.isDigit(myString.charAt(0));
Sorry I didn't see your Java tag, was reading question only. I'll leave my other answers here anyway since I've typed them out.
Java
String myString = "9Hello World!";
if ( Character.isDigit(myString.charAt(0)) )
{
System.out.println("String begins with a digit");
}
C++:
string myString = "2Hello World!";
if (isdigit( myString[0]) )
{
printf("String begins with a digit");
}
Regular expression:
\b[0-9]
Some proof my regex works: Unless my test data is wrong?
I think you ought to use a regex:
import java.util.regex.*;
public class Test {
public static void main(String[] args) {
String neg = "-123abc";
String pos = "123abc";
String non = "abc123";
/* I'm not sure if this regex is too verbose, but it should be
* clear. It checks that the string starts with either a series
* of one or more digits... OR a negative sign followed by 1 or
* more digits. Anything can follow the digits. Update as you need
* for things that should not follow the digits or for floating
* point numbers.
*/
Pattern pattern = Pattern.compile("^(\\d+.*|-\\d+.*)");
Matcher matcher = pattern.matcher(neg);
if(matcher.matches()) {
System.out.println("matches negative number");
}
matcher = pattern.matcher(pos);
if (matcher.matches()) {
System.out.println("positive matches");
}
matcher = pattern.matcher(non);
if (!matcher.matches()) {
System.out.println("letters don't match :-)!!!");
}
}
}
You may want to adjust this to accept floating point numbers, but this will work for negatives. Other answers won't work for negatives because they only check the first character! Be more specific about your needs and I can help you adjust this approach.
This should work:
String s = "123foo";
Character.isDigit(s.charAt(0));
System.out.println(Character.isDigit(mystring.charAt(0));
EDIT: I searched for java docs, looked at methods on string class which can get me 1st character & looked at methods on Character class to see if it has any method to check such a thing.
I think, you could do the same before asking it.
EDI2: What I mean is, try to do things, read/find & if you can't find anything - ask.
I made a mistake when posting it for the first time. isDigit is a static method on Character class.
Use a regex like ^\d

Categories