I am interested to extract the first 10 digits if exists from a long string while disregarding the leading zeros. Additionally if there are only zeroes, return only 1 zero, if there no numbers, return empty string. I wish to match it in a single find.
For example:
"abcd00111.g2012asd" should match to "1112012"
"aktr0011122222222222ddd" should match to "1112222222"
"asdas000000asdasds0000" should match to "0"
"adsads.cxzv.;asdasd" should match to ""
Here is what I have tried so far: Ideone Demo - code
Pattern p = Pattern.compile("[1-9]{1}+[0-9]{9}");
Matcher m = p.matcher(str);
if (m.find()) {
String match = m.group();
System.out.println(match);
}
The problem is that this regex require 9 sequential digits after the first non zero, and I need any 9 digits (possible non digit chars in between).
Notice that in the code I have if (m.find()) instead of while (m.find()) because I wish to find the match in single run.
UPDATE
base on the comments i understood that it is not possible with regex to do it in single run.
I would like an answer not have to be regex based but most efficient since i will execute this method many times.
In general case, it is not possible to do it with a single find. You can do it if you know the maximum number of contiguous sequence of digits, but if that is not known, then it is not possible, at least at the level of support of Java Pattern class. I was wrong about this. Kobi's comment shows that it is possible with a single regex. I will reproduce the comment here:
Oh, and it is sort of possible with a regex, by capturing each of the 10 digits, something like: ^[\D0]*(\d)\D*(?:(\d)\D*(?:(\d)\D*(?:(\d)\D*(?#{6 more times}))?)?)?, but it is really ugly, and doesn't scale well.
You still need to concatenate the groups, though. The logic in the regex at the beginning is quite nice: due to the greedy property, it will search for the first non-zero digit that are after all the leading zero if any, or it will take the last 0 if there is no non-zero digit.
If you throw the talk about efficiency out of the door, and you want short code:
String digitOnly = str.replaceAll("\\D+", "");
String noLeadingZero = digitOnly.replaceFirst("^0+", "");
String result = digitOnly.isEmpty() ? "" :
noLeadingZero.isEmpty() ? "0" :
noLeadingZero.substring(0, Math.min(noLeadingZero.length(), 10));
Frankly, a loop through the string, with a StringBuilder is good enough, and it should be faster than regex solution.
StringBuilder output = new StringBuilder();
boolean hasDigit = false;
boolean leadingZero = true;
for (int i = 0; i < str.length() && output.length() < 10; i++) {
char currChar = str.charAt(i);
if ('0' <= currChar && currChar <= '9') {
hasDigit = true;
if (currChar != '0') {
output.append(currChar);
leadingZero = false;
} else if (!leadingZero) { // currChar == 0
output.append(currChar);
} // Ignore leading zero
}
}
String result = !hasDigit ? "" :
output.length() == 0 ? "0" :
output.toString();
Performance testing code. Note that you should adjust the parameters to make it resemble actual input so that you get a good approximation. I doubt looping method is slower than anything involving regex; however, the difference is only significant on large scale.
String test = "sdfsd0000234.432004gr23.022";
StringBuilder sb = new StringBuilder();
for(int i=0;i<test.length();i++) {
if(Character.isDigit(test.charAt(i)))
sb = sb.append(test.charAt(i));
}
String result = sb.toString();
result = result.replaceFirst("^0*", ""); //Remove leading zeros
System.out.println(result); //Will print 23443200423022
Related
I've seen questions on how to prefix zeros here in SO. But not the other way!
Can you guys suggest me how to remove the leading zeros in alphanumeric text? Are there any built-in APIs or do I need to write a method to trim the leading zeros?
Example:
01234 converts to 1234
0001234a converts to 1234a
001234-a converts to 1234-a
101234 remains as 101234
2509398 remains as 2509398
123z remains as 123z
000002829839 converts to 2829839
Regex is the best tool for the job; what it should be depends on the problem specification. The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
s.replaceFirst("^0+(?!$)", "")
The ^ anchor will make sure that the 0+ being matched is at the beginning of the input. The (?!$) negative lookahead ensures that not the entire string will be matched.
Test harness:
String[] in = {
"01234", // "[1234]"
"0001234a", // "[1234a]"
"101234", // "[101234]"
"000002829839", // "[2829839]"
"0", // "[0]"
"0000000", // "[0]"
"0000009", // "[9]"
"000000z", // "[z]"
"000000.z", // "[.z]"
};
for (String s : in) {
System.out.println("[" + s.replaceFirst("^0+(?!$)", "") + "]");
}
See also
regular-expressions.info
repetitions, lookarounds, and anchors
String.replaceFirst(String regex)
You can use the StringUtils class from Apache Commons Lang like this:
StringUtils.stripStart(yourString,"0");
If you are using Kotlin This is the only code that you need:
yourString.trimStart('0')
How about the regex way:
String s = "001234-a";
s = s.replaceFirst ("^0*", "");
The ^ anchors to the start of the string (I'm assuming from context your strings are not multi-line here, otherwise you may need to look into \A for start of input rather than start of line). The 0* means zero or more 0 characters (you could use 0+ as well). The replaceFirst just replaces all those 0 characters at the start with nothing.
And if, like Vadzim, your definition of leading zeros doesn't include turning "0" (or "000" or similar strings) into an empty string (a rational enough expectation), simply put it back if necessary:
String s = "00000000";
s = s.replaceFirst ("^0*", "");
if (s.isEmpty()) s = "0";
A clear way without any need of regExp and any external libraries.
public static String trimLeadingZeros(String source) {
for (int i = 0; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != '0') {
return source.substring(i);
}
}
return ""; // or return "0";
}
To go with thelost's Apache Commons answer: using guava-libraries (Google's general-purpose Java utility library which I would argue should now be on the classpath of any non-trivial Java project), this would use CharMatcher:
CharMatcher.is('0').trimLeadingFrom(inputString);
You could just do:
String s = Integer.valueOf("0001007").toString();
Use this:
String x = "00123".replaceAll("^0*", ""); // -> 123
Use Apache Commons StringUtils class:
StringUtils.strip(String str, String stripChars);
Using Regexp with groups:
Pattern pattern = Pattern.compile("(0*)(.*)");
String result = "";
Matcher matcher = pattern.matcher(content);
if (matcher.matches())
{
// first group contains 0, second group the remaining characters
// 000abcd - > 000, abcd
result = matcher.group(2);
}
return result;
Using regex as some of the answers suggest is a good way to do that. If you don't want to use regex then you can use this code:
String s = "00a0a121";
while(s.length()>0 && s.charAt(0)=='0')
{
s = s.substring(1);
}
If you (like me) need to remove all the leading zeros from each "word" in a string, you can modify #polygenelubricants' answer to the following:
String s = "003 d0g 00ss 00 0 00";
s.replaceAll("\\b0+(?!\\b)", "");
which results in:
3 d0g ss 0 0 0
I think that it is so easy to do that. You can just loop over the string from the start and removing zeros until you found a not zero char.
int lastLeadZeroIndex = 0;
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '0') {
lastLeadZeroIndex = i;
} else {
break;
}
}
str = str.subString(lastLeadZeroIndex+1, str.length());
Without using Regex or substring() function on String which will be inefficient -
public static String removeZero(String str){
StringBuffer sb = new StringBuffer(str);
while (sb.length()>1 && sb.charAt(0) == '0')
sb.deleteCharAt(0);
return sb.toString(); // return in String
}
Using kotlin it is easy
value.trimStart('0')
You could replace "^0*(.*)" to "$1" with regex
String s="0000000000046457657772752256266542=56256010000085100000";
String removeString="";
for(int i =0;i<s.length();i++){
if(s.charAt(i)=='0')
removeString=removeString+"0";
else
break;
}
System.out.println("original string - "+s);
System.out.println("after removing 0's -"+s.replaceFirst(removeString,""));
If you don't want to use regex or external library.
You can do with "for":
String input="0000008008451"
String output = input.trim();
for( ;output.length() > 1 && output.charAt(0) == '0'; output = output.substring(1));
System.out.println(output);//8008451
I made some benchmark tests and found, that the fastest way (by far) is this solution:
private static String removeLeadingZeros(String s) {
try {
Integer intVal = Integer.parseInt(s);
s = intVal.toString();
} catch (Exception ex) {
// whatever
}
return s;
}
Especially regular expressions are very slow in a long iteration. (I needed to find out the fastest way for a batchjob.)
And what about just searching for the first non-zero character?
[1-9]\d+
This regex finds the first digit between 1 and 9 followed by any number of digits, so for "00012345" it returns "12345".
It can be easily adapted for alphanumeric strings.
hey I need a regex that removes the leadings zeros.
right now I am using this code . it does work it just doesn't keep the negative symbol.
String regex = "^+(?!$)";
String numbers = txaTexte.getText().replaceAll(regex, ")
after that I split numbers so it puts the numbers in a array.
input :
-0005
0003
-87
output :
-5
3
-87
I was also wondering what regex I could use to get this.
the words before the arrow are input and after is the output
the text is in french. And right now I am using this it works but not with the apostrophe.
String [] tab = txaTexte.getText().split("(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+")
Un beau JOUR. —> Un/beau/JOUR
La boîte crânienne —> La/boîte/crânienne
C’était mieux aujourd’hui —> C’/était/mieux/aujourd’hui
qu’autrefois —> qu’/autrefois
D’hier jusqu’à demain! —> D’/hier/jusqu’/à/demain
Dans mon sous-sol—> Dans/mon/sous-sol
You might capture an optional hyphen, then match 1+ more times a zero and 1 capture 1 or more digits in group 2 starting with a digit 1-9
^(-?)0+([1-9]\d*)$
^ Start of string
(-?) Capture group 1, match optional hyphen
0+ Match 0+ zeroes
([1-9]\d*) Capture group 2, match 1+ digits starting with a digit 1-9
$ End of string
See a regex demo.
In the replacement use group 1 and group 2.
String regex = "^(-?)0+([1-9]\\d*)$";
String text = "-0005";
String numbers = txaTexte.getText().replaceAll(regex, "$1$2");
Here is one way. This preserves the sign.
capture the optional sign.
check for 0 or more leading zeros
followed by 1 or more digits.
String regex = "^([+-])?0*(\\d+)";
String [] data = {"-1415", "+2924", "-0000123", "000322", "+000023"};
for (String num : data) {
String after = num.replaceAll(regex, "$1$2");
System.out.printf("%8s --> %s%n", num , after);
}
prints
-1415 --> -1415
+2924 --> +2924
-0000123 --> -123
000322 --> 322
+000023 --> +23
If you want to keep -000, 000, +0000 etc. as just 0, try this regex:
`^[-+]?0*(0)$|^([-+])?0*(\d+)$`
Break down:
^...$ means the entire string should match (^ is the start of the string, $ is the end)
...|... is an alternative
[-+] is a character class that contains only the plus and minus characters. Note that - has a special meaning ("range") in character classes if it's not the first or last character
(...) is a capturing group which can be referenced in the replacement string by $number where number is the 1-based and 1-digit position of the group within the regex (the first group to start is no. 1 etc.)
?, * and + are quantifiers when used outside character classes meaning "0 or 1 occurence" (?), "any number of occurences, including none" (*) and "at least one occurence" (+)
^[-+]?0*(0)$ thus means: the entire string must be an optional sign, followed by any number of zeros and ending with a single zero which is captured as group 1.
alternatively ^([-+])?0*(\d+)$ means the entire string must be an optional sign which is captured as group 2, followed by any number of zeros and ending in at least one digit which is captured as group 3.
This regex can then be used with String.replaceAll(regex, "$1$2$3") in order to keep only the single 0 from group 1 or the optional sign and the number without leading zeros from groups 2 and 3. Any empty groups will result empty strings, that's why this works.
However, regular expressions can be slow, especially if you have to process a lot of strings.
One thing to improve this would be to compile the pattern only once:
//compile the pattern once and reuse it
Pattern p = Pattern.compile("^[+-]?0*(0)$|^([+-])?0*(\\d+)$");
//build a matcher from the pattern and the input string, and do the replacement
String number = p.matcher(txaTexte.getText()).replaceAll("$1$2$3");
If you're working on a large number of strings (> 10000) you might want to use some specialized plain parsing without regex. Consider something like this, which on my machine is about 10x faster than the regex approach with reused pattern:
public static String stripLeadingZeros(String s) {
//nothing to do, return the string as is
if( s == null || s.isEmpty() ) {
return s;
}
char[] chars = s.toCharArray();
int usedChars = 0;
//check if the first character is the sign
boolean hasSign = false;
if(chars[0] == '-' || chars[0] == '+') {
hasSign = true;
usedChars++;
//special case: just a sign
if(chars.length == 1) {
return s;
}
}
//process the rest of the characters
boolean stripZeros = true;
for( int i = usedChars; i < chars.length; i++) {
//not a digit, this isn't a simple integer, stop processing and keep the original string
if( chars[i] < '0' || chars[i] > '9') {
return s;
}
//are we still in zero-stripping mode
if( stripZeros) {
if( chars[i] == '0') {
continue; //check next char
}
//we've found a non-zero char, keep it and end zero-stripping mode
if(chars[i] >= '1' && chars[i] <= '9') {
stripZeros = false;
}
}
//since we are ignoring leading zeros, we just move all digits of the actual number to the left
chars[usedChars++] = chars[i];
}
//handle special case of number 0 (with optional sign)
if( usedChars == (hasSign ? 1 : 0)) {
chars[0] = '0';
usedChars = 1;
}
return new String(chars,0, usedChars);
}
The output should be like;
Hans4444müller ---> HansIVmüller
Mary555kren ---> MaryVkren
Firstly I have tried to get all repetitive numbers from a word with that regex:
(\d)\1+ // and replace that with $1
After I get the repetitive number such as 4, I tried to change this number to IV but
unfortunately I can't find the correct regex for this.
What I think about this algorithm is if there is a repeating number, replace that number with the roman form.
Are there any possible way to do it with regex ?
I don't know Java very well, but I do know regular expressions, C# and JavaScript. I am confident you can adapt one of my techniques to Java.
I have sample code with two different techniques.
The first invokes a function on every match to perform the replacement
The second iterates the matches provided by your regular expression you and convert each match into Roman numerals, then injects the result into your original text.
The link below illustrates technique 1 using DotNetFiddle. The replacement function takes a method name. The method in question performs is invoked for every match. This technique requires very little code.
https://dotnetfiddle.net/o9gG28. If you're lucky, Java has a similar technique available.
Technique 2: a javascript version that loops through every match found by the regex:
https://jsfiddle.net/ActualRandy/rxnzoc3u/81/. The method does some string concatenation using the replacement value.
Here's some code for method 2 using .NET syntax, Java should be similar. The key methods are 'Match' and 'GetNextMatch'. Match uses your regex to get the first match.
private void btnRegexRep_Click(object sender, RoutedEventArgs e) {
string fixThis = #"Hans4444müller,Mary555kren";
var re = new Regex("\\d+");
string result = "";
int lastIndex = 0;
string lastMatch = "";
//Get the first match using the regular expression:
var m = re.Match(fixThis);
//Keep looping while we can match:
while (m.Success) {
//Get length of text between last match and current match:
int len = m.Index - (lastIndex + lastMatch.Length);
result += fixThis.Substring(lastIndex + lastMatch.Length, len) + GetRomanText(m);
//Save values for next iteration:
lastIndex = m.Index;
lastMatch = m.Value;
m = m.NextMatch();
}
//Append text after last match:
if (lastIndex > 0) {
result += fixThis.Substring(lastIndex + lastMatch.Length);
}
Console.WriteLine(result);
}
private string GetRomanText(Match m) {
string[] roman = new[] { "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "IX" };
string result = "";
// Get ASCII value of first digit from the match (remember, 48= ascii 0, 57=ascii 9):
char c = m.Value[0];
if (c >= 48 && c <= 57) {
int index = c - 48;
result = roman[index];
}
return result;
}
I'm looking for the regex expression that will detect repeating symbols in a String. And currently I didn't found solution that fits all my requirements.
Requirements are pretty simple:
detect any repeating symbol in a String;
to be able to setup repeating count (eg. more than twice)
Examples of required detection (of symbol 'a', more than 2 times, true if detects, false otherwise)
"Abcdefg" - false
"AbcdaBCD" - false
"abcd_ab_ab" - true (symbol 'a' used three times)
"aabbaabb" - true (symbols 'a' used four times)
Since I'm not a pro in regex and usage of them - code snippet and explanation would be appreciated!
Thanks!
I think that
(.).*\1
would work:
(.) match a single character and capture
.* match any intervening characters
\1 match the captured group again.
(You'd need to compile with the DOTALL flag, or replace . with [\s\S] or similar if the string contains characters not ordinarily matched by .)
and if you want to require that it is found at least 3 times, just change the quantifier of the second two bullets:
(.)(.*\1){2}
etc.
This is going to be pretty inefficient, though, because it's going to have to do the "search for the next matching character" between every character in the string and the end of the string, making it at least quadratic.
You might be as well off not using regular expressions, e.g.
char[] cs = str.toCharArray();
Arrays.sort(cs);
int n = numOccurrencesRequired - 1;
for (int i = n; i < cs.length; ++i) {
boolean allSame = true;
for (int j = 1; j <= n && allSame; ++j) {
allSame = cs[i] == cs[i - j];
}
if (allSame) return true;
}
return false;
This sorts all of the same characters together, allowing you just to pass over the string once looking for adjacent equal characters.
Note that this doesn't quite work for any symbol: it will split up multi-char codepoints like 🍕. You can adapt the code above to work with codepoints, rather than chars.
Try this regex: (.)(?:.*\1)
It basically matches any character (.) is followed by anything .* and itself \1. If you want to check for 2 or more repeats only add {n,} at the end with n being the number of repeats you want to check for.
Yea, such regex exists but just because the set of characters is finite.
regex: .*(a.*a|b.*b|c.*c|...|y.*y|z.*z).*
It makes no sense. Use another approach:
String string = "something";
int[] count = new int[256];
for (int i = 0; i < string.length; i++) {
int temp = int(string.charAt(i));
count[temp]++;
}
Now you have all characters counted and you can use them as you wish.
I have a string and I want to get the first comma, space, or period in it.
int word = title.indexOf(" ", idx);
This will get the first space, how Can I make it to get the first thing from space, comma, or period?
I tried using || but didn't work.
ex.
int word = title.indexOf(" " || "," || ".", idx);
Gets the index of the first occurence of space, comma or dot or -1 if none of them could be found:
Pattern pattern = Pattern.compile("[ ,\\.]");
Matcher matcher = pattern.matcher(title);
int index = matcher.find() ? matcher.start() : -1;
Note that you can pre-compile the pattern and reuse it as often as you like.
See also http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Note also that if you want to break a text into single words, you can/should use a BreakIterator instead!
What you're doing isn't valid Java syntax. Use the indexOf() method with a space, comma and period, then determine the smallest of these 3 values.
int a = title.indexOf(" ", idx);
int b = title.indexof(",", idx);
int c = title.indexOf(".", idx);
Then just determine which is the smallest.
A faster way would be to write your own method. Behind the scenes, indexOf will just loop over all the characters. You can do that yourself manually
public static int findFirstOccurrence(String s) {
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == ',' || // period/space) {
return i;
}
}
return -1;
}
Unfortunately, you can't use array of characters for indexOf, instead you need to call indexOf three times, or you can match a regex, the code you provided is invalid java syntax. this symbol || is a conditional OR operator that you can use to perform boolean operations like
if(x || y )