I have a string which contains a contiguous chunk of digits and then a contiguous chunk of characters. I need to split them into two parts (one integer part, and one string).
I tried using String.split("\\D", 1), but it is eating up first character.
I checked all the String API and didn't find a suitable method.
Is there any method for doing this thing?
Use lookarounds: str.split("(?<=\\d)(?=\\D)")
String[] parts = "123XYZ".split("(?<=\\d)(?=\\D)");
System.out.println(parts[0] + "-" + parts[1]);
// prints "123-XYZ"
\d is the character class for digits; \D is its negation. So this zero-matching assertion matches the position where the preceding character is a digit (?<=\d), and the following character is a non-digit (?=\D).
References
regular-expressions.info/Lookarounds and Character Class
Related questions
Java split is eating my characters.
Is there a way to split strings with String.split() and include the delimiters?
Alternate solution using limited split
The following also works:
String[] parts = "123XYZ".split("(?=\\D)", 2);
System.out.println(parts[0] + "-" + parts[1]);
This splits just before we see a non-digit. This is much closer to your original solution, except that since it doesn't actually match the non-digit character, it doesn't "eat it up". Also, it uses limit of 2, which is really what you want here.
API links
String.split(String regex, int limit)
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.
There's always an old-fashioned way:
private String[] split(String in) {
int indexOfFirstChar = 0;
for (char c : in.toCharArray()) {
if (Character.isDigit(c)) {
indexOfFirstChar++;
} else {
break;
}
}
return new String[]{in.substring(0,indexOfFirstChar), in.substring(indexOfFirstChar)};
}
(hope it works with digit-only or char-only Strings too - can't test it here - if not, take it as a general idea)
Related
I have a string consisting of 18 digits Eg. 'abcdefghijklmnopqr'. I need to add a blank space after 5th character and then after 9th character and after 15th character making it look like 'abcde fghi jklmno pqr'. Can I achieve this using regular expression?
As regular expressions are not my cup of tea hence need help from regex gurus out here. Any help is appreciated.
Thanks in advance
Regex finds a match in a string and can't preform a replacement. You could however use regex to find a certain matching substring and replace that, but you would still need a separate method for replacement (making it a two step algorithm).
Since you're not looking for a pattern in your string, but rather just the n-th char, regex wouldn't be of much use, it would make it unnecessary complex.
Here are some ideas on how you could implement a solution:
Use an array of characters to avoid creating redundant strings: create a character array and copy characters from the string before
the given position, put the character at the position, copy the rest
of the characters from the String,... continue until you reach the end
of the string. After that construct the final string from that
array.
Use Substring() method: concatenate substring of the string before
the position, new character, substring of the string after the
position and before the next position,... and so on, until reaching the end of the original string.
Use a StringBuilder and its insert() method.
Note that:
First idea listed might not be a suitable solution for very large strings. It needs an auxiliary array, using additional space.
Second idea creates redundant strings. Strings are immutable and final in Java, and are stored in a pool. Creating
temporary strings should be avoided.
Yes you can use regex groups to achieve that. Something like that:
final Pattern pattern = Pattern.compile("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})");
final Matcher matcher = pattern.matcher("abcdefghijklmnopqr");
if (matcher.matches()) {
String first = matcher.group(0);
String second = matcher.group(1);
String third = matcher.group(2);
String fourth = matcher.group(3);
return first + " " + second + " " + third + " " + fourth;
} else {
throw new SomeException();
}
Note that pattern should be a constant, I used a local variable here to make it easier to read.
Compared to substrings, which would also work to achieve the desired result, regex also allow you to validate the format of your input data. In the provided example you check that it's a 18 characters long string composed of only lowercase letters.
If you had a more interesting examples, with for example a mix of letters and digits, you could check that each group contains the correct type of data with the regex.
You can also do a simpler version where you just replace with:
"abcdefghijklmnopqr".replaceAll("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})", "$1 $2 $3 $4")
But you don't have the benefit of checking because if the string doesn't match the format it will just not replaced and this is less efficient than substrings.
Here is an example solution using substrings which would be more efficient if you don't care about checking:
final Set<Integer> breaks = Set.of(5, 9, 15);
final String str = "abcdefghijklmnopqr";
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (breaks.contains(i)) {
stringBuilder.append(' ');
}
stringBuilder.append(str.charAt(i));
}
return stringBuilder.toString();
How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)
I need to split a string by semicolon ignoring the semicolons that may come as HTML characters.
For instance, given the string:
id=com.google.android;keywords=Android;Operating System;Phone;versions=Gingerbread;ICS;JB
I need to split it into:
id = com.google.android
keywords=Android;Operating System;Phone
versions=Gingerbread;ICS;JB
any ideia how to do this?
A regex like (?<!&#?[0-9a-zA-Z]+); would probably do it. This would prevent matching a semicolon that terminates an entity reference or character reference, though it also catches a few cases that are not technically either by the specs (e.g. it wouldn't match the semicolon at the end of &#foo; or &123;).
(?<!...) is a "negative lookbehind", so you can read this regex as matching a semicolon that is not preceded by a substring that matches &#?[0-9a-zA-Z]+ (i.e. ampersand, optional hash, and one or more alphanumerics). However lookbehinds must have an upper bound on the number of characters they can match, which + doesn't, so you'll have to use a bounded repetition count, like {1,5} rather than the unbounded +. The upper bound needs to be at least as long as the longest entity reference you might see, and if your data might contain arbitrary entity references then you'll have to use something like the length of the string as the upper bound.
String[] keyValuePairs = theString.split(
"(?<!&#?[0-9a-zA-Z]{1," + theString.length() + "});");
If you can specify a smaller bound then that would probably be more efficient.
Edit: Android apparently doesn't like this lookbehind, even with bounded repetition, so you probably won't be able to use a single regex with String.split to do what you're after, you'll have to do the looping yourself, e.g.
Pattern p = Pattern.compile("(?:&#?[0-9a-zA-Z]+)?;");
Matcher m = p.matcher(theString);
List<String> splits = new ArrayList<String>();
int lastEltStart = 0;
while(m.find()) {
if(m.end() - m.start() > 1) {
// this match was an entity/character reference so don't split here
continue;
}
if(m.start() > lastEltStart) {
// non-empty part
splits.add(theString.substring(lastEltStart, m.start()));
}
lastEltStart = m.end();
}
if(lastEltStart < theString.length()) {
// non-empty final part
splits.add(theString.substring(lastEltStart));
}
Since the HTML entites have only two or three numbers between the '&#' and ';' I used the following regex:
(?<!&#\d{2,3});
I'm new to regular expressions, and was wondering how I could get only the first number in a string like 100 2011-10-20 14:28:55. In this case, I'd want it to return 100, but the number could also be shorter or longer.
I was thinking about something like [0-9]+, but it takes every single number separately (100,2001,10,...)
Thank you.
/^[^\d]*(\d+)/
This will start at the beginning, skip any non-digits, and match the first sequence of digits it finds
EDIT:
this Regex will match the first group of numbers, but, as pointed out in other answers, parseInt is a better solution if you know the number is at the beginning of the string
Try this to match for first number in string (which can be not at the beginning of the string):
String s = "2011-10-20 525 14:28:55 10";
Pattern p = Pattern.compile("(^|\\s)([0-9]+)($|\\s)");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(2));
}
Just
([0-9]+) .*
If you always have the space after the first number, this will work
Assuming there's always a space between the first two numbers, then
preg_match('/^(\d+)/', $number_string, $matches);
$number = $matches[1]; // 100
But for something like this, you'd be better off using simple string operations:
$space_pos = strpos($number_string, ' ');
$number = substr($number_string, 0, $space_pos);
Regexs are computationally expensive, and should be avoided if possible.
the below code would do the trick.
Integer num = Integer.parseInt("100 2011-10-20 14:28:55");
[0-9] means the numbers 0-9 can be used the + means 1 or more times. if you use [0-9]{3} will get you 3 numbers
Try ^(?'num'[0-9]+).*$ which forces it to start at the beginning, read a number, store it to 'num' and consume the remainder without binding.
This string extension works perfectly, even when string not starts with number.
return 1234 in each case - "1234asdfwewf", "%sdfsr1234" "## # 1234"
public static string GetFirstNumber(this string source)
{
if (string.IsNullOrEmpty(source) == false)
{
// take non digits from string start
string notNumber = new string(source.TakeWhile(c => Char.IsDigit(c) == false).ToArray());
if (string.IsNullOrEmpty(notNumber) == false)
{
//replace non digit chars from string start
source = source.Replace(notNumber, string.Empty);
}
//take digits from string start
source = new string(source.TakeWhile(char.IsDigit).ToArray());
}
return source;
}
NOTE: In Java, when you define the patterns as string literals, do not forget to use double backslashes to define a regex escaping backslash (\. = "\\.").
To get the number that appears at the start or beginning of a string you may consider using
^[0-9]*\.?[0-9]+ # Float or integer, leading digit may be missing (e.g, .35)
^-?[0-9]*\.?[0-9]+ # Optional - before number (e.g. -.55, -100)
^[-+]?[0-9]*\.?[0-9]+ # Optional + or - before number (e.g. -3.5, +30)
See this regex demo.
If you want to also match numbers with scientific notation at the start of the string, use
^[0-9]*\.?[0-9]+([eE][+-]?[0-9]+)? # Just number
^-?[0-9]*\.?[0-9]+([eE][+-]?[0-9]+)? # Number with an optional -
^[-+]?[0-9]*\.?[0-9]+([eE][+-]?[0-9]+)? # Number with an optional - or +
See this regex demo.
To make sure there is no other digit on the right, add a \b word boundary, or a (?!\d)
or (?!\.?\d) negative lookahead that will fail the match if there is any digit (or . and a digit) on the right.
public static void main(String []args){
Scanner s=new Scanner(System.in);
String str=s.nextLine();
Pattern p=Pattern.compile("[0-9]+");
Matcher m=p.matcher(str);
while(m.find()){
System.out.println(m.group()+" ");
}
\d+
\d stands for any decimal while + extends it to any other decimal coming directly after, until there is a non number character like a space or letter
Java's string split(regex) function splits at all instances of the regex. Python's partition function only splits at the first instance of the given separator, and returns a tuple of {left,separator,right}.
How do I achieve what partition does in Java?
e.g.
"foo bar hello world".partition(" ")
should become
"foo", " ", "bar hello world"
Is there an external library which
provides this utility already?
how would I achieve it without
an external library?
And can it be achieved without an external library and without Regex?
NB. I'm not looking for split(" ",2) as it doesn't return the separator character.
The String.split(String regex, int limit) is close to what you want. From the documentation:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array.
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.
If n is non-positive then the pattern will be applied as many times as possible and the array can have any length.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
Here's an example to show these differences (as seen on ideone.com):
static void dump(String[] ss) {
for (String s: ss) {
System.out.print("[" + s + "]");
}
System.out.println();
}
public static void main(String[] args) {
String text = "a-b-c-d---";
dump(text.split("-"));
// prints "[a][b][c][d]"
dump(text.split("-", 2));
// prints "[a][b-c-d---]"
dump(text.split("-", -1));
// [a][b][c][d][][][]
}
A partition that keeps the delimiter
If you need a similar functionality to the partition, and you also want to get the delimiter string that was matched by an arbitrary pattern, you can use Matcher, then taking substring at appropriate indices.
Here's an example (as seen on ideone.com):
static String[] partition(String s, String regex) {
Matcher m = Pattern.compile(regex).matcher(s);
if (m.find()) {
return new String[] {
s.substring(0, m.start()),
m.group(),
s.substring(m.end()),
};
} else {
throw new NoSuchElementException("Can't partition!");
}
}
public static void main(String[] args) {
dump(partition("james007bond111", "\\d+"));
// prints "[james][007][bond111]"
}
The regex \d+ of course is any digit character (\d) repeated one-or-more times (+).
While not exactly what you want, there's a second version of split which takes a "limit" parameter, telling it the maximum number of partitions to split the string into.
So if you called (in Java):
"foo bar hello world".split(" ", 2);
You'd get the array:
["foo", "bar hello world"]
which is more or less what you want, except for the fact that the separator character isn't embedded at index 1. If you really need this last point, you'd need to do it yourself, but hopefully all you specifically wanted was the ability to limit the number of splits.
How about this:
String partition(String string, String separator) {
String[] parts = string.split(separator, 2);
return new String[] {parts[0], separator, parts[1]};
}
BTW, you have to add some input/result checks at this :)
Use:
"foo bar hello world".split(" ",2)
By default the delimiter is whitespace
Is there an external library which provides this utility already?
None that I know of.
how would I achieve it without an external library?
And can it be achieved without an external library and without Regex?
Sure, that's no problem at all; just use String.indexOf() and String.substring(). However, Java does not have tuple datatype, so you'll have to return an array, List or write your own result class.