Java - regular expression fixed length split into array - java

I've seen an example once before, but cannot find it again on how to split a fixed length data stream into an array using Regular expressions. Is this actually possible, is so, does anyone have a basic example?
08/14 1351 XMGYV4 AOUSC LTC .000 .000 VDPJU01PMP 11AUG14:15:17:05.99
I want to store each value into a separated value in an array without using substring.

The problem in this case is, that there is no fixed field size for every column.
Hence one needs to match on individual widths, enumerated.
String s = " 08/14 1351 XMGYV4 "
+ "AOUSC LTC .000 .000 "
+ "VDPJU01PMP 11AUG14:15:17:05.99 ";
Pattern pattern = Pattern.compile("(.{7,7})(.{11,11})(.)(.{12,12})(.{18,18})(.*)");
Matcher m = pattern.matcher(s);
if (m.matches()) {
for (int i = 1; i <= m.groupCount(); ++i) {
String g = m.group(i);
System.out.printf("[%d] %s%n", i, g);
}
}
This is a listing of groups like (.{7,7}) of minimal and maximal 7 characters.

Need to match with regular expression with whitespace character one or more times i.e. "\s"
String input = " 08/14 1351 XMGYV4 AOUSC LTC .000 .000 VDPJU01PMP 11AUG14:15:17:05.99 ";
String[] split = input.split("\\s+");
System.out.println(Arrays.toString(split));

Perhaps consider Krayo's solution String[] array = s.split( "\\s+" );?

Related

Regex pattern to convert comma separated String

Changing string with comma separated values to numbered new-line values
For example:
Input: a,b,c
Output:
1.a
2.b
3.c
Finding it hard to change it using regex pattern, instead of converting string to string array and looping through.
I'm not really sure, that it's possible to achive with only regex without any kind of a loop. As fore me, the solution with spliting the string into an array and iterating over it, is the most straightforward:
String value = "a,b,c";
String[] values = value.split(",");
String result = "";
for (int i=1; i<=values.length; i++) {
result += i + "." + values[i-1] + "\n";
}
Sure, it's possible to do without splitting and any kind of arrays, but it could be a little bit awkward solution, like:
String value = "a,b,c";
Pattern pattern = Pattern.compile("[(^\\w+)]");
Matcher matcher = pattern.matcher(value.replaceAll("\\,", "\n"));
StringBuffer s = new StringBuffer();
int i = 0;
while (matcher.find()) {
matcher.appendReplacement(s, ++i + "." + matcher.group());
}
System.out.println(s.toString());
Here the , sign is replaced with \n new line symbol and then we are looking for a groups of characters at the start of every line [(^\\w+)]. If any group is found, then we are appending to the start of this group a line number. But even here we have to use a loop to set the line number. And this logic is not as clear, as the first one.

Parsing a string with [3:0] substring in it

I want to store two numbers from a string into two distinct variables - for example, var1 = 3 and var2 = 0 from "[3:0]". I have the following code snippet:
String myStr = "[3:0]";
if (myStr.trim().matches("\\[(\\d+)\\]")) {
// Do something.
// If it enter the here, here I want to store 3 and 0 in different variables or an array
}
Is it possible doing this with split and regular expressions?
Don't call trim(). Enhance you regex instead.
Your regex is missing the pattern for : and the second number, and you don't need to escape the ].
To capture the matched numbers, you need the Matcher:
String myStr = " [3:0] ";
Matcher m = Pattern.compile("\\s*\\[(\\d+):(\\d+)]\\s*").matcher(myStr);
if (m.matches())
System.out.println(m.group(1) + ", " + m.group(2));
Output
3, 0
You can use replaceAll and split
String myStr = "[3:0]";
if(myStr.trim().matches("\\[\\d+:\\d+\\]") {
String[] numbers = myStr.replaceAll("[\\[\\]]","").split(":");
}
Moreover, your regExp to match String should be \\[\\d+:\\d+\\], if you want to avoid trim you can add \\s+ at start and end to match the spaces.But trim is not bad.
EDIT
As suggested by Andreas in comments,
String myStr = "[3:0]";
String regExp = "\\[(\\d+):(\\d+)\\]";
Pattern pattern = Pattern.compile(regExp);
Matcher matcher = pattern.matcher(myStr.trim());
if(matcher.find()) {
int a = Integer.parseInt(matcher.group(1));
int b = Integer.parseInt(matcher.group(2));
System.out.println(a + " : " + b);
}
OUTPUT
3 : 0
Without any regular expressions you could do this:
// this will remove the braces [ and ] and just leave "3:0"
String numberString= myString.trim().replace("[", "").replace("]","");
// this will split the string in everything before the : and everything after the : (so two values as an array)
String[] numbers = numberString.split(":");
// get the first value and parse it as a number "3" will become a simple 3
int firstNumber = Integer.parseInt(numbers[0]) ;
// get the second value and parse it from "0" to a plain 0
int secondNumber = Integer.parseInt(numbers[1]);
be carefull when parsing numbers, depending on your input string and what other possibilities there might be (e.g. "3:12" is ok, but "3:02" might throw an error).
In case you don't need to validate input and you want to simply get numbers from it, you could simply find indexOf(":") and substring parts which you are interested, in which are:
from [ (which is at position 0) till :
and from index of : till ] (which is at position equal to length of string -1)
Your code can look like
String text = "[3:0]";
int colonIndex = text.indexOf(':');
String first = text.substring(1, colonIndex);
String second = text.substring(colonIndex + 1, text.length() - 1);

How to find all occurrences of a substring (with wildcards allowed) in a given String

I'm searching for an efficient way for a wildcard-enabled search in Java. My first approach was of course to use regex. However this approach does NOT find ALL possible matches!
Here's the code:
public static ArrayList<StringOccurrence> matchesWildcard(String string, String pattern, boolean printToConsole) {
Pattern p = Pattern.compile(normalizeWildcards(pattern));
Matcher m = p.matcher(string);
ArrayList<StringOccurrence> res = new ArrayList<StringOccurrence>();
int count = 0;
while (m.find()){
res.add(new StringOccurrence(m.start(), m.end(), count, m.group()));
if(printToConsole)
System.out.println(count + ") " + m.group() + ", " + m.start() + ", " + m.end());
count +=1;
}
return res;
For a query q: ab*b and a String str: abbccabbccbbb I get the output:
0) abb, 0, 3
1) abb, 5, 8
But the whole String should be also a result, because it matches the pattern. It seems that the Java-implementation of regex starts each new search after the last match...
Any ideas how this could work (or suggestions for frameworks...)?
If you really need all possible matches, this answer is not useful for you (anyway maybe other user finds it useful).
If the widest match would be sufficient for you, then use a greedy quantifier (I guess you're using a reluctant one, showing your pattern would be useful).
Google for greedy vs reluctant quantifiers for regex.
Cheers.
ab*b means "a" followed by zero or more "b" followed by a "b". The minimum match would be "ab". Soulds like you're looking for something like: a[a-z]*b where [a-z]* indicates zero or more of any lowercase letter. You may also want to bound it so that the start of the "word" must be an "a" and the end must be a "b": \ba[a-z]*b\b
You are expecting * to mean .* and .*? at the same time (and more).
You should reconsider what you really need. Let's extend your example:
abbccabbccbbbcabb
Do you really want all possibilities?
To achieve what you want you'll have to
iterate p1 over all occurrences of "ab"
from p1+2 on
iterate p2 over all occurrences of "b"
output substring between p1 and p2+1
This is the corresponding Java code:
public static void main( String[] args ){
String s = "abbccabbccbbb";
int f1 = 0;
int p1;
while( (p1 = s.indexOf( "ab", f1 )) >= 0 ){
int f2 = p1 + 2;
int p2;
while( (p2 = s.indexOf( "b", f2 )) >= 0 ){
System.out.println( s.substring( p1, p2 + 1 ) );
f2 = p2 + 1;
}
f1 = p1 + 2;
}
}
Below is the output. You may be surprised - maybe that's more than you expect, but then you'll need to refine your specification.
abb 0:3
abbccab 0:7
abbccabb 0:8
abbccabbccb 0:11
abbccabbccbb 0:12
abbccabbccbbb 0:13
abb 5:8
abbccb 5:11
abbccbb 5:12
abbccbbb 5:13
Later
Why is a single regular expression not capable of doing it?
The basic mechanism of pattern matching is to try and match the regex against a string, starting at some position, initially 0. If a match is found, this position is advanced according to the matched string. The pattern matcher never looks back.
A pattern ab.*?b will try and find the next 'b' after an "ab". This means that *no match is possible beginning with the same "ab" and ending at some 'b' following that previously found "next 'b'".
In other words: one regex cannot find overlapping substrings.

StringUtils.countMatches words starting with a string?

I'm usingStringUtils.countMatches to count word frequencies, is there a way to search text for words starting-with some characters?
Example:
searching for art in "artificial art in my apartment" will return 3! I need it to return 2 for words starting with art only.
My solution was to replace \r and \n in the text with a space and modify the code to be:
text = text.replaceAll("(\r\n|\n)"," ").toLowerCase();
searchWord = " "+searchWord.toLowerCase();
StringUtils.countMatches(text, searchWord);
I also tried the following Regex:
patternString = "\\b(" + searchWord.toLowerCase().trim() + "([a-zA-Z]*))";
pattern = Pattern.compile(patternString);
matcher = pattern.matcher(text.toLowerCase());
Questions:
-Does my first solution make sense or is there a better way to do this?
-Is my second solution faster? as I'm working with large text files and decent number of search-words.
Thanks
text = text.replaceAll("(\r\n|\n)"," ").toLowerCase();
searchWord = " "+searchWord.toLowerCase();
String[] words = text.split(" ");
int count = 0;
for(String word : words)
if(searchWord.length() < word.length())
if(word.substring(word.length).equals(searchWord))
count++;
Loops provide the same effect.
Use a regular expression to count examples of art.... The pattern to use is:
\b<search-word>
Here, \b matches a word boundary. Of course, the \b needs to be escaped when listed in the pattern string. Below is an example:
String input = "artificial art in my apartment";
Matcher matcher = Pattern.compile("\\bart").matcher(input);
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
Output: 2

I need to get a substring from a java string Tokenizer

I need to get a substring from a java string tokenizer.
My inpunt string is = Pizza-1*Nutella-20*Chicken-65*
StringTokenizer productsTokenizer = new StringTokenizer("Pizza-1*Nutella-20*Chicken-65*", "*");
do
{
try
{
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
System.out.println(product + " " + count);
}
catch(Exception e)
{
}
}
while(productsTokenizer .hasMoreTokens());
My output must be:
Pizza 1
Nutella 20
Chicken 65
I need the product value and the count value in separate variables to insert that values in the Data Base.
I hope you can help me.
You could use String.split() as
String[] products = "Pizza-1*Nutella-20*Chicken-65*".split("\\*");
for (String product : products) {
String[] prodNameCount = product.split("\\-");
System.out.println(prodNameCount[0] + " " + prodNameCount[1]);
}
Output
Pizza 1
Nutella 20
Chicken 65
You invoke the nextToken() method 3 times. That will get you 3 different tokens
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
Instead you should do something like:
String token = productsTokenizer .nextToken();
int pos = token.indexOf("-");
String product = token.substring(...);
String count= token.substring(...);
I'll let you figure out the proper indexes for the substring() method.
Also instead of using a do/while structure it is better to just use a while loop:
while(productsTokenizer .hasMoreTokens())
{
// add your code here
}
That is don't assume there is a token.
An alternative answer you may want to use if your input grows:
// find all strings that match START or '*' followed by the name (matched),
// a hyphen and then a positive number (not starting with 0)
Pattern p = Pattern.compile("(?:^|[*])(\\w+)-([1-9]\\d*)");
Matcher finder = p.matcher(products);
while (finder.find()) {
// possibly check if the new match directly follows the previous one
String product = finder.group(1);
int count = Integer.valueOf(finder.group(2));
System.out.printf("Product: %s , count %d%n", product, count);
}
Some people dislike regex, but this is a good application for them. All you need to use is "(\\w+)-(\\d{1,})\\*" as your pattern. Here's a toy example:
String template = "Pizza-1*Nutella-20*Chicken-65*";
String pattern = "(\\w+)-(\\d+)\\*";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while(m.find())
{
System.out.println(m.group(1) + " " + m.group(2));
}
To explain this a bit more, "(\\w+)-(\\d+)\\*" looks for a (\\w+), which is any set of at least 1 character from [A-Za-z0-9_], followed by a -, followed by a number \\d+, where the+ means at least one character in length, followed by a *, which must be escaped. The parentheses capture what's inside of them. There are two sets of capturing parentheses in this regex, so we reference them by group(1) and group(2) as seen in the while loop, which prints:
Pizza 1
Nutella 20
Chicken 65

Categories