StringTokenizer for multiple multi-charecter tokens in JAVA? - java

I need to split up a string according to multiple tokens which also may have multiple charecter like given bellow,
word1:word2|word3||word4|word5|||word6|word7
I need to token the above string according to ':', '|', '||', '|||'.
Is it possible with StringTokenizer or else what is the code to tokenize it using Regular Expression split??.. Remember, i also need the token in the resulted array...

You can use the StringUtils Lang API.
Please find the Javadocs for the same here.
It has the following methods -
Substring/Left/Right/Mid - null-safe substring extractions
SubstringBefore/SubstringAfter/SubstringBetween - substring extraction relative to other strings

This is possible with StringTokenizer. But this has to be multi-step process.

Obviously, you can split the String like this:
line.split ("[:|]+")
res113: Array[java.lang.String] = Array(word1, word2, word3, word4, word5, word6, word7)
But what were the delimiters? Well - obviously the opposite:
line.split ("[^:|]+")
res114: Array[java.lang.String] = Array("", :, |, ||, |, |||, |)

I dont know if any API available.you can solve like below.
steps should be.
1.take String
2.define regex to be replaced //you should know them in advance
3.loop all expressions
4.replace every expression with Space.
5.now you can use String tokenizer.
String str="word1:word2|word3||word4|word5|||word6|word7";
String[] tokens={"[:]","[|]{3}","[|]{2}","[|]"};
for (int i = 0; i < tokens.length; i++) {
str=str.replaceAll(tokens[i], " ");
System.out.println(str);
}

Related

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

replace string with regex

There is piece of code that replaces the C/o,d/o,s/o or w/o as below :
if (temp.contains(",,"))
{
temp=temp.replace ("C/O,,","");
temp=temp.replace ("S/O,,","");
temp=temp.replace ("D/O,,","");
temp=temp.replace ("W/O,,","");
}
But i want to replace above by regex so that it automatically removes C or S or D or W if there is a char sequence ",," I am not able to get what regex can be used .
Please help.
You mean this?
temp=temp.replaceAll("[SDWC]/O,,","");
For case-insensitive match,
temp=temp.replaceAll("(?i)[SDWC]/O,,","");

How to remove comma after a word pattern in java

Please help me out to get the specific regex to remove comma after a word pattern in java.
Assume, I would like to delete comma after each pattern where the pattern is <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG> now I want my output to be <Word$TAG> <Word$TAG> <Word$TAG> <Word$TAG> . if I used .replaceAll(), it will replace all commas, but in my <Word$TAG> Word may have a comma(,).
For example, Input.txt is as follows
mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E
and Output.txt
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E
You could use ", " as search and replace it with " " (space) as below:
one.replace(", ", " ");
If you think, you have "myString, ,,," or multiple spaces in between, then you could use replace all with regex like
one.replaceAll(",\\s+", " ");
(?<=[^,\s]),
Try this.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/5
Match the data you want, not the one you don't want.
You probably want ([^ ]+), and keep the bracketed data, separated by whitespace.
You might even want to narrow it down to ([^ ]+§[^ ]+),. Usually, stricter is better.
You could use a positive lookahead assertion to match all the commas which are followed by a space or end of the line anchor.
String s = "mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E";
System.out.println(s.replaceAll(",(?=\\s|$)",""));
Output:
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E

replace a single quote in a string with another single quote

I have a String with single quote. I want to replace the single quote with 2 single quotes.
I tried using
String s="Kathleen D'Souza";
s.replaceAll("'","''");
s.replaceAll("\'","\'\'");
s.replace("'","''");
s.replace("\'","\'\'");
But the single quote is not getting replaced with 2 single quotes.
reassign the replaced string to s
String s="Kathleen D'Souza";
s = s.replaceAll("'","''");
Please try
s= "test ' test";
`s.replaceAll("'","\"");` => test " test
`s.replaceAll("'","''");` => test '' test
Strings are immutable. Assign the result of replaceAll to your String:
s = s.replaceAll("'","''");
String s="Kathleen D'Souza";
s= s.replace("'", "''");
Try String#replace(). It will replace all occurrence of single ' with double ''.
Note, with the given solutions successive single quotes will be doubled, so Kathleen D''Souza turns into Kathleen D''''Souza. (I've seen users outsmart themselves like this.) If that is something you are concerned about, you can match successive single quotes with:
s = s.replaceAll("''*","''");

How to split this string using Java Regular Expressions

I want to split the string
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
to
name
employeeno
dob
joindate
I wrote the following java code for this but it is printing only name other matches are not printing.
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
Pattern pattern = Pattern.compile("\\[.+\\]+?,?\\s*" );
String[] split = pattern.split(fields);
for (String string : split) {
System.out.println(string);
}
What am I doing wrong here?
Thank you
This part:
\\[.+\\]
matches the first [, the .+ then gobbles up the entire string (if no line breaks are in the string) and then the \\] will match the last ].
You need to make the .+ reluctant by placing a ? after it:
Pattern pattern = Pattern.compile("\\[.+?\\]+?,?\\s*");
And shouldn't \\]+? just be \\] ?
The error is that you are matching greedily. You can change it to a non-greedy match:
Pattern.compile("\\[.+?\\],?\\s*")
^
There's an online regular expression tester at http://gskinner.com/RegExr/?2sa45 that will help you a lot when you try to understand regular expressions and how they are applied to a given input.
WOuld it be better to use Negated Character Classes to match the square brackets? \[(\w+\s)+\w+[^\]]\]
You could also see a good example how does using a negated character class work internally (without backtracking)?

Categories