I have this code to parse some csv, given the understanding that doing double quote escapes the quote within the string literal (as said in the Apache docs)
private void test() {
char quote = '\'';
char delim = ',';
// should be split into [comma, comma], [quote ', comma]
String inputListValues = "'comma, comma', 'quote '', comma'";
StrTokenizer st = new StrTokenizer(inputListValues, delim, quote);
List<String> vals = new ArrayList<String>();
while (st.hasNext()) {
vals.add(st.nextToken().trim());
}
System.out.println(vals);
// should be split into [quote ', comma], [comma, comma]
String inputListValues2 = "'quote '', comma', 'comma, comma'";
StrTokenizer st2 = new StrTokenizer(inputListValues2, delim, quote);
List<String> vals2 = new ArrayList<String>();
while (st2.hasNext()) {
vals2.add(st2.nextToken().trim());
}
System.out.println(vals2);
}
the output is
vals ArrayList<E> (id=1088)
[0] "comma, comma" (id=1063)
[1] "'quote ''" (id=1036)
[2] "comma'" (id=2123)
vals2 ArrayList<E> (id=2296)
[0] "quote ', comma" (id=1920)
[1] "'comma" (id=1852)
[2] "comma'" (id=1316)
I'm expecting 2 items parsed: [quote ', comma], [comma, comma]
If it didn't work at all it would be one thing, but it seems like changing the order causes the parsing to change the behavior.
Does anyone have any idea? I'm on the verge of just using another library or regex.
It's because I started using this with "csv parser" in mind, however it's not. The docs say
"a, ", b ,", c" - Three tokens "a, " , " b ", ", c" (quoted text untouched)
So spaces are part of the token. I added then used setTrimmerMatcher, since for a trimmer matcher:
These characters are trimmed off on each side of the delimiter until the token or quote is found.
Code ended up being
StrTokenizer st = new StrTokenizer(toTokenize, DELIM_CHAR, QUOTE_CHAR);
// by default this is a STRING matching, not csv parser, so spaces count as part of the token
// ie "a, ", b ,", c" - Three tokens "a, " , " b ", ", c" (quoted text untouched)
// thus we set the trimmer matcher, which "are trimmed off on each side of the delimiter until the token or quote is found."
st.setTrimmerMatcher(StrMatcher.trimMatcher());
Related
public static ArrayList<String> getDropDownTextList(By locator) {
List<WebElement> countryList = getDropDownOptions(locator);
ArrayList<String> countryTextList = new ArrayList<String>();
for(WebElement e:countryList) {
String text = e.getText();
text.replace(',' , ' ');
if(text.length()!= 0) {
countryTextList.add(text.concat("\n"));
}
}
return countryTextList;
}[![enter image description here](https://i.stack.imgur.com/XKyqd.png)](https://i.stack.imgur.com/XKyqd.png)
Apologies , I'm a newbie into programming
text.replace(',' , ' ');
To remove the ", " part which is immediately followed by end of string, you can do:
str = str.replaceAll(", $", "");
This handles the empty list (empty string) gracefully, as opposed to lastIndexOf / substring solutions which requires special treatment of such case.
Example code:
String str = "kushalhs, mayurvm, narendrabz, ";
str = str.replaceAll(", $", "");
System.out.println(str); // prints "kushalhs, mayurvm, narendrabz"
NOTE:
Since there has been some comments and suggested edits about the ", $" part: The expression should match the trailing part that you want to remove
If your input looks like "a,b,c,", use ",$".
If your input looks like "a, b, c, ", use ", $".
If your input looks like "a , b , c , ", use " , $".
I am working in Java. I have list of parameters stored in a string which is coming form excel. I want to split it only at starting hyphen of every new line. This string is stored in every excel cell and I am trying to extract it using Apache poi. The format is as below:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
What I want
array or arraylist which looks like this
[I am string one,
I am string two,
I am string-three with new line,
I am string-four,
I am string five]
What I Tried
I tried to use split function like this:
String[] newline_split = text.split("-");
but the output I get is not what I want
My O/P
[, I am string one,
I am string two,
I am string, // wrong
three // wrong
with new line, // wrong
I am string, // wrong!
four, // wrong!
I am string five]
I might have to tweak split function a bit but not able to understand how, because there are so many hyphens and new lines in the string.
P.S.
If i try splitting only at new line then the line - I am string-three \n with new line breaks into two parts which again is not correct.
EDIT:
Please know that this data inside string is incorrectly formatted just like what is shown above. It is coming from an excel file which I have received. I am trying to use apache poi to extract all the content out of each excel cell in a form of a string.
I intentionally tried to keep the format like what client gave me. For those who are confused about description inside A, I have changed it because I cannot post the contents on here as it is against privacy of my workplace.
You can
remove line separators (replace it with space) if they don't have - after it (in next line): .replaceAll("\\R(?!-)", " ") should do the trick
\R (written as "\\R" in string literal) since Java 8 can be used to represent line separators
(?!...) is negative-look-ahead mechanism - ensures that there is no - after place in which it was used (will not include it in match so we will not remove potential - which ware matched by it)
then remove - placed at start of each line (lets also include followed whitespaces to trim start of the string). In other words replace - placed
after line separators: can be represented by "\\R"
after start of string: can be represented by ^
This should do the trick: .replaceAll("(?<=\\R|^)-\\s*","")
split on remaining line separtors: .split("\\R")
Demo:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
String[] split = text.replaceAll("\\R(?!-)", " ")
.replaceAll("(?<=\\R|^)-\\s*","")
.split("\\R");
for (String s: split){
System.out.println("'"+s+"'");
}
Output (surrounded with ' to show start and end of results):
'I am string one'
'I am string two'
'I am string-three with new line'
'I am string-four'
'I am string five'
This is how I would do:
import java.util.*;
public class MyClass {
public static void main(String args[]) {
String A = "- I am string one \n" +
" -I am string two\n" +
" - I am string-three \n" +
" with new line\n" +
" -I am string-four\n" +
"- I am string five";
String[] s2 = A.split("\r?\n");
List<String> lines = new ArrayList<String>();
String line = "";
for (int i = 0; i < s2.length; i++) {
String ss = s2[i].trim();
if (i == 0) { // first line MUST start with "-"
line = ss.substring(1).trim();
} else if (ss.startsWith("-")) {
lines.add(line);
ss = ss.substring(1).trim();
line = ss;
} else {
line = line + " " + ss;
}
}
lines.add(line);
System.out.println(lines.toString());
}
}
I hope it helps.
A little explanation:
I will process line by line, trimming each one.
If it starts with '-' it means the end of the previous line, so I include it in the list. If not, I concatenate with the previous line.
looks as if you are splitting the FIRST - of each line, so you need to remove every instance of a "newline -"
str.replace("\n-", '\n')
then Remove the initial "-"
str = str.substring(1);
MSH|^~\&|RAD|MCH|SOARCLIN|MCH|201309281506||ORU^R01|RMS|P|2.4
PID|0001|_MISSING_|059805^a~059805^a~059805^a||RENNER^KATHRYN^
In a string like the above I need to replace the string on basis of | (pipe sign) count.
e.g. :
MSH line want to replace after 3rth position of (|) pipe sign "MCH"
with "ABC"
input : MSH|^~\&|RAD|MCH|SOARCLIN|MCH|201309281506||ORU^R01|RMS|P|2.4
output : MSH|^~\&|RAD|MCH|SOARCLIN|ABC|201309281506||ORU^R01|RMS|P|2.4
String repSection( String del, int count, String rep ){
String[] toks = theString.split( Pattern.quote( del ) );
toks[count] = rep;
theString = String.join( del, toks );
}
Call:
String result = repSection( "|", 3, "ABC" );
It depends on counting alone; it doesn't matter what is there between the 3rd and 4th pipe char.
I prefer this to some fancy and difficult to maintain regex.
s = s.replaceAll( "^((?:[^|]*\\|){3})[^|]*", "$1|ABC" );
Again, this doesn't care what is between 3rd and 4th pipe symbol.
I want to remove all special characters from input text as well as some restricted words.
Whatever the things I want to remove, that will come dynamically
(Let me clarify this: Whatever the words I need to exclude they will be provided dynamically - the user will decide what needs to be excluded. That is the reason I did not include regex. restricted_words_list (see my code) will get from the database just to check the code working or not I kept statically ),
but for demonstration purposes, I kept them in a String array to confirm whether my code is working properly or not.
public class TestKeyword {
private static final String[] restricted_words_list={"#","of","an","^","#","<",">","(",")"};
private static final Pattern restrictedReplacer;
private static Set<String> restrictedWords = null;
static {
StringBuilder strb= new StringBuilder();
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
strb.setLength(strb.length()-1);
restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);
strb = new StringBuilder();
}
public static void main(String[] args)
{
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
System.out.println("inputText : " + inputText);
String modifiedText = restrictedWordCheck(inputText);
System.out.println("Modified Text : " + modifiedText);
}
public static String restrictedWordCheck(String input){
Matcher m = restrictedReplacer.matcher(input);
StringBuffer strb = new StringBuffer(input.length());//ensuring capacity
while(m.find()){
if(restrictedWords==null)restrictedWords = new HashSet<String>();
restrictedWords.add(m.group()); //m.group() returns what was matched
m.appendReplacement(strb,""); //this writes out what came in between matching words
for(int i=m.start();i<m.end();i++)
strb.append("");
}
m.appendTail(strb);
return strb.toString();
}
}
The output is :
inputText : abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg
Modified Text : abcd abc# cbda ssef jjj the gg wh&at gggg ss%ss ### (()) DhD
Here the excluded words are of and an, but only some of the special characters, not all that I specified in restricted_words_list
Now I got a better Solution:
String inputText = title;// assigning input
List<String> restricted_words_list = catalogueService.getWordStopper(); // getting all stopper words from database dynamically (inside getWordStopper() method just i wrote a query and getting list of words)
String finalResult = "";
List<String> stopperCleanText = new ArrayList<String>();
String[] afterTextSplit = inputText.split("\\s"); // split and add to list
for (int i = 0; i < afterTextSplit.length; i++) {
stopperCleanText.add(afterTextSplit[i]); // adding to list
}
stopperCleanText.removeAll(restricted_words_list); // remove all word stopper
for (String addToString : stopperCleanText)
{
finalResult += addToString+";"; // add semicolon to cleaned text
}
return finalResult;
public String replaceAll(String regex,
String replacement)
Replaces each substring of this string (which matches the given regular expression) with the given replacement.
Parameters:
regex - the regular expression to which this string is to be
matched
replacement - the string to be substituted for each match.
So you just need to provide replacement parameter with an empty String.
You should change your loop
for(String str:restricted_words_list){
strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}
to this:
for(String str:restricted_words_list){
strb.append("\\b*").append(Pattern.quote(str)).append("\\b*|");
}
Because with your loop you're matching the restricted_words_list elements only if there is something before and after the match. Since abc# does not have anything after the # it will not be replaced. If you add * (which means 0 or more occurences) to the \\b on either side it will match things like abc# as well.
You may consider to use Regex directly to replace those special character with empty ''? Check it out: Java; String replace (using regular expressions)?, some tutorial here: http://www.vogella.com/articles/JavaRegularExpressions/article.html
You can also do like this :
String inputText = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
String regx="([^a-z^ ^0-9]*\\^*)";
String textWithoutSpecialChar=inputText.replaceAll(regx,"");
System.out.println("Without Special Char:"+textWithoutSpecialChar);
String yourSetofString="of|an"; // your restricted words.
String op=textWithoutSpecialChar.replaceAll(yourSetofString,"");
System.out.println("output : "+op);
o/p :
Without Special Char:abcd abc cbda ssef of jjj the gg an what gggg ssss h
output : abcd abc cbda ssef jjj the gg what gggg ssss h
String s = "abcd abc# cbda ssef of jjj t#he g^g an wh&at ggg (blah) and | then";
String[] words = new String[]{ " of ", "|", "(", " an ", "#", "#", "&", "^", ")" };
StringBuilder sb = new StringBuilder();
for( String w : words ) {
if( w.length() == 1 ) {
sb.append( "\\" );
}
sb.append( w ).append( "|" );
}
System.out.println( s.replaceAll( sb.toString(), "" ) );
I have an instruction like:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a description.\nCool, isn\'t it?"
});
The Eclipse plugin I am using, called MonjaDB splits the instruction by newline and I get each line as a separate instruction, which is bad. I fixed it using ;(\r|\n)+ which now includes the entire instruction, however, when sanitizing the newlines between the parts of the JSON, it also sanitizes the \n and \r within string in the json itself.
How do I avoid removing \t, \r, \n from within JSON strings? which are, of course, delimited by "" or ''.
You need to arrange to ignore whitespace when it appears within quotes,. So as suggested by one of the commenters:
\s+ | ( " (?: [^"\\] | \\ . ) * " ) // White-space inserted for readability
Match java whitespace or a double-quoted string where a string consists of " followed by any non-escape, non-quote or an escape + plus any character, then a final ". This way, whitespaces inside strings are not matched.
and replace with $1 if $1 is not null.
Pattern clean = Pattern.compile(" \\s+ | ( \" (?: [^\"\\\\] | \\\\ . ) * \" ) ", Pattern.COMMENTS | Pattern.DOTALL);
StringBuffer sb = new StringBuffer();
Matcher m = clean.matcher( json );
while (m.find()) {
m.appendReplacement(sb, "" );
// Don't put m.group(1) in the appendReplacement because if it happens to contain $1 or $2 you'll get an error.
if ( m.group(1) != null )
sb.append( m.group(1) );
}
m.appendTail(sb);
String cleanJson = sb.toString();
This is totally off the top of my head but I'm pretty sure it's close to what you want.
Edit: I've just got access to a Java IDE and tried out my solution. I had made a couple of mistakes with my code including using \. instead of . in the Pattern. So I have fixed that up and run it on a variation of your sample:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"
});
The code:
String json = "db.insert( {\n" +
" _id:3,\n" +
" cost:{_0:11},\n" +
" description:\"This is a \\\"description\\\" with an embedded newline: \\\"\\n\\\".\\nCool, isn\\'t it?\"\n" +
"});";
// insert above code
System.out.println(cleanJson);
This produces:
db.insert({_id:3,cost:{_0:11},description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"});
which is the same json expression with all whitespace removed outside quoted strings and whitespace and newlines retained inside quoted strings.