How can you parse the string which has a text qualifier

How can you parse the string which has a text qualifier - java

How can I parse a String str = "abc, \"def,ghi\"";
such that I get the output as
String[] strs = {"abc", "\"def,ghi\""}
i.e. an array of length 2.
Should I use regular expression or Is there any method in java api or anyother opensource
project which let me do this?
Edited
To give context about the problem, I am reading a text file which has a list of records one on each line. Each record has list of fields separated by delimiter(comma or semi-colon). Now I have a requirement where I have to support text qualifier some thing excel or open office supports. Suppose I have record
abc, "def,ghi"
In this , is my delimiter and " is my text qualifier such that when I parse this string I should get two fields abc and def,ghi not {abc,def,ghi}
Hope this clears my requirement.
Thanks
Shekhar

The basic algorithm is not too complicated:
public static List<String> customSplit(String input) {
List<String> elements = new ArrayList<String>();
StringBuilder elementBuilder = new StringBuilder();
boolean isQuoted = false;
for (char c : input.toCharArray()) {
if (c == '\"') {
isQuoted = !isQuoted;
// continue; // changed according to the OP comment - \" shall not be skipped
}
if (c == ',' && !isQuoted) {
elements.add(elementBuilder.toString().trim());
elementBuilder = new StringBuilder();
continue;
}
elementBuilder.append(c);
}
elements.add(elementBuilder.toString().trim());
return elements;
}

This question seems appropriate: Split a string ignoring quoted sections
Along that line, http://opencsv.sourceforge.net/ seems appropriate.

Try this -
String str = "abc, \"def,ghi\"";
String regex = "([,]) | (^[\"\\w*,\\w*\"])";
for(String s : str.split(regex)){
System.out.println(s);
}

Try:
List<String> res = new LinkedList<String>();
String[] chunks = str.split("\\\"");
if (chunks.length % 2 == 0) {
// Mismatched escaped quotes!
}
for (int i = 0; i < chunks.length; i++) {
if (i % 2 == 1) {
res.addAll(Array.asList(chunks[i].split(",")));
} else {
res.add(chunks[i]);
}
}
This will only split up the portions that are not between escaped quotes.
Call trim() if you want to get rid of the whitespace.

Related

Splitting the string in java is giving different results than expected [duplicate]

This question already has answers here:
Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
(1 answer)
Tokenizing a String but ignoring delimiters within quotes
(14 answers)
Closed 6 years ago.
Hi I am new to Java and trying to use the split method provided by java.
The input is a String in the following format
broadcast message "Shubham Agiwal"
The desired output requirement is to get an array with the following elements
["broadcast","message","Shubham Agiwal"]
My code is as follows
String str="broadcast message \"Shubham Agiwal\"";
for(int i=0;i<str.split(" ").length;i++){
System.out.println(str.split(" ")[i]);
}
The output I obtained from the above code is
["broadcast","message","\"Shubham","Agiwal\""]
Can somebody let me what I need to change in my code to get the desired output as mentioned above?

this is hard to split string directly.So, i will use the '\t' to replace
the whitespace if the whitespace is out of "". My code is below, you can try it, and maybe others will have better solution, we can discuss it too.
package com.code.stackoverflow;
/**
* Created by jiangchao on 2016/10/24.
*/
public class Main {
public static void main(String args[]) {
String str="broadcast message \"Shubham Agiwal\"";
char []chs = str.toCharArray();
StringBuilder sb = new StringBuilder();
/*
* false: means that I am out of the ""
* true: means that I am in the ""
*/
boolean flag = false;
for (Character c : chs) {
if (c == '\"') {
flag = !flag;
continue;
}
if (flag == false && c == ' ') {
sb.append("\t");
continue;
}
sb.append(c);
}
String []strs = sb.toString().split("\t");
for (String s : strs) {
System.out.println(s);
}
}
}

This is tedious but it works. The only problem is that if the whitespace in quotes is a tab or other white space delimiter it gets replaced with a space character.
String str = "broadcast message \"Shubham Agiwal\" better \"Hello java World\"";
Scanner scanner = new Scanner(str).useDelimiter("\\s");
while(scanner.hasNext()) {
String token = scanner.next();
if ( token.startsWith("\"")) { //Concatenate until we see a closing quote
token = token.substring(1);
String nextTokenInQuotes = null;
do {
nextTokenInQuotes = scanner.next();
token += " ";
token += nextTokenInQuotes;
}while(!nextTokenInQuotes.endsWith("\""));
token = token.substring(0,token.length()-1); //Get rid of trailing quote
}
System.out.println("Token is:" + token);
}
This produces the following output:
Token is:broadcast
Token is:message
Token is:Shubham Agiwal
Token is:better
Token is:Hello java World

public static void main(String[] arg){
String str = "broadcast message \"Shubham Agiwal\"";
//First split
String strs[] = str.split("\\s\"");
//Second split for the first part(Key part)
String[] first = strs[0].split(" ");
for(String st:first){
System.out.println(st);
}
//Append " in front of the last part(Value part)
System.out.println("\""+strs[1]);
}

Java Regex to find in a Text all the possible pairs of a list

I have a List of Strings containing names and surnames and i have a free text.
List<String> names; // contains: "jon", "snow", "arya", "stark", ...
String text = "jon snow and stark arya";
I have to find all the names and surnames, possibly with a Java Regex (so using Pattern and Matcher objects). So i want something like:
List<String> foundNames; // contains: "jon snow", "stark arya"
I have done this 2 possible ways but without using Regex, they are not static beacause part of a NameFinder class that have a list "names" that contains all the names.
public List<String> findNamePairs(String text) {
List<String> foundNamePairs = new ArrayList<String>();
List<String> names = this.names;
text = text.toLowerCase();
for (String name : names) {
String nameToSearch = name + " ";
int index = text.indexOf(nameToSearch);
if (index != -1) {
String textSubstring = text.substring(index + nameToSearch.length());
for (String nameInner : names) {
if (name != nameInner && textSubstring.startsWith(nameInner)) {
foundNamePairs.add(name + " " + nameInner);
}
}
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}
or in a worse (very bad) way (creating all the possible pairs):
public List<String> findNamePairsInTextNotOpt(String text) {
List<String> foundNamePairs = new ArrayList<String>();
text = text.toLowerCase();
List<String> pairs = getNamePairs(this.names);
for (String name : pairs) {
if (text.contains(name)) {
foundNamePairs.add(name);
}
}
removeDuplicateFromList(foundNamePairs);
return foundNamePairs;
}

You can create a regex using the list of names and then use find to find the names. To ensure you don't have duplicates, you can check if the name is already in the list of found names. The code would look like this.
List<String> names = Arrays.asList("jon", "snow", "stark", "arya");
String text = "jon snow and Stark arya and again Jon Snow";
StringBuilder regexBuilder = new StringBuilder();
for (int i = 0; i < names.size(); i += 2) {
regexBuilder.append("(")
.append(names.get(i))
.append(" ")
.append(names.get(i + 1))
.append(")");
if (i != names.size() - 2) regexBuilder.append("|");
}
System.out.println(regexBuilder.toString());
Pattern compile = Pattern.compile(regexBuilder.toString(), Pattern.CASE_INSENSITIVE);
Matcher matcher = compile.matcher(text);
List<String> found = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
String match = matcher.group().toLowerCase();
if (!found.contains(match)) found.add(match);
start = matcher.end();
}
for (String s : found) System.out.println("found: " + s);
If you want to be case sensitive just remove the flag in Pattern.compile(). If all matches have the same capitalization you can omit the toLowerCase() in the while loop as well.
But make sure that the list contains a multiple of 2 as list elements (name and surname) as the for-loop will throw an IndexOutOfBoundsException otherwise. Also the order matters in my code. It will only find the name pairs in the order they occur in the list. If you want to have both orders, you can change the regex generation accordingly.
Edit: As it is unknown whether a name is a surname or name and which build a name/surname pair, the regex generation must be done differently.
StringBuilder regexBuilder = new StringBuilder("(");
for (int i = 0; i < names.size(); i++) {
regexBuilder.append("(")
.append(names.get(i))
.append(")");
if (i != names.size() - 1) regexBuilder.append("|");
}
regexBuilder.append(") ");
regexBuilder.append(regexBuilder);
regexBuilder.setLength(regexBuilder.length() - 1);
System.out.println(regexBuilder.toString());
This regex will match any of the given names followed by a space and then again any of the names.

removal of repeated string

I have a string something like
JNDI Locations eis/FileAdapter,eis/FileAdapter used by composite
HelloWorld1.0.jar are not available in the
destination domain.
eis/FileAdapter,eis/FileAdapter is occuring twice.
I want it to be formatted as
JNDI Locations eis/FileAdapter used by composite
HelloWorld1.0.jar are not available in the
destination domain.
I tried below thing
String[ ] missingAdapters =((textMissingAdapterList.item(0)).getNodeValue().trim().split(","));
missingAdapters.get(0)
but i am missing second part any better way to handle this?

In your comment below the question you confirm, that the duplicates will alway be conencted via a comma. Using this information, this should work (for most cases):
String replaceCustomDuplicates(String str) {
if (str.indexOf(",") < 0) {
return str; // nothing to do
}
StringBuilder result = new StringBuilder(str.length());
for (String token : str.split(" ", -1)) {
if (token.indexOf(",") > 0) {
String[] parts = token.split(",");
if (parts.length == 2 && parts[0].equals(parts[1])) {
token = parts[0];
}
}
result.append(token + " ");
}
return result.delete(result.length() - 1, result.length()).toString();
}
a little demo with your example:
String str = "JNDI Locations eis/FileAdapter,eis/FileAdapter used by composite";
System.out.println(str);
str = replaceCustomDuplicates(str);
System.out.println(str);
Previous errors fixed

That should do it:
String[] missingAdapters = ((textMissingAdapterList.item(0)).getNodeValue().trim().split(","));
String result = missingAdapters[0] + " " + missingAdapters[1].split(" ", 2)[1];
assuming there is no space in this double string you want to leave out.

using tokenizer to read a line

public void GrabData() throws IOException
{
try {
BufferedReader br = new BufferedReader(new FileReader("data/500.txt"));
String line = "";
int lineCounter = 0;
int TokenCounter = 1;
arrayList = new ArrayList < String > ();
while ((line = br.readLine()) != null) {
//lineCounter++;
StringTokenizer tk = new StringTokenizer(line, ",");
System.out.println(line);
while (tk.hasMoreTokens()) {
arrayList.add(tk.nextToken());
System.out.println("check");
TokenCounter++;
if (TokenCounter > 12) {
er = new DataRecord(arrayList);
DR.add(er);
arrayList.clear();
System.out.println("check2");
TokenCounter = 1;
}
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Driver.class.getName()).log(Level.SEVERE, null, ex);
}
}
Hello , I am using a tokenizer to read the contents of a line and store it into an araylist. Here the GrabData class does that job.
The only problem is that the company name ( which is the third column in every line ) is in quotes and has a comma in it. I have included one line for your example. The tokenizer depends on the comma to separate the line into different tokens. But the company name throws it off i guess. If it weren't for the comma in the company column , everything goes as normal.
Example:-
Essie,Vaill,"Litronic , Industries",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com
Any ideas?

First of all StringTokenizer is considered to be legacy code. From Java doc:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Using the split() method you get an array of strings. While iterating through the array you can check if the current string starts with a quote and if that's the case check if the next one ends with a quote. If you meet these 2 conditions then you know you didn't split where you wanted and you can merge these 2 together, process it like you want and continue iterating through the array normally after that. In that pass you will probably do i+=2 instead of your regular i++ and it should go unnoticed.

You can accomplish this using Regular Expressions. The following code:
String s = "asd,asdasd,asd\"asdasdasd,asdasdasd\", asdasd, asd";
System.out.println(s);
s = s.replaceAll("(?<=\")([^\"]+?),([^\"]+?)(?=\")", "$1 $2");
s = s.replaceAll("\"", "");
System.out.println(s);
yields
asd,asdasd,asd, "asdasdasd,asdasdasd", asdasd, asd
asd,asdasd,asd, asdasdasd asdasdasd, asdasd, asd
which, from my understanding, is the preprocessing you require for your tokenizer-code to work. Hope this helps.

While StringTokenizer might not natively handle this for you, a couple lines of code will do it... probably not the most efficient, but should get the idea across...
while(tk.hasMoreTokens()) {
String token = tk.nextToken();
/* If the item is encapsulated in quotes, loop through all tokens to
* find closing quote
*/
if( token.startsWIth("\"") ){
while( tk.hasMoreTokens() && ! tk.endsWith("\"") ) {
// append our token with the next one. Don't forget to retain commas!
token += "," + tk.nextToken();
}
if( !token.endsWith("\"") ) {
// open quote found but no close quote. Error out.
throw new BadFormatException("Incomplete string:" + token);
}
// remove leading and trailing quotes
token = token.subString(1, token.length()-1);
}
}

As you can see, in the class description, the use of StringTokenizer is discouraged by Oracle.
Instead of using tokenizer I would use the String split() method
which you can use a regular expression as argument and significantly reduce your code.
String str = "Essie,Vaill,\"Litronic , Industries\",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com";
String[] strs = str.split("(?<! ),(?! )");
List<String> list = new ArrayList<String>(strs.length);
for(int i = 0; i < strs.length; i++) list.add(strs[i]);
Just pay attention to your regex, using this one you're assuming that the comma will be always between spaces.

Filter words from string

I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.

Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}

try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")

Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.

I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can you parse the string which has a text qualifier - java

This question seems appropriate: Split a string ignoring quoted sections Along that line, http://opencsv.sourceforge.net/ seems appropriate.

Try this - String str = "abc, \"def,ghi\""; String regex = "([,]) | (^[\"\\w,\\w\"])"; for(String s : str.split(regex)){ System.out.println(s); }

Related

Splitting the string in java is giving different results than expected [duplicate]

Java Regex to find in a Text all the possible pairs of a list

removal of repeated string

using tokenizer to read a line

Filter words from string

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can you parse the string which has a text qualifier - java

This question seems appropriate: Split a string ignoring quoted sections Along that line, http://opencsv.sourceforge.net/ seems appropriate.

Try this - String str = "abc, \"def,ghi\""; String regex = "([,]) | (^[\"\\w*,\\w*\"])"; for(String s : str.split(regex)){ System.out.println(s); }

Related

Splitting the string in java is giving different results than expected [duplicate]

Java Regex to find in a Text all the possible pairs of a list

removal of repeated string

using tokenizer to read a line

Filter words from string

Categories

Resources

Try this - String str = "abc, \"def,ghi\""; String regex = "([,]) | (^[\"\\w,\\w\"])"; for(String s : str.split(regex)){ System.out.println(s); }