Splitting the string in java is giving different results than expected [duplicate] - java

This question already has answers here:
Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
(1 answer)
Tokenizing a String but ignoring delimiters within quotes
(14 answers)
Closed 6 years ago.
Hi I am new to Java and trying to use the split method provided by java.
The input is a String in the following format
broadcast message "Shubham Agiwal"
The desired output requirement is to get an array with the following elements
["broadcast","message","Shubham Agiwal"]
My code is as follows
String str="broadcast message \"Shubham Agiwal\"";
for(int i=0;i<str.split(" ").length;i++){
System.out.println(str.split(" ")[i]);
}
The output I obtained from the above code is
["broadcast","message","\"Shubham","Agiwal\""]
Can somebody let me what I need to change in my code to get the desired output as mentioned above?

this is hard to split string directly.So, i will use the '\t' to replace
the whitespace if the whitespace is out of "". My code is below, you can try it, and maybe others will have better solution, we can discuss it too.
package com.code.stackoverflow;
/**
* Created by jiangchao on 2016/10/24.
*/
public class Main {
public static void main(String args[]) {
String str="broadcast message \"Shubham Agiwal\"";
char []chs = str.toCharArray();
StringBuilder sb = new StringBuilder();
/*
* false: means that I am out of the ""
* true: means that I am in the ""
*/
boolean flag = false;
for (Character c : chs) {
if (c == '\"') {
flag = !flag;
continue;
}
if (flag == false && c == ' ') {
sb.append("\t");
continue;
}
sb.append(c);
}
String []strs = sb.toString().split("\t");
for (String s : strs) {
System.out.println(s);
}
}
}

This is tedious but it works. The only problem is that if the whitespace in quotes is a tab or other white space delimiter it gets replaced with a space character.
String str = "broadcast message \"Shubham Agiwal\" better \"Hello java World\"";
Scanner scanner = new Scanner(str).useDelimiter("\\s");
while(scanner.hasNext()) {
String token = scanner.next();
if ( token.startsWith("\"")) { //Concatenate until we see a closing quote
token = token.substring(1);
String nextTokenInQuotes = null;
do {
nextTokenInQuotes = scanner.next();
token += " ";
token += nextTokenInQuotes;
}while(!nextTokenInQuotes.endsWith("\""));
token = token.substring(0,token.length()-1); //Get rid of trailing quote
}
System.out.println("Token is:" + token);
}
This produces the following output:
Token is:broadcast
Token is:message
Token is:Shubham Agiwal
Token is:better
Token is:Hello java World

public static void main(String[] arg){
String str = "broadcast message \"Shubham Agiwal\"";
//First split
String strs[] = str.split("\\s\"");
//Second split for the first part(Key part)
String[] first = strs[0].split(" ");
for(String st:first){
System.out.println(st);
}
//Append " in front of the last part(Value part)
System.out.println("\""+strs[1]);
}

Related

Filter words from string

I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.
Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}
try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")
Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.
I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}

Android - Editing my String so each word starts with a capital

I was wondering if someone could provide me some code or point me towards a tutrial which explain how I can convert my string so that each word begins with a capital.
I would also like to convert a different string in italics.
Basically, what my app is doing is getting data from several EditText boxes and then on a button click is being pushed onto the next page via intent and being concatenated into 1 paragraph. Therefore, I assume I need to edit my string on the intial page and make sure it is passed through in the same format.
Thanks in advance
You can use Apache StringUtils. The capitalize method will do the work.
For eg:
WordUtils.capitalize("i am FINE") = "I Am FINE"
or
WordUtils.capitalizeFully("i am FINE") = "I Am Fine"
Here is a simple function
public static String capEachWord(String source){
String result = "";
String[] splitString = source.split(" ");
for(String target : splitString){
result
+= Character.toUpperCase(target.charAt(0))
+ target.substring(1) + " ";
}
return result.trim();
}
The easiest way to do this is using simple Java built-in functions.
Try something like the following (method names may not be exactly right, doing it off the top of my head):
String label = Capitalize("this is my test string");
public String Capitalize(String testString)
{
String[] brokenString = testString.split(" ");
String newString = "";
for(String s : brokenString)
{
s.charAt(0) = s.charAt(0).toUpper();
newString += s + " ";
}
return newString;
}
Give this a try, let me know if it works for you.
Just add android:inputType="textCapWords" to your EditText in layout xml. This wll make all the words start with the Caps letter.
Strings are immutable in Java, and String.charAt returns a value, not a reference that you can set (like in C++). Pheonixblade9's will not compile. This does what Pheonixblade9 suggests, except it compiles.
public String capitalize(String testString) {
String[] brokenString = testString.split(" ");
String newString = "";
for (String s : brokenString) {
char[] chars = s.toCharArray();
chars[0] = Character.toUpperCase(chars[0]);
newString = newString + new String(chars) + " ";
}
//the trim removes trailing whitespace
return newString.trim();
}
String source = "hello good old world";
StringBuilder res = new StringBuilder();
String[] strArr = source.split(" ");
for (String str : strArr) {
char[] stringArray = str.trim().toCharArray();
stringArray[0] = Character.toUpperCase(stringArray[0]);
str = new String(stringArray);
res.append(str).append(" ");
}
System.out.print("Result: " + res.toString().trim());

Problem replacing a String in Java

I am trying to replace a URL with a shortened URL inside of a String:
public void shortenMessage()
{
String body = composeEditText.getText().toString();
String shortenedBody = new String();
String [] tokens = body.split("\\s");
// Attempt to convert each item into an URL.
for( String token : tokens )
{
try
{
Url url = as("mycompany", "someapikey").call(shorten(token));
Log.d("SHORTENED", token + " was shortened!");
shortenedBody = body.replace(token, url.getShortUrl());
}
catch(BitlyException e)
{
//Log.d("BitlyException", token + " could not be shortened!");
}
}
composeEditText.setText(shortenedBody);
// url.getShortUrl() -> http://bit.ly/fB05
}
After the links are shortened, I want to print the modified string in an EditText. My EditText is not displaying my messages properly.
For example:
"I like www.google.com" should be "I like [some shortened url]" after my code executes.
In Java, strings are immutable. String.replace() returns a new string which is the result of the replacement. Thus, when you do shortenedBody = body.replace(token, url.getShortUrl()); in a loop, shortenedBody will hold the result of (only the very) last replace.
Here's a fix, using StringBuilder.
public void shortenMessage()
{
String body = composeEditText.getText().toString();
StringBuilder shortenedBody = new StringBuilder();
String [] tokens = body.split("\\s");
// Attempt to convert each item into an URL.
for( String token : tokens )
{
try
{
Url url = as("mycompany", "someapikey").call(shorten(token));
Log.d("SHORTENED", token + " was shortened!");
shortenedBody.append(url.getShortUrl()).append(" ");
}
catch(BitlyException e)
{
//Log.d("BitlyException", token + " could not be shortened!");
}
}
composeEditText.setText(shortenedBody.toString());
// url.getShortUrl() -> http://bit.ly/fB05
}
You'll probably want String.replaceAll and Pattern.quote to "quote" your string before you pass it to replaceAll, which expects a regex.

How can you parse the string which has a text qualifier

How can I parse a String str = "abc, \"def,ghi\"";
such that I get the output as
String[] strs = {"abc", "\"def,ghi\""}
i.e. an array of length 2.
Should I use regular expression or Is there any method in java api or anyother opensource
project which let me do this?
Edited
To give context about the problem, I am reading a text file which has a list of records one on each line. Each record has list of fields separated by delimiter(comma or semi-colon). Now I have a requirement where I have to support text qualifier some thing excel or open office supports. Suppose I have record
abc, "def,ghi"
In this , is my delimiter and " is my text qualifier such that when I parse this string I should get two fields abc and def,ghi not {abc,def,ghi}
Hope this clears my requirement.
Thanks
Shekhar
The basic algorithm is not too complicated:
public static List<String> customSplit(String input) {
List<String> elements = new ArrayList<String>();
StringBuilder elementBuilder = new StringBuilder();
boolean isQuoted = false;
for (char c : input.toCharArray()) {
if (c == '\"') {
isQuoted = !isQuoted;
// continue; // changed according to the OP comment - \" shall not be skipped
}
if (c == ',' && !isQuoted) {
elements.add(elementBuilder.toString().trim());
elementBuilder = new StringBuilder();
continue;
}
elementBuilder.append(c);
}
elements.add(elementBuilder.toString().trim());
return elements;
}
This question seems appropriate: Split a string ignoring quoted sections
Along that line, http://opencsv.sourceforge.net/ seems appropriate.
Try this -
String str = "abc, \"def,ghi\"";
String regex = "([,]) | (^[\"\\w*,\\w*\"])";
for(String s : str.split(regex)){
System.out.println(s);
}
Try:
List<String> res = new LinkedList<String>();
String[] chunks = str.split("\\\"");
if (chunks.length % 2 == 0) {
// Mismatched escaped quotes!
}
for (int i = 0; i < chunks.length; i++) {
if (i % 2 == 1) {
res.addAll(Array.asList(chunks[i].split(",")));
} else {
res.add(chunks[i]);
}
}
This will only split up the portions that are not between escaped quotes.
Call trim() if you want to get rid of the whitespace.

Tokenize a string with a space in java

I want to tokenize a string like this
String line = "a=b c='123 456' d=777 e='uij yyy'";
I cannot split based like this
String [] words = line.split(" ");
Any idea how can I split so that I get tokens like
a=b
c='123 456'
d=777
e='uij yyy';
The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:
When you hit a space, break off a token;
When you hit a quote keep getting characters until you hit another quote.
Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.
The example doesn't use the regular expression that you would need for this task though.
You can also use this SO thread as a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.
Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.
line.split(" (?=[a-z+]=)")
correctly gives:
a=b
c='123 456'
d=777
e='uij yyy'
Make sure you adapt the [a-z+] part in case your keys structure changes.
Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.
StreamTokenizer can help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:
String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
switch (st.ttype) {
case StreamTokenizer.TT_NUMBER:
System.out.println(st.nval);
break;
case StreamTokenizer.TT_WORD:
System.out.println(st.sval);
break;
case '=':
System.out.println("=");
break;
default:
System.out.println(st.sval);
}
}
outputs
Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy
If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0, which might be useful to you.
Assumptions:
Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
Validation of your input is not required (input assumed to be in valid a=b format)
This works fine for me.
Input:
a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk' 123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Output:
a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'
123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Code:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
// SPACE CHARACTER followed by
// sequence of non-space characters of 1 or more followed by
// first occuring EQUALS CHARACTER
final static String regex = " [^ ]+?=";
// static pattern defined outside so that you don't have to compile it
// for each method call
static final Pattern p = Pattern.compile(regex);
public static List<String> tokenize(String input, Pattern p){
input = input.trim(); // this is important for "last token case"
// see end of method
Matcher m = p.matcher(input);
ArrayList<String> tokens = new ArrayList<String>();
int beginIndex=0;
while(m.find()){
int endIndex = m.start();
tokens.add(input.substring(beginIndex, endIndex));
beginIndex = endIndex+1;
}
// LAST TOKEN CASE
//add last token
tokens.add(input.substring(beginIndex));
return tokens;
}
private static void println(List<String> tokens) {
for(String token:tokens){
System.out.println(token);
}
}
public static void main(String args[]){
String test = "a=b " +
"abc='123 456' " +
"&=777 " +
"#='uij yyy' " +
"ABC='slk slk' " +
"123sdkljhSDFjflsakd#*#&=456sldSLKD)#(";
List<String> tokens = RegexTest.tokenize(test, p);
println(tokens);
}
}
Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:
String line = "a = b c='123 456' d=777 e = 'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
String found = m.group();
if (state.equals("key")) {
if (found.equals("=") || found.startsWith("'"))
{ System.err.println ("ERROR"); }
else { key = found; state = "equals"; }
} else if (state.equals("equals")) {
if (! found.equals("=")) { System.err.println ("ERROR"); }
else { state = "value"; }
} else if (state.equals("value")) {
if (key == null) { System.err.println ("ERROR"); }
else {
if (found.startsWith("'"))
found = found.substring(1,found.length()-1);
keyval.put (key, found);
key = null;
state = "key";
}
}
}
if (! state.equals("key")) { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);
prints out
map: {d=777, e=uij yyy, c=123 456, a=b}
It does some basic error checking, and takes the quotes off the values.
This solution is both general and compact (it is effectively the regex version of cletus' answer):
String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
System.out.println(m.group()); // or whatever you want to do
}
In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).
public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123 456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
token = tokenizer.nextToken();
value = token.contains("'") ? value + " " + token : token ;
if(!value.contains("'") || value.endsWith("'")) {
//Split the strings and get variables into hashmap
attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
value ="";
}
}
System.out.println(attributes);
}
output:
{d=777, a=b, e='uij yyy', c='123 456'}
In this case continuous space will be truncated to single space in the value.
here attributed hashmap contains the values
import java.io.*;
import java.util.Scanner;
public class ScanXan {
public static void main(String[] args) throws IOException {
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("<file name>")));
while (s.hasNext()) {
System.out.println(s.next());
<write for output file>
}
} finally {
if (s != null) {
s.close();
}
}
}
}
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
int index = token.indexOf('=');
String key = token.substring(0, index);
String value = token.substring(index + 1);
}
Have you tried splitting by '=' and creating a token out of each pair of the resulting array?

Categories