Tokenize a string with a space in java - java

I want to tokenize a string like this
String line = "a=b c='123 456' d=777 e='uij yyy'";
I cannot split based like this
String [] words = line.split(" ");
Any idea how can I split so that I get tokens like
a=b
c='123 456'
d=777
e='uij yyy';

The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:
When you hit a space, break off a token;
When you hit a quote keep getting characters until you hit another quote.

Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.
The example doesn't use the regular expression that you would need for this task though.
You can also use this SO thread as a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.
Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.

line.split(" (?=[a-z+]=)")
correctly gives:
a=b
c='123 456'
d=777
e='uij yyy'
Make sure you adapt the [a-z+] part in case your keys structure changes.
Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.

StreamTokenizer can help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:
String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
switch (st.ttype) {
case StreamTokenizer.TT_NUMBER:
System.out.println(st.nval);
break;
case StreamTokenizer.TT_WORD:
System.out.println(st.sval);
break;
case '=':
System.out.println("=");
break;
default:
System.out.println(st.sval);
}
}
outputs
Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy
If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0, which might be useful to you.

Assumptions:
Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
Validation of your input is not required (input assumed to be in valid a=b format)
This works fine for me.
Input:
a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk' 123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Output:
a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'
123sdkljhSDFjflsakd#*#&=456sldSLKD)#(
Code:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
// SPACE CHARACTER followed by
// sequence of non-space characters of 1 or more followed by
// first occuring EQUALS CHARACTER
final static String regex = " [^ ]+?=";
// static pattern defined outside so that you don't have to compile it
// for each method call
static final Pattern p = Pattern.compile(regex);
public static List<String> tokenize(String input, Pattern p){
input = input.trim(); // this is important for "last token case"
// see end of method
Matcher m = p.matcher(input);
ArrayList<String> tokens = new ArrayList<String>();
int beginIndex=0;
while(m.find()){
int endIndex = m.start();
tokens.add(input.substring(beginIndex, endIndex));
beginIndex = endIndex+1;
}
// LAST TOKEN CASE
//add last token
tokens.add(input.substring(beginIndex));
return tokens;
}
private static void println(List<String> tokens) {
for(String token:tokens){
System.out.println(token);
}
}
public static void main(String args[]){
String test = "a=b " +
"abc='123 456' " +
"&=777 " +
"#='uij yyy' " +
"ABC='slk slk' " +
"123sdkljhSDFjflsakd#*#&=456sldSLKD)#(";
List<String> tokens = RegexTest.tokenize(test, p);
println(tokens);
}
}

Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:
String line = "a = b c='123 456' d=777 e = 'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
String found = m.group();
if (state.equals("key")) {
if (found.equals("=") || found.startsWith("'"))
{ System.err.println ("ERROR"); }
else { key = found; state = "equals"; }
} else if (state.equals("equals")) {
if (! found.equals("=")) { System.err.println ("ERROR"); }
else { state = "value"; }
} else if (state.equals("value")) {
if (key == null) { System.err.println ("ERROR"); }
else {
if (found.startsWith("'"))
found = found.substring(1,found.length()-1);
keyval.put (key, found);
key = null;
state = "key";
}
}
}
if (! state.equals("key")) { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);
prints out
map: {d=777, e=uij yyy, c=123 456, a=b}
It does some basic error checking, and takes the quotes off the values.

This solution is both general and compact (it is effectively the regex version of cletus' answer):
String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
System.out.println(m.group()); // or whatever you want to do
}
In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).

public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123 456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
token = tokenizer.nextToken();
value = token.contains("'") ? value + " " + token : token ;
if(!value.contains("'") || value.endsWith("'")) {
//Split the strings and get variables into hashmap
attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
value ="";
}
}
System.out.println(attributes);
}
output:
{d=777, a=b, e='uij yyy', c='123 456'}
In this case continuous space will be truncated to single space in the value.
here attributed hashmap contains the values

import java.io.*;
import java.util.Scanner;
public class ScanXan {
public static void main(String[] args) throws IOException {
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("<file name>")));
while (s.hasNext()) {
System.out.println(s.next());
<write for output file>
}
} finally {
if (s != null) {
s.close();
}
}
}
}

java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
int index = token.indexOf('=');
String key = token.substring(0, index);
String value = token.substring(index + 1);
}

Have you tried splitting by '=' and creating a token out of each pair of the resulting array?

Related

How to get exact match keyword from the given string using java?

I'm trying to match exact AdvanceJava keyword with the given inputText string but it executes both if and else condition,instead of I want only AdvanceJava keyword matched.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List keywordsList = new ArrayList<>();//where keywordsList{advance,core,programming} -> keywordlist fetch
// from database
Enumeration e = Collections.enumeration(keywordsList);
int size = keywordsList.size();
while (e.hasMoreElements()) {
for (int i = 0; i < size; i++) {
String s1 = (String) keywordsList.get(i);
if (inputText.contains(s1) && inputText.contains(match)) {
System.out.println("Yes we providing " + s1);
} else if (!inputText.contains(s1) && inputText.contains(match)) {
System.out.println("Yes we are working on java");
}
}
break;
}
Thanks
you can simply do this by using pattern and matcher classes
Pattern p = Pattern.compile("java");
Matcher m = p.matcher("Print this");
m.find();
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Here's how you can achieve what you seek using pattern matching.
In the first example I have taken your input text as it is. This only improves your algorithm which has O(n^2) performance.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List<String> keywordsList = Arrays.asList("advance", "core", "programming");
for (String keyword : keywordsList) {
Pattern p = Pattern.compile(keyword.concat(match));
Matcher m = p.matcher(inputText);
//System.out.println(m.find());
if (m.find()) {
System.out.println("Yes we are providing " + keyword.concat(match));
}
}
But we can improve this in to a better implementation. Here's a more generic version of the above implementation. This code doesn't manipulate the input text before matching, rather we provide a more generic regular expression which ignores spaces and matches case insensitive manner.
String inputText = "i want to know related to Advance java";
String match = "java";
List<String> keywordsList = Arrays.asList("advance", "core", "programming");
for (String keyword : keywordsList) {
Pattern p = Pattern.compile(MessageFormat.format("(?i)({0}\\s*{1})", keyword, match));
Pattern p1 = Pattern.compile(MessageFormat.format("(?i)({0})", match));
Matcher m = p.matcher(inputText);
Matcher m1 = p1.matcher(inputText);
//System.out.println(m.find());
if(m.find()) {
System.out.println("Yes we are providing " + keyword.concat(match));
} else if(m1.find()) {
System.out.println("Yes we are working with " + match);
}
}
#sithum - Thanks but it executes both condition of if else in output.Please refer Screen shot which I attached here.
I applied following logic and it works fine. please refer it , Thanks.
String inputText = ("iwanttoknowrelatedtoAdvancejava").toLowerCase().replaceAll("\\s", "");
String match = "java";
List<String> keywordsList = session.createSQLQuery("SELECT QUESTIONARIES_RAISED FROM QUERIES").list(); // Fetch values from database (advance,core,programming)
String uniqueKeyword=null;
String commonKeyword= null;
int size =keywordsList.size();
for(int i=0;i<size;i++){
String s1 = (String) keywordsList.get(i);//get values one by one from list
if(inputText.contains(match)){
if(inputText.contains(s1) && inputText.contains(match)){
Queries q1 = new Queries();
q1.setQuestionariesRaised(s1); //set matched keyword to getter setter method
keywordsList1=session.createQuery("from Queries sentence where questionariesRaised='"+q1.getQuestionariesRaised()+"'").list(); // based on matched keyword fetch according to matched keyword sentence which stored in database
for(Queries ob : keywordsList1){
uniqueKeyword= ob.getSentence().toString();// Store fetched sentence to on string variable
}
break;
}else {
commonKeyword= "java only";
}
}
}}
if(uniqueKeyword!= null){
System.out.println("Yes we providing......................" + uniqueKeyword);
}else if(commonKeyword!= null){
System.out.println("Yes we providing " + commonKeyword);
}else{
}

Splitting the string in java is giving different results than expected [duplicate]

This question already has answers here:
Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
(1 answer)
Tokenizing a String but ignoring delimiters within quotes
(14 answers)
Closed 6 years ago.
Hi I am new to Java and trying to use the split method provided by java.
The input is a String in the following format
broadcast message "Shubham Agiwal"
The desired output requirement is to get an array with the following elements
["broadcast","message","Shubham Agiwal"]
My code is as follows
String str="broadcast message \"Shubham Agiwal\"";
for(int i=0;i<str.split(" ").length;i++){
System.out.println(str.split(" ")[i]);
}
The output I obtained from the above code is
["broadcast","message","\"Shubham","Agiwal\""]
Can somebody let me what I need to change in my code to get the desired output as mentioned above?
this is hard to split string directly.So, i will use the '\t' to replace
the whitespace if the whitespace is out of "". My code is below, you can try it, and maybe others will have better solution, we can discuss it too.
package com.code.stackoverflow;
/**
* Created by jiangchao on 2016/10/24.
*/
public class Main {
public static void main(String args[]) {
String str="broadcast message \"Shubham Agiwal\"";
char []chs = str.toCharArray();
StringBuilder sb = new StringBuilder();
/*
* false: means that I am out of the ""
* true: means that I am in the ""
*/
boolean flag = false;
for (Character c : chs) {
if (c == '\"') {
flag = !flag;
continue;
}
if (flag == false && c == ' ') {
sb.append("\t");
continue;
}
sb.append(c);
}
String []strs = sb.toString().split("\t");
for (String s : strs) {
System.out.println(s);
}
}
}
This is tedious but it works. The only problem is that if the whitespace in quotes is a tab or other white space delimiter it gets replaced with a space character.
String str = "broadcast message \"Shubham Agiwal\" better \"Hello java World\"";
Scanner scanner = new Scanner(str).useDelimiter("\\s");
while(scanner.hasNext()) {
String token = scanner.next();
if ( token.startsWith("\"")) { //Concatenate until we see a closing quote
token = token.substring(1);
String nextTokenInQuotes = null;
do {
nextTokenInQuotes = scanner.next();
token += " ";
token += nextTokenInQuotes;
}while(!nextTokenInQuotes.endsWith("\""));
token = token.substring(0,token.length()-1); //Get rid of trailing quote
}
System.out.println("Token is:" + token);
}
This produces the following output:
Token is:broadcast
Token is:message
Token is:Shubham Agiwal
Token is:better
Token is:Hello java World
public static void main(String[] arg){
String str = "broadcast message \"Shubham Agiwal\"";
//First split
String strs[] = str.split("\\s\"");
//Second split for the first part(Key part)
String[] first = strs[0].split(" ");
for(String st:first){
System.out.println(st);
}
//Append " in front of the last part(Value part)
System.out.println("\""+strs[1]);
}

Java Split method strings into method name and argument

I am writing a small programming language for a game I am making, this language will be for allowing users to define their own spells for the wizard entity outside the internal game code. I have the language written down, but I'm not entirely sure how to change a string like
setSpellName("Fireball")
setSplashDamage(32,5)
into an array which would have the method name and the arguments after it, like
{"setSpellName","Fireball"}
{"setSplashDamage","32","5"}
How could I do this using java's String.split or string regex's?
Thanks in advance.
Since you're only interested in the function name and parameters I'd suggest scanning up to the first instance of ( and then to the last ) for the params, as so.
String input = "setSpellName(\"Fireball\")";
String functionName = input.substring(0, input.indexOf('('));
String[] params = input.substring(input.indexOf(')'), input.length - 1).split(",");
To capture the String
setSpellName("Fireball")
Do something like this:
String[] line = argument.split("(");
Gets you "setSpellName" at line[0] and "Fireball") at line[1]
Get rid of the last parentheses like this
line[1].replaceAll(")", " ").trim();
Build your JSON with the two "cleaned" Strings.
There's probably a better way with Regex, but this is the quick and dirty way.
With String.indexOf() and String.substring(), you can parse out the function and parameters. Once you parse them out, apply the quotes are around each of them. Then combine them all back together delimited by commas and wrapped in curly braces.
public static void main(String[] args) throws Exception {
List<String> commands = new ArrayList() {{
add("setSpellName(\"Fireball\")");
add("setSplashDamage(32,5)");
}};
for (String command : commands) {
int openParen = command.indexOf("(");
String function = String.format("\"%s\"", command.substring(0, openParen));
String[] parameters = command.substring(openParen + 1, command.indexOf(")")).split(",");
for (int i = 0; i < parameters.length; i++) {
// Surround parameter with double quotes
if (!parameters[i].startsWith("\"")) {
parameters[i] = String.format("\"%s\"", parameters[i]);
}
}
String combine = String.format("{%s,%s}", function, String.join(",", parameters));
System.out.println(combine);
}
}
Results:
{"setSpellName","Fireball"}
{"setSplashDamage","32","5"}
This is a solution using regex, use this Regex "([\\w]+)\\(\"?([\\w]+)\"?\\)":
String input = "setSpellName(\"Fireball\")";
String pattern = "([\\w]+)\\(\"?([\\w]+)\"?\\)";
Pattern r = Pattern.compile(pattern);
String[] matches;
Matcher m = r.matcher(input);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
String[] params = m.group(2).split(",");
if (params.length > 1) {
matches = new String[params.length + 1];
matches[0] = m.group(1);
System.out.println(params.length);
for (int i = 0; i < params.length; i++) {
matches[i + 1] = params[i];
}
System.out.println(String.join(" :: ", matches));
} else {
matches = new String[2];
matches[0] = m.group(1);
matches[1] = m.group(2);
System.out.println(String.join(", ", matches));
}
}
([\\w]+) is the first group to get the function name.
\\(\"?([\\w]+)\"?\\) is the second group to get the parameters.
This is a Working DEMO.

replaceFirst for character "`"

First time here. I'm trying to write a program that takes a string input from the user and encode it using the replaceFirst method. All letters and symbols with the exception of "`" (Grave accent) encode and decode properly.
e.g. When I input
`12
I am supposed to get 28AABB as my encryption, but instead, it gives me BB8AA2
public class CryptoString {
public static void main(String[] args) throws IOException, ArrayIndexOutOfBoundsException {
String input = "";
input = JOptionPane.showInputDialog(null, "Enter the string to be encrypted");
JOptionPane.showMessageDialog(null, "The message " + input + " was encrypted to be "+ encrypt(input));
public static String encrypt (String s){
String encryptThis = s.toLowerCase();
String encryptThistemp = encryptThis;
int encryptThislength = encryptThis.length();
for (int i = 0; i < encryptThislength ; ++i){
String test = encryptThistemp.substring(i, i + 1);
//Took out all code with regard to all cases OTHER than "`" "1" and "2"
//All other cases would have followed the same format, except with a different string replacement argument.
if (test.equals("`")){
encryptThis = encryptThis.replaceFirst("`" , "28");
}
else if (test.equals("1")){
encryptThis = encryptThis.replaceFirst("1" , "AA");
}
else if (test.equals("2")){
encryptThis = encryptThis.replaceFirst("2" , "BB");
}
}
}
I've tried putting escape characters in front of the grave accent, however, it is still not encoding it properly.
Take a look at how your program works in each loop iteration:
i=0
encryptThis = '12 (I used ' instead of ` to easier write this post)
and now you replace ' with 28 so it will become 2812
i=1
we read character at position 1 and it is 1 so
we replace 1 with AA making 2812 -> 28AA2
i=2
we read character at position 2, it is 2 so
we replace first 2 with BB making 2812 -> BB8AA2
Try maybe using appendReplacement from Matcher class from java.util.regex package like
public static String encrypt(String s) {
Map<String, String> replacementMap = new HashMap<>();
replacementMap.put("`", "28");
replacementMap.put("1", "AA");
replacementMap.put("2", "BB");
Pattern p = Pattern.compile("[`12]"); //regex that will match ` or 1 or 2
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()){//we found one of `, 1, 2
m.appendReplacement(sb, replacementMap.get(m.group()));
}
m.appendTail(sb);
return sb.toString();
}
encryptThistemp.substring(i, i + 1); The second parameter of substring is length, are you sure you want to be increasing i? because this would mean after the first iteration test would not be 1 character long. This could throw off your other cases which we cannot see!

Filter words from string

I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.
Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}
try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")
Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.
I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}

Categories