What are the different methods to parse strings in Java? [closed] - java

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
For parsing player commands, I've most often used the split method to split a string by delimiters and then to then just figure out the rest by a series of ifs or switches. What are some different ways of parsing strings in Java?

I really like regular expressions. As long as the command strings are fairly simple, you can write a few regexes that could take a few pages of code to manually parse.
I would suggest you check out http://www.regular-expressions.info for a good intro to regexes, as well as specific examples for Java.

I assume you're trying to make the command interface as forgiving as possible. If this is the case, I suggest you use an algorithm similar to this:
Read in the string
Split the string into tokens
Use a dictionary to convert synonyms to a common form
For example, convert "hit", "punch", "strike", and "kick" all to "hit"
Perform actions on an unordered, inclusive base
Unordered - "punch the monkey in the face" is the same thing as "the face in the monkey punch"
Inclusive - If the command is supposed to be "punch the monkey in the face" and they supply "punch monkey", you should check how many commands this matches. If only one command, do this action. It might even be a good idea to have command priorities, and even if there were even matches, it would perform the top action.

Parsing manually is a lot of fun... at the beginning:)
In practice if commands aren't very sophisticated you can treat them the same way as those used in command line interpreters. There's a list of libraries that you can use: http://java-source.net/open-source/command-line. I think you can start with apache commons CLI or args4j (uses annotations). They are well documented and really simple in use. They handle parsing automatically and the only thing you need to do is to read particular fields in an object.
If you have more sophisticated commands, then maybe creating a formal grammar would be a better idea. There is a very good library with graphical editor, debugger and interpreter for grammars. It's called ANTLR (and the editor ANTLRWorks) and it's free:) There are also some example grammars and tutorials.

I would look at Java migrations of Zork, and lean towards a simple Natural Language Processor (driven either by tokenizing or regex) such as the following (from this link):
public static boolean simpleNLP( String inputline, String keywords[])
{
int i;
int maxToken = keywords.length;
int to,from;
if( inputline.length() = inputline.length()) return false; // check for blank and empty lines
while( to >=0 )
{
to = inputline.indexOf(' ',from);
if( to > 0){
lexed.addElement(inputline.substring(from,to));
from = to;
while( inputline.charAt(from) == ' '
&& from = keywords.length) { status = true; break;}
}
}
return status;
}
...
Anything which gives a programmer a reason to look at Zork again is good in my book, just watch out for Grues.
...

Sun itself recommends staying away from StringTokenizer and using the String.spilt method instead.
You'll also want to look at the Pattern class.

Another vote for ANTLR/ANTLRWorks. If you create two versions of the file, one with the Java code for actually executing the commands, and one without (with just the grammar), then you have an executable specification of the language, which is great for testing, a boon for documentation, and a big timesaver if you ever decide to port it.

If this is to parse command lines I would suggest using Commons Cli.
The Apache Commons CLI library provides an API for processing command line interfaces.

Try JavaCC a parser generator for Java.
It has a lot of features for interpreting languages, and it's well supported on Eclipse.

#CodingTheWheel Heres your code, a bit clean up and through eclipse (ctrl+shift+f) and the inserted back here :)
Including the four spaces in front each line.
public static boolean simpleNLP(String inputline, String keywords[]) {
if (inputline.length() < 1)
return false;
List<String> lexed = new ArrayList<String>();
for (String ele : inputline.split(" ")) {
lexed.add(ele);
}
boolean status = false;
to = 0;
for (i = 0; i < lexed.size(); i++) {
String s = (String) lexed.get(i);
if (s.equalsIgnoreCase(keywords[to])) {
to++;
if (to >= keywords.length) {
status = true;
break;
}
}
}
return status;
}

A simple string tokenizer on spaces should work, but there are really many ways you could do this.
Here is an example using a tokenizer:
String command = "kick person";
StringTokenizer tokens = new StringTokenizer(command);
String action = null;
if (tokens.hasMoreTokens()) {
action = tokens.nextToken();
}
if (action != null) {
doCommand(action, tokens);
}
Then tokens can be further used for the arguments. This all assumes no spaces are used in the arguments... so you might want to roll your own simple parsing mechanism (like getting the first whitespace and using text before as the action, or using a regular expression if you don't mind the speed hit), just abstract it out so it can be used anywhere.

When the separator String for the command is allways the same String or char (like the ";") y recomend you use the StrinkTokenizer class:
StringTokenizer
but when the separator varies or is complex y recomend you to use the regular expresions, wich can be used by the String class itself, method split, since 1.4. It uses the Pattern class from the java.util.regex package
Pattern

If the language is dead simple like just
VERB NOUN
then splitting by hand works well.
If it's more complex, you should really look into a tool like ANTLR or JavaCC.
I've got a tutorial on ANTLR (v2) at http://javadude.com/articles/antlrtut which will give you an idea of how it works.

JCommander seems quite good, although I have yet to test it.

If your text contains some delimiters then you can your split method.
If text contains irregular strings means different format in it then you must use regular expressions.

split method can split a string into an array of the specified substring expression regex.
Its arguments in two forms, namely: split (String regex) and split (String regex, int limit), which split (String regex) is actually by calling split (String regex, int limit) to achieve, limit is 0. Then, when the limit> 0 and limit <0 represents what?
When the jdk explained: when limit> 0 sub-array lengths up to limit, that is, if possible, can be limit-1 sub-division, remaining as a substring (except by limit-1 times the character has string split end);
limit <0 indicates no limit on the length of the array;
limit = 0 end of the string empty string will be truncated.
StringTokenizer class is for compatibility reasons and is preserved legacy class, so we should try to use the split method of the String class.
refer to link

Related

Splitting strings like push1234 in java

So, please bear with me as I have a long question here, I have some code in java that is using an array list to implement a stack. I need to be able to enter the command "push" to add stuff to the stack. However my problem is that it has to be in the format pushSTUFF.
Where the "STUFF" is anything, upper case, lower case, string, int, etc.. The way I've been trying to implement this is with the string split method where PUSH is the delimiter. Then the command is passed to a switch case.
I quickly realized that the split gets discarded, at least as far as I can tell, and that the switch case is getting pushSTUFF not push as the case input.
In contemplating this problem I came up with a couple of ways I could do this. I just don't know if they are possible or how to do them.
So,
Is there a way to split a string like pushSTUFF and keep both parts (the push and the STUFF)
Is there a way to split, from a string, something of unknown length or contents (since I don't know what the user will input the STUFF is unknown)
Is there a way to tell the switch case to look for the pushSTUFF as opposed to just push (again because STUFF is unknown).
Are any of these even possible to do? If so what would you recommend?
I'm sure there are better ways but as I'm still learning java these seemed like the best for right now. Also I didn't post any code because I didn't feel it was necessary to the question. I will post some if you need it though. Just ask and I will be happy to oblige.
(tl;dr) Is it possible to do any of 1, 2, or 3 above and if so how?
Thanks in advance.
Instead of splitting the strings, you can use regular expressions with groups and iterate over the matching parts of it (as you saw, the split character(s) get discarded).
For #1, you could do something like (pseudocode):
regex = (push)(.*)
stuff = groups[1]
That should also cover #2 since it will match all characters after the push.
I'm not entirely sure what you're asking in #3.
There is a regex tutorial here if you're not familiar with java regular expressions.
You can also take a look at the StringTokenizer, which has an option to keep delimiters.
If the format will always be push*SOMETHING* why aren't you using String.substring()?
You can do:
String something = "pushSTUFF".substring(4);
This way, you will always get whatever is behind push.
I really don't understand what you are trying to achieve without seeing the actual code, but your problem seems simple enough to be solved this way.
Use .indexOf and find push:
public class splitstring {
public static void main(String[] args){
String tosplit, part1, part2 = new String();
int ind = 0;
tosplit = "push1234";
ind = tosplit.indexOf("push");
part1 = tosplit.substring(ind,ind + 4);
part2 = tosplit.substring(ind + 4, tosplit.length());
}
}
You can search for any Uppercase letter and use String.substring(...)
Find if first character in a string is upper case, Java

select a word from a section of string?

I'm trying to find out if there are any methods in Java which would me achieve the following.
I want to pass a method a parameter like below
"(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
I want the method to select one of the words inside the parenthesis separated by '|' and return the full string with one of the words randomly selected. Does Java have any methods for this or would I have to code this myself using character by character checks in loops?
You can parse it by regexes.
The regex would be \(\w+(\|\w+)*\); in the replacement you just split the argument on the '|' and return the random word.
Something like
import java.util.regex.*;
public final class Replacer {
//aText: "(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
//returns: "hello my name is Bob. Today is a wonderful day."
public static String getEditedText(String aText){
StringBuffer result = new StringBuffer();
Matcher matcher = fINITIAL_A.matcher(aText);
while ( matcher.find() ) {
matcher.appendReplacement(result, getReplacement(matcher));
}
matcher.appendTail(result);
return result.toString();
}
private static final Pattern fINITIAL_A = Pattern.compile(
"\\\((\\\w+(\\\|\w+)*)\\\)",
Pattern.CASE_INSENSITIVE
);
//aMatcher.group(1): "hi|hello"
//words: ["hi", "hello"]
//returns: "hello"
private static String getReplacement(Matcher aMatcher){
var words = aMatcher.group(1).split('|');
var index = randomNumber(0, words.length);
return words[index];
}
}
(Note that this code is written just to illustrate an idea and probably won't compile)
May be it helps,
Pass three strings("hi|hello"),(Bob|Robert) and (good|great|wonderful) as arguments to the method.
Inside method split the string into array
by, firststringarray[]=thatstring.split("|"); use this for other two.
and Use this to use random string selection.
As per my knowledge java don't have any method to do it directly.
I have to write code for it or regexe
I don't think Java has anything that will do what you want directly. Personally, instead of doing things based on regexps or characters, I would make a method something like:
String madLib(Set<String> greetings, Set<String> names, Set<String> dispositions)
{
// pick randomly from each of the sets and insert into your background string
}
There is no direct support for this. And you should ideally not try a low level solution.
You should search for 'random sentence generator'. The way you are writing
`(Hi|Hello)`
etc. is called a grammar. You have to write a parser for the grammar. Again there are many solutions for writing parsers. There are standard ways to specify grammar. Look for BNF.
The parser and generator problems have been solved many time over, and the interesting part of your problem will be writing the grammar.
Java does not provide any readymade method for this. You can use either Regex as described by Penartur or create your own java method to split Strings and store random words. StringTokenizer class can help you if following second approach.

Regexp for a string to contain only letters , numbers and space in Java

Requirement: String should contain only letters , numbers and space.
I have to pass a clean name to another API.
Implementation: Java
I came up with this for my requirement
public static String getCleanFilename(String filename) {
if (filename == null) {
return null;
}
return filename.replaceAll("[^A-Za-z0-9 ]","");
}
This works well for few of my testcase , but want to know am I missing any boundary conditions, or any better way (in performance) to do it.
Additional to comments: i don't think that performance is an issue in a scenario where user input is taken (and a filename shouldn't be that long...).
But concerning your question: you may reduce the number of replacements by adding an additional + in your regex:
[^A-Za-z0-9 ]+
To answer you're direct question, \t fails your method and passes through as "space." Switch to \s ([...\s] and you're good.
At any rate, your design is probably flawed. Instead of arbitrarily dicking with user input, let the user know what you don't allow and make the correction manual.
EDIT:
If the filename doesn't matter, take the SHA-2 hash of the file name and use that. Guaranteed to meet your requirements.

Split textual script into substrings by pattern

Consider following script (it's total nonsense in pseudo-language):
if (Request.hostMatch("asfasfasf.com") && someString.existsIn(new String[] {"brr", "hrr"})) {
if (Requqest.clientIp("10.0.x.x")) {
somevar = "1";
}
somevar = "2";
}
else {
somevar = "first";
}
string foo = "foo";
// etc. etc.
How would you grab if-block's parameters and contents from it? The if-block has format of:
if<whitespace>(<parameters>)<whitespace>{<contents>}<anything>
I tried using String.split() with regex pattern of ^if\s*\(|\)\s*\{|\}\s* but this fails miserably. Namely, the problem is that ) { is found also in inner if-block and the closing } is found from many places as well. I don't think neither lazy or eager expansion works here.
So... any pointers to what might I need here in order to implement this with regex?
I also need to get the remaining string without the if-block's code (so code starting from else { ...). Using just String.split() seems to make it difficult as there is no information about the length of the parts that were parsed away.
I initially created a loop based solution (using String.substring() heavily) for this, but it's dull. I would like to have something fancier instead. Should I go with regex or create a custom, generic function (there are many other cases than just this) that takes the parseable String and the pattern instead (consider the if<whitespace>(... pattern above)?
Edit: Changed returns to variable assignments as it would have not made sense otherwise.
You'd be far better off using (or writing) a parser than trying to do this with Regex.
Regex is great for somethings, but for complex parsing like this, it sucks. Another example where it sucks that gets asked a lot here is parsing HTML - you can do it to a limited degree, but for anything complex, a DOM parser is a much better solution.
For a [very] simple parser, what you need is a recursive function that searches for a braces { and }, recursing down a level each time it comes across an opening brace, and returning back up a level when it finds a closing brace. It then needs to store the string contents between the two braces at each level.
A regular language won't work because a regular grammar can't match things like "any number of open parenthesis followed by any number of close parenthesis". A context-free grammar would be needed for that.
Unless you use a context-free grammar parser for Java or a regular expression extension that makes regular expressions no longer regular, your loop-based solution is probably the fanciest solution.
As per the above, you'll need a parser. One type that's easy to implement (and fun to write!) is a recursive descent parser with backtracking. There is also a plethora of parser generators out there, though most of those have a learning curve. One Java-friendly parser generator is JavaCC.

Regular expression performance in Java -- better few complex or many simple?

I am doing some fairly extensive string manipulations using regular expressions in Java. Currently, I have many blocks of code that look something like:
Matcher m = Pattern.compile("some pattern").matcher(text);
StringBuilder b = new StringBuilder();
int prevMatchIx = 0;
while (m.find()) {
b.append(text.substring(prevMatchIx, m.start()));
String matchingText = m.group(); //sometimes group(n)
//manipulate the matching text
b.append(matchingText);
prevMatchIx = m.end();
}
text = b.toString()+text.substring(prevMatchIx);
My question is which of the two alternatives is more efficient (primarily time, but space to some extent):
1) Keep many existing blocks as above (assuming there isn't a better way to handle such blocks -- I can't use a simple replaceAll() because the groups must be operated on).
2) Consolidate the blocks into one big block. Use a "some pattern" that is the combination of all the old blocks' patterns using the |/alternation operator. Then, use if/else if within the loop to handle each of the matching patterns.
Thank you for your help!
If the order in which the replacements are made matters, you would have to be careful when using technique #1. Allow me to give an example: If I want to format a String so it is suitable for inclusion in XML, I have to first replace all & with & and then make the other replacements (like < to <). Using technique #2, you would not have to worry about this because you are making all the replacements in one pass.
In terms of performance, I think #2 would be quicker because you would be doing less String concatenations. As always, you could implement both techniques and record their speed and memory consumption to find out for certain. :)
I'd suggest caching the patterns and having a method that uses the cache.
Patterns are expensive to compile so at least you will only compile them once and there is code reuse in using the same method for each instance. Shame about the lack of closures though as that would make things a lot cleaner.
private static Map<String, Pattern> patterns = new HashMap<String, Pattern>();
static Pattern findPattern(String patStr) {
if (! patterns.containsKey(patStr))
patterns.put(patStr, Pattern.compile(patStr));
return patterns.get(patStr);
}
public interface MatchProcessor {
public void process(String field);
}
public static void processMatches(String text, String pat, MatchProcessor processor) {
Matcher m = findPattern(pat).matcher(text);
int startInd = 0;
while (m.find(startInd)) {
processor.process(m.group());
startInd = m.end();
}
}
Last time I was in your position I used a product called jflex.
Java's regex doesn't provide the traditional O(N log M) performance guarantees of true regular expression engines (for input strings of length N, and patterns of length M). Instead it inherits from its perl roots exponential time for some patterns. Unfortunately these pathological patterns, while rare in normal use, are all too common when combining regexes as you propose to do (I can attest to this from personal experience).
Consequently, my advice is to either:
a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or
b) switch to a lexer package such as jflex, which will provide a more declarative, and far more readable, syntax to approach this sort of cascading/sequential regex processing; and
c) seriously consider using a parser generator package. My current favourite is Beaver, but CUP is also a good option. Both of these are excellent tools and I highly recommend both of them, and as they both sit on top of jflex you can add them as/when you need them.
That being said, if you haven't used a parser-generator before and you are in a hurry, it will be easier to get up to speed with JavaCC. Not as powerful as Beaver/CUP but its parsing model is easier to understand.
Whatever you do, please don't use Antlr. It is very fashionable, and has great cheerleaders, but its online documentation sucks, its syntax is awkward, its performance is poor, and its scannerless design makes several common simple cases painful to handle. You would be better off using an abomination like sablecc(v1).
Note: Yes I have used everything I have mentioned above, and more besides; so this advice comes from personal experience.
First, does this need to be efficient? If not, don't bother -- complexification won't help code maintainability.
Assuming it does, doing them separately is usually the most efficient. This is especially true if there are large blocks of text in the expressions: without alternation this can be used to speed up matching, with it can't help at all.
If performance is really critical, you can code it several ways and test with sample data.
Option #2 is almost certainly the better way to go, assuming it isn't too difficult to combine the regexes. And you don't have to implement it from scratch, either; the lower-level API that replaceAll() is built on (i.e., appendReplacement() and appendTail()), is also available for your use.
Taking the example that #mangst used, here's how you might process some text to be inserted into an XML document:
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String test_in = "One < two & four > three.";
Pattern p = Pattern.compile("(&)|(<)|(>)");
Matcher m = p.matcher(test_in);
StringBuffer sb = new StringBuffer(); // (1)
while (m.find())
{
String repl = m.start(1) != -1 ? "&" :
m.start(2) != -1 ? "<" :
m.start(3) != -1 ? ">" : "";
m.appendReplacement(sb, ""); // (2)
sb.append(repl);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
In this very simple example, all I need to know about each match is which capture group participated in it, which I find out by means of the start(n) method. But you can use the group() or group(n) method to examine the matched text, as you mentioned in the question.
Note (1) As of JDK 1.6, we have to use a StringBuffer here because StringBuilder didn't exist yet when the Matcher class was written. JDK 1.7 will add support for StringBuilder, plus some other improvements.
Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any $n sequence with the contents of the n'th capture group. We don't want that to happen, so we pass it an empty string and then append() the replacement string ourselves.

Categories