Split string using custom regex java

Split string using custom regex java - java

I am building a compiler. Some of the specifications of this are the following:
String literals are enclosed by dollar sign ("$") - eg. $ string sample $
Comments are enclosed by "*" - eg. * sample comment *
Comments could exist anywhere execept between operations - eg. 4 + * sample comment * 5 - (this is not allowed)
Now I have to split a source code line to tokenize it.
Example case:
PRINT $ THE FLOAT IS $ * DISPLAY THE RESULT *
As I will tokenize it, it should produce:
PRINT - token is KEYWORD
THE FLOAT IS - token is STRING_LITERAL
DISPLAY THE RESULT - token is COMMENT
I would like to know the most efficient way to obtain this. Note that I still have to validate the occurence of string literal and comment. (Ex. Check if it is properly enclosed). So far my way is to split each line by whitespaces and and when a lexeme contains a "$" or "*", I will validate the string literal. Here is my implementation:
private void getLexemes(){
for(String line : newSourceCode){
String[] lexemesInALine = line.trim().split("[\\s]+");
for(String lexemeInALine : lexemesInALine){
if(!(lexemeInALine.contains("$"))){
lexemes.add(lexemeInALine);
tempTokens.add(findToken(lexemeInALine));
line = line.replaceFirst(lexemeInALine,"").trim();
}else{
validateStringType(line);
break;
}
}
Thank you for the help.

I assume your language is deterministic and context-free?
That means, you can't correctly parse it using regular expressions.
What you need is a state machine that works on a stream of tokens.
Java comes with two classes that might work for you: StreamTokenizer and StringTokenizer.
But what you really want is to use one of the dozens parser generators. Maybe something like ANTLR.
There are plenty described here:
https://en.wikipedia.org/wiki/Comparison_of_parser_generators
If all this fails, a finite state machine it is.
Something along those lines
public class Parsy {
enum State { string, comment, token }
void parse(StringTokenizer tokenizer) {
State state = State.token;
List<String> tokens = new ArrayList<>();
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
// figure out type of token
if (token.length() == 1) {
char delim = token.charAt(0);
switch (delim) {
case '$':
switch (state) {
case token: {
// a string literal has started, emit what we have, start a string
printOut(tokens, state);
tokens.clear();
tokens.add(token);
state = State.string;
break;
}
case string: { // parsing a string, so this ends it
printOut(tokens, state);
tokens.clear();
state = State.token;
break;
}
case comment: { // $ is ignored since we are in a comment
tokens.add(token);
break;
}
}
break;
// ...
}
} else {
// not a delimiter token
tokens.add(token);
}
} // end of while
if (state != State.token) {
System.out.println("Oops! Syntax error. I'm still parsing" + state);
}
}
}

Related

StreamTokenizer mangles integers and loose periods

I've appropriated and modified the below code which does a pretty good job of tokenizing Java code using Java's StreamTokenizer. Its number handling is problematic, though:
it turns all integers into doubles. I can get past that by testing num % 1 == 0, but this feels like a hack
More critically, a . following whitespace is treated as a number. "Class .method()" is legal Java syntax, but the resulting tokens are [Word "Class"], [Whitespace " "], [Number 0.0], [Word "method"], [Symbol "("], and [Symbol ")"]
I'd be happy turning off StreamTokenizer's number parsing entirely and parsing the numbers myself from word tokens, but commenting st.parseNumbers() seems to have no effect.
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}

parseNumbers() in "on" by default. Use resetSyntax() to turn off number parsing and all other predefined character types, then enable what you need.
That said, manual number parsing might get tricky with accounting for dots and exponents... With a scanner and regular expressions it should be relatively straightforward to implement your own tokenizer, tailored exactly to your needs. For an example, you may want to take a look at the Tokenizer inner class here: https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java (about 120 LOC at the end)

I'll look into parboiled when I have a chance. In the meantime, the disgusting workaround I implemented to get it working is:
private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";
Then in tokenize()
//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\\s)\\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");
And in the tokenization loop
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
if(word.equals(DANGLING_PERIOD_TOKEN))
tokens.add(new SymbolToken('.'));
else
tokens.add(new WordToken(word));
break;
This solution is specific to my needs of not caring what the original whitespace was (as it adds some around the inserted "token")

Scanning a number and returning the lexeme in the input stream- Java?

I am trying to write a method that will scan the input and return a String representing the lexeme found in the input string.
This is what I have so far but I don't know if I'm going in the right direction-- all help would be appreciated :)
private String scanNumbers(char input)
{
String result= "";
int value = in.read()
if(value != -1)
{
If(isDigit(input))
{
result = Integer.toString(value);
}
}
return result;
}
public static boolean isDigit(char input)
{
return (input >= '0' && input <= '9');
}
Thank you I am new to parsing/lexemes/compilers.

Introduction
Questions that appear to be related to a homework exercise are often slow to be answered on SO. We often wait until the deadline has well passed!
You mention you are new to the topics of parsing/lexemes/compilers, and want some help in writing a Java method to scan the input and return a string representing the lexeme found in the input string. Later you clarify, indicating that you want a method that skips characters until it finds digits.
There is quite a bit of confusion in your question which produces conflicts in what you want to achieve.
It is not clear if you are wanting to learn about performing lexical analysis in Java as part of a larger compiler project, whether you only want to do it with numbers, whether you are looking for existing tools or methods that do this or are trying to learn how to program such methods yourself. If you are programming, whether you only need to know about reading a number, or if this is just an example of the kind of things you want to do.
Lexical Analysis
Lexical analysis, which is also known as scanning, is the process of reading a corpus of text which is composed of characters. This can be done for several purposes, such as data input, linguistic analysis of written material (such as word frequency counting) or part of language compilation or interpretation. When done as part of compilation it is one (and usually the first) of a sequence of phases that include parsing, semantic analysis, code generation, optimisation and such. In the writing of compilers code generator tools are usually used, so if it was desired to write a compiler in Java, then a Java lexical generator and a Java parser generator would often be used to create the Java code for those compiler components. Sometimes that lexer and parser are hand written, but it is not a recommended task for a novice. It would require a compiler writing specialist to build a compiler by hand better than a tool-set. Sometimes, as a class exercise, students are asked to write code to perform a piece lexical analysis to help them understand the process, but this is often for a few lexemes, like your digit exercise.
The term lexeme is used to describe a sequence of characters that compose an individual entity recognised by a lexical analyser. Once recognised it is usually represented by a token. The lexeme is therefore replaced by a token as part of the lexical analysis process. A lexical analyser will sometime record the lexeme in a symbol table for later use before replacing it by the token. This is how identifiers in programs are often recorded in a compiler.
There are several tools for building lexers in Java. Two of the most common are Jlex and JFlex. To illustrate how they work, to recognise an integer whilst skipping whitespace, we would use the following rules:
%%
WHITE_SPACE_CHAR=[\n\ \t\b\012]
DIGIT=[0-9]
%%
{WHITE_SPACE_CHAR}+ { }
{DIGIT}+ { return(new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length())); }
%%
which would be processed by the tool to produce Java methods to achieve that task.
The notations used to describe the lexemes are usually written as regular expressions. Computer Science theory can help us with the programming of a lexical analyser. Regular expressions can be represented by a form of finite state automata. There is a particular style of coding that can be used to match lexemes that experienced programers would recognise and use in this situation, which involves a switch inside a loop:
while ( ! eof ) {
switch ( next_symbol() ) {
case symbol:
...
break;
default:
error(diagnostic); break;
}
}
It is often these concepts that a simple lexical programming exercise is intended to introduce to students.
Tokenizing in Java
With all those preliminary explanations out of the way, lets get down to your piece of Java code. As mentioned in the comments there is a difference in Java between reading bytes from an input stream and reading characters, as characters are in unicode, which is represented by two bytes. You have used a byte read within a character processing method.
The recognising simple tokens in an input stream, particularly for data entry, is such a common activity that Java has a specific built-in class for that called the StreamTokenizer.
We could implement your task in the following way, for example:
// create a new tokenizer
Reader r = new BufferedReader(new InputStreamReader( System.in ));
StreamTokenizer st = new StreamTokenizer(r);
// print the stream tokens
boolean eof = false;
do {
int token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_EOF:
System.out.println("End of File encountered.");
eof = true;
break;
case StreamTokenizer.TT_EOL:
System.out.println("End of Line encountered.");
break;
case StreamTokenizer.TT_NUMBER:
System.out.println("Number: " + st.nval);
break;
default:
System.out.println((char) token + " encountered.");
if (token == '!') {
eof = true;
}
}
} while (!eof);
However, this does not return the string of the lexeme for a number, only matches the number and gets the value.
I see you have noticed the Java class java.util.scanner because your question had that as a tag. This is another class that can perform similar operations.
We could get an integer lexeme from the input like this:
Scanner s = new Scanner(System.in);
System.out.println(s.nextInt());
Solution
Finally, lets re-write your original code to find the lexeme for an integer skipping over an unwanted characters, in which I use java regular expression matching.
import java.io.IOException; import java.io.InputStreamReader;
import java.util.regex.Pattern;
public class ReadNumbers {
static InputStreamReader in = null; // Have input source as a global
static int value = -1; // and the current input value
public static void main ( String [] args ) {
try {
in = new InputStreamReader(System.in); // Set up the input
value = in.read(); // pre-fill the input state
System.out.println(scanNumbers()) ;
}
catch (Exception e) {
e.printStackTrace(); // print error
}
}
private static String scanNumbers() {
String SkipCharacters = "\\s" ; // Characters that can be skipped
String result= ""; // empty string to store lexeme
int charcount=0;
try {
while ( (value != -1) && Pattern.matches(SkipCharacters,"" + (char)value) )
// Now skip optional characters before the number
value = in.read() ; // pre-load the next character
while ( (value != -1) && isDigit((char)value)) {
// Now find the number digits
result = result + (char)value; // append digit character to result
value = in.read() ; // pre-load the next character
}
} finally {
return result;
}
}
public static boolean isDigit(char input) {
return (input >= '0' && input <= '9');
}
}
Afterword
The comment from #markspace is interesting and useful, as it points out not all numbers are soley decimal digits.
Consider numbers in other bases, like hexdecimal. Java allows integer constants to be specified in those number bases which do not just use the digits 0..9.

Java replace $^{}

I am writing a program that allows users to input variable names that they can then use in other Strings. For example, if the user enters:
$token aslkdjfna98y
A mapping is created for key "token" and value "aslkdjfna98y". I then want to add this token variable in a URL by specifying that it should be swapped out using this syntax:
http://www.example.com/getThing?token=$^{token}
So here, I would like to swap $^{token} with my mapped value aslkdjfna98y.
I have tried various String.replace and String.replaceAll calls, however I am currently getting stuck in a loop - where it's known that the String contains the text $^{token}, but I cannot replace the text. Here is where I am struggling:
if (request.contains("$^{"))
{
//handle variables
for (String s : variables.keySet())
{
String str = String.format(Locale.US, "$^{%s}", s);
while(request.contains(str))
{
//Stuck Here
request = request.replace(String.format(Locale.US, "$^{%s}", s), variables.get(s));
}
}
}
This could ideally be simplified down to:
request.replaceAll(regex, str);
How can I correctly replace the characters, or how can I improve this to use replaceAll?

Enclose the String in \\Q and \\E. This switches off all special characters in Java regexes:
request = request.replace(String.format(Locale.US, "\\Q$^{%s}\\E", s), variables.get(s));

"$^{token}"
im confused whats suppose to be in the token field...
any letters/num?
"$^{[a-zA-Z0-9]*}"
Um.....
a certain amount(8) of letters/numbs.
"$^{[a-zA-Z0-9]{8}}"
depending on the language you are using you might need to escape { $ and ^

I was able to simplify the code down to this simple block:
if (variables.get(s) != null) {
request = request.replaceAll(Log.format("\\Q$^{%s}\\E", s), variables.get(s));
}
else {
Log.err("No variable \"%s\" set", s);
}

How can write this? I Need help specifially with my nextWord method

How can write this? I Need help specifially with my nextWord method
package code;
import java.util.HashMap;
public class WordCounter {
/**
* Reads a file (identified by inputFilePath), one character at a time, and tracks
* word counts using a java.util.HashMap<String,Integer>.
*
* A word is defined as a contiguous sequence of characters that does not contain
* word separator characters, where word separator characters are: ' ', '\t', '\n',
* ',' and '.'
*
* You may use only CharacterFromFileReader to read characters from the input file.
*
* In order to keep your code readable, break your code into several methods. Only
* the wordCounts method may be public; define meaningful private helper methods that
* you call from the wordCounts method.
* #returns a HashMap containing the word->count mappings
*/
public HashMap<String,Integer> wordCounts(String inputPath) {
return new HashMap<String, Integer>();
}
private String nextWord(CharacterFromFileReader iterator) {
while(iterator.hasNext()){
//make loop that takes each word an saves it to a string until u hit a space break
String s = "";
if(iterator.hasNext()){
}
}
return null;
}
}

How about this?
String words[] = str.split("[ ,\\t\\n,\\.]");
To generate str you can use the append operation to a StringBuilder in your iterator loop.

I would use a StringBuider in your nextWord(). Then you can iterate until you reach a word separator character or the end of the stream, and return that word. It should go something like:
StringBuilder sb = new StringBuilder();
char nextChar;
while(iterator.hasNext()) {
nextChar = iterator.next();
switch(nextChar) {
case ' ':
case '\t':
case '\n':
case ',':
case '.':
return sb.toString();
}
}

I think you might be approaching the problem of defining nextWord the wrong way. It looks like you've filled out a frame for the nextWord function without understanding HOW the function should accomplish what it's going to do. This is backwards. Try following these basic steps:
Write down in English exactly what nextWord should do. Think about any edge cases. What should it return given a sentence? What should it return if there are no characters left in CharacterFromFileReader?
Once you have this definition write down(in English) step by step instructions for how to accomplish the task. e.g. 1) Get next character 2) If character is white space 3)etc...
Try following the instructions yourself using pen and paper. Do they work? If not edit and repeat this step until they do.
Now that you know HOW the function should work, implement the instructions you've written down using java code

I am guessing that the frame for the nextWord()-method was given by the teacher, and if OP would deliver something involving StringBuilder/StringBuffer teacher might scream "we haven't covered that yet". Also, teacher seems very happy with inefficient object-created-for-every-character monstrosities, so I think OP should go with using Strings-only; as in: inefficient but allowed by teacher:
private String nextWord(CharacterFromFileReader iterator) {
String toReturn = "";
while(iterator.hasNext()){ //make loop that takes each word an saves it to a string until u hit a space break
String s = "";
if(iterator.hasNext()){
char c = iterator.next();
if(c==' ') break;
toReturn = toReturn+c;
}
}
return toReturn;
}

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!

I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.

The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568

Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.

will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf

Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.

I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split string using custom regex java - java

Related

StreamTokenizer mangles integers and loose periods

Scanning a number and returning the lexeme in the input stream- Java?

Java replace $^{}

How can write this? I Need help specifially with my nextWord method

Need help parsing strings in Java

Categories

Resources