Keeping track of state in JFlex

Keeping track of state in JFlex - java

I'm writing a custom flex file to generate a lexer for use with JSyntaxpane.
The custom language I need to lex has different states that can be embedded into each other in a kind of stack.
I.E you could be writing an expression that has a single quoted string in it and then embed another expression within the string using a special token eval(). But you can also embed the expression within a double quoted string.
eg:
someExpressionFunction('a single-quoted string with an eval(expression) embedded in it', "a double-quoted string with an eval(expression) embedded in it")
This is a simplification, there are more states than this, but assuming I need to have different states for DOUBLE_STRING and SINGLE_STRING it adequately describes my situation.
What's the best way to ensure I return to the correct state upon closing the eval expression (i.e return to DOUBLE_STRING if I was in double quotes, SINGLE_STRING if I was in single quotes)
The solution I've come up with, which works, is to keep track of state using a Stack and some custom methods to use in lieu of using yybegin to start a different state.
private Stack<Integer> stack = new Stack<Integer>();
public void yypushState(int newState) {
stack.push(yystate());
yybegin(newState);
}
public void yypopState() {
yybegin(stack.pop());
}
Is this the best way to achieve this? Is there a simpler built-in function of JFlex I can leverage or a best practice I should know about?

I think that's one very good way of doing it. I actually needed some similar feature to add Groovy GString, Python like String and some HTML to JavaDocs.
What I would also like to add is a Lexer calling a Lexer to parse sub sections. Something like JavaScript embedded in HTML. But I could not get the time to do it.
I like StackOverflow, but just wondering why didn't you post this on JSyntaxPane's issues?

Related

RegEx that captures a method and its body [duplicate]

This question already has answers here:
Java : parse java source code, extract methods
(2 answers)
Closed 1 year ago.
I have tried to develop a regex that captures a method and its body (The modifier is not important), but I could not develop a solid solution. The regex that I came up with so far is this: \\b\\w*\\s*\\w*\\s*\\(.*?\\)\\s*\\{([^}]+)\\}
It does not capture the methods correctly because it does not consider matching balanced Curley braces. Thus, sometimes it captures part of the method and not all. What am I doing wrong or what could I do to improve the solution that can capture the whole method!

You can't do this. It's impossible.
The 'regular' in 'Regular Expression' refers to a certain subset of grammars; the so-called 'Regular Grammars'.
Here's the thing:
Non-Regular Grammars cannot be parsed with regular expressions.
Java (the language) is Non-Regular.
Thus, you can't use regular expressions for this, QED.
So, how do you parse java?
There are many ways; so far, java is still so-called LL(k) parseable, which means that just about every 'parser/grammar' library out there will be capable of parsing java code, and many such libraries ship with a java grammar as an example. These usually aren't quite perfect, but pretty good.
A basic web search gets you many options. Alternatively, javac is free (but GPL, you'd have to GPL anything you build with it), and ecj (the parser that powers eclipse, amongst other things) is open source with a more permissive license. It's also faster. It's also far harder to use, so there's that.
These are fairly complex tools. However, java is a very complex language (much programming languages are). Parsing them is decidedly non-trivial.
Before you think: Geez, surely it can't be this hard, consider:
public void test {
{}
String x = "{";
}
Which is legal java.
Or:
public void test() {
// method body
\u007D
That really is legal java, that \u007D thing closes it. Of course...
public void test() {
//{} \u007D
}
Here the \u thing doesn't. It is a real closing brace, but, that is in a comment.
Another one to consider:
public void test() {
class Foo {
String y = """
}
""";
}
}
Hopefully, considering the above, you realize you stand absolutely no chance whatsoever unless you use a parser that knows about the entire language spec.

How to insert, replace and delete regions of a file, in Java?

I'm looking for an implemented class (in Java) which handles insertion, replacement and deletion of text for an existing file, just like StringBuilder does.
The reason I do not want to use a StringBuilder is that I want to avoid having all the file content in memory.
The purpose is to apply patches to a file which contains code in any programming language.

Generally, a class should do one thing, and do it well.
That means that it is unlikely you'll find a single class that reads the code, turns it into an internal representation (parse tree), detects the issue at hand, alters the internal representation, and writes the internal representation back out to disk.
With that in mind, there are a number of projects that you might be able to extend to add in your desired functionality.
Checkstyle parses Java code, with the intent of reporting the stylistic errors. To do this it must read the code, turn it into an internal representation, and detect (formatting) issues. It might be a good starting point, depending on your goals.
PMD is a static code analysis tool. For it to find issues in Java source code, it must: read the code, turn it into an internal representation, and detect (structural) issues.
Note that neither of these tools does everything you wish; but, they are close. All you will have to do is construct a "fixer" that runs on the parsed tree, fixing the detected problem. Then you will either need to find if the tool provides an "outputter" that reconstructs the text of the code from the internal parsed (tree) representation, and use it to generate the desired text which you will then save to disk.
If the tree-to-text module doesn't exist, you might have to write it.
Source code is subject to rules, and while you might feel that you don't need these extra steps, your code will have a lifetime much greater than it would have by skipping these steps. Simply pasting in a line of code might not make sense with unexpected input. For example, assuming you add the #Overrides tag to a Java method, this pseudo code will fail
currentLine = next();
if (currentLine.detectMethod() and isAnOverride(currentLine().getMethod())) {
code.insertBefore(currentline, "#Overrides");
}
Because someone will feed your code this
public
void
myMethod(
String one,
String two,
String three)
{
System.out.println("Haha! I broke you!");
}
possibly leading to
public
void
#Overrides
myMethod(
String one,
String two,
String three)
{
System.out.println("Haha! I broke you!");
}
And you can say "Well, nobody should do that!" But, if the language permits it, then you'll be at odds with the language itself.
If you don't believe me, a line-by-line processor would not detect "public" as a method, nor "void" as a method, but would detect "myMethod(" as the beginning of a (misidentified) package private multi-line method.

How do I ensure the format for saving and parsing string representations of Objects correlate properly

I am making a small boardgame program which needs to persist the state of the board to a file, and later read from the file and re-create the board.
I am delegating this functionality to the class shown below. I would like to implement this such that the save format of a square of the board along with it's co-ordinates are captured in the SQUARE_FORMAT constant, and the regex for reading that same information is captured in the LOAD_REGEX constant. Both should co-relate in code and also be able to visually decipher (by that I mean that a person should be able to clearly see that they co-relate to the same data)
Is there an idiom or pattern for doing this in Java code ?
public class BoardPersistenceUtility {
private final String SQUARE_SAVE_FORMAT = "";
private fial String LOAD_REGEX = "";
public void save(PrintWriter writer, Board board) {
}
public Board load(BufferedReader reader) {
// Implement
return null;
}
}
Update 1:
On reading my question again, I guess it might be a bit confusing, about what exactly I am looking for. I am specifically looking for the right way to represent SQUARE_SAVE_FORMAT so that it clearly co-relates with the regex LOAD_REGEX.
SQUARE_SAVE_FORMAT would ideally be a String which uses special characters/variables that will be replaced with actual values and the result will be saved to a file. LOAD_REGEX is the corresponding regex that will be used to read contents from the file. The regex will use capturing groups so I can re-create the original object from the values I get from the capturing groups.
My question is, what are the idioms around creating such pairs of Strings - one of them a format string to be used for saving data, and the other a regex to be used while reading that data.
Update 2:
On thinking a bit more, I think I have been able to clarify my question a bit better.
If you look at both the Strings, SQUARE_SAVE_FORMAT is a format string which will be used in String.format() to create the text for a square on the board, which will be saved in the file. The constant SQUARE_LOAD_REGEX is a regex which will be used to read the line and capture relevant parts into named groups, so I can re-create the original object. (sorry if my regex is slightly incorrect... I quickly wrote something, but I need to refresh some regex principles to ensure that this is indeed what I need)
If you look at both these Strings visually, it is difficult to co-relate them together. Perhaps it is because we do not have any named variables in a Java format String. The best we can do is to specify %i where i is the index of the argument.
I would like to understand if there is any idiom or pattern to represent such pairs of Strings, where one is used for formatting some data to text and the other is used to read the same text and parse it's parts.
public class BoardPersistenceUtility {
private final String SQUARE_SAVE_FORMAT =
"%d,%d:%b-%s";
private final String SQUARE_LOAD_REGEX =
"^(?<row>\d*),(?<col>\d*):(?<mine>true|false)-(?<status>\w)$";
public void save(PrintWriter writer, Board board) {
}
public Board load(BufferedReader reader) {
// Implement
return null;
}
}

Note: you call SQUARE_SAVE_FORMAT and LOAD_REGEX "constants" which they are not, as you haven't declared them static final. It's better to keep terminology clear :-)
The simplest way to link these two is to define a class which encloses both as (final) fields. If you plan to define multiple such pairs of information, you can define multiple instances of the class, one for each type of format.
If you really want to keep these as constants, it may be best to define the enclosing class as an enum. Note that Java enums may contain methods too, so you may choose to implement the save/load logic as Strategies in the enum instances themselves, and call these polymorphically, which may help simplify your code.

I'm still not sure what you mean, but need formatting, so answer instead of comment.
First of all, the names are almost completely unrelated--related them somehow.
SQUARE_DATA_STORE
SQUARE_DATA_REGEX
Second, there's no point in differentiating the "style" of the saved data if there's only a single BoardPersistenceUtility--if there were multiple formats then that information would be captured in a persistence utility subclass, like SquareFormatPersister or something.
Third, according to your text, one string is where the data will actually be stored. The other is a regular expression. The two will, in this case, never be "visually similar"--regular expressions of any complexity will never (much) look like the strings they can represent. (In this case, we have no clue, because we don't know what the board data can look like, of course.)
If your code is so non-self-explanatory that the reader can't figure out the two fields are related through via your comments and your code, something has gone horribly wrong. I'm having a hard time imagining this code is so overwhelmingly complex that their relationship cannot be trivially communicated.
Edit after update
The answer is still no.
You could use a templating mechanism to provide names for the fields, similar to those used in your regex. This might also make the code a bit more self-explanatory as you'd fill the template context with named values (like "row" or "col").
You could use a real parser/generator, but the complexity there is a bit too much.
You could use a DSL (internal using Groovy, JRuby, JavaScript, etc. or external, which brings us back to parsing) and write chunks of the code that way.
IMO you're over-thinking, and over-estimating perceived complexity: except possibly for the templating solution, which IMO is likely over-engineering for the level of difficulty, you'd be far better off writing one or two sentences, which should be more than enough to relate the "fields" of the load and save formats.

Put comments in your code to explain that they're related, how they're related, what they're used for, and that if one is changed, the other should be modified accordingly.
Implement a unit test to make sure that a saved board can be loaded.
Make sure that your build and release process runs the unit tests, and fails if one of them doesn't pass.

Wanted: a very simple Java RegExp API

I'm tired of writing
Pattern p = Pattern.compile(...
Matcher m = p.matcher(str);
if (m.find()) {
...
Over and over again in my code. I was going to write a helper class to make it neater, but I then I wondered: is there a library that tries to provide a simpler facade for Regular Expressions in Java?
I'm thinking something in the style of commons-lang and Guava.
CLARIFICATION: I am actually hoping for some general library that would make working with regular expression a more streamlined experience, kind of like how perl does it. The code above was just an example.
I was thinking of something I could use like this:
for (int question : RegEx.findAllInts("SO question #(\\d+)", str)) {
// do something with int
}
Again, this is just an example of one of the many things I'd like to have. Probably not even a good example. APIs are hard.
UPDATE: I guess the answer is "No". Thanks for all the answers, have an upvote.

Why not just write your own wrapper method? Sure, you should not reinvent the wheel but another library also means another dependency.

Pattern should only be compiled once; save it in a static final field. This at least saves you from repeating, at coding time an runtime, this step. That is to say, this step ought not always go hand-in-hand with creating a Matcher for performance reasons.
In your example, it seems RegEx plays the role of a Matcher object anyway. I hope it's not supposed to be a class with a static method since this would not work in a multithreaded environment -- the find and getInt calls are not connected then. So you need a Matcher of some sort anyway.
And so you're back to precisely the Java API, when design considerations are factored in. No I don't think there's a shorter way to do this correctly and efficiently.

There is a java library which has extend feature over the built-in java regex library . Have a look at RegExPlus. I haven't tried it personally.But hope this helps.

Yeah, it's always bugged me, too, having to write so much boilerplate to perform such common tasks. I think it would help a lot if String had a pair of methods like
public String findFirst(String regex)
public String[] findAll(String regex)
These represent the two most commonly performed regex operations that aren't already supported by String methods. If we had those, plus a dynamic replacement facility like Rewriter, we could almost forget about Pattern and Matcher. We would only need them when we're writing something really complicated, like a findAllInts() method. :D

There is Jakarta Regexp (see the RE class). Have a look at this old thread for advantages of Jakarta's RegExp package over the Java built-in RegEx.

Since Java 1.4, you can also use String.matches(String regex). Which precisely is a facade to the aforementionned code.

For the specific example you give, you might be able to improvise something using Guava's splitter:
for (String number : Splitter.onPattern("[^\d]+").split(input)) {
// Do something with the number
}
or more specifically, if you had input like
SO question #1234, SO Question #3456, SO Question #5678
you might do
for (String number : Splitter.onPattern("(, )? SO Question #").split(input)) {
// Do something
}
It's a bit hacky, but in specific cases it may do what you're after.

Implement a Custom Escaper in Freemarker

Freemarker has the ability to do text escaping using something like this:
<#escape x as x?html>
Foo: ${someVal}
Bar: ${someOtherVal}
</#escape>
xml, xhtml, and html are all built in escapers. Is there a way to register a custom written escaper? I want to generate CSV and have each individual element escaped and that seems like a good mechanism.
I'm trying to do this in Struts 2 if that matters as well.

You seem to be confusing two concepts here. ?xml, ?xhtml and ?html are string built-ins.
<#escape> OTOH is syntax sugar to save you from typing the same expression over and over again. It can be used with any expression, it's not limited to built-ins.
That said, there's unfortunately no built-in for csv string escaping and there's no way to write your own without modifying FreeMarker source (though if you do want to go this way it's pretty straightforward - take a look at freemarker.core.BuiltIn). Perhaps you can by with ?replace using regex or just write / expose an appropriate method and invoke it in your template.

The Javadoc for HtmlEscaper indicates how to instantiate/register that in code (see the header), so I suspect if you implement your own TemplateTransformModel, and register it in a similar fashion then that should work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.