Pattern Matching On Java Syntax (Checkstyle)

Pattern Matching On Java Syntax (Checkstyle) - java

I am currently trying to create my own check in Checkstyle.
It's supposed to throw a warning for commented Code inside a class.
Now, as far as the recognition of comments goes, I got it all figured out, but now I'm facing the problem of how to make it recognize Java Code.
Are there any collections which provide these features already? Just checking for certain keywords like modifiers, types, scopes, etc. would be too vague in some situations.
tl;dr: Looking for a way to find out if a string is java code or not (pattern matching)

It would be very hard to determine if a line is Java code or not, as a line can be as little as a single }. That said, if you want to check if a FILE is java, there are some good Regex options for you, mostly because you can look at the context of a certain line.
Even if you use those you could craft a specific file that will be detected as if it were Java, while it actually isn't. That said, it would work for most if not all "normal" files.
If the Regex is what you're looking for, you might want to look for similar threats on StackOverflow, because there should be a few around (I used one myself a while ago). If you really want to do this in Checkstyle however, you might be out of luck...

A good heuristic method to determine large blocks of commented code is to check for preceding spaces. A "valid" comment will usually be indented with the actual code:
public class A {
public void a() {
// valid comment
...
}
}
Whereas a code block that has been commented with ctrl-7 will directly start with the // characters:
public class A {
// public void a() {
// // valid comment
// ...
// }
}
Thus, your regular expression would look something like this
^//.*

Related

RegEx that captures a method and its body [duplicate]

This question already has answers here:
Java : parse java source code, extract methods
(2 answers)
Closed 1 year ago.
I have tried to develop a regex that captures a method and its body (The modifier is not important), but I could not develop a solid solution. The regex that I came up with so far is this: \\b\\w*\\s*\\w*\\s*\\(.*?\\)\\s*\\{([^}]+)\\}
It does not capture the methods correctly because it does not consider matching balanced Curley braces. Thus, sometimes it captures part of the method and not all. What am I doing wrong or what could I do to improve the solution that can capture the whole method!

You can't do this. It's impossible.
The 'regular' in 'Regular Expression' refers to a certain subset of grammars; the so-called 'Regular Grammars'.
Here's the thing:
Non-Regular Grammars cannot be parsed with regular expressions.
Java (the language) is Non-Regular.
Thus, you can't use regular expressions for this, QED.
So, how do you parse java?
There are many ways; so far, java is still so-called LL(k) parseable, which means that just about every 'parser/grammar' library out there will be capable of parsing java code, and many such libraries ship with a java grammar as an example. These usually aren't quite perfect, but pretty good.
A basic web search gets you many options. Alternatively, javac is free (but GPL, you'd have to GPL anything you build with it), and ecj (the parser that powers eclipse, amongst other things) is open source with a more permissive license. It's also faster. It's also far harder to use, so there's that.
These are fairly complex tools. However, java is a very complex language (much programming languages are). Parsing them is decidedly non-trivial.
Before you think: Geez, surely it can't be this hard, consider:
public void test {
{}
String x = "{";
}
Which is legal java.
Or:
public void test() {
// method body
\u007D
That really is legal java, that \u007D thing closes it. Of course...
public void test() {
//{} \u007D
}
Here the \u thing doesn't. It is a real closing brace, but, that is in a comment.
Another one to consider:
public void test() {
class Foo {
String y = """
}
""";
}
}
Hopefully, considering the above, you realize you stand absolutely no chance whatsoever unless you use a parser that knows about the entire language spec.

How to insert, replace and delete regions of a file, in Java?

I'm looking for an implemented class (in Java) which handles insertion, replacement and deletion of text for an existing file, just like StringBuilder does.
The reason I do not want to use a StringBuilder is that I want to avoid having all the file content in memory.
The purpose is to apply patches to a file which contains code in any programming language.

Generally, a class should do one thing, and do it well.
That means that it is unlikely you'll find a single class that reads the code, turns it into an internal representation (parse tree), detects the issue at hand, alters the internal representation, and writes the internal representation back out to disk.
With that in mind, there are a number of projects that you might be able to extend to add in your desired functionality.
Checkstyle parses Java code, with the intent of reporting the stylistic errors. To do this it must read the code, turn it into an internal representation, and detect (formatting) issues. It might be a good starting point, depending on your goals.
PMD is a static code analysis tool. For it to find issues in Java source code, it must: read the code, turn it into an internal representation, and detect (structural) issues.
Note that neither of these tools does everything you wish; but, they are close. All you will have to do is construct a "fixer" that runs on the parsed tree, fixing the detected problem. Then you will either need to find if the tool provides an "outputter" that reconstructs the text of the code from the internal parsed (tree) representation, and use it to generate the desired text which you will then save to disk.
If the tree-to-text module doesn't exist, you might have to write it.
Source code is subject to rules, and while you might feel that you don't need these extra steps, your code will have a lifetime much greater than it would have by skipping these steps. Simply pasting in a line of code might not make sense with unexpected input. For example, assuming you add the #Overrides tag to a Java method, this pseudo code will fail
currentLine = next();
if (currentLine.detectMethod() and isAnOverride(currentLine().getMethod())) {
code.insertBefore(currentline, "#Overrides");
}
Because someone will feed your code this
public
void
myMethod(
String one,
String two,
String three)
{
System.out.println("Haha! I broke you!");
}
possibly leading to
public
void
#Overrides
myMethod(
String one,
String two,
String three)
{
System.out.println("Haha! I broke you!");
}
And you can say "Well, nobody should do that!" But, if the language permits it, then you'll be at odds with the language itself.
If you don't believe me, a line-by-line processor would not detect "public" as a method, nor "void" as a method, but would detect "myMethod(" as the beginning of a (misidentified) package private multi-line method.

Java Coding standard: variable declaration

Which of the following is best practice according to Java coding standards
public void function1(){
boolean valid = false;
//many lines of code
valid = validateInputs();
//many lines of code
}
or
public void function1(){
//many lines of code
boolean valid = validateInputs();
//many lines of code
}
Here 'valid' will not be for returning. Its scope is internal to the function only. Sometimes only in one if condition
I usually code similar to the second case. It seems my superior does not like this and modifies the code when I put it for review. Is there some specific reason that my approach is not correct?
The disadvantage I see for the first approach is that it is very difficult to refactor the method to multiple methods at a later point.

I would go for the second approach - not much a matter of Java coding standards here, but a matter of clean and readable code. Also, you assign the value false to valid in the first case, but that's not really correct as valid shouldn't have any value at that point.
On a side note, I won't expect a method called validateInputs() to return a boolean. There's no parameter passed, and the name is not giving an hint that the method would return something. What about refactoring your code to something like boolean validInput = isValid(input)?

I would prefer to only declare variables within the scope they are used. This avoid accidentally using it when you shouldn't and means you can see both the declaration and the usage together, instead of having to jump to the start of your code to find it.
In the C days, you had to use the first form, because the compilers were not very smart. But the second form was added as it made the code easier to understand AFAIK.

Whichever is better is a matter of personal taste. Every place has its own standards, so you should follow it at work.
That's one more reason I think every programmer should have their own personal projects. That way you can also code in your own style at home, so you don't get your mind stuck with just one style.

There should always be reasoning behind a decision.
The second example is better because it is good to initialize values in the declaration.
Google has a good set of standards that is applicable to many C-type languages. The example you are referring to is shown in the 'local variables' section.

To put space before method signature or not?

I would like to know what the overall recommendation is for whitespace between a method's name and its parameters.
That is, the general preference between the following two lines:
public static void main (String[] args) {} // (We'll ignore the question of spaces between the `String` and `[]` .)
public static void main(String[] args) {}
I recently have begun to feel like the former is the better one, especially considering that everything else in a method declaration (e.g. the throws Exception(s) section) is also space-separated.

As #chris mentioned in the comments, the Official Java Code Conventions specifically states:
Note that a blank space should not be used between a method name and its opening parenthesis. This helps to distinguish keywords from method calls.
As you questionably considered in your question, methods are different on purpose.

2018 update: since almost all IDEs allow easy invocation/declaration lookup, the main gains of switching the convention here are moot; after 3 years, I've switched back to "no space after method name" rule just because most style guides in most languages use it... making an exception to existing code styles is IMVHO viable when the gains outweigh the "WTF factor" of the change; in this case, with up-to-date tooling, there are no actual gains, so I'd personally recomend against the alternative proposed below.
I beg to differ with the interpretation of 1999's unmaintained White Space Java conventions. It only says that the space shouldn't be used to help distinguish keywords from method calls. Thus, there ain't no official rules as to whether use the space in contexts where method call can't appear (and thus where no such help is needed) - as such, the rule obviously does apply to invocative context (where calls can appear and where it helps) and doesn't apply to declarative context (where calls can't appear and where it would serve no purpose). Even more - because the conventions state that the whitespace use should help in distinguishing the usage context, using space on declaration actually keeps with the spirit of the rule - it actually allows you to distinguish method invocation from method declaration, even when using simple text search (just search for the method name followed by space).
After switching to it, distinguishing invocations from the declaration got easier. It also highlighted the fact that the parentheses after the name ain't the invocative ones - and that their syntax is different from the call syntax (i.e. type declarations are needed before variable names etc.), as you noticed already.
tl;dr you can use
void method () { } // declaration
void method2 () { // declaration
method(); // invocation
}
to be able to quickly do searches on declarations/invocations only and to satisfy the convention at the same time.
Note that all the official Java code, as well as most of the code styles in the wild, doesn't use the space in the declarations.

How do I ensure the format for saving and parsing string representations of Objects correlate properly

I am making a small boardgame program which needs to persist the state of the board to a file, and later read from the file and re-create the board.
I am delegating this functionality to the class shown below. I would like to implement this such that the save format of a square of the board along with it's co-ordinates are captured in the SQUARE_FORMAT constant, and the regex for reading that same information is captured in the LOAD_REGEX constant. Both should co-relate in code and also be able to visually decipher (by that I mean that a person should be able to clearly see that they co-relate to the same data)
Is there an idiom or pattern for doing this in Java code ?
public class BoardPersistenceUtility {
private final String SQUARE_SAVE_FORMAT = "";
private fial String LOAD_REGEX = "";
public void save(PrintWriter writer, Board board) {
}
public Board load(BufferedReader reader) {
// Implement
return null;
}
}
Update 1:
On reading my question again, I guess it might be a bit confusing, about what exactly I am looking for. I am specifically looking for the right way to represent SQUARE_SAVE_FORMAT so that it clearly co-relates with the regex LOAD_REGEX.
SQUARE_SAVE_FORMAT would ideally be a String which uses special characters/variables that will be replaced with actual values and the result will be saved to a file. LOAD_REGEX is the corresponding regex that will be used to read contents from the file. The regex will use capturing groups so I can re-create the original object from the values I get from the capturing groups.
My question is, what are the idioms around creating such pairs of Strings - one of them a format string to be used for saving data, and the other a regex to be used while reading that data.
Update 2:
On thinking a bit more, I think I have been able to clarify my question a bit better.
If you look at both the Strings, SQUARE_SAVE_FORMAT is a format string which will be used in String.format() to create the text for a square on the board, which will be saved in the file. The constant SQUARE_LOAD_REGEX is a regex which will be used to read the line and capture relevant parts into named groups, so I can re-create the original object. (sorry if my regex is slightly incorrect... I quickly wrote something, but I need to refresh some regex principles to ensure that this is indeed what I need)
If you look at both these Strings visually, it is difficult to co-relate them together. Perhaps it is because we do not have any named variables in a Java format String. The best we can do is to specify %i where i is the index of the argument.
I would like to understand if there is any idiom or pattern to represent such pairs of Strings, where one is used for formatting some data to text and the other is used to read the same text and parse it's parts.
public class BoardPersistenceUtility {
private final String SQUARE_SAVE_FORMAT =
"%d,%d:%b-%s";
private final String SQUARE_LOAD_REGEX =
"^(?<row>\d*),(?<col>\d*):(?<mine>true|false)-(?<status>\w)$";
public void save(PrintWriter writer, Board board) {
}
public Board load(BufferedReader reader) {
// Implement
return null;
}
}

Note: you call SQUARE_SAVE_FORMAT and LOAD_REGEX "constants" which they are not, as you haven't declared them static final. It's better to keep terminology clear :-)
The simplest way to link these two is to define a class which encloses both as (final) fields. If you plan to define multiple such pairs of information, you can define multiple instances of the class, one for each type of format.
If you really want to keep these as constants, it may be best to define the enclosing class as an enum. Note that Java enums may contain methods too, so you may choose to implement the save/load logic as Strategies in the enum instances themselves, and call these polymorphically, which may help simplify your code.

I'm still not sure what you mean, but need formatting, so answer instead of comment.
First of all, the names are almost completely unrelated--related them somehow.
SQUARE_DATA_STORE
SQUARE_DATA_REGEX
Second, there's no point in differentiating the "style" of the saved data if there's only a single BoardPersistenceUtility--if there were multiple formats then that information would be captured in a persistence utility subclass, like SquareFormatPersister or something.
Third, according to your text, one string is where the data will actually be stored. The other is a regular expression. The two will, in this case, never be "visually similar"--regular expressions of any complexity will never (much) look like the strings they can represent. (In this case, we have no clue, because we don't know what the board data can look like, of course.)
If your code is so non-self-explanatory that the reader can't figure out the two fields are related through via your comments and your code, something has gone horribly wrong. I'm having a hard time imagining this code is so overwhelmingly complex that their relationship cannot be trivially communicated.
Edit after update
The answer is still no.
You could use a templating mechanism to provide names for the fields, similar to those used in your regex. This might also make the code a bit more self-explanatory as you'd fill the template context with named values (like "row" or "col").
You could use a real parser/generator, but the complexity there is a bit too much.
You could use a DSL (internal using Groovy, JRuby, JavaScript, etc. or external, which brings us back to parsing) and write chunks of the code that way.
IMO you're over-thinking, and over-estimating perceived complexity: except possibly for the templating solution, which IMO is likely over-engineering for the level of difficulty, you'd be far better off writing one or two sentences, which should be more than enough to relate the "fields" of the load and save formats.

Put comments in your code to explain that they're related, how they're related, what they're used for, and that if one is changed, the other should be modified accordingly.
Implement a unit test to make sure that a saved board can be loaded.
Make sure that your build and release process runs the unit tests, and fails if one of them doesn't pass.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.