Java and whitespace-as-syntax (ala Python)? - java

There is a part of java syntax that bugs the crap out of me: that's curly braces and semicolons. Is there some sort of translator that exists that will allow me to use all of the Java syntax except for this? I want to do something like this:
public class Hello:
public static void main(String[] args):
System.out.println("I like turtles.")
public class Another:
public static void somethingelse():
System.out.println("And boobs")
It's Python's whitespace as syntax model, I've grown to love it. I believe it's cleaner, and easier on the eyes. If it doesn't exist, I'm actually considering heavily investing time into writing a parser that would do this for me. (Ideally it will open it up, format it with whitespace, and when saved, save as just java syntax with braces and all)
Would this cause problems elsewhere in the language? What kind of hiccups can I expect to run into? I want to use all of the rest of the Java syntax exactly how it is otherwise, just want to modify this small niggle.
I can already write and read code just fine in Eclipse. And yes, I already know how to use code formatting tools and all the auto-complete options available to me, this is merely a preference in coding style so please don't answer with "You should learn to get used to it" or "You should use an IDE that does braces FOR you"...no. I don't want that.

Can you switch to Jython? Python's syntax, Java's runtime environment.

You might consider instead investing time in learning to use eclipse and the code formatter.
Python's model is great in that it (more or less) forces all developers into the same code style format. Modern software firms have code-style guidelines and use IDEs with formatters to ease enforcement of the coding style. Even if your coworkers have (subjectively)atrocious styles(to you), you can quickly reformat that code into something you find readable.

You can use Jython or Scala, which drops most of curly braces and semicolons (as far as dots and parenthesis). Their syntax is much more readable, and you still all the power of JVM.
Though, if you need exactly translator (to save it as a plain Java code), you can easily write such translator by yourself. Read input file line by line, counting indents, and each time it changes, delete colon at the end of the string (if needed) and insert curly brace: opening brace for bigger indent and closing for smaller. (It doesn't take into account several possible cases, but most of them is considered as bad style.)

Related

Is there a quick way to refactor the entire codebase to adopt a different code convention?

I have received ownership of a code base that, although very well written, uses a rather bizarre convention:
public void someMethod(String pName, Integer pAge, Context pContext)
{
...
}
I'd like to make the following two changes to the entire code:
public void someMethod(String name, Integer age, Context context) { ... }
Opening bracket in the same line of the method declaration
Use a camelCase name for all parameters of the method, without this weird "p" prefix
Can checkstyle help me here? I'm looking but I can't find a way to rename all parameters in all method signatures to something more pleasant.
If you are willing to use the eclipse IDE it'd offer a very handy feature for auto-formatting code:
http://help.eclipse.org/juno/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Freference%2Fpreferences%2Fjava%2Fcodestyle%2Fref-preferences-formatter.htm
It is pretty self-explanatory and straight-forward in my opinion.
Eclipse allows for regex based search and replace operations.
Just open Search > File... there enter the following regex for Containing text:
\b[p]([A-Z][a-z]+)\b
And tick both Case sensitive and Regular expression.
Then press Replace...
In the newly popped up window enter
\1
in the With: field and tick Regular expression.
Edit: Sadly in its current version Eclipse does not support the \L flag for content groups so you are still stuck with an uppercase leading letter.
To answer your question about checkstyle: No, checkstyle is a tool used for analyzing code not for changing.
Using checkstyle to format code
(Question from Oct'12)
Also did some research, here's another stackoverflow question aiming at the practically same. The solution offered there is similarly work intensive.
Can I automatically refactor an entire java project and rename uppercase method parameters to lowercase?
(Question from Oct'10)

Java, how to recognize a part of a token as a separate token?

Hopefully my title isn't completely terrible. I don't really know what this should be called. I'm trying to write a very basic scheme parser in Java. The issue I'm having is with implementation.
I open a file, and I want to parse individual tokens:
while(sc.hasNext()) {
System.out.println(sc.next());
}
Generally, to get tokens, this is fine. But in scheme, recognizing the begining and end of a list is crucial; my program's functionality depends on this, so I need a way to treat a token such as:
(define
or
poly))
As multiple tokens, where any parentheses is its own token:
(
define
poly
)
)
If I can do that, I can properly recognize different symbols to add to my symtab, and know when/how to add nodes to my parse tree.
The Java API shows that the scanner class doesn't have any methods for doing exactly what I want. The closest thing I could think of is using the parantheses as custom delimiters, which would make each token clean enough to be recognized more easily by my logic, but then what happens to my parentheses?
Another method I'm thinking about is forgoing the Java tokenizer, and just scanning char by char until I find a complete symbol.
What should I do? Try to work around the Java scanner methods, or just do a character by character approach?
First, you need to get your terminology straight. (define is not a single token; it's a ( token followed by a define one. Similarly, poly)) is not a single token, it's three.
Don't let java.util.Scanner (that's what you're using, right?) throw you for a loop -- when you say "Generally, to get tokens, this is fine", I say no, it's not. Don't settle for what it provides if it's not enough.
To correctly tokenize Scheme code, I'd expect you need to at least be able to deal with regular languages. That would probably be very tough to do using Scanner, so here's a couple of alternatives:
learn and apply a tried-and-true parsing tool like Antlr or Lex. Will be beneficial for any of your future parsing projects
roll your own regular expression approach (I don't know Scheme well enough to be sure that this will work) for tokenizing, but don't forget that you need at least context-free for full parsing
learn about parser combinators and recursive descent parsing, which are relatively easy to implement by hand -- and you'll end up learning a ton about Java's type system

Should we use regular expression in Java?

I know regular expressions are very powerful, and to become an expert with them is not easy.
One of my colleagues once wrote a java class to parse formatted text files. Unfortunately it caused a StackOverFlowError in the first integration test. It seems difficault to find the bug, before another colleague from structural programming world came over and fixed it quickly by thowing away all regular expressions and instead using many nested conditional statements and many split and trim methods, and it works very well!
Well, why do we need regular expression in a programming language like Java? As far as I know, the only necessary usage of regular expression is the find/replace function in text editors.
Like everything else: Use with care and KISS
I use regexes quite often, but I don't go over the top and write a 100 character regex, because I know that I (personally) won't understand it later... in fact I think my limit is about 30-40 characters, something larger than that makes me spend too much time scratching my head.
Anything that can be expressed as a regular expression can, by definition, be expressed as a chain of IFs. You use REGEX basically for two reasons:
REGEX libraries tends to have optimized implementation that most of the time will be better than a hand-coded "IF" chain for some expressions.
REGEX are usually easier to follow, if properly written, than the IF chains. Specially for more complex expressions.
If your expression gets too complex, the use the advice given by this answer. If it get truly nasty, think about learning how to use a parser generator like ANTLR or JavaCC. A simple grammar usually can replace a regex, and it is a lot easier to maintain.
So the multiple nested conditional statements with many split and trim methods are easier for you to debug than a single line or two with regular expressions?
My preference is regular expressions because once you learn them, they are far more maintainable and far easier to read than parsing huge nested if loops.
If you find that a regular expression would get too complex and unmaintable, use code instead. Regular expressions can get very complex even for things that sound very simple at first. For example validation of dates in the format mm/dd/yy[yy] is as "simple" as:
^(((((((0?[13578])|(1[02]))[\.\-/]?((0?[1-9])|([12]\d)|(3[01])))|(((0?[469])|(11))[\.\-/]?((0?[1-9])|([12]\d)|(30)))|((0?2)[\.\-/]?((0?[1-9])|(1\d)|(2[0-8]))))[\.\-/]?(((19)|(20))?([\d][\d]))))|((0?2)[\.\-/]?(29)[\.\-/]?(((19)|(20))?(([02468][048])|([13579][26])))))$
Nobody can maintain that. Manually parsing the date will need more code but can be much more readable and maintainable.
Regular expressions are very powerful and useful for matching TEXT patterns, but are bad for validation with numeric parts like dates.
As always, you should use the best tool for the job. I would define the "best tool" by the most simple, understandable, effective method that fulfills the requirements.
Often regexes will simplify code and make it more readable. But this is not always the case.
Also, I would not jump to conclusions that regexes caused the StackOverflowError.
Regular expressions are a tool (like many others). You should use it when the work to be done could best be done with that tool. To know which tool to use, it helps ask a question like "When could I use regular expressions?". And of course it will become easier to decide which tool to use when you have many different tools in your toolbox and you know them fairly well.
You can use regex cleverly by spliting those into smaller chunks, something like,
final String REGEX_SOMETHING = "something";
final String REGEX_WHATEVER = "whatever";
..
String REGEX_COMPLETE = REGEX_SOMETHING + REGEX_WHATEVER + ...
Regular expressions can be easier to read, but they can also be too complicated. It depends on the format of data you want to match.
The Java RE implementation still has some quirks, with the effect that some quite simple expressions (like '((?:[^'\\]|\\.)*)') cause a stack overflow when matching longer strings. So make sure you test with real life data (and more extreme examples, too) - or use a regex engine with a different implementation (there are several ones, also as Java libraries).
Regular expression is very powerful in looking for patterns in the content. You can certainly avoid using regular expression and rely on the conditional statements, but you will soon notice that it takes many lines of code to accomplish the same task. Using too many nested conditional statements increases the cyclomatic complexity of your code, as a result, it becomes even more difficult to test because there are too many branches to test. Further, it also makes the code difficult to read and understand.
Granted, your colleague should have written testcases to test his regular expressions first.
There's no right or wrong answer here. If the task is simple, then there's no need to use regular expression. Otherwise, it is nice to sprinkle a little regular expressions here and there to make your code easy to read.

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

How can I parse REXX code in Java?

I'd like to parse REXX source so that I can analyse the structure of the program from Java.
I need to do things like normalise equivalent logic structures in the source that are syntactically different, find duplicate variable declarations, etc. and I already have a Java background.
Any easier ways to do this than writing a load of code?
REXX is not an easy language to parse with common tools, especially those that expect a BNF grammar. Unlike most languages designed by people exposed to C, REXX doesn't have any reserved words, making the task somewhat complicated. Every term that looks like a reserved word is actually only resolved in its specific context (e.g., "PULL" is only reserved as the first word of a PULL instruction or the second word of a PARSE PULL instruction - you can also have a variable called PULL ("PULL = 1 + 2")). Plus there are some very surprising effects of comments. But the ANSI REXX standard has the full syntax and all the rules.
If you have BNF Rexx grammar, then javacc can help you build an AST (Abstract Syntax Tree) representation of that Rexx code.
More accurately, javacc will build the Java classes which will :
parse Rexx code and
actually builds the AST.
There would still be "load of code", but you would not to be the one doing the writing of the classes for that Rexx code parser. Only its generation.
Have a look at ANTLR, it really does a nice work of building an AST, transforming it etc...
It has a nice editor (ANTLRWorks), is built on Java, and can debug your parser / tree walkers while they run in your application. Really worth investigating for any kind of parsing job.

Categories