Sorry for the vague title, I didn't know how to describe the problem with just one line.
Basically I'm trying to build a simple parser (manually) for a language with a syntax similar to XML, like this:
<my_language check="somestring">
*strings here*
</my_language>
Where strings here means that inside that there could be anything (but most likely code from another language).
An example of complete code could be something like this:
<my_language check="House">
House myHouse = new House();
house.setAdress("somewhere");
</my_language>
<my_language check="House/Garage">
Garage myGarage = new Garage();
garage.setCar("some car");
</my_language>
The sense of the language is not really relevant right now. What I need is a way to parse this, using a recursive descent parser (made with just a syntax analyzer and a lexical analyzer).
The grammar for the syntax analyzer is not really a problem... what I'm struggling to make is the lexical analyzer that finds the tokens I need.
I recently made another parser similar to this for a language more similar to XML, and I used a StreamTokenizer for the lexical analyzer. In this case though I don't know how can I use it.
With a StreamTokenizer I could easily split the parts like "my_language check="House">" into tokens, but then I would need to take the code inside the tags as a whole (leaving the format intact) and I don't know how can I do that. Basically I would need to take the whole code block instead of word for word, but I know that the StreamTokenizer can't let me do that.
So, what approach should I use?
Related
I need to do a parser for propositional logic. I pretend to do it by hand implemented like a recursive descent parser in java.
My question is about the lexer, is it really needed for this job? I mean to define a finite state machine to recognize tokens etc.
I have seen some examples about simple parsers for arithmetic and they handle all in a "single parser" relying only on the grammar rules. It doesn't look like they care about a separate independent lexer which provides the tokens for the parser.
Since i want to do this in the most correct way, i ask for advice for this job. Any link to related info is welcome.
A bit more information would be useful, i.e. the grammar you want to use and an example input String. I don't know how much you know about Chomsky's grammar levels, but that's the key. Simplified said, a lexer can parse on word level (Level 3: Regular grammars) and a parser can analyse syntax, too (Level 2: Context-free grammars). (more information here: lexers vs parsers)
It's possible to use a scannerless parser, but I think you would just integrate the lexer in your parser if you write without trying to avoid a lexer. In other words, if you write your program, you would call the part that tokenizes the raw input string the lexer and the one that applies the grammar the parser, if that's how you want to call it. But you shouldn't give a lot about the terms. If you write a parser and don't need a lexer, chances a high, that the lexer is already in your code, but who cares ;) Hope that helps, but feel free to ask if it's still unclear!
You don't need a "real" lexer for this kind of parser. You do need something that picks off the atoms of your language (identifiers, operators, parentheses). You can do this for simple languages, directly in the parser.
See my SO answer on how to code a simple parser: Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
Hopefully my title isn't completely terrible. I don't really know what this should be called. I'm trying to write a very basic scheme parser in Java. The issue I'm having is with implementation.
I open a file, and I want to parse individual tokens:
while(sc.hasNext()) {
System.out.println(sc.next());
}
Generally, to get tokens, this is fine. But in scheme, recognizing the begining and end of a list is crucial; my program's functionality depends on this, so I need a way to treat a token such as:
(define
or
poly))
As multiple tokens, where any parentheses is its own token:
(
define
poly
)
)
If I can do that, I can properly recognize different symbols to add to my symtab, and know when/how to add nodes to my parse tree.
The Java API shows that the scanner class doesn't have any methods for doing exactly what I want. The closest thing I could think of is using the parantheses as custom delimiters, which would make each token clean enough to be recognized more easily by my logic, but then what happens to my parentheses?
Another method I'm thinking about is forgoing the Java tokenizer, and just scanning char by char until I find a complete symbol.
What should I do? Try to work around the Java scanner methods, or just do a character by character approach?
First, you need to get your terminology straight. (define is not a single token; it's a ( token followed by a define one. Similarly, poly)) is not a single token, it's three.
Don't let java.util.Scanner (that's what you're using, right?) throw you for a loop -- when you say "Generally, to get tokens, this is fine", I say no, it's not. Don't settle for what it provides if it's not enough.
To correctly tokenize Scheme code, I'd expect you need to at least be able to deal with regular languages. That would probably be very tough to do using Scanner, so here's a couple of alternatives:
learn and apply a tried-and-true parsing tool like Antlr or Lex. Will be beneficial for any of your future parsing projects
roll your own regular expression approach (I don't know Scheme well enough to be sure that this will work) for tokenizing, but don't forget that you need at least context-free for full parsing
learn about parser combinators and recursive descent parsing, which are relatively easy to implement by hand -- and you'll end up learning a ton about Java's type system
I want to create an editor and store formatted text in database. I want just a sample editor do functions like StackOverFlow editor:
_sfdfdgdfgfg_ : for underlined text
/sfdfdgdfgfg/ : for bolded text
I will use a function to replace the first _ by <b> and for the second </b> (respec. for /).
So my question is how can I do to detect the end and the last _ and / if they are nested??
For example :
/dsdfdfffddf _ dsdsdssd_/ ffdfddsgds /dfdfehgiuyhbf/ ....
I will use this editor in Java Application.
So what you want is a Java Version of markdown.
Here's what Google finds:
http://code.google.com/p/markdownj/
It will not make you happy, but you should probably take the time to learn to write parsers (dragon book is nice for that). The thing with parser tasks is that they are easy if you know how to do it and nearly impossible if you don't.
I would write a tokenizer that can recognize tokens like <start_underline, "_"> and <end_underline, "_"> for the format indicators you want to use in your editor and one for everything else. Results could look like this:
Text: Hello _world_, /how are you?/
Tokens: <text, "Hello ">,<start_underline, "_">,<text, "world">,<end_underline, "_">,<text, ", ">,<start_bold, "/">,<text, "how are you?">,<end_bold, "/">,
Start and End can be tracked fairly easy with boolean variables, because it makes no sense to nest them. That's why I would do that tracking in the tokenizer already.
After that I would write a parser class that takes these tokens and configures the output to a textarea accordingly.
You see, that this is actually just an application of the principle divide and conquer. The task of How do I do everything I want with my text? is split up into 3 parts:
According to a useful structure, what is this string about? (Answer from Tokenizer)
How do I handle specific textpart x for all possible x? (Answer from Parser)
How do I represent the parsers interpretation of this string? (Answer from JTextpane or alike)
Both Tokenizer and Parser don't need to be extra classes. Because the context is not complicated they can just be methods in an extension class of the Textarea type you prefer.
Giving more detailed advice is not helpful I think, the next best step would be an implementation that you probably better want to do by yourself. Don't hesitate to ask if you fail to find a good solution to one specific part, though.
You can see stackoverflow.com Page Source and try to integrate... I guess it should work...
https://blog.stackoverflow.com/2008/12/reverse-engineering-the-wmd-editor/
This is an example show how to use MarkDownJ:
First, make sure that MarkdownJ is as a class library invoked in your Java application.
Second, use this code to invoke MarkdownJ :
MarkdownProcessor m = new MarkdownProcessor();
String html = m.markdown("this is *a sample text*");
System.out.print("<html>"+html+"</html>");
Consider following script (it's total nonsense in pseudo-language):
if (Request.hostMatch("asfasfasf.com") && someString.existsIn(new String[] {"brr", "hrr"})) {
if (Requqest.clientIp("10.0.x.x")) {
somevar = "1";
}
somevar = "2";
}
else {
somevar = "first";
}
string foo = "foo";
// etc. etc.
How would you grab if-block's parameters and contents from it? The if-block has format of:
if<whitespace>(<parameters>)<whitespace>{<contents>}<anything>
I tried using String.split() with regex pattern of ^if\s*\(|\)\s*\{|\}\s* but this fails miserably. Namely, the problem is that ) { is found also in inner if-block and the closing } is found from many places as well. I don't think neither lazy or eager expansion works here.
So... any pointers to what might I need here in order to implement this with regex?
I also need to get the remaining string without the if-block's code (so code starting from else { ...). Using just String.split() seems to make it difficult as there is no information about the length of the parts that were parsed away.
I initially created a loop based solution (using String.substring() heavily) for this, but it's dull. I would like to have something fancier instead. Should I go with regex or create a custom, generic function (there are many other cases than just this) that takes the parseable String and the pattern instead (consider the if<whitespace>(... pattern above)?
Edit: Changed returns to variable assignments as it would have not made sense otherwise.
You'd be far better off using (or writing) a parser than trying to do this with Regex.
Regex is great for somethings, but for complex parsing like this, it sucks. Another example where it sucks that gets asked a lot here is parsing HTML - you can do it to a limited degree, but for anything complex, a DOM parser is a much better solution.
For a [very] simple parser, what you need is a recursive function that searches for a braces { and }, recursing down a level each time it comes across an opening brace, and returning back up a level when it finds a closing brace. It then needs to store the string contents between the two braces at each level.
A regular language won't work because a regular grammar can't match things like "any number of open parenthesis followed by any number of close parenthesis". A context-free grammar would be needed for that.
Unless you use a context-free grammar parser for Java or a regular expression extension that makes regular expressions no longer regular, your loop-based solution is probably the fanciest solution.
As per the above, you'll need a parser. One type that's easy to implement (and fun to write!) is a recursive descent parser with backtracking. There is also a plethora of parser generators out there, though most of those have a learning curve. One Java-friendly parser generator is JavaCC.
Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).