I'm trying to write a simple interactive (using System.in as source) language using antlr, and I have a few problems with it. The examples I've found on the web are all using a per line cycle, e.g.:
while(readline)
result = parse(line)
doStuff(result)
But what if I'm writing something like pascal/smtp/etc, with a "first line" looks like X requirment? I know it can be checked in doStuff, but I think logically it is part of the syntax.
Or what if a command is split into multiple lines? I can try
while(readline)
lines.add(line)
try
result = parse(lines)
lines = []
doStuff(result)
catch
nop
But with this I'm also hiding real errors.
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Dutow wrote:
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?
Yes, ANTLR can do this. Perhaps not out of the box, but with a bit of custom code, it sure is possible. You also don't need to re-parse the entire token stream for it.
Let's say you want to parse a very simple language line by line that where each line is either a program declaration, or a uses declaration, or a statement.
It should always start with a program declaration, followed by zero or more uses declarations followed by zero or more statements. uses declarations cannot come after statements and there can't be more than one program declaration.
For simplicity, a statement is just a simple assignment: a = 4 or b = a.
An ANTLR grammar for such a language could look like this:
grammar REPL;
parse
: programDeclaration EOF
| usesDeclaration EOF
| statement EOF
;
programDeclaration
: PROGRAM ID
;
usesDeclaration
: USES idList
;
statement
: ID '=' (INT | ID)
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
But, we'll need to add a couple of checks of course. Also, by default, a parser takes a token stream in its constructor, but since we're planning to trickle tokens in the parser line-by-line, we'll need to create a new constructor in our parser. You can add custom members in your lexer or parser classes by putting them in a #parser::members { ... } or #lexer::members { ... } section respectively. We'll also add a couple of boolean flags to keep track whether the program declaration has happened already and if uses declarations are allowed. Finally, we'll add a process(String source) method which, for each new line, creates a lexer which gets fed to the parser.
All of that would look like:
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse(); // the entry point of our parser
}
}
Now inside our grammar, we're going to check through a couple of gated semantic predicates if we're parsing declarations in the correct order. And after parsing a certain declaration, or statement, we'll want to flip certain boolean flags to allow- or disallow declaration from then on. The flipping of these boolean flags is done through each rule's #after { ... } section that gets executed (not surprisingly) after the tokens from that parser rule are matched.
Your final grammar file now looks like this (including some System.out.println's for debugging purposes):
grammar REPL;
#parser::members {
boolean programDeclDone;
boolean usesDeclAllowed;
public REPLParser() {
super(null);
programDeclDone = false;
usesDeclAllowed = true;
}
public void process(String source) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(source);
REPLLexer lexer = new REPLLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
super.setTokenStream(tokens);
this.parse();
}
}
parse
: programDeclaration EOF
| {programDeclDone}? (usesDeclaration | statement) EOF
;
programDeclaration
#after{
programDeclDone = true;
}
: {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
;
usesDeclaration
: {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
;
statement
#after{
usesDeclAllowed = false;
}
: left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
;
idList
: ID (',' ID)*
;
PROGRAM : 'program';
USES : 'uses';
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT : '0'..'9'+;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which can be tested wit the following class:
import org.antlr.runtime.*;
import java.util.Scanner;
public class Main {
public static void main(String[] args) throws Exception {
Scanner keyboard = new Scanner(System.in);
REPLParser parser = new REPLParser();
while(true) {
System.out.print("\n> ");
String input = keyboard.nextLine();
if(input.equals("quit")) {
break;
}
parser.process(input);
}
System.out.println("\nBye!");
}
}
To run this test class, do the following:
# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g
# compile all .java source files:
javac -cp antlr-3.2.jar *.java
# run the main class on Windows:
java -cp .;antlr-3.2.jar Main
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main
As you can see, you can only declare a program once:
> program A
program <- A
> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?
uses cannot come after statements:
> program X
program <- X
> uses a,b,c
uses <- a,b,c
> a = 666
a <- 666
> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?
and you must start with a program declaration:
> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?
Here's an example of how to parse input from System.in without first manually parsing it one line at a time and without making major compromises in the grammar. I'm using ANTLR 3.4. ANTLR 4 may have addressed this problem already. I'm still using ANTLR 3, though, and maybe someone else with this problem still is too.
Before getting into the solution, here are the hurdles I ran into that keeps this seemingly trivial problem from being easy to solve:
The built-in ANTLR classes that derive from CharStream consume the entire stream of data up-front. Obviously an interactive mode (or any other indeterminate-length stream source) can't provide all the data.
The built-in BufferedTokenStream and derived class(es) will not end on a skipped or off-channel token. In an interactive setting, this means that the current statement can't end (and therefore can't execute) until the first token of the next statement or EOF has been consumed when using one of these classes.
The end of the statement itself may be indeterminate until the next statement begins.
Consider a simple example:
statement: 'verb' 'noun' ('and' 'noun')*
;
WS: //etc...
Interactively parsing a single statement (and only a single statement) isn't possible. Either the next statement has to be started (that is, hitting "verb" in the input), or the grammar has to be modified to mark the end of the statement, e.g. with a ';'.
I haven't found a way to manage a multi-channel lexer with my solution. It doesn't hurt me since I can replace my $channel = HIDDEN with skip(), but it's still a limitation worth mentioning.
A grammar may need a new rule to simplify interactive parsing.
For example, my grammar's normal entry point is this rule:
script
: statement* EOF -> ^(STMTS statement*)
;
My interactive session can't start at the script rule because it won't end until EOF. But it can't start at statement either because STMTS might be used by my tree parser.
So I introduced the following rule specifically for an interactive session:
interactive
: statement -> ^(STMTS statement)
;
In my case, there are no "first line" rules, so I can't say how easy or hard it would be to do something similar for them. It may be a matter of making a rule like so and execute it at the beginning of the interactive session:
interactive_start
: first_line
;
The code behind a grammar (e.g., code that tracks symbols) may have been written under the assumption that the lifespan of the input and the lifespan of the parser object would effectively be the same. For my solution, that assumption doesn't hold. The parser gets replaced after each statement, so the new parser must be able to pick up the symbol tracking (or whatever) where the last one left off. This is a typical separation-of-concerns problem so I don't think there's much else to say about it.
The first problem mentioned, the limitations of the built-in CharStream classes, was my only major hang-up. ANTLRStringStream has all the workings that I need, so I derived my own CharStream class off of it. The base class's data member is assumed to have all the past characters read, so I needed to override all the methods that access it. Then I changed the direct read to a call to (new method) dataAt to manage reading from the stream. That's basically all there is to this. Please note that the code here may have unnoticed problems and does no real error handling.
public class MyInputStream extends ANTLRStringStream {
private InputStream in;
public MyInputStream(InputStream in) {
super(new char[0], 0);
this.in = in;
}
#Override
// copied almost verbatim from ANTLRStringStream
public void consume() {
if (p < n) {
charPositionInLine++;
if (dataAt(p) == '\n') {
line++;
charPositionInLine = 0;
}
p++;
}
}
#Override
// copied almost verbatim from ANTLRStringStream
public int LA(int i) {
if (i == 0) {
return 0; // undefined
}
if (i < 0) {
i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
if ((p + i - 1) < 0) {
return CharStream.EOF; // invalid; no char before first char
}
}
// Read ahead
return dataAt(p + i - 1);
}
#Override
public String substring(int start, int stop) {
if (stop >= n) {
//Read ahead.
dataAt(stop);
}
return new String(data, start, stop - start + 1);
}
private int dataAt(int i) {
ensureRead(i);
if (i < n) {
return data[i];
} else {
// Nothing to read at that point.
return CharStream.EOF;
}
}
private void ensureRead(int i) {
if (i < n) {
// The data has been read.
return;
}
int distance = i - n + 1;
ensureCapacity(n + distance);
// Crude way to copy from the byte stream into the char array.
for (int pos = 0; pos < distance; ++pos) {
int read;
try {
read = in.read();
} catch (IOException e) {
// TODO handle this better.
throw new RuntimeException(e);
}
if (read < 0) {
break;
} else {
data[n++] = (char) read;
}
}
}
private void ensureCapacity(int capacity) {
if (capacity > n) {
char[] newData = new char[capacity];
System.arraycopy(data, 0, newData, 0, n);
data = newData;
}
}
}
Launching an interactive session is similar to the boilerplate parsing code, except that UnbufferedTokenStream is used and the parsing takes place in a loop:
MyLexer lex = new MyLexer(new MyInputStream(System.in));
TokenStream tokens = new UnbufferedTokenStream(lex);
//Handle "first line" parser rule(s) here.
while (true) {
MyParser parser = new MyParser(tokens);
//Set up the parser here.
MyParser.interactive_return r = parser.interactive();
//Do something with the return value.
//Break on some meaningful condition.
}
Still with me? Okay, well that's it. :)
If you are using System.in as source, which is an input stream, why not just have ANTLR tokenize the input stream as it is read and then parse the tokens?
You have to put it in doStuff....
For instance, if you're declaring a function, the parse would return a function right? without body, so, that's fine, because the body will come later. You'd do what most REPL do.
Related
Let's say we have a java file that looks like this :
class Something {
public static void main(String[] args){
System.out.println("Hello World!");
}
}
I would like to write some Kotlin code that would go through this java file and detect how many lines there is in the method body (Here is the main method only). Empty lines are counted!
My approach is to simply use the File forEachline method to read the java file line by line. I can write code to detect the method signature. Now I want to be able to determine where the method ends. I don't know how to do that.
If I simply look for "}" my code could make mistakes assuming that we are at the end of the body of the method while in reality we are at the end of an if statement body within the method.
How can I avoid this pitfall?
One way to approach this is keeping track of the number of open brackets('{') and close brackets ('}') seen. At the start of the method, the count will increment to 1. Assuming the method is validly structured, at the end of the method the number of unclosed brackets should be 0. Pseudocode like this should work:
int numLines = 1 (assuming method start line counts)
int numBrackets = 1 (after finding method open bracket for method)
while(numBrackets > 0)
if char = '{' -> numBrackets++
if char = '}' -> numBrackets--
if char = newline -> numLines++
if numBrackets not 0 -> FAIL
return numLines
Edit
As noted by Gidds below, this pseudo-code is insufficient. A more complete answer will need to include the fact that not all brackets impact method structure. One way to approach this is by keeping track of the context of the current character being parsed. Only increment/decrement numBrackets when in a valid context (non-string literal, comment, etc..). Though as noted by Gidds, this will increase complexity. Updated Pseudocode:
int numLines = 1
int numValidBrackets = 1
Context context = Context(MethodStructure)
while(numValidBrackets > 0)
context.acceptNextChar(char)
if char = newline -> numLines++
if(context.state() != MethodStructure) continue;
if char = '{' -> numValidBrackets++
if char = '}' -> numValidBrackets--
if numBrackets not 0 -> FAIL
return numLines
I am trying to write a method that will scan the input and return a String representing the lexeme found in the input string.
This is what I have so far but I don't know if I'm going in the right direction-- all help would be appreciated :)
private String scanNumbers(char input)
{
String result= "";
int value = in.read()
if(value != -1)
{
If(isDigit(input))
{
result = Integer.toString(value);
}
}
return result;
}
public static boolean isDigit(char input)
{
return (input >= '0' && input <= '9');
}
Thank you I am new to parsing/lexemes/compilers.
Introduction
Questions that appear to be related to a homework exercise are often slow to be answered on SO. We often wait until the deadline has well passed!
You mention you are new to the topics of parsing/lexemes/compilers, and want some help in writing a Java method to scan the input and return a string representing the lexeme found in the input string. Later you clarify, indicating that you want a method that skips characters until it finds digits.
There is quite a bit of confusion in your question which produces conflicts in what you want to achieve.
It is not clear if you are wanting to learn about performing lexical analysis in Java as part of a larger compiler project, whether you only want to do it with numbers, whether you are looking for existing tools or methods that do this or are trying to learn how to program such methods yourself. If you are programming, whether you only need to know about reading a number, or if this is just an example of the kind of things you want to do.
Lexical Analysis
Lexical analysis, which is also known as scanning, is the process of reading a corpus of text which is composed of characters. This can be done for several purposes, such as data input, linguistic analysis of written material (such as word frequency counting) or part of language compilation or interpretation. When done as part of compilation it is one (and usually the first) of a sequence of phases that include parsing, semantic analysis, code generation, optimisation and such. In the writing of compilers code generator tools are usually used, so if it was desired to write a compiler in Java, then a Java lexical generator and a Java parser generator would often be used to create the Java code for those compiler components. Sometimes that lexer and parser are hand written, but it is not a recommended task for a novice. It would require a compiler writing specialist to build a compiler by hand better than a tool-set. Sometimes, as a class exercise, students are asked to write code to perform a piece lexical analysis to help them understand the process, but this is often for a few lexemes, like your digit exercise.
The term lexeme is used to describe a sequence of characters that compose an individual entity recognised by a lexical analyser. Once recognised it is usually represented by a token. The lexeme is therefore replaced by a token as part of the lexical analysis process. A lexical analyser will sometime record the lexeme in a symbol table for later use before replacing it by the token. This is how identifiers in programs are often recorded in a compiler.
There are several tools for building lexers in Java. Two of the most common are Jlex and JFlex. To illustrate how they work, to recognise an integer whilst skipping whitespace, we would use the following rules:
%%
WHITE_SPACE_CHAR=[\n\ \t\b\012]
DIGIT=[0-9]
%%
{WHITE_SPACE_CHAR}+ { }
{DIGIT}+ { return(new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length())); }
%%
which would be processed by the tool to produce Java methods to achieve that task.
The notations used to describe the lexemes are usually written as regular expressions. Computer Science theory can help us with the programming of a lexical analyser. Regular expressions can be represented by a form of finite state automata. There is a particular style of coding that can be used to match lexemes that experienced programers would recognise and use in this situation, which involves a switch inside a loop:
while ( ! eof ) {
switch ( next_symbol() ) {
case symbol:
...
break;
default:
error(diagnostic); break;
}
}
It is often these concepts that a simple lexical programming exercise is intended to introduce to students.
Tokenizing in Java
With all those preliminary explanations out of the way, lets get down to your piece of Java code. As mentioned in the comments there is a difference in Java between reading bytes from an input stream and reading characters, as characters are in unicode, which is represented by two bytes. You have used a byte read within a character processing method.
The recognising simple tokens in an input stream, particularly for data entry, is such a common activity that Java has a specific built-in class for that called the StreamTokenizer.
We could implement your task in the following way, for example:
// create a new tokenizer
Reader r = new BufferedReader(new InputStreamReader( System.in ));
StreamTokenizer st = new StreamTokenizer(r);
// print the stream tokens
boolean eof = false;
do {
int token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_EOF:
System.out.println("End of File encountered.");
eof = true;
break;
case StreamTokenizer.TT_EOL:
System.out.println("End of Line encountered.");
break;
case StreamTokenizer.TT_NUMBER:
System.out.println("Number: " + st.nval);
break;
default:
System.out.println((char) token + " encountered.");
if (token == '!') {
eof = true;
}
}
} while (!eof);
However, this does not return the string of the lexeme for a number, only matches the number and gets the value.
I see you have noticed the Java class java.util.scanner because your question had that as a tag. This is another class that can perform similar operations.
We could get an integer lexeme from the input like this:
Scanner s = new Scanner(System.in);
System.out.println(s.nextInt());
Solution
Finally, lets re-write your original code to find the lexeme for an integer skipping over an unwanted characters, in which I use java regular expression matching.
import java.io.IOException; import java.io.InputStreamReader;
import java.util.regex.Pattern;
public class ReadNumbers {
static InputStreamReader in = null; // Have input source as a global
static int value = -1; // and the current input value
public static void main ( String [] args ) {
try {
in = new InputStreamReader(System.in); // Set up the input
value = in.read(); // pre-fill the input state
System.out.println(scanNumbers()) ;
}
catch (Exception e) {
e.printStackTrace(); // print error
}
}
private static String scanNumbers() {
String SkipCharacters = "\\s" ; // Characters that can be skipped
String result= ""; // empty string to store lexeme
int charcount=0;
try {
while ( (value != -1) && Pattern.matches(SkipCharacters,"" + (char)value) )
// Now skip optional characters before the number
value = in.read() ; // pre-load the next character
while ( (value != -1) && isDigit((char)value)) {
// Now find the number digits
result = result + (char)value; // append digit character to result
value = in.read() ; // pre-load the next character
}
} finally {
return result;
}
}
public static boolean isDigit(char input) {
return (input >= '0' && input <= '9');
}
}
Afterword
The comment from #markspace is interesting and useful, as it points out not all numbers are soley decimal digits.
Consider numbers in other bases, like hexdecimal. Java allows integer constants to be specified in those number bases which do not just use the digits 0..9.
JSpeech Grammar Format allows user to specify tags for separate strings in curly brackets as follows:
<jump> = jump { primitive jump } [up] |
jump [to the] (left { primitive jump_left } |right { primitive jump_right } );
or
<effects> = nothing happens { NOTHING_HAPPENS } | ( [will] die | dies ) { OBJECT_DESTRUCTION } | (get|gets) new (coin|coins) { COIN_INCREASE };
Using tags is described more thoroughly in section 4.6.1 of the referenced specification.
In Sphinx4 you can catch these tags using getTags() method in RuleParse. So if user says "jump to the left" the following tag will be returned "primitive jump_left"
Now, I would like to do exactly the opposite - given the tag, I would like to match it to the string. So for "NOTHING_HAPPENS" I would like to get "nothing happens" or for "OBJECT_DESTRUCTION" an arry with all possible options: "will die, die, dies".
Is there any such method that can parse grammar files in such way or do I have to hardcode it?
My sollution to this is to generate all possible sentences defined by JSGF file. This can be done easily with dumpRandomSentences or getRandomSentence methods provided by Grammar classin Sphinx and give them back to the Recognizer which will print out the tags.
Sample code from my project:
for (int i = 0; i < 20000; i++) {
String utterance = grammar.getRandomSentence();
String tags;
try {
tags = parser.getTagString(utterance);
System.out.println(tags+" ==> "+utterance);
} catch (GrammarException e) {
error(e.toString());
}
}
Hopefully there are a few experts in the EpochX framework around here...I'm not sure that the user group is still active.
I am attempting to implement simple recursion within their represention of a BNF grammar and have fun into the following issue:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -9
at java.lang.String.substring(String.java:1911)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:235)
at org.epochx.epox.EpoxParser.parse(EpoxParser.java:254)
at org.epochx.tools.eval.EpoxInterpreter.eval(EpoxInterpreter.java:89)
at org.epochx.ge.model.epox.SAGE.getFitness(SAGE.java:266)
at org.epochx.ge.representation.GECandidateProgram.getFitness(GECandidateProgram.java:304)
at org.epochx.stats.StatField$7.getStatValue(StatField.java:97)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.StatField$8.getStatValue(StatField.java:117)
at org.epochx.stats.Stats.getStat(Stats.java:134)
at org.epochx.stats.Stats.getStats(Stats.java:162)
at org.epochx.stats.Stats.print(Stats.java:194)
at org.epochx.stats.Stats.print(Stats.java:178)
at org.epochx.ge.model.epox.Tester$1.onGenerationEnd(Tester.java:41)
at org.epochx.life.Life.fireGenerationEndEvent(Life.java:634)
at org.epochx.core.InitialisationManager.initialise(InitialisationManager.java:207)
at org.epochx.core.RunManager.run(RunManager.java:166)
at org.epochx.core.Model.run(Model.java:147)
at org.epochx.ge.model.GEModel.run(GEModel.java:82)
at org.epochx.ge.model.epox.Tester.main(Tester.java:55)
Java Result: 1
My simple grammar is structured as follows, where terminals are passed in separately to the evaluation function:
public static final String GRAMMAR_FRAGMENT = "<program> ::= <node>\n"
+ "<node> ::= <s_list>\n"
+ "<s_list> ::= <s> | <s> <s_list>\n"
+ "<s> ::= FUNCTION( <terminal> )\n"
+ "<terminal> ::= ";
Edit: Terminal creation -
// Generate the input sequences.
inputValues = BoolUtils.generateBoolSequences(4);
argNames = new String[4];
argNames[0] = "void";
argNames[1] = "bubbleSort";
argNames[2] = "int*";
argNames[3] = "numbers";
...
// Evaluate all possible inputValues.
for (final boolean[] vars: inputValues) {
// Convert to object array.
final Boolean[] objVars = ArrayUtils.toObject(vars);
Boolean result = null;
try {
interpreter.eval(program.getSourceCode(),
argNames, objVars);
score = (double)program.getParseTreeDepth();
} catch (final MalformedProgramException e) {
// Assign worst possible fitness and stop evaluating.
score = 0;
break;
}
}
The stacktrace shows that the problem is actually in the EpoxParser, this means that its not so much the grammar that is ill-formed, but rather that the programs that get generated cannot be parsed.
Because you're using the EpoxInterpreter, the programs that get generated get parsed as Epox programs. Epox is the name used to refer to the language that the tree representation of EpochX uses (a sort of corrupted form of Lisp which you can add your own literals/functions to). The parsing expects the S-Expression format, and tries to identify each function and terminal and it builds a tree made up of equivalent Node objects (see the org.epochx.epox.* packages). Then the tree can be evaluated to run the program.
But in Epox there's no built-in function called FUNCTION, nor any known literals 'void', 'bubbleSort', 'int*' or 'numbers'. So the parsing fails. So you need to add these constructs to the EpoxParser, so it knows how to parse them into nodes. You can do this with the declareFunction, declareLiteral and declareVariable methods (see the JavaDoc for the EpoxParser http://www.epochx.org/javadoc/1.4/).
I wrote this regex to parse entries from srt files.
(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).
The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r
The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.
My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processing:
val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r
def apply(string: String): Entry = string match {
case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
timeFormat.parse(end), text);
}
Sample entries:
One line:
1073
01:46:43,024 --> 01:46:45,015
I am your father.
Two Lines:
160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>
The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).
So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?
Cheers,
(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.
The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).
Maybe
(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z
Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html
I'm not optimistic, but here are two things to try:
you could do is move the (?s) to just before you need it.
remove the \r?$ and use a greedy .++ for the text .+
^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(?s)(.++)$
To really get good performance, I would refactor the code and regex to use findAllIn. The current code is doing a regex for every Entry in your file. I imagine the single findAllIn regex would perform better...But maybe not...
Check this out:
(?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$
This regex matches a complete .srt file entry in place. You don't have to split the contents up on line breaks first; that's a huge waste of resources.
The regex takes advantage of the fact that there's exactly one line separator (\n or \r\n) separating the lines within an entry (multiple line separators are used to separate entries from each other). Using \r?+\n instead of \s{1,2} means you can never accidentally match two line separators (\n\n) when you only wanted to match one.
This way, too, you don't have to rely on the . in (?s) mode. #Jacob was right about that: it's not really helping you, and it's killing your performance. But (?m) mode is helpful, for correctness as well as performance.
You mentioned java.util.Scanner; this regex would work very nicely with findWithinHorizon(0). But I'd be surprised if Scala doesn't offer a nice, idiomatic way to use it as well.
I wouldn't use java.util.Scanner or even strings. Everything you're doing will work perfectly on a byte stream as long as you can assume UTF-8 encoding of your files (or a lack of unicode). You should be able to speed things up by at least 5x.
Edit: this is just a lot of low-level fiddling of bytes and indices. Here's something based loosely on things I've done before, which seems about 2x-5x faster, depending on file size, caching, etc.. I'm not doing the date parsing here, just returning strings, and I'm assuming the files are small enough to fit in a single block of memory (i.e. <2G). This is being rather pedantically careful; if you know, for example, that the date string format is always okay, then the parsing can be faster yet (just count the characters after the first line of digits).
import java.io._
abstract class Entry {
def isDefined: Boolean
def date1: String
def date2: String
def text: String
}
case class ValidEntry(date1: String, date2: String, text: String) extends Entry {
def isDefined = true
}
object NoEntry extends Entry {
def isDefined = false
def date1 = ""
def date2 = ""
def text = ""
}
final class Seeker(f: File) {
private val buffer = {
val buf = new Array[Byte](f.length.toInt)
val fis = new FileInputStream(f)
fis.read(buf)
fis.close()
buf
}
private var i = 0
private var d1,d2 = 0
private var txt,n = 0
def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b
def nextNL() {
while (i < buffer.length && buffer(i) != '\n') i += 1
i += 1
if (i < buffer.length && buffer(i) == '\r') i += 1
}
def digits() = {
val zero = i
while (i < buffer.length && isDig(buffer(i))) i += 1
if (i==zero || i >= buffer.length || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
def dates(): Boolean = {
if (i+30 >= buffer.length) {
i = buffer.length
false
}
else {
d1 = i
while (i < d1+12 && buffer(i) != '\n') i += 1
if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') {
nextNL()
false
}
else {
i += 5
d2 = i
while (i < d2+12 && buffer(i) != '\n') i += 1
if (i < d2+12 || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
}
}
def gatherText() {
txt = i
while (i < buffer.length && buffer(i) != '\n') {
i += 1
nextNL()
}
n = i-txt
nextNL()
}
def getNext: Entry = {
while (i < buffer.length) {
if (digits()) {
if (dates()) {
gatherText()
return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n))
}
}
}
return NoEntry
}
}
Now that you see that, aren't you glad that the regex solution was so quick to code?