JavaCC generates code that fails while parsing UTF-8 strings

JavaCC generates code that fails while parsing UTF-8 strings - java

I have an older project, where a JavaCC grammar was used to generate classes to parse a custom language.
Now, several years later I have to adapt the grammar to add functionality (just a minor change).
This works, but when running all tests, I see I have a problem parsing UTF-8 characters.
I don't really have an idea what is causing this.
I reverted my change to the grammar and recreated the classes, but the problem remains.
As soon as I run javacc with the grammar and run my tests, the one with the UTF-8 characters fail.
This is the call I am using:
java -cp javacc-7.0.10.jar javacc -GRAMMAR_ENCODING=UTF-8 functionsGrammar.jj
I tried it with all major javacc versions from 4.x to 7.0.10, they all have the same problem.
I also tried this with different java version (6, 7, 8, 11) but that also did not make any difference.
Below you can find the relevant parts of the grammar:
options
{
JDK_VERSION = "1.6";
LOOKAHEAD= 2;
FORCE_LA_CHECK = true;
static = false;
}
TOKEN:
{
...
|< STRING : < QUOTES > (~["\"", "\\"])* ("\\"~[] (~["\"", "\\"])*)* < QUOTES > >
...}
TOKEN:
{
...
| < LIST :
< LCURLY_BRACE > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
( < COMMA > < SPACES >
( < STRING > | < DATE > | < PARAMETER_FIELD_ID > | < PARAMETER_ELEMENT > | < NULL > )
)*
...}
It fails for the string: "美丽的树" but works when changed to "slkdfj" for example.
I wonder if there are any options for JavaCC that I am missing? Or other java / javacc version combinations that might work?

Related

Why does ":" at the last of a groovy statement does not throw any error?

I mistakenly wrote the following in the groovy console but afterwards I realized that it should throw error but it did not. What is the reason behind groovy not throwing any error for colon at last of the statement?Is it allocated for documentation or sth like that?
a:
String a
println a
This threw no error when i tried executing this code in https://groovyconsole.appspot.com/

It's a label, just like it would be in Java. For example:
a:
for (int i = 0; i < 10; i++)
{
String a = "hello"
println a
 break a; // This refers to the label before the loop
}

One good use of labels in Groovy I can think of is the Spock Framework, where they are used for clauses:
def 'test emailToNamespace'() {
given:
Partner.metaClass.'static'.countByNamespaceLike = { count }
expect:
Partner.emailToNamespace( email ) == res
where:
email | res | count
'aaa.com' | 'com.aaa' | 0
'aaa.com' | 'com.aaa1' | 1
}

Warning in Regular Expression - JavaCC

I've this code in my JavaCC parser:
< VARIABILE : "§" < LETTERA > ( < CIFRA > | < LETTERA > )* >
< TERMINE: ( < NUM_SEGNO > | < VARIABILE > | "(-" < VARIABILE > ")" ) >
I get this error when compiling
Regular Expression choice : VARIABILE can never be matched as : TERMINE
How can I fix this?

In your production for TERMINE, the second alternative is useless; you might as well write
< TERMINE: ( < NUM_SEGNO > | "(-" < VARIABILE > ")" ) >
That's what the error message is telling you. Why is it useless? JavaCC's regular expressions obey the three rules explained in FAQ 3.3. Go read about them, before you read further. ... Ok you're back. You now should understand that, if the longest prefix of the input that matches any rule matches the rule for < VARIABILE > (and therefore also your rule for <TERMINE>), then the rule for < VARIABILE > will beat the rule for < TERMINE > by virtue of being first in the .jj file.
What to do to fix this depends on what you want to achieve. My guess is that you should move the choice up to the parser level. I.e. delete the rule for TERMINE and replace it with a syntactic rule
void Termine() : {} {
<NUM_SEGNO>
|
<VARIABILE>
|
"(-" <VARIABLE> ")"
}
For other possibilities, see FAQ 3.6 and FAQ 4.19.

How can I find similar matches of a word from a database in Java?

My current programming project is a sort of french dictionary in Java (using sqlite). I was wondering what would happen if someone wanted to find the present tense for "avoir" but typed in "avior" and how I would handle it. So I thought I could implement some sort of closest match/did you mean functionality. So my question is:
Is there a way to use the database to search for similar matches?
when I made the same program in python a while back (using xml instead) I used this system but it wasn't very effective and required a large error margin to be somewhat effective (and subsequently suggesting words with no relevance!)... but something similar could still be useful nether the less
def getSimilar(self, word, Return = False):
matches = list()
for verb in self.data.getElementsByTagName("Verb"):
for x in range(16):
if x % 2 != 0 and x>0:
if (x == 15 or x == 3 or x == 1):
part = Dict(self.data).removeBrackets(Dict(self.data).getAccents(verb.childNodes[x].childNodes[0].data))
diff = 0
for char in word:
if (not char in part):
diff += 1
if (diff < self.similarityValue) and (-self.errorAllowance <= len(part) - len(word) <= self.errorAllowance):
matches.append(part)
else:
for y in range(14):
if (y % 2 != 0 and y>0):
part = Dict(self.data).getAccents(verb.childNodes[x].childNodes[y].childNodes[0].data)
diff = 0
for char in word:
if (not char in part):
diff += 1
if (diff < self.similarityValue) and (-self.errorAllowance <= len(part) - len(word) <= self.errorAllowance):
matches.append(part)
if not Return:
for match in matches:
print "Did you mean '" + match + "'?"
if Return: return matches
Any help is welcomed!
Jamie

try using
https://github.com/mateusza/SQLite-Levenshtein
Works quite well

Java simple line parser

I could see bunch of java parsers like OpenCSV, antlr, jsapar etc, but I dont see any of those with ability to specify both custom line seperator and column seperator? Do we have any such easy to use libraries. I dont want to write one using Scanner or Stringtokenizer now!
Eg. A | B || C | D || E | F
want to break this above string to something like {{A,B},{C,D},{E,F}}

You can parse it yourself, it's quite simple to achieve. I haven't test this code practically, you may try it yourself.
line_delimiter = "||";
column_delimiter = "|";
String rows[];
rows = str.split(line_delimiter);
for (int i = 0; i < rows.length; i++) {
String columns[];
columns = rows[i].split(column_delimiter);
for (int j = 0; j < columns.length; j++) {
// Do something to your data here;
}
}

Actually, with JSaPar you can have any character sequence for both line delimiter as well as cell delimiter. You specify which delimiter to use within your schema and it can be any number of characters.
The problem you will face by using the same character in both is that the parser will not know if you have a line break or if it is just an empty cell.

Can this regex be further optimized?

I wrote this regex to parse entries from srt files.
(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).
The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r
The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.
My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processing:
val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r
def apply(string: String): Entry = string match {
case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
timeFormat.parse(end), text);
}
Sample entries:
One line:
1073
01:46:43,024 --> 01:46:45,015
I am your father.
Two Lines:
160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>
The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).
So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?
Cheers,

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.
The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).
Maybe
(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z
Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

I'm not optimistic, but here are two things to try:
you could do is move the (?s) to just before you need it.
remove the \r?$ and use a greedy .++ for the text .+
^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(?s)(.++)$
To really get good performance, I would refactor the code and regex to use findAllIn. The current code is doing a regex for every Entry in your file. I imagine the single findAllIn regex would perform better...But maybe not...

Check this out:
(?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$
This regex matches a complete .srt file entry in place. You don't have to split the contents up on line breaks first; that's a huge waste of resources.
The regex takes advantage of the fact that there's exactly one line separator (\n or \r\n) separating the lines within an entry (multiple line separators are used to separate entries from each other). Using \r?+\n instead of \s{1,2} means you can never accidentally match two line separators (\n\n) when you only wanted to match one.
This way, too, you don't have to rely on the . in (?s) mode. #Jacob was right about that: it's not really helping you, and it's killing your performance. But (?m) mode is helpful, for correctness as well as performance.
You mentioned java.util.Scanner; this regex would work very nicely with findWithinHorizon(0). But I'd be surprised if Scala doesn't offer a nice, idiomatic way to use it as well.

I wouldn't use java.util.Scanner or even strings. Everything you're doing will work perfectly on a byte stream as long as you can assume UTF-8 encoding of your files (or a lack of unicode). You should be able to speed things up by at least 5x.
Edit: this is just a lot of low-level fiddling of bytes and indices. Here's something based loosely on things I've done before, which seems about 2x-5x faster, depending on file size, caching, etc.. I'm not doing the date parsing here, just returning strings, and I'm assuming the files are small enough to fit in a single block of memory (i.e. <2G). This is being rather pedantically careful; if you know, for example, that the date string format is always okay, then the parsing can be faster yet (just count the characters after the first line of digits).
import java.io._
abstract class Entry {
def isDefined: Boolean
def date1: String
def date2: String
def text: String
}
case class ValidEntry(date1: String, date2: String, text: String) extends Entry {
def isDefined = true
}
object NoEntry extends Entry {
def isDefined = false
def date1 = ""
def date2 = ""
def text = ""
}
final class Seeker(f: File) {
private val buffer = {
val buf = new Array[Byte](f.length.toInt)
val fis = new FileInputStream(f)
fis.read(buf)
fis.close()
buf
}
private var i = 0
private var d1,d2 = 0
private var txt,n = 0
def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b
def nextNL() {
while (i < buffer.length && buffer(i) != '\n') i += 1
i += 1
if (i < buffer.length && buffer(i) == '\r') i += 1
}
def digits() = {
val zero = i
while (i < buffer.length && isDig(buffer(i))) i += 1
if (i==zero || i >= buffer.length || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
def dates(): Boolean = {
if (i+30 >= buffer.length) {
i = buffer.length
false
}
else {
d1 = i
while (i < d1+12 && buffer(i) != '\n') i += 1
if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') {
nextNL()
false
}
else {
i += 5
d2 = i
while (i < d2+12 && buffer(i) != '\n') i += 1
if (i < d2+12 || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
}
}
def gatherText() {
txt = i
while (i < buffer.length && buffer(i) != '\n') {
i += 1
nextNL()
}
n = i-txt
nextNL()
}
def getNext: Entry = {
while (i < buffer.length) {
if (digits()) {
if (dates()) {
gatherText()
return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n))
}
}
}
return NoEntry
}
}
Now that you see that, aren't you glad that the regex solution was so quick to code?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JavaCC generates code that fails while parsing UTF-8 strings - java

Related

Why does ":" at the last of a groovy statement does not throw any error?

Warning in Regular Expression - JavaCC

How can I find similar matches of a word from a database in Java?

Java simple line parser

Can this regex be further optimized?

Categories

Resources