I was wondering how could it be possible to format in a human-readable format a ParseException thrown by JavaCC: in fact it includes fields such asbeginLine, beginColumn, endColumn, endLine in the token reference of the exception, but not the reference to the source parsed.
Thanks! :)
The problem is that, by default, JavaCC doesn't retain the raw source data. So unless you keep a reference to the tokens somehow, they're not held in memory. And even if you did hang onto all the regular tokens, you'd need to add special handling for any SKIP tokens that you'd defined - e.g., for discarding whitespace and comments. The reason JavaCC doesn't retain all this stuff is that it would use a lot more memory.
Keeping all the token images is definitely doable... just takes some semi-manual intervention.
I don't know if it is enough, but you can use property currentToken from catched ParseException object:
try {
parser.Start();
}
catch(ParseException e){
System.out.println("Problem with code!");
System.out.println("Unknown symbol >> "
+ e.currentToken.image
+ " << Line:" + e.currentToken.beginLine
+ ", column:" + e.currentToken.beginColumn);
//e.printStackTrace();
}
Just keep the filename before calling the parser. Then when you catch ParseException, reread the file and, using beginLine, skip to the right line. Alternatively, instead of the filename, keep the original source text yourself.
I recently used javacc and did exactly that. Also had to handle include-like files recursively so I had the parser build a stack of included source files. Upon catching ParseException, it was a simple matter to walk the stack so the user could see the context (i.e. line number in the parent) where each file was included.
Related
I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.
Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .
I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.
My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.
All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.
I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.
It seems that I can only catch these exceptions from the function who calls the parser:
try {
myIS = new FileInputStream(xmlFilePath);
parser.parse(myIS, handler);
retValue = true;
} catch(SAXParseException err) {
System.out.println("SAXParseException " + err);
}
However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.
Somebody asked the same question here some years ago, but didn't get any solution.
I'd really appreciate any ideas or workarounds for this.
Data Sample (as requested):
<Producto>
<Brand>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
<BrandId>eps</BrandId>
</Brand>
<New>false</New>
<OnSale>null</OnSale>
<Type>Physical</Type>
<Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
<Category>
<CategoryId>pos</CategoryId>
<Description>Puntos de Venta</Description>
<Subcategories>
<CategoryId>pos.printer</CategoryId>
<Description>Impresoras para Recibos</Description>
</Subcategories>
</Category>
<InStock>0</InStock>
<Price>
<UnitPrice>4865.6042</UnitPrice>
<CurrencyId>MXN</CurrencyId>
</Price>
<Manufacturer>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
</Manufacturer>
<Mpn>C31CA85814</Mpn>
<Sku>PT910EPS27</Sku>
<CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>
The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.
Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.
Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.
I solved it removing invalid characters of the xml file before processing it.
I couldn't do what I was trying to do (cath error and continue) but this workaround worked.
I was writing code for studying YAML files, and I'm trying to put a comment in the YAML file, but I just found out that it doesn't work the way it does.
My doubts are:
It is possible to insert comments when writing a document.
Am I doing it right?
If it is not possible with the SnakeYaml API, what other method is more plausible.
Codes
JAVA CODE
try {
text = "#Some random Comentary"
+ "Something: Something\n"
+ "RandoText: Goes Here\n"
+ "Number: true\n"
+ "sometext: Something Else";
Object obj = writeYaml.load(text);
FileWriter writer = new FileWriter(directoryPath);
writeYaml.dump(obj, writer);
} catch (Exception e) {}
YAML was create
{RandoText: Goes Here, Number: true, sometext: Something Else}
YAML I want create
{
#Some random Comentary
RandoText: Goes Here,
Number: true,
sometext: Something Else
}
I found a solution to this problem, it is not the most plausible but let's get to the result.
I was reading the Snakeyaml documentation (I don't know if it was the official documentation), but it said that the documentation was out of date, so it wasn't much help.
So I decided to write the document by hand, my code ended up being like this:
try {
FileWriter fileWriter = new FileWriter("filename.yaml");
String text = "#Some random Cometary\n"
+ "RandomText: Goes Here,\n"
+ "Number: 10,\n"
+ "isBoolean: true";
fileWriter.write(text);
fileWriter.close();
} catch (Exception e) {};
But I do not intend to abandon SnakeYaml for now, due to the fact of being able to read Yaml without having to waste time dealing with texot, Snake Yaml already does that, there is no reason to rewrite the text.
However, if someone has any other better method, give me a warning that will always be welcome.
ah! when I forget to say, I tried to make a document clone, but these documents do not go with the jar file when you close the project build.
Currently saving an int[] from hashmap in a file with the name of the key to the int[]. This exact key must be reachable from another program. Hence I can't switch name of the files to english only chars. But even though I use ISO_8859_1 as the charset for the filenames the files get all messed up in the file tree. The english letters are correct but not the special ones.
/**
* Save array to file
*/
public void saveStatus(){
try {
for(String currentKey : hmap.keySet()) {
byte[] currentKeyByteArray = currentKey.getBytes();
String bytesString = new String(currentKeyByteArray, StandardCharsets.ISO_8859_1);
String fileLocation = "/var/tmp/" + bytesString + ".dat";
FileOutputStream saveFile = new FileOutputStream(fileLocation);
ObjectOutputStream out = new ObjectOutputStream(saveFile);
out.writeObject(hmap.get(currentKey));
out.close();
saveFile.close();
System.out.println("Saved file at " + fileLocation);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Could it have to do with how linux is encoding characters or is more likely to do with the Java code?
EDIT
I think the problem lies with the OS. Because when looking at text files with cat for example the problem is the same. However vim is able to decode the letters correctly. In that case I would have to perhaps change the language settings from the terminal?
You have to change the charset in the getBytes function as well.
currentKey.getBytes(StandardCharsets.ISO_8859_1);
Also, why are you using StandardCharsets.ISO_8859_1? To accept a wider range of characters, use StandardCharsets.UTF_8.
The valid characters of a filename or path vary depending on the file system used. While it should be possible to just use a java string as filename (as long as it does not contain characters invalid in the given file system), there might be interoperability issues and bugs.
In other words, leave out all Charset-magic as #RealSkeptic recommends and it should work. But changing the environment might result in unexpected behavior.
Depending on your requirements, you might therefore want to encode the key to make sure it only uses a reduced character set. One variant of Base64 might work (assuming your file system is case sensitive!). You might even find a library (Apache Commons?) offering a function to reduce a string to characters safe for use in a file name.
So, I'm parsing a .mozeml file from Eudora and converting them into an mbox file (mbox got corrupted, and deleted but mozeml files were left over, but unable to import them). There's over 200,000 e-mails, and unsure of what's a good way to handle this properly.
I am thinking of creating a Java program that will read the .mozeml files (they are xml, utf-8 format) parse the data, and then write an mbox file in this format http://en.wikipedia.org/wiki/Mbox#Family.
The problem is just that the xml file didn't separate the To line and the message; it's just one entire string. I'm not entirely sure how to properly handle that.
For example here is how the message looks
"Joe 1" <joe1#gmail.com>joe2#gmail.comHello this is an e-mail...
or
"Joe 1" <joe1#gmail.com>"Joe 2" <joe2#gmail.com>Hello this is an e-mail...
There's a lot of test cases to check if it's a .com/.net/com.hk/.co.jp/etc. for the first one. The second one is a bit easier because the end of the to line is >. So, I'm unsure about the first case and ensuring that it's going to be accurate for the 200,000 emails.
Try antlr library for parsing strings.
The first thought for this problem is to use regexp and scanner to find next email occurence in cycle.
class EmailScanner {
public static void main(String[] args) {
try {
Scanner s = new Scanner(new File(/* Your file name here. */););
String token;
do {
token = s.findInLine(/* Put your email pattern here. */);
/* Write your token where you need it. */
} while (token != null);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Possible email patterns can be found easily. For example ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$ or ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.(?:[a-zA-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$ see http://www.regular-expressions.info/email.html.
Here's a standard email regex modified for your format:
Pattern pattern = Pattern.compile(";[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
String text = "\"Joe 1\" <joe1#gmail.com>joe2#gmail.com Hello this is an e-mail...";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group().replaceFirst(";", ""));
}
It's not going to work if, as in your first example, the email runs directly into the message (joe2#gmail.comHello this), and it assumes your email addresses always begin with ;. You can put other delimiters in there, though.
If you know what all the domain suffixes are, you can do this with some regex-fu:
[a-zA-Z_\.0-9]+#[a-zA-Z_\.0-9]+\.(com|edu|org|net|us|tv|...)
You can find a list of top level domain names here: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
The full regex, I believe, should be this:
[a-zA-Z_\.0-9\-]+#[a-zA-Z_\.0-9\-]+\.(.aero|.asia|.biz|.cat|.com|.coop|.info|.int|.jobs|.mobi|.museum|.name|.net|.org|.pro|.tel|.travel|.xxx|.edu|.gov|.mil|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bv|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cs|.cu|.cv|.cx|.cy|.cz|.dd|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.eh|.er|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gb|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sj|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tp|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.uk|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.yu|.za|.zm|.zw)
Of course, I'm not sure if that's a complete list of TLDs, and I know ICANN recently started allowing custom TLDs, but this should catch the vast majority of the email addresses.