Tika parsing gives maximum limit reached error

Tika parsing gives maximum limit reached error - java

I am using Apache Tika for getting content from PDF files.
When I run it I get below error. I don't see this error documented anywhere and this is just a bad surprise.
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
at org.apache.tika.parser.pdf.PDF2XHTML.writeWordSeparator(PDF2XHTML.java:318)
at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1741)
at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
Just want to know how to get away with this error and be able to parse files again. Or How to make this limit unlimited.

You can use the writeLimit to set the limit or even disable it using:
public BodyContentHandler(int writeLimit)
The docs says the following:
writeLimit - maximum number of characters to include in the string, or
-1 to disable the write limit

Related

java SAXParser ignore exception and continue parsing

I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.
My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.
All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.
I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.
It seems that I can only catch these exceptions from the function who calls the parser:
try {
myIS = new FileInputStream(xmlFilePath);
parser.parse(myIS, handler);
retValue = true;
} catch(SAXParseException err) {
System.out.println("SAXParseException " + err);
}
However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.
Somebody asked the same question here some years ago, but didn't get any solution.
I'd really appreciate any ideas or workarounds for this.
Data Sample (as requested):
<Producto>
<Brand>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
<BrandId>eps</BrandId>
</Brand>
<New>false</New>
<OnSale>null</OnSale>
<Type>Physical</Type>
<Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
<Category>
<CategoryId>pos</CategoryId>
<Description>Puntos de Venta</Description>
<Subcategories>
<CategoryId>pos.printer</CategoryId>
<Description>Impresoras para Recibos</Description>
</Subcategories>
</Category>
<InStock>0</InStock>
<Price>
<UnitPrice>4865.6042</UnitPrice>
<CurrencyId>MXN</CurrencyId>
</Price>
<Manufacturer>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
</Manufacturer>
<Mpn>C31CA85814</Mpn>
<Sku>PT910EPS27</Sku>
<CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>

The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.
Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.
Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.

I solved it removing invalid characters of the xml file before processing it.
I couldn't do what I was trying to do (cath error and continue) but this workaround worked.

How to get rid of extra quotes from the header when using Apache CSVParser

I am trying to pull in a CSV file to validate against expected values. However, there is an issue with reading in the header row. Specifically, whatever is the first column header remains quoted which confuses the mappings.
Here is the method for reading in the file:
public boolean openCsv(File fileObject) {
if (fileObject.exists()) {
try {
parser = CSVParser.parse(fileObject, StandardCharsets.UTF_8, CSVFormat.RFC4180.withFirstRecordAsHeader().withIgnoreHeaderCase());
headers = parser.getHeaderMap();
records = parser.getRecords();
return true;
} catch (IOException e) {
System.out.println("Cannot parse CSV file: " + fileObject.getName());
}
}
return false;
}
The problem is, with the given header:
"Company ID","Company Name","Company Website","Company Phone", ...
The header map and records list will always leave the first value as quoted:
Error: IllegalArgumentException-Mapping for Company ID not found, expected one of [Company Name, Company Phone, Company Website, ..., "Company ID"]
I tried looping through the header and removing the quotes, but the quoted value is part of the mapping of the records too which means I'd have to loop and rebuild everything.
I have tried different values for CSVParse.parse but the problem remains.
Is there something I'm missing? I checked the Apache Commons JIRA board and no one else has reported this issue so I am inclined to think it's something I need to configure.
Since the columns vary from export to export, I cannot hardcode them and pass it to the parser. It needs to be dynamic.

I was able to replicate similar issue, if there is space before "Company ID" it gets quoted (but you would probably notice space before first column in the header, also the space would still be present in the mapping).
Then I noticed one more thing in your error message: "Company ID" is the last printed element of the mapping, even tough it should be first using alphabetical or "in file" order.
Next I remembered there are some "invisible" characters in unicode. For example: zero width space, (on wikipedia) I've created test file with zero width space before "Company ID", and got exactly the same error message you show in your question:
Mapping for Company ID not found, expected one of [Company Name, Company Phone, Company Website, "Company ID"]
at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:102)
The no break space is present in above message.
By the way, after finding this I've copied your error message and checked it for invisible characters. It seems there's an "zero width no break space" before "Company ID".
Probably you will have to parse the file and remove such characters from it - I don't know why something like that would find its way into a csv file.

jsoup don't get full data

I have a project for school to parse web code and use it like a data base. When I tried to down data from (https://www.marathonbet.com/en/betting/Football/), I didn't get it all?
Here is my code:
Document doc = Jsoup.connect("https://www.marathonbet.com/en/betting/Football/").get();
Elements newsHeadlines = doc.select("div#container_EVENTS");
for (Element e: newsHeadlines.select("[id^=container_]")) {
System.out.println(e.select("[class^=block-events-head]").first().text());
System.out.println(e.select("[class^=foot-market]").select("[class^=event]").text());
}
for result you get (this is last of displayed leagues):
Football. Friendlies. Internationals All bets Main bets
1. USA 2. Mexico 16 Apr 01:30 +124 7/5 23/10 111/50 +124
on top of her are all leagues displayed.
Why don't I get full data? Thank you for your time!

Jsoup has a default body response limit of 2MB. You can change it to whatever you need with maxBodySize(int)
Set the maximum bytes to read from the (uncompressed) connection into
the body, before the connection is closed, and the input truncated.
The default maximum is 2MB. A max size of zero is treated as an
infinite amount (bounded only by your patience and the memory
available on your machine).
E.g.:
Document doc = Jsoup.get(url).userAgent(ua).maxBodySize(0).get();
You might like to look at the other options in Connection, on how to set request timeouts, the user-agent, etc.

Java/MongoDB message length error on Windows but not on Linux

we are currently working on importing huge JSON files (~100 MB) into MongoDB using the java driver. Currently we split up the files into smaller chunks, since we first encountered problems with importing the whole file. Of course we are aware of the limitation to MongoDB that the maximum document size is 16 MB, however our chunks that we are now importing are far smaller than that.
Strangely enough, the import procedure is working when running it on Linux (eclipse), yet the same program will throw an exception stating "can't say something" on Windows (eclipse).
When observing the log from the database, the error message says
> "Thu Sep 13 11:38:48 [conn1] recv(): message len 1835627538 is too
> large1835627538"
Rerunning the import on the same dataset always leads to the same error message regarding the message length. We investigated the size of our documents to import (using .toString().length()) - the chunk that caused the error was only some kB large.
It makes no difference on which OS the mongo database runs on, but depends on where the import code is being executed (Using the same java-mongo-driver

"we are currently working on importing huge JSON files (~100 MB) into
MongoDB using the java driver"
Are we talking about a JSON file containing 1000s of JSON objects OR 1 JSON object that is size ~100MB? Because if I remember correctly the 16MB limit is per object not per JSON file containing 1000s of JSON objecs.
Also!
"Thu Sep 13 11:38:48 [conn1] recv(): message len 1835627538 is too
large1835627538"
the chunk that caused the error was only some kB large.
If 1835627538 is indeed in kb, that is pretty big, cause thats around ~1750 GigaBytes!!
To get round a JSON file containing 1000s of JSON objects, Why don't you iterate through your data file line by line and do your inserts that way? With my method doesn't matter how large your data file is, the iterator is just a pointer to a specific line. It doesn't load the WHOLE FILE into memory and insert.
NOTE: This is assuming your data file contains 1 JSON object per line.
Using the Apache Commons IO FileUtils (click here), you can use their Line iterator to iterate through you file, for example (not fully working code, need to import correct libs):
LineIterator line_iter;
try {
line_iter = FileUtils.lineIterator(data_file);
while (line_iter.hasNext()) {
line = line_iter.next();
try {
if (line.charAt(0) == '{')
this.mongodb.insert(line);
} catch (IndexOutOfBoundsException e) {}
}
}
line_iter.close(); // close the iterator
} catch (IOException e) {
e.printStackTrace();
}

Digester: The element type "user" must be terminated by the matching end-tag "</user>"

I'm using Digester to parse a xml file and I get the following error:
May 3, 2011 6:41:25 PM org.apache.commons.digester.Digester fatalError
SEVERE: Parse Fatal Error at line 2336608 column 3: The element type "user" must be terminated by the matching end-tag "</user>".
org.xml.sax.SAXParseException: The element type "user" must be terminated by the matching end-tag "</user>".
However 2336608 is the last line of my text file. I guess I'm opening a tag and I never close it. Do you know how can I find it and fix it, in big text files ?
thanks

Write another script which scans each file of the line and whenever it finds an open <user> tag, increments a counter and prints
line number 1234 <user> opened (1 open total)
and whenever it finds a close </user> tag, decrements the counter prints
line number 4546 </user> closed (0 open total)
Since you have one more opening tag than closing tag, the final output of this script will tell you that 1 tag was left open. However, assuming that your XML model does not allow for nested <user> tags, then you can assume the problemsome declaration is wherever you see the output of line number ... <user> opened (2 open total).

$ grep -Hin "</\?user>" Text.xml will print out every line with either or . If they're not nested, then you should be able to inspect that output fand find the missing close tag (when immediately follows . A script do do the same:
https://gist.github.com/953837
This assumes that the open and close tags are on different lines.

Use tidy -xml -e <your-xml-file>. http://tidy.sourceforge.net/
Tidy is a great little tool for validating HTML, and in XML mode (-xml above) it will validate XML as well.
It prints out line and column numbers for parse errors.
Most of the major package managers (apt, port, etc.) will have pre-built packages for it.

I think there is no need to start scripting for detecting xml errors.
You can use the w3 xml validator for instance
http://www.w3schools.com/xml/xml_validator.asp
I just pasted a 15 mb xml in there and I managed to fix it quite easily. You can also input the xml as a url if you have the possibility to upload it somewhere. Java reported the error in some place which seemed fine, but this tool localized the actual error, and after correcting that, java didn't error anymore.
There are many types of xml errors, and are not all related to the nested structure, so it is best to just use a well known tool for this. For instance, my error was an argument error(I was missing a ") but java detected a nesting problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Tika parsing gives maximum limit reached error - java

You can use the writeLimit to set the limit or even disable it using: public BodyContentHandler(int writeLimit) The docs says the following: writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

Related

java SAXParser ignore exception and continue parsing

How to get rid of extra quotes from the header when using Apache CSVParser

jsoup don't get full data

Java/MongoDB message length error on Windows but not on Linux

Digester: The element type "user" must be terminated by the matching end-tag "</user>"

Categories

Resources