Morning,
I have to parse a huge xml file (2GB) in Java. It has many tags but I only need to write the content of two tags <title> and <subtext> each time in a common file, so I use SaxParse
So far, I have managed to write 1M95 text in the output file, by then this exception occurs:
org.xml.sax.SAXParseException; systemId: filePath; lineNumber: x; columnNumber: y; JAXP00010004 : La taille cumulée des entités est "50 000 001" et dépasse la limite de "50 000 000" définie par "FEATURE_SECURE_PROCESSING".
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1465)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.checkEntityLimit(XMLScanner.java:1544)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.handleCharacter(XMLDocumentFragmentScannerImpl.java:1940)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1866)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3058)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:504)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)
at Parsing.main(Class.java:38)
The translation of the exception is like:
The cumulative size of the entities is "50 000 001" which exceeds the boundary of "50 000 000" defined by "FEATURE_SECURE_PROCESSING".
This is the code I've written:
public class Parsing {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
try {
File inputFile = new File(System.getProperty("user.dir") + "/input.xml");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void doThingOne(String text, String title) throws IOException {
// Write the text and the title on a file
}
public static void doThingTwo(String text, String title) throws IOException {
//Write the text and the title on another file
}
class UserHandler extends DefaultHandler {
boolean bText = false;
boolean bTitle = false;
StringBuffer tagTextBuffer;
StringBuffer tagTitleBuffer;
String text = null;
String title = null;
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equals("title")) {
tagTitleBuffer = new StringBuffer();
bTitle = true;
} else if (qName.equalsIgnoreCase("text")) {
tagTextBuffer = new StringBuffer();
bText = true;
}
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("title")) {
bTitle = false;
title = tagTextBuffer.toString();
} else if (qName.equals("text")) {
text = tagTextBuffer.toString();
bText = false;
if (text!=null && title == "One") {
try {
Parsing.doThingOne(page, title);
} catch (IOException e) {
e.printStackTrace();
}
} else if (text != null) {
try {
Parsing.doThingTwo(page, title);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (bTitle) {
tagTitleBuffer.append(new String(ch, start, length));
} else if (bText) {
tagTextBuffer.append(new String(ch, start, length));
}
}
}
Thank you for your time.
Switch off the FEATURE_SECURE_PROCESSING has no effect (Java8).
For increases the limit, use :
System.setProperty("jdk.xml.totalEntitySizeLimit", String.valueOf(Integer.MAX_VALUE));
before SAXParserFactory.newInstance();
The limit is there to prevent the "billion laughs" attack. If you trust the XML source you could switch off the SECURE_PROCESSING feature which imposes this limit.
I would generally recommend using Apache Xerces in preference to the version bundled with the JDK.
Your code for the characters() method is wrong: both the text and the title element content can be delivered split into multiple calls so you need to accumulate a buffer for both cases.
It would be nice to know something about why the entity expansion limit is being hit. Does your document include lots of entity references to tiny entities, or a few references to big ones, or what? Do the entity references occur in the parts of the document you are interested in?
Related
I've got a java SAX Parser for XML (we set the date, make URL reqest for this date and parse XML file). Now I need to turn this code to web app in Tomcat. I've imported all nessessary libraries, created artefacts, but don't know how to change code itself.\
Here is initial code
Handler:
public class UserHandler extends DefaultHandler {
boolean bName = false;
boolean bValue = false;
String result=" ";
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("Valute")) {
String CharCode = attributes.getValue("CharCode");
} else if (qName.equalsIgnoreCase("Name")) {
bName = true;
} else if (qName.equalsIgnoreCase("Value")) {
bValue = true;
}
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("Valute")) {
System.out.print(" ");
}
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
if (bName) {
result=(new String(ch, start, length)+" ");
bName = false;
} else if (bValue) {
result=result+(new String(ch, start, length));
bValue = false;
System.out.print(result);
}
}
}
Main:
public static void main(String[] args) throws MalformedURLException {
//Set the date dd.mm.yyyy
String date="12.08.2020";
String link ="http://www.cbr.ru/scripts/XML_daily.asp?date_req=";
URL url =new URL(link);
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userHandler = new UserHandler();
saxParser.parse(String.valueOf(url+date), userHandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I have an XML file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2019-11-22.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
<year>2010</year>
<school>Aarhus University</school>
<pages>1-315</pages>
<isbn>978-3-86596-263-8</isbn>
<ee>http://d-nb.info/996064095</ee>
</phdthesis><phdthesis mdate="2020-02-12" key="phd/Hoff2002">
.
. (continues with the same tags for a lot of other books)
From that XML file I'm trying to export the details from the tag "year" in order to count how many books have been published each year. I tried a lot of implementation for that purpose but none of them seems to be working.
Code I've written until now:
public class Publications {
public static void main(String[] args) {
try
{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler()
{
boolean year = false;
//parser starts parsing a specific element inside the document
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException
{
System.out.println("Start Element :" + qName);
if(qName.equalsIgnoreCase("Year"))
{
year=true;
}
}
//parser ends parsing the specific element inside the document
public void endElement(String uri, String localName, String qName) throws SAXException
{
System.out.println("End Element:" + qName);
}
//reads the text value of the currently parsed element
public void characters(char ch[], int start, int length) throws SAXException
{
if (year)
{
System.out.println("Year : " + new String(ch, start, length));
year = false;
}
}
};
saxParser.parse("dblp-2020-04-01.xml", handler);
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
I also adding the exceptions I get:
java.io.FileNotFoundException: C:\Users\Deray\DataAnalysis\dblp-2020-04-01.dtd
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
at java.base/sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:86)
at java.base/sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:184)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:654)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:150)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:860)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:324)
at java.xml/javax.xml.parsers.SAXParser.parse
(SAXParser.java:276)
at Publications.main(Publications.java:44)
Do you have any other suggestions about the implementation?
I am a novice in Java and I have written a code in which I am struggling to fetch the element value inside the tag. for example in the below xml- id = bk001 didn't appear in the output
<book id="bk001">
<author>Hightower, Kim</author>
<title>The First Book</title>
<genre>Fiction</genre>
<price>44.95</price>
<pub_date>2000-10-01</pub_date>
<date>
<auth_date>
2000-10-01
</auth_date>
<auth_date>
2000-10-05
</auth_date>
</date>
<review>An amazing story of nothing.</review>
</book>
We can expect XML of any type, we have to convert into a flat structure e.g. CSV
Code written
public class SAX
{
Map<String, String> list = new HashMap<String,String>();
public static void main(String[] args) throws IOException {
new SAX().printElementNames("input/books_1.xml");
}
public void printElementNames(String fileName) throws IOException
{
try {
SAXParserFactory parserFact = SAXParserFactory.newInstance();
SAXParser parser = parserFact.newSAXParser();
DefaultHandler handler = new DefaultHandler()
{
public void startElement(String uri, String lName, String ele, Attributes attributes) throws SAXException {
System.out.print(ele + " ");
if((attributes.getValue("TagValue"))==null)
{
return;
}
else
{
System.out.println(attributes.getValue("TagValue"));
}
}
public void characters(char ch[], int start, int length) throws SAXException {
String value = new String(ch, start, length).trim();
if(value.length() == 0) return;
System.out.println(value);
}
};
parser.parse(new File(fileName), handler);
}catch(Exception e){
e.printStackTrace();
}
}
}
Kindly help me with the same. I have tried to search the same on stackoverflow but couldn't get anything concrete.
Agenda of the code is that it should work for any valid XML.
Note - We are not allowed to use external libraries like gson etc.
The only attribute that your code is attempting to read is "TagValue", so why would you expect your code to display the value of an "id" attribute?
replace your startElement with:
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
System.out.print(qName + " ");
for(int i=0; i<attributes.getLength();i++) {
System.out.println(attributes.getQName(i) + " " + attributes.getValue(i));
}
}
I get the xml repsonse for http request. I store it as a string variable
String str = in.readLine();
And the contents of str is:
<response>
<lastUpdate>2012-04-26 21:29:18</lastUpdate>
<state>tx</state>
<population>
<li>
<timeWindow>DAYS7</timeWindow>
<confidenceInterval>
<high>15</high>
<low>0</low>
</confidenceInterval>
<size>0</size>
</li>
</population>
</response>
I want to assign tx, DAYS7 to variables. How do I do that?
Thanks
Slightly modified code from http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
public class ReadXMLFile {
// Your variables
static String state;
static String timeWindow;
public static void main(String argv[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// Http Response you get
String httpResponse = "<response><lastUpdate>2012-04-26 21:29:18</lastUpdate><state>tx</state><population><li><timeWindow>DAYS7</timeWindow><confidenceInterval><high>15</high><low>0</low></confidenceInterval><size>0</size></li></population></response>";
DefaultHandler handler = new DefaultHandler() {
boolean bstate = false;
boolean tw = false;
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("STATE")) {
bstate = true;
}
if (qName.equalsIgnoreCase("TIMEWINDOW")) {
tw = true;
}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (bstate) {
state = new String(ch, start, length);
bstate = false;
}
if (tw) {
timeWindow = new String(ch, start, length);
tw = false;
}
}
};
saxParser.parse(new InputSource(new ByteArrayInputStream(httpResponse.getBytes("utf-8"))), handler);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("State is " + state);
System.out.println("Time windows is " + timeWindow);
}
}
If you're running this as a part of some process you might want to extend the ReadXMLFile from DefaultHandler.
I am using the SAX Parser for XML Parsing. The problem is if I print, everything is fine. However, If I want to save anything, I get this error message (with the typos):
"XML Pasing Excpetion = java.lang.NullPointerException"
My code is given below:
Parser code:
try {
/** Handling XML */
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
/** Send URL to parse XML Tags */
URL sourceUrl = new URL(
"http://50.19.125.224/Demo/VeryGoodSex_and_the_City_S6E6.xml");
/** Create handler to handle XML Tags ( extends DefaultHandler ) */
MyXMLHandler myXMLHandler = new MyXMLHandler();
xr.setContentHandler((ContentHandler) myXMLHandler);
xr.parse(new InputSource(sourceUrl.openStream()));
} catch (Exception e) {
System.out.println("XML Pasing Excpetion = " + e);
}
Object to hold XML parsed Info:
public class ParserObject {
String name=null;
String description=null;
String bitly=null; //single
String productLink=null;//single
String productPrice=null;//single
Vector<String> price=null;
}
Handler class:
static ParserObject[] xmlDataObject = null;
public void endElement(String uri, String localName, String qName)
throws SAXException {
currentElement = false;
if (qName.equalsIgnoreCase("title"))
{
xmlDataObject[index].name=currentValue;
}
else if (qName.equalsIgnoreCase("artist"))
{
xmlDataObject[index].artist=currentValue;
}
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
currentElement = true;
if (qName.equalsIgnoreCase("allinfo"))
{
System.out.println("started");
}
else if (qName.equalsIgnoreCase("tags"))
{
insideTag=1;
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
if (currentElement) {
currentValue = new String(ch, start, length);
currentElement = false;
}
}
Your ParserObject array i.e xmlDataObject is having null value thats is why it is showing null pointer exception. This is my View and it might be wrong but once check it too.