Parsing XML in Java line by line - java

I'd like to parse an XML file in Java line by line because the framework of a file that I got is a bit different than usual. It is not nested; each tag is in its own line.
Part of XML file:
<sentence><flag>3</flag></sentence>
<word><text>Zdravo</text></word>
<phoneme><onephoneme>z</onephoneme></phoneme>
<phoneme><onephoneme>d</onephoneme></phoneme>
<phoneme><onephoneme>r</onephoneme></phoneme>
<phoneme><onephoneme>"a:</onephoneme></phoneme>
<phoneme><onephoneme>v</onephoneme></phoneme>
<phoneme><onephoneme>O</onephoneme></phoneme>
<sentence><flag>0</flag></sentence>
<word><text>moje</text></word>
...
I searched and found a lot of different ways to parse an XML file but all of them scan the whole file and I don't want that because my file is almost 100k lines and for now (and maybe even later) I only need first 800 lines so it would be much faster to just parse line by line. I don't know how many lines I really need in advance but I'd like to count how many times I reach tag and stop at certain count (for now it's 17 - that's around line 800).
Tutorials that I found:
nested XML: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
nested XML: http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
single line XML with attributes: Read single XML line with Java
Each sentence is then separated into word and each word into phonemes, so in the end I'd have 3 ArrayLists: flags, words and phonemes.
I hope I gave you enough information.
Thank you.

Lines are not really relevant for XML, you can have all your XML worth of 100K lines in one single line. What you need to do is count by elements/nodes you parse. Use a SAX parser, it is event based, it will notify you when an element start and when it ends. Whenever you get an element you are interested in parsing increment the counter, this assumes you know the elements you are interested in, from your example, those would be:
<sentence>
<word>
<phoneme>
etc.

Andrew Stubbs suggested SAX and StAX but if your file will be really big I would use VTD-XML it is at least 3 times faster then SAX and much more flexible. Processing 2GB XMLs is not a problem at all

If you want to read a file line by line, is has nothing to do with XML. Simply use an BufferedReader as it provides an readLine method.
With a simple counter you can check, how many lines you have already read and quit the loop after you hit the 800 mark.

As said by #Korashen, if you can guarantee that the files you will be dealing with are going to follow a flat, line by line structure then you are probably best off pretending that the files are not XML at all, and using a normal BufferedReader.
However, if you need to parse it as XML, then a streaming XML reader should be able to do what you want. According to Java XML Parser for huge files, SAX or StAX are the standard choices.

You can use sax parser. In the xml is traversed line by line and appropriate events are triggered. In addition you could use org.xml.sax.Locator to identify the line number and throw an exception when you encounter line 800 to stop parsing.

Related

Defining a manual Split algorithm for File Input

I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.

Parsing messy texts with Stanford Parser

I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?
One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.
Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?
Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?
Thanks!
Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)
This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.
The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.
Choices are:
Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.
Parser are known to be slow. You can try using shallow parser which will be relatively faster then full blown version. If you just need POS tag then consider using tagger. Create a static instance of parser and use it over and over rather then reloading. <>Q-24

Dealing with specific lines when parsing XML docs in Java

I have a huge xml file from wiktionary that I need to parse for a class project. I only need to extract data from a set of 200 lines, which start at line 395,000. How would I go about only scanning that small number of lines? Is there some sort of built in property for line number?
If line boundaries are significant in your data then it's not true XML. Accept it for what it is, a line-oriented file, and start by processing it using line-oriented text tools. Use these to extract the XML (if you can), and then pass this XML to an XML parser.
There is no built in property for line numbers.
If you want to look at all of the data from line 395,000 to 395,200 programatically, you can do so by counting new line characters.
Each line in the file ends with a new line ("\n"), so you could count 349,999 of them, and then look at the data until you see 200 more.

Parsing XML file from the end of file

I want to use XML for storing some data. But I do not want read full file when I want to get the last data that was inserted there, as well as I do not want to rewrite full file when adding new data there. Is there a standard way in java to parse xml file not from the beginning but from the end. So that for example SAX or StaX parser will first encounter last closing root tag and than last tag. Or if I want to do this I should read and write everything like I am reading/writing regular text file?
Fundamentally, XML is a poor representation choice for this. The format is inherently "contained" like this, and I haven't seen any APIs which encourage you to fight against that.
Options:
Choose a different format entirely (e.g. use a database)
Create lots of small XML files instead - each one self-contained. When you want the whole of the data, read all the files
Just swallow the hit and read/write the whole file each time.
I found a good topic on this with example solutions for what I want.
This link: http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
Seems that XML is not good file format to achieve what I want. There is no standard parser that can parse XML from the end instead of beginning.
Probably the best solution for will be storing all xml data in one file that contains composition of many xml files contents. On each line stored separate contents of XML. The file itself is not well formed XML but each line contains well formed xml that I will parse using standard xml parser(StaX).
This way I will be able to read just lines from the end of file and append new data to the end of file. When I need the whole data or only the part of it I will read all line or part of them. Probably I can also implement pagination from the end of file for that because the file can be big.
Why XML in each line? I think it is easy to use API for parsing it as well as it is human readable to store data in xml instead of just separating values in the line with some symbol.
Why not use sax/stax and simply process only your last entry? Yes, it will need to open and go through the whole file, but at least it's fairly efficient as opposed to loading the whole DOM tree.
Short of doing that, I don't think you can do what you're asking using XML as a source.
Another alternative, apart from the ones provided by Jon Skeet in his answer, would be to keep the same format but insert the latest entries first, and stop processing the files as soon as you've read your entry.

Conversion from one form of XML to another form of XML

I am trying to convert an XML file of one particular format i.e. with with one set of tags to another set of tags and when I am printing the data between the tags in the new file, it is getting repeated 100s of times. I am getting the new tags and everything but the data is getting repeated many times. For example: If the sentence is "Hello", Hello is getting repeated many times.
I am using SAX parser to parse the old XML file and Node class using appendChild to put the contents into the new file.
Kindly help me with this! I'll provide the code.
Thank you!
I recommend using XSLT rather than a XML parser to transform XML.
Having said that, there is most likely something wrong with your SAX callbacks to cause something like that to happen. I can't be more specific without seeing your actual code.

Categories