I have a huge xml file from wiktionary that I need to parse for a class project. I only need to extract data from a set of 200 lines, which start at line 395,000. How would I go about only scanning that small number of lines? Is there some sort of built in property for line number?
If line boundaries are significant in your data then it's not true XML. Accept it for what it is, a line-oriented file, and start by processing it using line-oriented text tools. Use these to extract the XML (if you can), and then pass this XML to an XML parser.
There is no built in property for line numbers.
If you want to look at all of the data from line 395,000 to 395,200 programatically, you can do so by counting new line characters.
Each line in the file ends with a new line ("\n"), so you could count 349,999 of them, and then look at the data until you see 200 more.
Related
I am using openPdf library (fork of iTextPdf) to replace placeholders like #{Address_name_1} with real values. My PDF file is not simple, so I use regular expression to find this placeholder:[{].*?[A].*?[d].*?[d].*?[r].*?[e].*?[s].*?[s].*?[L].*?[i].*?[n].*?[e].*?[1].*?[}]
and do something like
content = MY_REGEXP.replace(content, "Saint-P, Nevskiy pr.");
obj.setData(content.toByteArray(CHARSET)).
The problem happens when my replacement line is too long and it is unfortunately cut from right end. Can I somehow make it carry over to the next line? Naive \n does not work.
PDF store strings in a different way. There are no next lines, there are lines.
So you will need to add several placeholders on fields on your template for replacements that can get long enough, like:
#{Address_name_1_line1}
#{Address_name_1_line2}
#{Address_name_1_line3}
And place it in different lines on your template. The non-used empty placeholders (because replacement is not long enough) should be replaced by empty strings.
For longer replacements you will need to use several placeholders. The number of placeholders to use and the replacement splitting should be determined by code.
If your PDF is too complex to place different placeholders then you will need to placeholder everything, all your text contents should be inyected into placeholders, at least if you want to use this approach.
PDF files are NOT text files. Each line is an object with an x/y offset. To place something on the next line would require a new object placed at new x/y coords. You would need an advanced PDF editing toolkit.
I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.
I'd like to parse an XML file in Java line by line because the framework of a file that I got is a bit different than usual. It is not nested; each tag is in its own line.
Part of XML file:
<sentence><flag>3</flag></sentence>
<word><text>Zdravo</text></word>
<phoneme><onephoneme>z</onephoneme></phoneme>
<phoneme><onephoneme>d</onephoneme></phoneme>
<phoneme><onephoneme>r</onephoneme></phoneme>
<phoneme><onephoneme>"a:</onephoneme></phoneme>
<phoneme><onephoneme>v</onephoneme></phoneme>
<phoneme><onephoneme>O</onephoneme></phoneme>
<sentence><flag>0</flag></sentence>
<word><text>moje</text></word>
...
I searched and found a lot of different ways to parse an XML file but all of them scan the whole file and I don't want that because my file is almost 100k lines and for now (and maybe even later) I only need first 800 lines so it would be much faster to just parse line by line. I don't know how many lines I really need in advance but I'd like to count how many times I reach tag and stop at certain count (for now it's 17 - that's around line 800).
Tutorials that I found:
nested XML: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
nested XML: http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
single line XML with attributes: Read single XML line with Java
Each sentence is then separated into word and each word into phonemes, so in the end I'd have 3 ArrayLists: flags, words and phonemes.
I hope I gave you enough information.
Thank you.
Lines are not really relevant for XML, you can have all your XML worth of 100K lines in one single line. What you need to do is count by elements/nodes you parse. Use a SAX parser, it is event based, it will notify you when an element start and when it ends. Whenever you get an element you are interested in parsing increment the counter, this assumes you know the elements you are interested in, from your example, those would be:
<sentence>
<word>
<phoneme>
etc.
Andrew Stubbs suggested SAX and StAX but if your file will be really big I would use VTD-XML it is at least 3 times faster then SAX and much more flexible. Processing 2GB XMLs is not a problem at all
If you want to read a file line by line, is has nothing to do with XML. Simply use an BufferedReader as it provides an readLine method.
With a simple counter you can check, how many lines you have already read and quit the loop after you hit the 800 mark.
As said by #Korashen, if you can guarantee that the files you will be dealing with are going to follow a flat, line by line structure then you are probably best off pretending that the files are not XML at all, and using a normal BufferedReader.
However, if you need to parse it as XML, then a streaming XML reader should be able to do what you want. According to Java XML Parser for huge files, SAX or StAX are the standard choices.
You can use sax parser. In the xml is traversed line by line and appropriate events are triggered. In addition you could use org.xml.sax.Locator to identify the line number and throw an exception when you encounter line 800 to stop parsing.
I want to use XML for storing some data. But I do not want read full file when I want to get the last data that was inserted there, as well as I do not want to rewrite full file when adding new data there. Is there a standard way in java to parse xml file not from the beginning but from the end. So that for example SAX or StaX parser will first encounter last closing root tag and than last tag. Or if I want to do this I should read and write everything like I am reading/writing regular text file?
Fundamentally, XML is a poor representation choice for this. The format is inherently "contained" like this, and I haven't seen any APIs which encourage you to fight against that.
Options:
Choose a different format entirely (e.g. use a database)
Create lots of small XML files instead - each one self-contained. When you want the whole of the data, read all the files
Just swallow the hit and read/write the whole file each time.
I found a good topic on this with example solutions for what I want.
This link: http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
Seems that XML is not good file format to achieve what I want. There is no standard parser that can parse XML from the end instead of beginning.
Probably the best solution for will be storing all xml data in one file that contains composition of many xml files contents. On each line stored separate contents of XML. The file itself is not well formed XML but each line contains well formed xml that I will parse using standard xml parser(StaX).
This way I will be able to read just lines from the end of file and append new data to the end of file. When I need the whole data or only the part of it I will read all line or part of them. Probably I can also implement pagination from the end of file for that because the file can be big.
Why XML in each line? I think it is easy to use API for parsing it as well as it is human readable to store data in xml instead of just separating values in the line with some symbol.
Why not use sax/stax and simply process only your last entry? Yes, it will need to open and go through the whole file, but at least it's fairly efficient as opposed to loading the whole DOM tree.
Short of doing that, I don't think you can do what you're asking using XML as a source.
Another alternative, apart from the ones provided by Jon Skeet in his answer, would be to keep the same format but insert the latest entries first, and stop processing the files as soon as you've read your entry.
I have a flat file which contain data with Fixed length,Is there any good approach to parse the data and splitting the lines using a regular occurrence,ie for every occurrence starting with "02" should be a new line and it should be stored some where?I have gone through Flatworm which uses a XML,which will be lengthy process if lot of fixed length data is to be inserted?
Note:I have gone through some samples of Flat file parsing in stackoverflow using Flatworm and FFP,but cant be a standard one,as I am trying to do it as a utility class.
Text file seems to be like:
022010015450696611KR GERGIN MR vvvv 020100145078211PETRO EMILIAN MR
vvv