Best way to input files to Xpath

Best way to input files to Xpath - java

I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.
I tried to use java nio file channels and memory mapped files but was hard to use with Xpath.
So can someone tell a way to do it ?

A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.
If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.

If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.
On *nix you have the utility called xpath exactly for that.
Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.

If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search Stackoverflow for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

Related

Xml parsing and writing txt file using multithread in java

I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.
example of xml file:
<text>
<paragraph>
<line>
<character>g</character>
<character>o</character>
.....
</line>
<line>
<character>k</character>
.....
</line>
</paragraph>
</text>
<text>
<paragraph>
<line>
<character>c</character>
.....
</line>
</paragraph>
</text>
example of text file:
go..
k..
c..
How can I parse many xml files and write many text files using multi thread in java as fast as I can?
Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?
I have no experience in multi thread. How should I build a multi-thread structure to be effective?
Any help is appreciated. Thanks in advance.
EDIT
I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?

Points to note in this situation.
In many cases, attempting to write to multiple files at once using multi-threading is utterly pointless. All this generally does is exercise the disk heads more than necessary.
Writing to disk while parsing is also likely a bottleneck. You would be better to parse the xml into a buffer and then writing the whole buffer to disk in one hit.
The speed of your parser is unlikely to affect the overall time for the process significantly. Your system will almost certainly spend much more time reading and writing than parsing.
A quick check with some real test data would be invaluable. Try to get a good estimate of the amount of time you will not be able to affect.
Determine an approximate total read time by reading a few thousand sample files into memory because that time will still need to be taken however parallel you make the process.
Estimate an approximate total write time in a similar way.
Add the two together and compare that with your total execution time for reading, parsing and writing those same files. This should give you a good idea how much time you might save through parallelism.
Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.

First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.
However, toward the actual question:
Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.
The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?

If you write your code in XSLT (2.0 or later), using the collection() function to parse your source files, and the xsl:result-document instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.
This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.
LATER
I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character> elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:
<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
<xsl:value-of select="."/>
</xsl:template>
If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.
If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.
If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.

Parsing huge XML with non-forward cursor movement

I'm creating a task to parse two large XML files and find 1-1 relation between elements. I am completely unable to keep whole file in memory and I have to "jump" in my file to check up to n^2 combinations.
I am wondering what approach may I take to navigate between nodes without killing my machine. I did some reading on StAX and I liked the idea but cursor moves one way only and I will have to go back to check different possibilities.
Could you suggest me any other possibility? I need one with commercial use allowance.

I'd probably consider reading the first file into some sort of structured cache and then read the 2nd XML document, referencing against this cache (the cache could actually be a DB - it doesn't need to be in memory).
Otherwise there's no real solution (that I know of) unless you could read the whole file into memory. This ought to perform better too rather than going back and forth across the DOM of an XML document.

One solution would be an XML database. These usually have good join optimizers so as well as saving memory they may be able to avoid the O(n^2) elapsed time.
Another solution would be XSLT, using xsl:key to do "manual" optimization of the join logic.
If you explain the logic in more detail there may turn out to be other solutions using XSLT 3.0 streaming.

Is there a method to determine if a document is a file of text sentences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm processing hundreds of thousands of files. Potentially millions later on down the road. A bad file will contain a text version of an excel spreadsheet or other text that isn't binary but also isn't sentences. Such files cause CoreNLP to blow up (technically, these files take a long time to process such as 15 seconds per kilobyte of text.) I'd love to detect these files and discard them in sub-second time.
What I am considering is taking a few thousand files at random, examining the first, say, 200 characters and looking for the distribution of characters to determine what is legic and what is an outlier. Example, if there are no punctuation marks or too many of them. Does this seem like a good approach? Is there a better one that has been proven? I think, for sure, this will work well enough, possibly throwing out potentially good files but rarely.
Another idea is to simply run with annotators tokenize and ssplit and do word and sentence count. That seems to do a good job as well and returns quickly. I can think of cases where this might fail as well, possibly.

This kind of processing pipeline is always in a state of continuous improvement. To kick off that process, the first thing I would build is an instrument around the timing behavior of CoreNLP. If CoreNLP is taking too long, kick out the offending file into a separate queue. If this isn't good enough, you can write recognizers for the most common things in the takes-too-long queue and divert them before they hit CoreNLP. The main advantage of this approach is that it works with inputs that you don't expect in advance.

There are two main approaches to this kind of problem.
The first is to take the approach you are considering in which you examine the contents of the file and decide whether it is acceptable text or not based on a statistical analysis of the data in the file.
The second approach is to use some kind of meta tag such as a file extension to at least eliminate those files that are pretty certainly to be a problem (.pdf, .jpg, etc.).
I would suggest a mixture of the two approaches so as to cut down on the amount of processing.
You might consider a pipeline approach in which you have a sequence of tests. The first test filters out files based on meta data such as the file extension, the second step then does a preliminary statistical check on the first few bytes of the file to filter out obvious problem files, a third step does a more involved statistical analysis of the text, and the fourth handles the CoreNLP rejection step.
You do not say where the files originate nor if there are any language considerations (English versus French versus Simplified Chinese text). For instance are the acceptable text files using UTF-8, UTF-16, or some other encoding for the text?
Also is it possible for the CoreNLP application to be more graceful about detecting and rejecting incompatible text files?

Could you not just train a Naive Bayes Classifier to recognize the bad files? For features use things like (binned) percentage of punctuation, percentage of numerical characters, and average sentence length.

Peter,
You are clearly dealing with files for ediscovery. Anything and everything is possible, and as you know, anything kicked out must be logged as an exception. I've faced this, and have heard the same from other analytics processors.
Some of the solutions above, pre-process and in-line can help. In some ediscovery solutions it may be feasible to dump text into a field in SQL and truncate, or otherwise truncate, and still get what you need. In other apps, anything to do with semantic clustering, or predictive coding, it may be better to use pre-filters using metadata (e.g. file type), document type classification libraries, and entity extraction based upon prior examples, current sampling or your best guess as to the nature of "bad file" contents.
Good luck.

Parsing binary data in Java - high volume, single thread

I need to parse (and transform and write) a large binary file (larger than memory) in Java. I also need to do so as efficiently as possible in a single thread. And, finally, the format being read is very structured, so it would be good to have some kind of parser library (so that the code is close to the complex specification).
The amount of lookahead needed for parsing should be small, if that matters.
So my questions are:
How important is nio v io for a single threaded, high volume application?
Are there any good parser libraries for binary data?
How well do parsers support streaming transformations (I want to be able to stream the data being parsed to some output during parsing - I don't want to have to construct an entire parse tree in memory before writing things out)?
On the nio front my suspicion is that nio isn't going to help much, as I am likely disk limited (and since it's a single thread, there's no loss in simply blocking). Also, I suspect io-based parsers are more common.

Let me try to explain if and how Preon addresses all of the concerns you mention:
I need to parse (and transform and write) a large binary file (larger
than memory) in Java.
That's exactly why Preon was created. You want to be able to process the entire file, without loading it into memory entirely. Still, the program model gives you a pointer to a data structure that appears to be in memory entirely. However, Preon will try to load data as lazily as it can.
To explain what that means, imagine that somewhere in your data structure, you have a collection of things that are encoded in a binary representation with a constant size; say that every element will be encoded in 20 bytes. Then Preon will first of all not load that collection in memory at all, and if you're grabbing data beyond that collection, it will never touch that region of your encoded representation at all. However, if you would pick the 300th element of that collection, it would (instead of decoding all elements up to the 300th element), calculate the offset for that element, and jump there immediately.
From the outside, it is as though you have a reference to a list that is fully populated. From the inside, it only goes out to grab an element of the list if you ask for it. (And forget about it immediately afterward, unless you instruct Preon to do things differently.)
I also need to do so as efficiently as possible in a single thread.
I'm not sure what you mean by efficiently. It could mean efficiently in terms of memory consumption, or efficiently in terms of disk IO, or perhaps you mean it should be really fast. I think it's fair to say that Preon aims to strike a balance between an easy programming model, memory use and a number of other concerns. If you really need to traverse all data in a sequential way, then perhaps there are ways that are more efficient in terms of computational resources, but I think that would come at the cost of "ease of programming".
And, finally, the format being read is very structured, so it would be
good to have some kind of parser library (so that the code is close to
the complex specification).
The way I implemented support for Java byte code, is to just read the byte code specification, and then map all of the structures they mention in there directly to Java classes with annotations. I think Preon comes pretty close to what you're looking for.
You might also want to check out preon-emitter, since it allows you to generate annotated hexdumps (such as in this example of the hexdump of a Java class file) of your data, a capability that I haven't seen in any other library. (Hint: make sure you hover with your mouse over the hex numbers.)
The same goes for the documentation it generates. The aim has always been to mak sure it creates documentation that could be posted to Wikipedia, just like that. It may not be perfect yet, but I'm not unhappy with what it's currently capable of doing. (For an example: this is the documentation generated for Java's class file specification.)
The amount of lookahead needed for parsing should be small, if that matters.
Okay, that's good. In fact, that's even vital for Preon. Preon doesn't support lookahead. It does support looking back though. (That is, sometimes part the encoding mechanism is driven by data that was read before. Preon allows you to declare dependencies that point back to data read before.)
Are there any good parser libraries for binary data?
Preon! ;-)
How well do parsers support streaming transformations (I want to be
able to stream the data being parsed to some output during parsing - I
don't want to have to construct an entire parse tree in memory before
writing things out)?
As I outlined above, Preon does not construct the entire data structure in memory before you can start processing it. So, in that sense, you're good. However, there is nothing in Preon supporting transformations as first class citizens, and it's support for encoding is limited.
On the nio front my suspicion is that nio isn't going to help much, as
I am likely disk limited (and since it's a single thread, there's no
loss in simply blocking). Also, I suspect io-based parsers are more
common.
Preon uses NIO, but only it's support for memory mapped files.

On NIO vs IO you are right, going with IO should be the right choice - less complexity, stream oriented etc.
For a binary parsing library - checkout Preon

Using a Memory Mapped File you can read through it without worrying about your memory and it's fast.

I think you are correct re NIO vs IO unless you have little endian data as NIO can read little endian natively.
I am not aware of any fast binary parsers, generally you want to call the NIO or IO directly.
Memory mapped files can help with writing from a single thread as you don't have to flush it as you write. (But it can be more cumbersome to use)
You can stream the data how you like, I don't forsee any problems.

How can pattern search make faster?

I am working on about 1GB incremental file and I want to search for a particular pattern.
Currently I am using Java Regular expressions, do you have any idea how can I do this faster?

Sounds like a job for Apache Lucene.
You probably will have to rethink your searching strategy, but this library is made for doing things like this and adding indexes incrementally.
It works by building reverse indexes of your data (documents in Lucene parlance), and then quickly checking in the reverse indexes for which documents have parts of your pattern.
You can store metadata with the document indexes so you might able to not having to consult the big file in the majority of use-cases.

Basically what you need is a state machine that can process a stream. This stream being bounded to the file... Each time the file grow, you read what has been appended to it (like the tail linux command that append to standard output the lines added to the file).
If you need to stop/restart your analyser, you can either just store somewhere the start position (that can depend of the window you need for your pattern matching) and restart from that. Or you can restart from scratch.
That is for the "increasing file" part of the problem.
For the best way to process the content, it depend of what you really need, what kind of data and pattern you want to apply. Regular expression are maybe the best solution: flexible, fast and relatively convenient.
From my understanding, Lucene would be good if you wanted to do document search matching for some natural language content. This would be a poor choice to match all dates or all line with a specific property. Also because Lucene first make an index of the document... This would help only for really heavy processing as indexing in the first place take time.

You can try using the Pattern and Matcher classes to search with compiled expressions.
See http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html and http://download.oracle.com/javase/tutorial/essential/regex/
or use your favorite search engine to search on the terms:
java regular expression optimization
or
java regular expression performance

I think it depends on:
the structure of your data (line oriented?)
the complexity of the match
the speed at which the data file is growing
If your data is line oriented (or block oriented) and a match must occur within such a unit you can match until the last complete block, and store the file position of that endpoint. The next scan should start at that endpoint (possibly using RandomAccessFile.seek()).
This particularly helps if the data isn't growing all that fast.
If your match is highly complex but has a distinctive fixed text, and the pattern doesn't occur all that often you may be faster by a String.contains() and only if that's true apply the pattern. As patterns tend to be highly optimized it's definitely not guaranteed to be faster.
You may even think of replacing the regex by hand-writing a parser, possibly based on StringTokenizer or some such. That's definitely a lot of work to get it right, but it would allow you to pass some extra intelligence about the data into the parser, allowing it to fail fast. This would only be a good option if you really know a lot about the data that you can't encode in a pattern.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.