Antlr parse Tree node coordinates? - java

I use Antlr4 4.9.2
I have a requirement to perform multiple passes of the same parse tree at different stages of my analysis. Some of the files my application handles are very large, therefore I'd like to be able to avoid keeping the parse tree in memory, and be able to regenerate a different parse tree instance each time. So far so good.
My challenge is that I need a way to (a) compare nodes and (b) quickly access nodes that works with different instances of equivalent parse trees.
For example the following pseudo-code generates two separate instances of a parse tree that represent the same file (therefore the parse trees and their nodes are equivalent)
ParseTree parseTree1 = parse(myFile, myGrammar)
ParseTree parseTree2 = parse(myFile, myGrammar)
Since myFile and myGrammar are the same, both parseTree1 and parseTree2 are equivalent, however are different instances and don't satisfy Objects.equals()
In ANTLR, how do I represent the coordinates C of a node in such a way that:
C(node1) = C(node2) if the nodes are equivalent
I can access C(parseTree1) or C(parseTree2) without having to visit the parse trees - so I can quickly position myself on the same node, for any instance of the parsetree

You can use ANTLR4's XPath implementation to directly access nodes in a given parse tree path. Here's how I get all query expressions in MySQL code, after parsing:
const expressions = XPath.findAll(tree, "/query/simpleStatement//queryExpression", this.parser);

Related

renaming jcr nodes customly (in CQ/AEM)

Authors make some comments once a month.
It is stored in "content" in jcr under node "remarks". each comment
is stored in a child node which is named as"remarks_xxxx" where
xxxx are random alphabets and numbers.
I need to rename all the current nodes to "remarks_mmddyy"
and also assign future names in a similar fashion.
Thanks
The best approach is to write the date of the remark into a property (of type Date) instead of writing it into the node name. This will eliminate the need to rename nodes and also improve your chances to leverage jcr queries to your advantage.
In order to retrieve remarks for a certain date and time use the jcr query api, which allows to search for properties (including Date format of course). Since AEM 6 and jackrabbit oak, you can define a custom index to make sure that a given property query is blazing fast in terms of performance. Note that "order by" is supported as well, in case that ordering is an issue.
In case that you absolutely must stick with the detrimental data model of renaming nodes and sticking dates into node-names, check out the following article how to do it: How can you change the name of a JCR node?

How do I check if an XML node is a leaf node in Java?

I want to list of all the leaf nodes present in an XML document. The XML is not fixed, thus the code should work for any given XML file.
Find an XML parser. Those libraries will parse the XML String for you and build an Object Oriented tree of the XML nodes (called a DOM, which stands for Document Object Model). There should be definitely a method like getChildCount(), getChildren() or isLeaf().
Take a look here: Best XML parser for Java
If you are using the DOM:
if (!myNode.hasChildNodes())
{
// found a leaf node
}

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.

What is xml normalization? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What does Java Node normalize method do?
What is xml normalization .I found following in javadoc but i cant understand it?Can anyone help?
public void normalize()
Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes. This can be used to ensure that the DOM view of a document is the same as if it were saved and re-loaded, and is useful when operations (such as XPointer [XPointer] lookups) that depend on a particular document tree structure are to be used. If the parameter "normalize-characters" of the DOMConfiguration object attached to the Node.ownerDocument is true, this method will also fully normalize the characters of the Text nodes.
Note: In cases where the document contains CDATASections, the normalize operation alone may not be sufficient, since XPointers do not differentiate between Text nodes and CDATASection nodes.
Since:
DOM Level 3
Parsers will often return "surprising" text nodes, where text is split up into multiple nodes, or, less commonly, empty text nodes. This is a side-effect of them being streamlined for maximum performance. It may happen when there's ignorable whitespace, buffer boundaries, or anywhere else that it was just convenient for the parser.
normalize() will get rid of all these surprises, merging adjacent text nodes and removing empty ones.
The API doc explains it in great details, not sure what there is to explain. Basically the method converts the DOM subtree beginning at this node into a "standard format" by combining adjacent text nodes, eliminating empty text nodes and optionally also normalizes characters that are Unicode composites.

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

Categories