What is xml normalization? [duplicate]

What is xml normalization? [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What does Java Node normalize method do?
What is xml normalization .I found following in javadoc but i cant understand it?Can anyone help?
public void normalize()
Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes. This can be used to ensure that the DOM view of a document is the same as if it were saved and re-loaded, and is useful when operations (such as XPointer [XPointer] lookups) that depend on a particular document tree structure are to be used. If the parameter "normalize-characters" of the DOMConfiguration object attached to the Node.ownerDocument is true, this method will also fully normalize the characters of the Text nodes.
Note: In cases where the document contains CDATASections, the normalize operation alone may not be sufficient, since XPointers do not differentiate between Text nodes and CDATASection nodes.
Since:
DOM Level 3

Parsers will often return "surprising" text nodes, where text is split up into multiple nodes, or, less commonly, empty text nodes. This is a side-effect of them being streamlined for maximum performance. It may happen when there's ignorable whitespace, buffer boundaries, or anywhere else that it was just convenient for the parser.
normalize() will get rid of all these surprises, merging adjacent text nodes and removing empty ones.

The API doc explains it in great details, not sure what there is to explain. Basically the method converts the DOM subtree beginning at this node into a "standard format" by combining adjacent text nodes, eliminating empty text nodes and optionally also normalizes characters that are Unicode composites.

Related

Jsoup - how to find out elements size

I am confused with jsoup API. My code parses a table with 4 cells. But I found an occurence where three cells are merged into the single one and my code fails there because the child at position 3 does not exist.
String sMminutesLeft = row.child(3).text();
The element.child(x) returns a filtered list of child elements, e.g. only tags, not text nodes. But element.childNodesCount() will return a count of all elements including text nodes. I expected 4 but I receive 9 (lots of newlines are included).
I found element.getElementsByTag("TD") returning Elements object. This object acts like a container but it does not have any size() method.
How can I safely find out number of TDs under the current TR element? Implementing NodeVisitor seems like overkill to me.

I found a workaround but as I feel the API is incomplete, I have created a pull request that adds new method to get the number of filtered children that is complementary to child(int). Here it is: https://github.com/jhy/jsoup/pull/1291

Formatting a list of strings to output to YML

Here is a piece of data I am working with:
snmp-server view DenyAll iso excluded
snmp-server view iso_view iso included
snmp-server view Cust_View interfaces included
snmp-server view Cust_View ifMIB included
I am attemping to get it into a YML format as seen below:
snmp-server:
view:
Cust_View:
- "interfaces included"
- "ifMIB included"
- "etc etc etc"
DenyAll: "iso included"
iso_view: "iso included"
I've tried to Iterate through the data set, split each piece of data by a space, and use the first two elements in the list as the "key" in the YML file, and the remaining elements in the list as the values.
However this doesn't fit any other data set which I might want to format in the same way.
I am not looking for the code to be written for me. I am looking for ideas on how I'd go about doing this and outputting it into the structure I'd like, I'm perfectly fine writing to a YML etc, the only part I'm struggling on is the formatting of data

You need to use a Trie (Prefix tree) for your task. Read each line, separate the words by space, and then insert it into a trie. Then start from the root of the trie and try to print the elements in an pre-order traversal and use tab (or space) for indentation at each level.
It also looks like that you need the data be printed sorted alphabetically. You can achieve this by inserting the nodes in a sorted order the trie.

Evaluate many elements with XPathExpression and NODESET

I parse a very large xml file (from jpylyzer, a jp2 properties extractor). This xml contains properties of many JP2 images, each one with the same elements, like :
//results/jpylyzer/fileInfo/fileName
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/height
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/width
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/bPCDepth
In order to reduce processing time, I'm using this method :
for (XPathExpression xPathExpression : listXPathExpression) {
nodeList = (NodeList) xPathExpression.evaluate(document, XPathConstants.NODESET);
//we use our list
}
It's very convenient and fast, but the number of elements must be as we expected for each property.
As some properties are unique to some images, some xpath values won't be found for some images.
nodeList is filled ONLY with found values, which is a problem : there's no way to match those values to other ones as lists don't have the same size depending on how many properties has been found.
Is there a way to fill "blank" when no value is found ?

What you want is not possible with a single XPath expression, not even with version 2.0. In such a case, you have to reach for the higher-level language you embed XPath in.
As I'm not familiar with Java very much, I cannot give you specific code, but I can explain what you have to do.
I assume an XML document similar to
<results>
<jpylyzer>
<fileInfo>
<fileName>Name of file</fileName>
</fileInfo>
<properties>
<jp2HeaderBox>
<imageHeaderBox>
<height>45</height>
<width>66</width>
<bPCDepth>386</bPCDepth>
</imageHeaderBox>
<imageHeaderBox>
<width>32</width>
</imageHeaderBox>
</jp2HeaderBox>
</properties>
</jpylyzer>
</results>
As a starting point, find an element that really is present in all XML documents, in all situations. For the sake of an example, let us assume imageHeaderBox is present everywhere, but its children height, width and bPCDepth are not necessarily there.
Find an XPath expression for the imageHeaderBox element:
/results/jpylyzer/properties/imageHeaderBox
evaluate the expression and save the result to a nodeList. Next, process this list further. This only works if XPath expressions can be applied to the individual items in a nodeList, but it seems you are optimistic about that:
I can iterate over nodelist. I guess i can evaluate too
Iterate over the nodeList (the result of the imageHeaderBox expression) and apply another path expression to each item.
XPath 2.0
In XPath 2.0, you can use an if/then statement that checks for the presence of a node. Assuming the imageHeaderBox element node as the context item:
if(height) then height else 'e.g. text saying there is no height'
XPath 1.0
With XPath 1.0, it's slightly more complicated:
concat(height, substring('e.g. text saying there is no height', 1 div not(height)))"
See Dimitre Novatchev's answer here for an explanation. The technique is known as the Becker method, probably introduced here.
Finally, the result list should look similar to
45
e.g. text saying there is no height

Meaning of #text in DOM parser

I'm relatively new to XML parsers, trying to understand some java code using DOM api to parse an XML document.
I need to know what '#text' means in the following code or even what this line of code does: -
if(!ChildNode.getNodeName().equals("#text"))
{
//do something
}

According to the JavaDoc, #text is the value of the nodeName attribute for nodes implementing the Text interface.
i.e. if a node in the document is a text node (as opposed to, for example, an element), it's nodeName will be #text.
The code in question appears to be checking whether the node referenced by ChildNode is a text node before performing some action. Presumably, the action is something that can't be performed upon a text node, like querying or adding to its children.

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.