Parsing XML files to get particular text content

Parsing XML files to get particular text content - java

I am parsing the XML files which represents research papers / artciles and have below XML schema to store in a MYSQL database in Java
<article>
<article-meta></article-meta>
<body>
<p>
Extensible Markup Language (XML) is a markup language that defines a set of
rules for encoding documents in a format that is both human-readable and machine-
readable <ref id = 1>. It is defined in the XML 1.0 Specification produced by the
W3C, and several other related specifications
</p>
<p>
Many application programming interfaces (APIs) have been developed to aid
software developers with processing XML <ref id = 2>. data, and several schema
systems exist to aid in the definition of XML-based languages.
</p>
</body>
<back>
<ref-list>
<ref id = 1>Details about this reference </ref>
<ref id = 2>Details about this reference </ref>
</ref-list>
</back>
</article>
I am parsing the files using DOM parser . One of the requirements is for every ref id , i have to extract 150 characters form left and right from the location where it is referred in the body tags. How can I do this ??
refId leftText rightText
1 left 150 150 chars on right side

Assuming you got the <ref> tag element Id = 1 and element content value = Details about this reference from xml in your code using dom, storing <ref> tag content value in a string variable then you can use sub string method to get left char and right char like this.
String text ="Details about this reference";
String leftText = text.substring(0,7); // get 7 chars from left side
String rightText =text.substring(text.length()-2); // get 2 char from right side, instead of 2 you have to pass10
result
leftText:Details rightText:ce
Note: you need to check string length grater than 150 before extracting it, if less than substring will throw exception ArayIndexBoundOfException

Related

Sending xml data in hidden field.Is it safe?

I have an html form:
<form>
<input type="hidden" id="hiddenField"/>
...Other form fields
</form>
In this form I want to set a hidden field with xml data.
Can anyone suggest if it is fine to set the hidden field directly with xml data.
i.e. in my javascript function is it safe to directly set the hidden field with xml like: $(#hiddenFiled).val(xml); and get the xml in my java servlet?Please suggest.

No you can't keep xml without encoding
You can opt either
var stringValue=escape(xml);
var xmlValue= unescape (stringValue)
in javascript
Though these methods has been depreciated in newer versions so you could find it in another library like http://underscorejs.org/#escapeUnderScoreJs
Also don't keep XML in hidden field if it holds andy sensitive information.

Hidden form fields are not for session tracking.
We have two mechanism for session tracking, they are cookies and URL rewriting, the latest for the people that doesn't have cookies enabled in their browsers, I could only understand sending a session id in a hidden field when you have your own session tracker and are not using the one that is already with your server container (HttpSession and all), but why re-invent the wheel?
Hidden fields are for passing information between pages, sometimes I use a and I clearly don't want that information displayed to the user

Posting XML without javascript or browser plugins is impossible. You should not send it directly as a form parameter. See this answer for more info:.
Use a library that would encode them while sending to server, and decode them at the server side.
Underscore.js provides such functionality. See the documentation:
escape_.escape(string)
Escapes a string for insertion into HTML, replacing &, <, >, ", `, and ' characters.
_.escape('Curly, Larry & Moe');
=> "Curly, Larry & Moe"
unescape_.unescape(string)
The opposite of escape, replaces &, <, >, ", ` and ' with their unescaped counterparts.
_.unescape('Curly, Larry & Moe');
=> "Curly, Larry & Moe"
However, do keep in mind that usually browsers have limits over the amount of data that you can send through GET request (around 255 bytes). Hence it's always a good option to use POST instead of GET even when sending encoded XML.

Skipping Html Content in Tag attributes

I am using SAX Parser to parse following piece of data with "Description" attribute containing HTML content . But I am getting error "The value of attribute "Description" associated with an element type "null" must not contain the '<' character".
How to make SAX Parser ignore this tag while XML Processing?
<Thread ThreadID="22" Title="google"
Description="http://google.com/"
DisplayName="Sam" LoginID="hjaja" UserEmailID="abx#ers"
UserSapCode="12345"
IsAnonymous="Yes" CreatedDate="2015-04-29T21:56:04.943" ReplyCount="0"
ViewCount="0" PopularityPoints="0" LastUpdatedBy="" LastPostDate="" />
Thanks in advance.

I really thing that you should take a look at this post (HTML code inside XML) to see how other people recommended to tackle such problem.

No XML parser can parse this data as the data do not comply the xml format. Please refer XML specifications.
There are two ways you can solve this:
Change the source format
Change the source to create the proper XML. You can include HTMLs by escaping the characters using these:
" "
' &apos;
< <
> >
& &
Change the target algo
Second is by creating your own parsing algorithm for you case.
Usually answer is always the the first one.

Use spring tag in XSLT

I have a XSL/XML parser to produce jsp/html code.
Using MVC model I need to accees spring library in order to perform i18n translation.
Thus, given the xml
<a>
...
<country>EN</country>
...
</a>
and using <spring:message code="table_country_code.EN"/> tag, choose based on the browser language, the transalation into England, Inglaterra, etc...
However, the XSL do not support <spring:message> tag.
The idea is to have a XSLT with something like this
<spring:message code="table_country_code.><xsl:value-of select="country"/>"/>`
to have the final code <spring:message code="table_country_code.EN"/> and be recognized in the final JSP/HTML based on i18n translation.
I also tried to create the spring tag in Java when I make a parse to create the XML but I sill have the same error.
The prefix "spring" for element "spring:message" is not bound.
[EDIT]
I saw some questions here, like using bean:spring but still have the same problem.
any pointers?

XSLT has to be namespace well formed XML, so you need to declare the namespace and you can not use < in attribute values.
Spring 3 - Accessing messages.properties in jsp
suggests the namespace should be
http://www.springframework.org/tags
so presumably you want an XSLT code of
<spring:message
xmlns:spring="http://www.springframework.org/tags"
code="table_country_code.{country}"
/>
where {} is an attribute value template that evaluates the XPath country

JAVA DOM: duplicated attributes

I'm using the DOM library for JAVA and some entries XHTML encounter this problem:
[Fatal Error] tree.xml:238:185: Attribute "itemprop" was already specified for element "span".
This is the XHTML part with problems:
<span class='fn' itemprop='author' itemscope='itemscope' itemtype='http://schema.org/Person' itemprop='name'>Rodrigo</span>
Exists some option to allow duplicate attributes in DOM?
Thanks!

My understanding is that the Microdata specification only allows one itemprop per HTML element, meaning that the DOM library you're using is properly marking it as invalid markup. If you want to specify multiple values, they need to be space-separated, like this:
<span class='fn' itemprop='author name' itemscope='itemscope' itemtype='http://schema.org/Person'>Rodrigo</span>
Incidentally, the class attribute works the same way.

Navigating an XML file while keeping track of order

I need to convert an XML file in the IOB format.
The XML file represents the structure of a Latex-written paper, i.e. with sections and subsections. In this representation, sections are encoded as BODY, then I have a HEADER and then paragraphs or subsections.
Example:
<DIV DEPTH="1">
<HEADER ID="H-8"> Practical Results </HEADER>
<P TYPE="TXT">
<S ID="S-56" TYPE="TXT"> To assess its performance , <REF REFID="R-12" ID="C-36">Grover et al. 1993</REF> tried various methods . </S>
<S ID="S-57" TYPE="TXT"> The grammar is defined in metagrammatical formalism which is compiled into a unification-based ` object grammar ' -- a syntactic variant of the Definite Clause Grammar formalism <REF REFID="R-21" ID="C-37">Pereira and Warren 1980</REF> -- containing 84 features and 782 phrase structure rules . </S>
<DIV DEPTH="2">
<HEADER ID="H-9"> Comparing the Parsers </HEADER>
<P TYPE="TXT">
<S ID="S-61" TYPE="TXT"> In the first experiment , the ANLT grammar was loaded and a set of sentences was input to each of the three parsers . </S>
</P>
<IMAGE ID="I-0"/>
</DIV>
What I want to do is to keep all the text, but convert it to a different format, i.e. I want to remove the BODY structure, and just tag the HEADERs and the text part like this:
Practical/B-Header Results/I-Header ./O
To/B-Text assess/I-Text its/I-Text performance/I-Text ,/I-Text Grover/I-Text et/I-Text al./I-Text tried/I-Text various/I-Text methods/I-Text ./O
The/B-Text grammar/I-Text ... ./O
And so on.
I know some DOM parsing in Java (for example, I have been using jdom2 for a little while) but I do not know how to keep the order of the text: for example, I want to fetch the content of the REF tag (which is inside S, look at the example), but the text from its parent extends before and after the REF tag.
Any pointers? Should be fairly simple, but searches like "strip XML tags after certain depth" did not help me :-(

I would use an event based xml parser like sTax, sax etc. Then you can keep track of levels, order and other things as you process each tag.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML files to get particular text content - java

Related

Sending xml data in hidden field.Is it safe?

Skipping Html Content in Tag attributes

Use spring tag in XSLT

JAVA DOM: duplicated attributes

Navigating an XML file while keeping track of order

Categories

Resources