Skipping Html Content in Tag attributes - java

I am using SAX Parser to parse following piece of data with "Description" attribute containing HTML content . But I am getting error "The value of attribute "Description" associated with an element type "null" must not contain the '<' character".
How to make SAX Parser ignore this tag while XML Processing?
<Thread ThreadID="22" Title="google"
Description="http://google.com/"
DisplayName="Sam" LoginID="hjaja" UserEmailID="abx#ers"
UserSapCode="12345"
IsAnonymous="Yes" CreatedDate="2015-04-29T21:56:04.943" ReplyCount="0"
ViewCount="0" PopularityPoints="0" LastUpdatedBy="" LastPostDate="" />
Thanks in advance.

I really thing that you should take a look at this post (HTML code inside XML) to see how other people recommended to tackle such problem.

No XML parser can parse this data as the data do not comply the xml format. Please refer XML specifications.
There are two ways you can solve this:
Change the source format
Change the source to create the proper XML. You can include HTMLs by escaping the characters using these:
" "
' &apos;
< <
> >
& &
Change the target algo
Second is by creating your own parsing algorithm for you case.
Usually answer is always the the first one.

Related

Get a list of all invalid XML tags using Regex?

I have XML contained in a string which has many invalid xml tags for an element, where a tag is "invalid" if it starts with a number. For example, the following are invalid:
<1>....</1>, <123abc>, etc.
In XML, we'd identify certain tags as invalid:
<tag1> ----> valid tag
<1tagname>....</1tagname> --->invalid tagname
<2tagname>....</2tagname> --->invalid tag name
</tag1> ----> valid tag
I want to fetch a list of invalid xml tags and I want to add a special string as a prefix let's say "item" so as to convert invalid to valid tag name.
I am using Java language compatiable regex.
You can use this:
String result = yourstr.replaceAll("(?<=</?)(?=[0-9])", "item");
You can use stack.
Explanation:
it's like the finding if bracket expression is valid.
your code should work like this:
Read the xml
For every opening tag, push it into the stack
For every closing tag, compare it with the top of the stack
If they are not match - mark as problem - add prefix
If they match - pop out of stack
When finish reading the xml and there are element in the stack - add prefix and close the tags
This will solve the simple case.
There are some edge case, like have unmatched closing tag inside a legal tag and maybe more

how to use <,> tags in java when create xml file

I want to html tag in xml. I'm using CDATA it is run in xml but I create xml file with java <, > tags was "<". I don't understand this situation.
String returnUrl="<![CDATA[ac=S<br/>DNbZCQOijAl6HrAAyyGV]]>";
Node returnUrlNode = doc.createElement("returnurl");
returnUrlNode.setTextContent(returnUrl);
userNode.appendChild(returnUrlNode);
If for whatever reason you want the text to be in a CDATA section and not a simple text node, you'll need to create the CDATA yourself. I'm assuming you're using the DOM and not some API that looks similar, so it would be:
Node returnUrlNode = doc.createElement("returnurl");
returnUrlNode.appendChild(
doc.createCDATASection(
"Whatever text you wanted to go in here, including unescaped < and >."));
Note that like SLaks pointed out, when the DOM is serialized, all escaping will happen automatically. (In this case, that means that the <![CDATA[ and ]]> will be added automatically.) This is just how you'd create an actual CDATA section if you need the output to be a CDATA section and not a normal text node.
The Java XML APIs will automatically escape your content.
You can just write .setTextContent("ac=S<br/>DNbZCQOijAl6HrAAyyGV"), and Java will excape the < and > for you.
You need to use the XML escape characters:
& &
< <
> >
" "
' &apos;

How to parse XML tags with dots in JavaScript/extJS

Dot in XML tag
I have problem with tags in xml file.
I have a lot of tags with dots for example <tag.state> example text </tag.state>
JavaScript (extJS), does not parse successfully tags with dots :\
XML file were generated automaticly, and I cannot influance in generated tags.. so is It possible to avoid this issue?
in cannot read tags in extJS
try with ' and dobule quatas " but also it fails...
fields: [ 'tag.state']
or
fields: [ "tag.state"]
I had a similar problem in java where my xml file could only have <string-array> and <item> and <name> etc. I just made a java file that ran through my xml (copied into a .txt) and rewrote everything with correct tags. I can share it with you if you would like.

Strip of tags from text extracted from XML

I am parsing XML documents. I do getTextContent() to get text from particular section that I want. The text that I get has tags like
<italic> </italic>
<sub> </sub>
..and some more. I want to strip of these tags and just keep the text, irrespective of what the tags are.
My document looks like this
<article>
<sec>Section 1</sec>
<sec>Section 2
<title>Title1</title>
<sec>
<title>Subtitle1</title>
<p>........<italic> </italic>...</p>
</sec>
<sec>
<title>Subtitle2</title>
<p>........<sub> </sub>...</p>
</sec>
</sec>
</article>
I need all the text in <p>...</p> without the tags in it.
How can I go about it? I was thinking of identifying all the tags and replacing it with "". But there has to be a better way.
Thanks
You could apply this reg ex to the results of getTextContent()
String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");
You could use a perl script to go through the file then use s/ \< .* \> //xg; to get rid of all the tags.

XML Editor in java(jsp,sevlet)

I am developing xml editor using jsp and servlet. In this case i am using DOM parser.
I have one problem in XML editor ,
How to edit the following xml file without losing elements.
eg:
<book id="b1">
<bookbegin id="bb1">
<para id="p1">This is<b>first</b>line</para>
<para id="p2">This is<b>second</b>line</para>
<para id="p3">This is<b>third</b>line</para>
</bookbegin>
</book>
I try to edit the above xml file using dtd using jsp,servlet. but while i read the textvalue from xml, it return only first,second,third.How to read the 'This is' and 'line '. Then how to store back to the xml file using xpath.
thank in advance.
The <b> tag inside the <para> tag is another element, not a formatting tag (in XML). Therefore, you need to traverse down to it.
Like #JRL says, the <b> tags are cosnidered as well-formed XML and, as a consequence, splitted by your DOM processor.
I think youf ail to read other text elements because you only read text when an XML node has no more XML node, which is not your case here.

Categories