parsing links from given file using jsoup

parsing links from given file using jsoup - java

I am using Jsoup to parse xml file stored in filesystem,But when I parse link element changes its scope...
XML file:-
<movies>
<movie>
<id>0</id>
<name>Aag - 1948</name>
<link>http://www.songspk.pk/indian/aag_1948.html</link>
</movie>
<movie>
<id>1</id>
<name></name>
<link>#</link>
</movie>
<movie>
<id>2</id>
<name>Aa Ab Laut Chalain</name>
<link>http://www.songspk.pk/aa_ab_laut_chalein.html</link>
</movie>
<movie>
<id>3</id>
<name>Aag - RGV Ki Aag</name>
<link>http://www.songspk.pk/aag.html</link>
</movie>
</movies>
Java implementation:-
public class DownloadSongsList {
private static Document document;
public static void main(String...string) throws IOException{
document = Jsoup.parse(new File("c:/movies.xml"), "UTF-8");
Elements movies = document.getElementsByTag("movies");
System.out.println(movies.html());
}
}
Output:-
<movie>
<id>
0
</id>
<name>
Aag - 1948
</name>
<link /> http://www.songspk.pk/indian/aag_1948.html
</movie>
<movie>
<id>
1
</id>
<name></name>
<link />#
</movie>
<movie>
<id>
2
</id>
<name>
Aa Ab Laut Chalain
</name>
<link />http://www.songspk.pk/aa_ab_laut_chalein.html
</movie>
<movie>
<id>
3
</id>
<name>
Aag - RGV Ki Aag
</name>
<link />http://www.songspk.pk/aag.html
</movie>
I want to parse links but can't get due to this problem.
And I would like to stick to Jsoup because I use this same library to create the following xml files...

Have you tried using the Parser.xmlParser()?
Example:
Document doc = Jsoup.parse(new File("c:/movies.xml"), "", Parser.xmlParser());
Elements movies = doc.getElementsByTag("movies");
System.out.println(movies.html());
Should output:
<movie>
<id>
0
</id>
<name>
Aag - 1948
</name>
<link>
http://www.songspk.pk/indian/aag_1948.html
</link>
</movie>
<movie>
<id>
1
</id>
<name></name>
<link>
#
</link>
</movie>
<movie>
<id>
2
</id>
<name>
Aa Ab Laut Chalain
</name>
<link>
http://www.songspk.pk/aa_ab_laut_chalein.html
</link>
</movie>
<movie>
<id>
3
</id>
<name>
Aag - RGV Ki Aag
</name>
<link>
http://www.songspk.pk/aag.html
</link>
</movie>
So then you can extract the <link> tags normally:
Elements links = doc.getElementsByTag("link");

Related

Sort nodes in XML using DOM parser

How can I sort the XML nodes according to the tag and append in
the new XML using DOM parser or can it be done using DOM parser. We've
used DOM parser extensively for appending nodes into a new file but I am not able to sort the nodes.
Any help would be highly appreciated.
Input.xml
<rss version="2.0">
<Configs>
<Value>defaultValue</Value>
<Config name="test1">
<title>Title 1</title>
<author>Author1</author>
<value>5600</value>
<order>02</order>
</Config>
<Config name="test2">
<title>Title 2</title>
<author>Author2</author>
<Value>6100</Value>
<order>01</order>
</Config>
</Configs>
<Ratings>
<body>
<Items name="ac_object1">
<something1>something1</something1>
<value>someValue1</value>
<order>02</order>
</Items>
<Items name="op_object2">
<something1>something2</something1>
<value>someValue2</value>
<order>03</order>
</Items>
<Items name="vt_object3">
<something1>something3</something1>
<value>someValue3</value>
<order>01</order>
</Items>
</body>
</Ratings>
</rss>
Expected Output.xml
<rss version="2.0">
<Configs>
<Value>defaultValue</Value>
<Config name="test2">
<title>Title 2</title>
<author>Author2</author>
<Value>6100</Value>
<order>01</order>
</Config>
<Config name="test1">
<title>Title 1</title>
<author>Author1</author>
<value>5600</value>
<order>02</order>
</Config>
</Configs>
<Ratings>
<body>
<Items name="vt_object3">
<something1>something3</something1>
<value>someValue3</value>
<order>01</order>
</Items>
<Items name="ac_object1">
<something1>something1</something1>
<value>someValue1</value>
<order>02</order>
</Items>
<Items name="op_object2">
<something1>something2</something1>
<value>someValue2</value>
<order>03</order>
</Items>
</body>
</Ratings>
</rss>

You really don't want to do this using low-level DOM interfaces. Here's how to do it in XSLT 3.0 (which you can call from Java after installing Saxon-HE):
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform> version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="*[*/order]">
<xsl:copy>
<xsl:apply-templates>
<xsl:sort select="number(order)"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:transform>
With a few extra lines of code you could also do it using XSLT 1.0, which comes bundled with the JDK.
How it works:
The xsl:mode declaration says that the default action for elements is to copy the element and then process its children
xsl:strip-space says ignore whitespace in the input
xsl:output says add indentation in the output
The xsl:template rule says that when processing an element that has order elements among its grandchildren, copy the start and end tag, and process the children in sorted order of the numeric value of their order child element.

Xpath with Default Namespace

I am trying to get all values of a certain XML element. However, the namespace is not defined. I've tried using the local-name() function but am not having any luck.
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<entry xml:base="https://www.website.com" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<id>A</id>
<title type="text"></title>
<updated>2015-07-21T02:40:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="Application" href="A(1347)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/b" type="application/atom+xml;type=feed" title="B" href="A(1347)/B" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/C" type="application/atom+xml;type=feed" title="C" href="A(1347)/C" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/D" type="application/atom+xml;type=entry" title="D" href="A(1347)/D" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/E" type="application/atom+xml;type=feed" title="E" href="A(1347)/E">
<m:inline>
<feed>
<title type="text">E</title>
<id>1347</id>
<updated>2015-07-21T02:40:30Z</updated>
<link rel="self" title="E" href="A(1347)/E" />
<entry>
<id>www.website.com/</id>
<title type="text"></title>
<updated>2015-07-21T02:40:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="E" href="E(4294)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/A" type="application/atom+xml;type=entry" title="Application" href="E(4294)/A" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/D" type="application/atom+xml;type=entry" title="D" href="E(4294)/D" />
<category term="APIModel.FileBaseDocuments" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:ID m:type="Edm.Int32">4294</d:ID>
<d:Type>123</d:Type>
</m:properties>
</content>
</entry>
<entry>
<id>www.website.com</id>
<title type="text"></title>
<updated>2015-07-21T02:40:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="E" href="E(4295)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/A" type="application/atom+xml;type=entry" title="A" href="E(4295)/A" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/D" type="application/atom+xml;type=entry" title="D" href="E(4295)/D" />
<category term="APIModel.FileBaseDocuments" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:ID m:type="Edm.Int32">4295</d:ID>
<d:Type>456</d:Type>
</m:properties>
</content>
</entry>
</feed>
</m:inline>
</link>
</entry>
I want to retrieve all values inside of "m:properties/d:ID m:type="Edm.Int32" (in this case 4294) but I am getting no luck. So for one file, there would be one "feed" tag filled with multiple "entry" tags. Inside these tags there would be one "m:properties/d:ID m:type="Edm.Int32" which I need to retrieve. Any suggestion on what the correct xPath would be for this situation?

Instead of resorting to namespace agnostic xml by using local-name(), why not register a namespace for the default namespace, e.g. x prefix for xmlns:x="http://www.w3.org/2005/Atom", and then your Xpath would be something like:
//x:feed/x:entry/x:content/m:properties/d:ID[#m:type='Edm.Int32']
The local-name() approach is more verbose:
//*[local-name()='feed']/*[local-name()='entry']/*[local-name()='content']
/*[local-name()='properties']/*[local-name()='ID' and #m:type='Edm.Int32']
Example of both approaches here

Use the #exclude-result-prefixes attribute to filter out the extraneous namespaces in your output. You don't need to declare the http://www.w3.org/2005/Atom in your transform, unless you need it for other purposes, but even so, it is probably a good thing.
Just match on expression m:properties/d:ID[#m:type='Edm.Int32'] to get your required data.
For example, this transform, when applied to your given input document ....
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns:atom="http://www.w3.org/2005/Atom"
version="2.0"
exclude-result-prefixes="xsl d m atom">
<xsl:output encoding="utf-8" omit-xml-declaration="yes" indent="yes" />
<xsl:template match="/">
<Edm.Int32s>
<xsl:apply-templates />
</Edm.Int32s>
</xsl:template>
<xsl:template match="*">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="text()" />
<xsl:template match="m:properties/d:ID[#m:type='Edm.Int32']">
<value><xsl:value-of select="text()" /></value>
</xsl:template>
</xsl:stylesheet>
... yields output ....
<edm.int32s>
<value>4294</value>
<value>4295</value>
</edm.int32s>
Note
I have assumed from your question tags that you want an XSLT transform. If you just wanted a simple XPath expression to apply directly to the document, you could use ...
//m:properties/d:ID[#m:type='Edm.Int32']/text()
(passing, of course, the namespace declarations for m and d).

Java XPathExpression for Child Element Using Parent Element As XPath Parameter

For the following xml:
<books>
<book>
<author>Peter</author>
<title>Tales from Somewhere</title>
<data>
<version>1</version>
</data>
</book>
<book>
<author>Paul</author>
<title>Tales from Nowhere</title>
<data>
<version>2</version>
</data>
</book>
</books>
How can I get the <version> value of the book author 'Paul' above, using this type of notation for building a Java XPathExpression:
//*[local-name()='books']/*
?
I used the following question as a reference:
Get first child node in XSLT using local-name()
Thanks!

This XPath will get the version of a book where there is at an author element with the value "Paul":
//book[author="Paul"]/data/version
When run against this XML:
<books>
<book>
<author>Peter</author>
<title>Tales from Somewhere</title>
<data>
<version>1</version>
</data>
</book>
<book>
<author>Paul</author>
<title>Tales from Nowhere</title>
<data>
<version>2</version>
</data>
</book>
<book>
<author>Peter</author>
<author>Paul</author>
<title>How to write a book with a friend</title>
<data>
<version>7</version>
</data>
</book>
</books>
You get this result:
<version>1</version>
<version>7</version>

Trim all Strings in a Node (Java webservice)

I am getting the XML data that is sent as part of SOAP request using a handler in webservice before unmarshalling of the request is done.
Node root = soapMsgContext.getMessage().getSOAPBody().getFirstChild();
Is there to trim all the elements in the node? For instance in the below case -
<Person>
<Name> J i m </Name>
<Details>
<age> 1 9 </age>
<Address>
<firstName> J i mm y </firstName>
<lastName> An der son </lastName>
</Address>
</Details>
<Person>
must be changed to
<Person>
<Name>Jim</Name>
<Details>
<age>19</age>
<Address>
<firstName>Jimmy</firstName>
<lastName>Anderson</lastName>
</Address>
</Details>
<Person>
EDIT: Webservices framework being used is JAX-RPC.

Merging of xml files java

I have two different xml files described as below and want to merge these xml files and get the expected output may be using xpath or dom parsing but not XSLT since the xmls are always not the same
XML1.xml
<personinfo>
<person>
<name><name>
<age></age>
<address>
<street></street>
<city></city>
<address>
</person>
<person>
<name><name>
<age></age>
<address>
<street></street>
<city></city>
<address>
</person>
<person>
<name><name>
<age></age>
<address>
<street></street>
<city></city>
<address>
</person>
</personinfo>
XML2.xml
<personinfo>
<person>
<name>tom<name>
<age>26</age>
<address>
<street>main street</street>
<city>washington</city>
<address>
</person>
<person>
<name>mike<name>
<age>30</age>
<address>
<street>first street</street>
<city>dallas</city>
<address>
</person>
</personinfo>
Expected.xml
<personinfo>
<person>
<name>tom<name>
<age>26</age>
<address>
<street>main street</street>
<city>washington</city>
<address>
</person>
<person>
<name>mike<name>
<age>30</age>
<address>
<street>first street</street>
<city>dallas</city>
<address>
</person>
<person>
<name><name>
<age></age>
<address>
<street></street>
<city></city>
<address>
</person>
</personinfo>
Thanks in advance ....

If you have the flexibility to create a new xml file, you can parse each of them using any parser you are comfortable with. Store the tags in a LinkedList of String LinkedLists and the tag values in a HashMap of the following type:
LinkedHashMap data= new LinkedHashMap();
You can then call the tag names from the linked lists, append the tag values from the Hash Map and write them out to a new XML file.
When I did merging of XMLs, this was the procedure I used.
Hope this helps

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
NodeList nodeLst = doc.getElementsByTagName("employee");
for (int s = 0; s < nodeLst.getLength(); s++)
{
stkey=getXMLData(s,nodeLst,"id");
keylist.add(stkey);// adding integer keys to a Linked List
data.put(stkey, stkey);
data.put(stkey+"first",getXMLData(s,nodeLst,"firstname"));
data.put(stkey+"last",getXMLData(s,nodeLst,"lastname"));
data.put(stkey+"loc",getXMLData(s,nodeLst,"location"));
data.put(stkey+"occ",getXMLData(s,nodeLst,"occupation"));
}
this will get the tag values in the hash map and the tag names in the linked list. to make your work easier, you can append the type of tag to the hashmap key. For example: if my key is the Employee ID(in my case), I append "first" to it. Lets say some one has an id: 10001. his data would be stored as: 10001, then 10001first, 10001last, 10001loc,10001occ. Now, you can call each hashmap key, get the element as per appended tag name and concatenate to your xml file.
Hope this helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing links from given file using jsoup - java

Related

Sort nodes in XML using DOM parser

Xpath with Default Namespace

Java XPathExpression for Child Element Using Parent Element As XPath Parameter

Trim all Strings in a Node (Java webservice)

Merging of xml files java

Categories

Resources