how to get the data from xml feeds

how to get the data from xml feeds - java

I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working

I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}

Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.

Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.

Related

MarkLogic Java API PlanBuilderBase.ExportablePlanBase

I use PlanBuilder.ModifyPlan to retrieve the contents and the results are in StringHandle().
I see the PlanBuilderBase.ExportablePlanBase but there is no reference as how to use its exportAs method.
This method should be sth like:
ExportablePlan ep = plan.exportAs(String);

Typically, an application wouldn't call exportAs().
Instead, an application would pass the plan to methods of the RowManager class. Internally, the implementation of such methods export the plan for sending to the server.
In particular, the following RowManager methods take a plan and get its result rows or an explanation of the query preparation:
http://docs.marklogic.com/javadoc/client/com/marklogic/client/expression/class-use/PlanBuilder.Plan.html#com.marklogic.client.row
Here is an example of getting result rows:
http://docs.marklogic.com/guide/java/OpticJava#id_93678
RowManager also provides methods for binding parameters of the plan to literal values before sending the plan to the server:
http://docs.marklogic.com/javadoc/client/com/marklogic/client/expression/class-use/PlanBuilder.Plan.html#com.marklogic.client.expression
Examples of edge cases where an application might want to export a plan include:
logging
inserting into a JSON document so an enode script could import a plan without receiving the plan from the client
The exported plan is a JSON document (represented as a String, if the exportAs() method is used). After exporting the plan, the application could process the JSON document in the same way as any other JSON document. For instance, the application could use JSONDocumentManager to write the plan as a document in the content database.
Hoping that helps,

SDMX-ML: SAS libname XML

Eurostat data can be downloaded via a REST API. The response format of the API is a XML file formatted according to the SDMX-ML standard. With SAS, very conveniently, one can access XML files with the libname statement and the XML or XMLv2 engine.
Currently, I am using the xmlv2 engine together with the automap= option to generate an xmlmap to access the data. It works. But the resulting SAS data sets are very unstructured, and for another data set to be downloaded the data structure might change. Also the request might depend on the DSD-file that Eurostat provides for each database item within a different XML file.
Here comes the code:
%let path = /your/working/directory/;
filename map "&path.map.txt";
filename resp "&path.resp.txt";
proc http
URL="http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011"
METHOD="GET"
OUT=resp;
run;quit;
libname resp XMLv2 automap=REPLACE xmlmap=map;
proc datasets;
copy out=WORK in=resp;
run;quit;
With the code above, you can view all downloaded data in your WORK library. Its a mess.
To download another time series change parameters of the URL according to Eurostat's description.
So here is my question
Is there a way to easily generate a xmlmap from a call to the DSD file so that the data are stored in a well structured way?
As the SDMX-ML standard is widely used in public institutions such as the ECB, Eurostat, OECD... I am wondering if somebody has implemented requests to the databases, already. I know about the tool from Banca Italia which uses a javaObject. However, I was wondering if there might be a solution without the javaObject.

How to extract data from a lot of URLs?

I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?
*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?
<?xml version="1.0"?>
-<playlist xmlns="http://xspf.org/ns/0/" version="1">
-<tracklist>
<location>http://radiotool.com/fransn.mp3</location>
<title>France, Paris radio 104.5</title>
</tracklist>
</playlist>
*edit: Fixed XML.

It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:
declare variable urls as xs:string* external;
<data>{
for $u in $urls return doc($u)//*:tracklist
}</data>
The Java code would be something like:
Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();

Have a look at the JSoup library here: http://jsoup.org/
It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

ANDROID usage of Jackson library: How to load object with indexes - range from to

I have really big JSON file for parsing and managing. My JSON file contains structure like this
[
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
{"id": "11040549","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
},
....
{"id": "11040548","key1":"keyValue1","key2":"keyValue2","key3":"keyValue3","key4":"keyValue4","key5":"keyValue5","key6":"keyValue6","key7":"keyValue7","key8":"keyValue8","key9":"keyValue9","key10":"keyValue10","key11":"keyValue11","key12":"keyValue12","key13":"keyValue13","key14":"keyValue14","key15":"keyValue15"
}
]
My JSON file contains data about topics from news website and practically every day this JSON file will be increased dramatically.
For parsing of that file I use
URL urlLinkSource = new URL(OUTBOX_URL);
urlLinkSourceReader = new BufferedReader(new InputStreamReader(
urlLinkSource.openStream(), "UTF-8"));
ObjectMapper mapper = new ObjectMapper();
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { }); //DataContainerList contains id, key1, key2, key3..key15
My problem is that I want to load in this line
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only range of JSON object - just first ten object, just second ten object - because I need to display in my app just 10 news in paging mode (all the time I know the index of which 10 I need to display). It totally stuped to load 10 000 objects and to iterate just first 10 of them. So my question is how I can load
in similar way like this one:
List<DataContainerList> DataContainerListData = mapper.readValue(urlLinkSourceReader,new TypeReference<List<DataContainerList>>() { });
only objects with indexes FROM -TO (for example from 30 to 40) without loading of all objects in the entire JSON file?
Regards

It depends of what you mean by "load object with indexes from to", if you want to
Read everything but bind only a sublist
The solution in that case is to read the full stream and only bind values within those indexes.
You can use jacksons streaming api and do it yourself. Parse the stream use a counter to keep track of actual index and then bind to POJOs only what you need.
However this is not a good solution if your file is large and its done in real time.
Read only the data between those indexes
You should do that if your file is big and performance matters. Instead of having a single big file, do the pagination by splitting your json array into multiple files matching your ranges, and then just deserialize the specific file content into your array.
Hope this helps...

Nutch Seed URLs

Is it possible to get URLs into Nutch directly from a database or a service etc. I'm not interested in the ways which data is taken from the database or service and written to seed.txt.

No. This cannot be done directly with the default nutch codebase. You need to modify Injector.java to achieve that.
EDIT:
Try using DBInputFormat : an InputFormat that reads input data from an SQL table. You need to modify the Inject code here (line 3 in snippet below):
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to get the data from xml feeds - java

Related

MarkLogic Java API PlanBuilderBase.ExportablePlanBase

SDMX-ML: SAS libname XML

How to extract data from a lot of URLs?

ANDROID usage of Jackson library: How to load object with indexes - range from to

Nutch Seed URLs

Categories

Resources