How to extract data from a lot of URLs?

How to extract data from a lot of URLs? - java

I have about 3200 URLs to small XML files which have some data in the form of strings(obviously).The XML files are displayed(not downloaded) when I go to the URLs. So I need to extract some data from all those XMLs and save it in a single .txt file or XML file or whatever. How can I automate this process?
*Note: This is what the files look like. I need to copy the 'location' and 'title' from all of them and put them in one single file. Using what methodology can this be achieved?
<?xml version="1.0"?>
-<playlist xmlns="http://xspf.org/ns/0/" version="1">
-<tracklist>
<location>http://radiotool.com/fransn.mp3</location>
<title>France, Paris radio 104.5</title>
</tracklist>
</playlist>
*edit: Fixed XML.

It's easy enough with XQuery or XSLT, though the details will depend on how the URLs are held. If they're in a Java List, then (with Saxon at least) you can supply this list as a parameter to the following query:
declare variable urls as xs:string* external;
<data>{
for $u in $urls return doc($u)//*:tracklist
}</data>
The Java code would be something like:
Processor proc = new Processor();
XQueryCompiler c = proc.newXQueryCompiler();
XQueryEvaluator q = c.compile($query).load();
List<XdmItem> urls = new ArrayList();
for (url : inputUrls) {
urls.append(new XdmAtomicValue(url);
}
q.setExternalVariable(new QName("urls"), new XdmValue(urls));
q.setDestination(...)
run();

Have a look at the JSoup library here: http://jsoup.org/
It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

Related

Java - How to extract information from many XML files in a directory and export it to an excel file

I am trying to get information from many xml files in a directory.
How can I get specific information from each one and send it to an excel file, in java?
file 1.xml
file 2.xml
file 3.xml
*********
**file.csv** or .**xls** with the information of the 'n' files XML

there are several libraries on Java that can help you to do so.
For instance, for getting information from XML you can use dom4j and extract the specific information make use of the query language XPATH, supported by the library (examples). And to read all the XML files form a directory, Java 8 has an easy way of achieving that.
Files.list(Paths.get("/path/to/xml/files"))
.map(YourXMLParser::parse)
.forEach(XLSExporter::export);
where parse method would have the signature:
public MyDataBean parse(Path path) {
InputStream inputStream = Files.newInputStream(Path);
SAXBuilder saxBuilder = new SAXBuilder(inputStream);
... <-- Making use of SAX for instance and return the read data in a custom Bean (MyDataBean)
}
As Files.list() method return Stream you can take advantage of that to use map and forEach.
Once you have the information from each XML files to you can export to XLS using the most used library in Java for it: Apache POI
I hope it can help.

How to read/write ID3v2 tags with java?

I want to be able to read ID3v2 tags from mp3 files. I have found multiple frameworks which allow me to get the basic tags. But I would like to be able those in this list. I don't care about backwards compatibility with IDv1.
I already had a look at Jaudiotagger and mp3agic. I didn't find out how to use them for custom tags. Is this possible?

You've to use the string identifier "TXXX", that's the main difference between "normal" and custom fields. I think the following code should be self-explanatory:
AudioFile audioFile = AudioFileIO.read(File songFile) // try and catch
MP3File mp3 = (MP3File) audioFile;
AbstractID3v2Tag v2Tag = mp3.getID3v2Tag();
// Since you've mentioned a list
List<TagField> tagList = v2Tag.getFields("TXXX")
That's one way to do it. The List has now all tags with the TXXX identifier. Now you can simply call toString() on every element in the list and you should get something along those lines:
Description="Play Count"; Text="1.000.000"

Merging two .odt files from code

How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?

I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH

Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).

How match JAXB elements in CIM/RDF?

Trying to load a model from a CIM/XML file acording to IEC 61970 (Common Information Model, for power systems models), I found a problem;
According JAXB´s graphs between elements are provided by #XmlREF #XmlID and these both should be equals to match. But in CIM/RDF the references to a resource through an ID, i.e. rdf:resource="#_37C0E103000D40CD812C47572C31C0AD" contain the "#" character, consequently JAXB is unable to match "GeographicalRegion" vs. "SubGeographicalRegion.Region" when in the rdf:resource atribute the "#" character is present.
Here an example:
<cim:GeographicalRegion rdf:ID="_37C0E103000D40CD812C47572C31C0AD">
<cim:IdentifiedObject.name>GeoRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>OpenCIM3bus</cim:IdentifiedObject.localName>
</cim:GeographicalRegion>
<cim:SubGeographicalRegion rdf:ID="_ID_SubGeographicalRegion">
<cim:IdentifiedObject.name>SubRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>SubRegion</cim:IdentifiedObject.localName>
<cim:SubGeographicalRegion.Region rdf:resource="#_37C0E103000D40CD812C47572C31C0AD"/>
</cim:SubGeographicalRegion>

I realize you're asking for a solution using JAXB, but I would urge you to consider an RDF-based solution as it is more flexible and robust. You're basically trying to reinvent what RDF parsers already have built in. RDF/XML is a difficult format to parse, it doesn't make much sense to try and hack your own parsing together - especially since files that have very different XML structures can express exactly the same information: this only becomes apparent when looking at the level of the RDF. You may find that your JAXB parser workaround works on one CIM/RDF file but completely fails on another.
So, here's an example of how to process your file using the Sesame RDF API. No inferencing is involved, this just parses the file and puts it in an in-memory RDF model, which you can then manipulate and query from any angle.
Assuming the root element of your CIM file looks something like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cim="http://example.org/cim/">
(only a guess of course, but I need prefixes for a proper example)
Then you can do the following, using Sesame's Rio RDF/XML parser:
String baseURI = "http://example.org/my/file";
FileInputStream in = new FileInputStream("/path/to/my/cim.rdf");
Model model = Rio.parse(in, baseURI, RDFFormat.RDFXML);
This creates an in-memory RDF model of your document. You can then simply filter-query over that. For example, to print out the properties of all resources that have _37C0E103000D40CD812C47572C31C0AD as their SubGeographicalRegion.Region:
String CIM_NS = "http://example.org/cim/";
ValueFactory vf = ValueFactoryImpl.getInstance();
URI subRegion = vf.createURI(CIM_NS, "SubGeographicalRegion.Region");
URI res = vf.createURI("http://example.org/my/file#_37C0E103000D40CD812C47572C31C0AD");
Set<Resource> subs = model.filter(null, subRegion, res).subjects();
for (Resource sub: subs) {
System.out.println("resource: " + sub + " has the following properties: ");
for (URI prop: model.filter(sub, null, null).predicates()) {
System.out.println(prop + ": " + model.filter(sub, prop, null).objectValue());
}
}
Of course at this point you can also choose to convert the model to some other syntax format for further handling by your application - as you see fit. The point is that the difference between the identifiers with the leading # and without has been resolved for you by the RDF/XML parser.
This is of course personal opinion only, since I don't know the details of your use case, but I think you'll find that this is quite quick and flexible. I should also point out that although the above solution keeps the entire model in memory, you can easily adapt this to a more streaming (and therefore less memory-intensive) approach if you find your files are too big.

how to parse 2 xml files in android

hi i am a new developer and i want know how to do parsing of two xml files in a project.
I have 2 xml files. the first one is as follows
<?xml version="1.0"?>
<X>
<Y>
<Z>
A
</Z>
<packs>
<pack>
<packname>B</packname>
</pack>
</packs>
</Y>
</X>
The next xml files looks as follows
<s>
<t>
<question>abc</question>
<question>def</question>
<question>ghi</question>
</t>
</s>
The first XML file works for me. When i touch A it moves over to B. Now when i touch B i want to show the first question alone ie abc, Can it be done, it is not working for me
pls tell me how to move from one xml file to the other xml file
can anyone explain this with a sample codes....
Where should i store the 2 files to be parsed... I have tried storing it in raw folder in Resources.

Here is an overview of some XML parsers available for Android including some examples.
It depends on your application needs where you want to store your xml file (xml from a web service call mey remain temporarily in memory). Files like xml should be generally stores in the Raw folder.

You can use SAX Parser or Pull Parser to parse the xml. Following are some links for help:
SAXParser
Example

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract data from a lot of URLs? - java

Have a look at the JSoup library here: http://jsoup.org/ It has facilities for pulling and fixing the contents of a URL, it is intended for HTML though, so I'm not sure it will be good for XML, but it is worth a look.

Related

Java - How to extract information from many XML files in a directory and export it to an excel file

How to read/write ID3v2 tags with java?

Merging two .odt files from code

How match JAXB elements in CIM/RDF?

how to parse 2 xml files in android

Categories

Resources