Extract JSON-LD from HTML using Apache Any23

Extract JSON-LD from HTML using Apache Any23 - java

My aim is to extract structured data from webpages. I'm using the code mentioned in this SO question. I'm using Apache Any23 CLI library dependency in my Spring project.
By using this, I'm able to extract the HTML5 Microdata (Schema.org) from webpages. But, I can't extract the JSON-LD format present in the webpages. When I checked Apache Any23's documentation, JSON-LD format is supported in it. Didn't find any further documentations on it.

Usually, if you create a new Any23 extractor with new Any23() it should work out of the box. If you use another constructor like Any23(String... extractorNames) you have to make make sure that the correct one is added for embedded JSON LD, which is "html-embedded-jsonld".
Now if there are any errors in the extraction process, Any23 drops them silently. (It's great, I know!)
I found it is possible to set a breakpoint in the org.apache.any23.extractorExtractionResultImpl method notifyIssue. With this you may be able to find a more detailed reason for your problems.

Related

Podio API, attaching files to items

I have a problem with attaching a file to a specific item using Java API. I know it should be possible as this functionality described here in the Podio documentation https://developers.podio.com/doc/files/attach-file-22518 and examples for PHP and Ruby are given. However I cannot find such method in the podio java library. I could find in FileAPI just methods that provide uploading files, but not attaching them to specific objects as described in documentation.
I use Podio APi version 0.7.1
Any ideas how it should be done in Java?

Podio uses a REST-Style API. You send standard http-request, and you get back json-formatted data. So you can do it all without a special library for your programming language.
If there is no predefined java class for you, you can just do the call yourself. In the end it is just a HTTP-call.
From the ruby implemention, I see that you attach the file as multipart/form-data,
so it is the same a browser would do it. There should be http-handling java classes to help you.
You also need to add the information from the API-Page, like the POST-Parameters and of course the url. The most difficult part is probably the authentication headers, but you need to solve this problem only once.

How do I connect to a WSDL service using Java?

I want to get end of day stock quotes using Java and was given a WSDLurl showing the xml.
All the places that I find on this topic want to show me how to create a service, and that is complicated. All I want to do is connect to the url and get the data.
This link seems close, but still wants to generate some xml code.
http://axis.apache.org/axis2/java/core/tools/eclipse/wsdl2java-plugin.html
Anyone have a simple java example where you get data from a WSDL url?
Thanks

I highly recommend using something like http://cxf.apache.org/docs/wsdl-to-java.html. Otherwise, you'll have all the pain of trying to deal with the SOAP protocol, and all of it's associated quirks and hoops.

You can use wsimport.
Copy-paste solution
wsimport -keep http://localhost:9999/ws/hello?wsdl
Tutorial

Invoke HSSF Serializer Invocation

I have to write a very large XLS file, I have tried Apache POI but it simply takes up too much memory for me to use.
I had a quick look through StackOverflow and I noticed some references to the Cocoon project and, specifically the HSSFSerializer. It seems that this is a more memory-efficient way to write XLS files to disk (from what I've read, please correct me if I'm wrong!).
I'm interested in the use case described here: http://cocoon.apache.org/2.1/userdocs/xls-serializer.html . I've already written the code to write out the file in the Gnumeric format, but I can't seem to find how to invoke the HSSFSerializer to convert it to XLS.
On further reading it seems like the Cocoon project is a web framework of sorts. I may very well be barking up the wrong tree, but:
Could you provide an example of reading in a file, running the HSSFSerializer on it and writing that output to another file? It's not clear how to do so from the documentation.

My friend, HSSF serializer is part of POI. You are just setting certain attributes in the xml to be serialized (but you need a whole process to create it). Also, setting a whole pipeline using this framework just to create a XLS seems odd as it changes the app's architecture. ¿Is that your decision?
From the docs:
An alternate way of generating a spreadsheet is via the Cocoon
serializer (yet you'll still be using HSSF indirectly). With Cocoon
you can serialize any XML datasource (which might be a ESQL page
outputting in SQL for instance) by simply applying the stylesheet and
designating the serializer.
If memory is an issue, try XSSF or SXSSF in POI.

I don't know if by "XLS" you mean a specific, prior to Office 2007, version of this "Horrible SpreadSheet Format" (which is what HSSF stands for), or just anything you can open with a recent version of MS Office, OpenOffice, ...
So depending on your client requirements (i.e. those that will open your Excel file), another option might be available : generating a .XLSX file.
It comes down to producing an XML file in the proper grammar, which seems to be fit to your situation, as you seem to have already done that with the Gnumeric XML-based file format without technical trouble, and without hitting memory-effisciency issues.
Please note other XML-based spreadsheet formats exist, that Excel and other clients would be able to use. You might want to dig into the open document file formats.
As to wether to use Apache Cocoon or something else:
Cocoon can sure host the XSL processing ; batch (Cocoon CLI) processing is available if you require Cocoon, but require it not to run as a webapp (though as far as I remember, CLI feature was broken in the lastest builds of the 2.1 series) ; and Cocoon comes with a load of features and technologies that could address further requirements.
Cocoon might be overkill if it just comes down to running an XSL transformation, for which there is a bunch of well-known, lighter tools you can pick from.

Getting data using OAI-PMH from an insitutional repository

I'm developing an application where I've requested data from an external institution's website. They have informed me that the data will be provided by OAI-PMH.
Could someone show me some sample code in Java how data is extracted from a OAI-PMH ?
I wonder how different it is from reading and parsing XML data.
Thank you.
Warmest wishes,
Shoubhik

For a Java implementation, for example, you could use some already existent library, like XOAI with an easy to use API. There are some provided samples.
To extract metadata from each Record you could use a XML Parser or a XML bind approach (JAXB). For other languages, like PHP and Perl there are also other alternatives.

Clientside Javascript --> Serverside Java --> user is served a .doc

I am helping someone out with a javascript-based web app (even though I know next to nothing about web development) and we are unsure about the best way to implement a feature we'd like to have.
Basically, the user will be using our tool to view all kinds of boring data in tables, columns, etc. via javascript. We want to implement a feature where the user can click a button or link that then allows the user to download the displayed data in a .doc file.
Our basic idea so far is something like:
call a Java function on the server with the desired data passed in as a String when the link is clicked
generate the .doc file on the server
automatically "open" a link to the file in the client's browser to initiate the download
Is this possible? If so, is it feasible? Or, can you recommend a better solution?
edit: the data does not reside on the server; rather, it is queried from a SQL database

Yep, its possible. Your saviour is the Apache POI library. Its HWPF library will help you generate Microsoft word files using java. The rest is just clever use of HTTP.

Your basic idea sounds a bit Rube-Goldbergesque.
Is the data you want in the document present on the server? If so, then all you need to do is display a plain HTML link with GET parameters that describes the data (i.e. data for customer X from date A to date B). The link will be handled on the server by a Servlet that gets the data and produces the .DOC file as its output to be downloaded by the browser - a very simple one-step process that doesn't even involve any JavaScript.

Passing large amount data as GET/POST around might not be the best idea. You could just pass in the same parameters you used to generate the HTML page earlier. You don't even need to use 3rd party library to generate DOC. You could just generate a plain old HTML file with DOC extension and Word will be happy to open it.

Sounds like Docmosis Java library could help - check out theonline demo since shows it something similar to what you're asking - generating a real doc file from a web site based on selections in the web page. Docmosis can query from databases and run pretty much anywhere.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract JSON-LD from HTML using Apache Any23 - java

Related

Podio API, attaching files to items

How do I connect to a WSDL service using Java?

Invoke HSSF Serializer Invocation

Getting data using OAI-PMH from an insitutional repository

Clientside Javascript --> Serverside Java --> user is served a .doc

Categories

Resources