Document processing in Liferay portal

Document processing in Liferay portal - java

I've been using Liferay a lot for past 2 years, but I have never needed any extensive document management.
Now I have a portlet where users upload documents (MS office OLE2 documents, ODS documents, PDF etc.) and I have to persist them with all metadata available.
I know how would I do that without using Liferay, I'd probably use Apache solr with Apache Tika (UpdateRichDocuments and ExtractingRequestHandler) or Apache Jackrabbit that are using Apache Tika under the hood (org.apache.jackrabbit.extractor.*).
The problem is, that If I look at the trunk of Liferay, there are some key classes :
Hooks (JCRHook, FileSystemHook, CMISHook, s3Hook) that are employed from within DLLocalServiceImpl kinda directly
Another alternative is using DLAppLocalServiceImpl that is employing DLRepositoryLocalServiceImpl and the files are persisted into repository also via Hooks, but a lot of additional stuff is done in there.
There is not jackrabbit-text-extractors library in Liferay, so I suppose If I wanted metadata to be extracted from PDF, DOCs, ODS documents, I would have very hard times... because the DL service layer doesn't accept additional properties
I think I'd have to avoid using DL services and JCR hook and access Jackrabbit directly... But I would loose the compatibility and possibility migrate my repository etc.
Could please anybody collaborate on this one please ? Thank you

SOLR for indexing, Jackrabbit for document storage. Managing Liferay Document Library in code is fairly easy, just look at the DL*LocalServiceUtil classes, namely DLFolderLocalServiceUtil and DLFileLocalServiceUtil. By default Liferay just creates a matching folder/file structure on the hard drive (with names changed) so you'd only need to write code or use Jackrabbit if you wanted more than this since Liferay allows up/download and viewing out of the box via the control panel and various portlets.
I haven't used JackRabbit with Liferay but once configured everything should be managed under the covers and you shouldn't need to worry about it on the front end.
When you say "with all metadata available" I'm not sure what is retained, but aside from renaming the file so that it can be tracked there shouldn't be any other changes. It should be quick and easy to test by uploading a file of each type and checking the entries in the LIFERAY/data/document_library directory and subdirectories. Again this would be different if Jackrabbit is used.

those two services DLLocalServiceImpl and DLAppLocalServiceImpl both are and will, I suppose, important. The former one if for direct access to repository. Notice that when adding a file via this service you need to persist corresponding DlFileEntry into database and than reference that addFile(...., fileEntryId, ...).
The latter service is doing additional stuff for you, mainly asset management and workflow.
Regarding your use case, I would avoid using document library, because no metadata can go down into the JCR repository. Actually only metadata/custom properties that you could store would be custom properties AKA Expando feature of Liferay portal.
Best way for you seem to be implement your own jackrabbit hook to store data into repository and let Liferay document library use that repository.

Think Edgar is correct. If you check the current trunk via http://svn.liferay.com/repos/public/portal/trunk/portal-service/src/com/liferay/documentlibrary/service/DLLocalService.java (login as guest and no password), you will no longer find the class DLFolderLocalServiceUtil. We are using the existing DLFolderLocalServiceUtil class as well. Thanks for the heads up. We will refactor our code so when 6.1 comes around we can still use the DocumentLibrary services.

You need to always use DLAppServiceUtil ( as Liferay instructs specifically ). Here is my working code that saves a file to the CMS:
public static void saveFileToCMS(ActionRequest aReq, long groupId, String fileName, File filenameWithPath) {
try {
ServiceContext serviceContext = ServiceContextFactory.getInstance(
Group.class.getName(), aReq);
// prevents duplicate entries based on unique title name
Random rand = new Random();
Integer suffix = new Integer(rand.nextInt(10000));
DLAppServiceUtil.addFileEntry(groupId, 0, fileName, "application/vnd.ms-excel",
fileName + suffix.toString(), "description goes here", "changelogname",
filenameWithPath, serviceContext);
//log.info("Successfully added the new file");
} catch (PortalException pe) {
log.error("Portal Exception occurred while saving file to CMS");
pe.printStackTrace();
} catch (SystemException e) {
log.error("System Exception occurred while saving file to CMS");
e.printStackTrace();
}
}

Related

SDMX-ML: SAS libname XML

Eurostat data can be downloaded via a REST API. The response format of the API is a XML file formatted according to the SDMX-ML standard. With SAS, very conveniently, one can access XML files with the libname statement and the XML or XMLv2 engine.
Currently, I am using the xmlv2 engine together with the automap= option to generate an xmlmap to access the data. It works. But the resulting SAS data sets are very unstructured, and for another data set to be downloaded the data structure might change. Also the request might depend on the DSD-file that Eurostat provides for each database item within a different XML file.
Here comes the code:
%let path = /your/working/directory/;
filename map "&path.map.txt";
filename resp "&path.resp.txt";
proc http
URL="http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011"
METHOD="GET"
OUT=resp;
run;quit;
libname resp XMLv2 automap=REPLACE xmlmap=map;
proc datasets;
copy out=WORK in=resp;
run;quit;
With the code above, you can view all downloaded data in your WORK library. Its a mess.
To download another time series change parameters of the URL according to Eurostat's description.
So here is my question
Is there a way to easily generate a xmlmap from a call to the DSD file so that the data are stored in a well structured way?
As the SDMX-ML standard is widely used in public institutions such as the ECB, Eurostat, OECD... I am wondering if somebody has implemented requests to the databases, already. I know about the tool from Banca Italia which uses a javaObject. However, I was wondering if there might be a solution without the javaObject.

Recursively scan documents for indexing in a folder in SolrJ

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin

If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

Unidentified MAPI property returned by Apache POI

I was digging in Apache POI API, trying out what all properties it fetches out of MSG file.
I parsed MSG file using POIFSChunkParser.
Here is the code:
try
{
InputStream is = new FileInputStream("C:\\path\\email.msg");
POIFSFileSystem poifs = new POIFSFileSystem(is);
POIFSChunkParser poifscprsr = new POIFSChunkParser();
ChunkGroup[] chkgrps = poifscprsr.parse(poifs);
for(ChunkGroup chunkgrp : chkgrps )
{
for(Chunk chunk : chunkgrp.getChunks())
{
System.out.println(chunk.getEntryName() + " ("
+ chunk.getChunkId() + ") " + chunk);
}
}
}
catch(FileNotFoundException fnfe)
{
System.out.println(fnfe.getMessage());
}
catch(IOException ioe)
{
System.out.println(ioe.getMessage());
}
In output it listed all accessible properties of MSG. One of them was looking like this:
__substg1.0_800A001F (32778) 04
I tried to find what is the significance of the property with HEX 800A here. (The subnodes of this topic lists the properties.)
Q1. However I didnt find property corresponding to HEX 800A. So what should I infer?
Also, I have some other but somewhat related questions:
Q2. Does Apache POI exposes all properties through MAPIMessage (I tried out exploring all methods of MAPIMessage too and started thinking it does not)?
Q3. If not, is there any other way to access all MAPI properties in Java with or without Apache POI.

First up, be a little wary of using the very low level HSMF classes if you're not following the Apache POI Dev List. There have been some updates to HSMF fairly recently to start adding support for fixed-length properties, and more are needed. Generally the high level classes will have a pretty stable API (even with the scratchpad warnings), which the lower level ones can (and sometimes do) change as new support gets added. If you're not on the dev list, this might be a shock...
Next up - working out what stuff is. This is where the HSMF Dev Tools come in. The simple TypesLister will let you check all the types that POI knows about (slightly more than it supports), while HSMFDump will do it's best to decode the file for you. If your chunk is of any kind of known type, between those two you can hopefully work out what it is and what it contains
Finally - getting all properties. As alluded to above, Apache POI used to only support variable length properties in .msg files. That has partly been corrected, with some fixed length support in there too, but more work is needed. Volunteers welcomed on the Dev List! MAPIMessage will give you all the common bits, but will also give you access to the different Chunk Groups. (A given message will be spread across a few different chunks, such as the main one, recipient ones, attachment ones etc). From there, you can get all the properties, along with the PropertiesChunk which gives access to the fixed length properties.

Writing to a PDF from inside a GAE app

I need to read several megabytes (raw text strings) out of my GAE Datastore and then write them all to a new PDF file, and then make the PDF file available for the user to download.
I am well aware of the sandbox restrictions that prevent you from writing to the file system. I am wondering if there is a crafty way of creating a PDF in-memory (or a combo of memory and the blobstore) and then storing it somehow so that the client-side (browser) can actually pull it down as a file and save it locally.
This is probably a huge stretch, but my only other option is to farm this task out to a non-GAE server, which I would like to avoid at all cost, even if it takes a lot of extra development on my end. Thanks in advance.

You can definitely achieve your use case using GAE itself. Here are the steps that you should follow at a high level:
Download the excellent iText library, which is a Java library to work with PDFs. First build out your Java code to generate the PDF content. Check out various examples at : http://itextpdf.com/book/toc.php
Since you cannot write to a file directly, you need to generate your PDF content in bytes and then write a Servlet which will act as a Download Servlet. The Servlet will use the Response object to open a stream, manipulate the Mime Headers (filename, filetype) and write the PDF contents to the stream. A browser will automatically present a download option when you do that.
Your Download Servlet will have high level code that looks like this:
public class DownloadPDF extends HttpServlet {
public void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
//Extract some request parameters, fetch your data and generate your document
String fileName = "<SomeFileName>.pdf";
res.setContentType("application/pdf");
res.setHeader("Content-Disposition", "attachment;filename=\"" + fileName + "\"");
writePDF(<SomeObjectData>, res.getOutputStream());
}
}
}
Remember the writePDF method above is your own method, where you use iText libraries Document and other classes to generate the data and write it ot the outputstream that you have passed in the second parameter.

While I'm not aware of the PDF generation on Google App Engine and especially in Java, but once you have it you can definitely store it and later serve it.
I suppose the generation of the PDF will take more than 30 seconds so you will have to consider using Task Queue Java API for this process.
After you have the file in memory you can simply write it to the Blobstore and later serve it as a regular blob. In the overview you will find a fully functional example on how to upload, write and serve your binary data (blobs) on Google App Engine.

I found a couple of solutions by googling. Please note that I have not actually tried these libraries, but hopefully they will be of help.
PDFJet (commercial)
Write a Google Drive document and export to PDF

how to create an odt file programmatically with java?

How can I create an odt (LibreOffice/OpenOffice Writer) file with Java programmatically? A "hello world" example will be sufficient. I looked at the OpenOffice website but the documentation wasn't clear.

Take a look at ODFDOM - the OpenDocument API
ODFDOM is a free OpenDocument Format
(ODF) library. Its purpose is to
provide an easy common way to create,
access and manipulate ODF files,
without requiring detailed knowledge
of the ODF specification. It is
designed to provide the ODF developer
community with an easy lightwork
programming API portable to any
object-oriented language.
The current reference implementation
is written in Java.
// Create a text document from a standard template (empty documents within the JAR)
OdfTextDocument odt = OdfTextDocument.newTextDocument();
// Append text to the end of the document.
odt.addText("This is my very first ODF test");
// Save document
odt.save("MyFilename.odt");
later
As of this writing (2016-02), we are told that these classes are deprecated... big time, and the OdfTextDocument API documentation tells you:
As of release 0.8.8, replaced by org.odftoolkit.simple.TextDocument in
Simple API.
This means you still include the same active .jar file in your project, simple-odf-0.8.1-incubating-jar-with-dependencies.jar, but you want to be unpacking the following .jar to get the documentation: simple-odf-0.8.1-incubating-javadoc.jar, rather than odfdom-java-0.8.10-incubating-javadoc.jar.
Incidentally, the documentation link downloads a bunch of jar files inside a .zip which says "0.6.1"... but most of the stuff inside appears to be more like 0.8.1. I have no idea why they say "as of 0.8.8" in the documentation for the "deprecated" classes: just about everything is already marked deprecated.
The equivalent simple code to the above is then:
odt_doc = org.odftoolkit.simple.TextDocument.newTextDocument()
para = odt_doc.getParagraphByIndex( 0, False )
para.appendTextContent( 'stuff and nonsense' )
odt_doc.save( 'mySpankingNewFile.odt' )
PS am using Jython, but the Java should be obvious.

I have not tried it, but using JOpenDocument may be an option. (It seems to be a pure Java library to generate OpenDocument files.)

A complement of previously given solutions would be JODReports, which allows creating office documents and reports in ODT format (from templates, composed using the LibreOffice/OpenOffice.org Writer word processor).
DocumentTemplateFactory templateFactory = new DocumentTemplateFactory();
DocumentTemplate template = templateFactory .getTemplate(new File("template.odt"));
Map data = new HashMap();
data.put("title", "Title of my doc");
data.put("picture", new RenderedImageSource(ImageIO.read(new File("/tmp/lena.png"))));
data.put("answer", "42");
//...
template.createDocument(data, new FileOutputStream("output.odt"));
Optionally the documents can then be converted to PDF, Word, RTF, etc. with JODConverter.
Edit/update
Here you can find a sample project using JODReports (with non-trivial formatting cases).

I have written a jruby DSL for programmatically manipulating ODF documents.
https://github.com/noah/ocelot
It's not strictly java, but it aims to be much simpler to use than the ODFDOM.
Creating a hello world document is as easy as:
% cat examples/hello.rb
include OCELOT
Text::create "hello" do
paragraph "Hello, world!"
end
There are a few more examples (including a spreadsheet example or two) here.

I have been searching for an answer about this question for myself. I am working on a project for generating documents with different formats and I was in a bad need for library to generate ODT files.
I finally can say the that ODFToolkit with the latest version of the simple-odf library is the answer for generating text documents.
You can find the the official page here :
Apache ODF Toolkit(Incubating) - Simple API
Here is a page to download version 0.8.1 (the latest version of Simple API) as I didn't find the latest version at the official page, only version 0.6.1
And here you can find Apache ODF Toolkit (incubating) cookbook

You can try using JasperReports to generate your reports, then export it to ODS. The nice thing about this approach is
you get broad support for all JasperReports output formats, e.g. PDF, XLS, HTML, etc.
Jasper Studio makes it easy to design your reports

The ODF Toolkit project (code hosted at Github) is the new home of the former ODFDOM project, which was until 2018-11-27 a Apache Incubator project.

the solution may be JODF Java API Independentsoft company.
For example, if we want to create an Open Document file using this Java API we could do the following:
import com.independentsoft.office.odf.Paragraph;
import com.independentsoft.office.odf.TextDocument;
public class Example {
public static void main(String[] args)
{
try
{
TextDocument doc = new TextDocument();
Paragraph p1 = new Paragraph();
p1.add("Hello World");
doc.getBody().add(p1);
doc.save("c:\\test\\output.odt", true);
}
catch (Exception e)
{
System.out.println(e.getMessage());
e.printStackTrace();
}
}
}
There are also .NET solutions for this API.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.