Recursively scan documents for indexing in a folder in SolrJ

Recursively scan documents for indexing in a folder in SolrJ - java

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin

If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

Related

SDMX-ML: SAS libname XML

Eurostat data can be downloaded via a REST API. The response format of the API is a XML file formatted according to the SDMX-ML standard. With SAS, very conveniently, one can access XML files with the libname statement and the XML or XMLv2 engine.
Currently, I am using the xmlv2 engine together with the automap= option to generate an xmlmap to access the data. It works. But the resulting SAS data sets are very unstructured, and for another data set to be downloaded the data structure might change. Also the request might depend on the DSD-file that Eurostat provides for each database item within a different XML file.
Here comes the code:
%let path = /your/working/directory/;
filename map "&path.map.txt";
filename resp "&path.resp.txt";
proc http
URL="http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2011"
METHOD="GET"
OUT=resp;
run;quit;
libname resp XMLv2 automap=REPLACE xmlmap=map;
proc datasets;
copy out=WORK in=resp;
run;quit;
With the code above, you can view all downloaded data in your WORK library. Its a mess.
To download another time series change parameters of the URL according to Eurostat's description.
So here is my question
Is there a way to easily generate a xmlmap from a call to the DSD file so that the data are stored in a well structured way?
As the SDMX-ML standard is widely used in public institutions such as the ECB, Eurostat, OECD... I am wondering if somebody has implemented requests to the databases, already. I know about the tool from Banca Italia which uses a javaObject. However, I was wondering if there might be a solution without the javaObject.

How to read data from Elasticsearch to Spark?

I`m trying to read data from ElasticSearch to Apache Spark by python.
Below are the code copied from official documents.
$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
The above can read the data from the corresponding index but it is reading the whole index.
Can you tell me how to use query to limit the read scope?
Also, I did not find much doc regarding this. For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. I go to Hadoop config did not find corresponding key and value regarding ES. Do you know some better articles about this?

You can add a es.query setting to your configuration like this:
conf.set("es.query", "?q=me*")
Here's a more detailed documentation on how to use it.

Index single Xml-file with Lucene

I'm writing a Java application and want to index an Xml-file with Lucene so I can search for a drug that has a given target. The file size is 400MB and it is filled with over 8000 drug-entries.
<drug type="biotech" created="2005-06-13" updated="2015-11-27">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
....
<targets>
<target position="1">
<id>BE0000767</id>
<name>Epidermal growth factor receptor</name>
....
</target>
....
</targets>
</drug>
<drug>
....
</drug>
How can I index this file so one drug-entry is one Document?
If someone has some useful links/resources or tips on how to index this Xml please let me know :)

The most flexible strategy is usually to just use SolrJ through a small java application that reads the file and transforms it to a suitable format for indexing in Solr. That way you can easily preprocess certain fields before they're received by Solr.
Another option is to use XSL to transform the XML file into something that Solr understands. This can be used either server-side (as with XSLTUpdateRequestHandler linked) or client-side (transform an XML document into an update request and submit it to the standard request handler).

Writing to a PDF from inside a GAE app

I need to read several megabytes (raw text strings) out of my GAE Datastore and then write them all to a new PDF file, and then make the PDF file available for the user to download.
I am well aware of the sandbox restrictions that prevent you from writing to the file system. I am wondering if there is a crafty way of creating a PDF in-memory (or a combo of memory and the blobstore) and then storing it somehow so that the client-side (browser) can actually pull it down as a file and save it locally.
This is probably a huge stretch, but my only other option is to farm this task out to a non-GAE server, which I would like to avoid at all cost, even if it takes a lot of extra development on my end. Thanks in advance.

You can definitely achieve your use case using GAE itself. Here are the steps that you should follow at a high level:
Download the excellent iText library, which is a Java library to work with PDFs. First build out your Java code to generate the PDF content. Check out various examples at : http://itextpdf.com/book/toc.php
Since you cannot write to a file directly, you need to generate your PDF content in bytes and then write a Servlet which will act as a Download Servlet. The Servlet will use the Response object to open a stream, manipulate the Mime Headers (filename, filetype) and write the PDF contents to the stream. A browser will automatically present a download option when you do that.
Your Download Servlet will have high level code that looks like this:
public class DownloadPDF extends HttpServlet {
public void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
//Extract some request parameters, fetch your data and generate your document
String fileName = "<SomeFileName>.pdf";
res.setContentType("application/pdf");
res.setHeader("Content-Disposition", "attachment;filename=\"" + fileName + "\"");
writePDF(<SomeObjectData>, res.getOutputStream());
}
}
}
Remember the writePDF method above is your own method, where you use iText libraries Document and other classes to generate the data and write it ot the outputstream that you have passed in the second parameter.

While I'm not aware of the PDF generation on Google App Engine and especially in Java, but once you have it you can definitely store it and later serve it.
I suppose the generation of the PDF will take more than 30 seconds so you will have to consider using Task Queue Java API for this process.
After you have the file in memory you can simply write it to the Blobstore and later serve it as a regular blob. In the overview you will find a fully functional example on how to upload, write and serve your binary data (blobs) on Google App Engine.

I found a couple of solutions by googling. Please note that I have not actually tried these libraries, but hopefully they will be of help.
PDFJet (commercial)
Write a Google Drive document and export to PDF

Document processing in Liferay portal

I've been using Liferay a lot for past 2 years, but I have never needed any extensive document management.
Now I have a portlet where users upload documents (MS office OLE2 documents, ODS documents, PDF etc.) and I have to persist them with all metadata available.
I know how would I do that without using Liferay, I'd probably use Apache solr with Apache Tika (UpdateRichDocuments and ExtractingRequestHandler) or Apache Jackrabbit that are using Apache Tika under the hood (org.apache.jackrabbit.extractor.*).
The problem is, that If I look at the trunk of Liferay, there are some key classes :
Hooks (JCRHook, FileSystemHook, CMISHook, s3Hook) that are employed from within DLLocalServiceImpl kinda directly
Another alternative is using DLAppLocalServiceImpl that is employing DLRepositoryLocalServiceImpl and the files are persisted into repository also via Hooks, but a lot of additional stuff is done in there.
There is not jackrabbit-text-extractors library in Liferay, so I suppose If I wanted metadata to be extracted from PDF, DOCs, ODS documents, I would have very hard times... because the DL service layer doesn't accept additional properties
I think I'd have to avoid using DL services and JCR hook and access Jackrabbit directly... But I would loose the compatibility and possibility migrate my repository etc.
Could please anybody collaborate on this one please ? Thank you

SOLR for indexing, Jackrabbit for document storage. Managing Liferay Document Library in code is fairly easy, just look at the DL*LocalServiceUtil classes, namely DLFolderLocalServiceUtil and DLFileLocalServiceUtil. By default Liferay just creates a matching folder/file structure on the hard drive (with names changed) so you'd only need to write code or use Jackrabbit if you wanted more than this since Liferay allows up/download and viewing out of the box via the control panel and various portlets.
I haven't used JackRabbit with Liferay but once configured everything should be managed under the covers and you shouldn't need to worry about it on the front end.
When you say "with all metadata available" I'm not sure what is retained, but aside from renaming the file so that it can be tracked there shouldn't be any other changes. It should be quick and easy to test by uploading a file of each type and checking the entries in the LIFERAY/data/document_library directory and subdirectories. Again this would be different if Jackrabbit is used.

those two services DLLocalServiceImpl and DLAppLocalServiceImpl both are and will, I suppose, important. The former one if for direct access to repository. Notice that when adding a file via this service you need to persist corresponding DlFileEntry into database and than reference that addFile(...., fileEntryId, ...).
The latter service is doing additional stuff for you, mainly asset management and workflow.
Regarding your use case, I would avoid using document library, because no metadata can go down into the JCR repository. Actually only metadata/custom properties that you could store would be custom properties AKA Expando feature of Liferay portal.
Best way for you seem to be implement your own jackrabbit hook to store data into repository and let Liferay document library use that repository.

Think Edgar is correct. If you check the current trunk via http://svn.liferay.com/repos/public/portal/trunk/portal-service/src/com/liferay/documentlibrary/service/DLLocalService.java (login as guest and no password), you will no longer find the class DLFolderLocalServiceUtil. We are using the existing DLFolderLocalServiceUtil class as well. Thanks for the heads up. We will refactor our code so when 6.1 comes around we can still use the DocumentLibrary services.

You need to always use DLAppServiceUtil ( as Liferay instructs specifically ). Here is my working code that saves a file to the CMS:
public static void saveFileToCMS(ActionRequest aReq, long groupId, String fileName, File filenameWithPath) {
try {
ServiceContext serviceContext = ServiceContextFactory.getInstance(
Group.class.getName(), aReq);
// prevents duplicate entries based on unique title name
Random rand = new Random();
Integer suffix = new Integer(rand.nextInt(10000));
DLAppServiceUtil.addFileEntry(groupId, 0, fileName, "application/vnd.ms-excel",
fileName + suffix.toString(), "description goes here", "changelogname",
filenameWithPath, serviceContext);
//log.info("Successfully added the new file");
} catch (PortalException pe) {
log.error("Portal Exception occurred while saving file to CMS");
pe.printStackTrace();
} catch (SystemException e) {
log.error("System Exception occurred while saving file to CMS");
e.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Recursively scan documents for indexing in a folder in SolrJ - java

Related

SDMX-ML: SAS libname XML

How to read data from Elasticsearch to Spark?

Index single Xml-file with Lucene

Writing to a PDF from inside a GAE app

Document processing in Liferay portal

Categories

Resources