We are migrating from GSA to Solr and are looking to keep the existing GSA Connectors to scrape our ECM systems.
GSA Connectors construct XML documents as follows
<gsafeed>
<header>
<datasource>source</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record url="..." displayurl="http://url.com/a/b" action="add" ...>
<metadata>
<meta name="Author" content="author#company.com"/>
<meta name="DocIcon" content="pdf"/>
... bunch of other meta fields ...
<content encoding="base64compressed">...</content>
</record>
<group>
</gsafeed>
The <content> is not text but the document byte stream, compressed and then encoded to base64.
What I need is for Solr to ingest this XML, will obviously needs to be modified first.
So I've come up with this process:
Code a custom request handler which GSA will send that XML to. This looks like a decent place to start: https://stackoverflow.com/a/40568514/482261
The custom handler will modify the incoming request body: (a) decode and then decrypt the <content> node data (b) construct a Solr-able XML
Forward this modified SolrQueryRequest to the /update/extract (class="solr.extraction.ExtractingRequestHandler") handler for Tika extraction
I am trying to build the custom handler. Doing CRUD on the request parameters is easy enough, but I am lost on how to deal with content streams.
Edit 1:
Solution posted.
Edit 2:
I now have a follow up question. The posted solution works when GSA feed only has a single document. With multiple documents, each with their own metadata, things get a bit murky. I haven't decided on a way of dealing with that yet, once I do the solution will be posted as a new question.
Here is the what I have come up with to address the original question. Hopefully it helps someone. I have extracted relevant bits from my working code, please treat this as pseudo-code.
public class MyCustomRequestHandler extends ExtractingRequestHandler {
#Override
public void handleRequestBody(SolrQueryRequest originalReq, SolrQueryResponse rsp) throws Exception {
Iterable<ContentStream> streams = originalReq.getContentStreams();
ContentStream theStream = streams.iterator().next();
InputStream is = theStream.getStream();
//stream is an XML so parse it to a Document. I used the XOM library for this
Document doc = parser.build(is)
//process accordingly:
// 1. Convert the <meta> tags to a Map<String, String>
SolrParams extractedSolrParams = new MapSolrParams(/*Map<String, String> of all <meta> fields in GSA feed */);
// 2. Take <content> and pass it to decodeUncompress()
byte[] decodedUncompressedContent = decodeUncompress(/* <content> from gsa feed*/)
//Once the parsing and processing is complete, construct a new solr request
LocalSolrQueryRequest localRequest = new LocalSolrQueryRequest(originalReq.getCore(), extractedSolrParams);
List<ContentStream> newContentStreams = new ArrayList<ContentStream>();
newContentStreams.add(new ContentStreamBase.ByteArrayStream(decodedUncompressedContent, "GSA Feed <content>"));
localRequest.setContentStreams(newContentStreams);
super.handleRequestBody(localRequest, rsp);
}
private byte[] decodeUncompress(byte[] data) throws IOException {
// Decode
byte[] decodedBytes = Base64.getDecoder().decode(data);
// Uncompress
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Inflater decompresser = new Inflater(false);
InflaterOutputStream inflaterOutputStream = new InflaterOutputStream(stream, decompresser);
try {
inflaterOutputStream.write(decodedBytes);
} catch (IOException e) {
throw e;
} finally {
try {
inflaterOutputStream.close();
} catch (IOException e) {
}
}
return stream.toByteArray();
}
}
Related
I have a Quarkus based REST API project in which one endpoint is supposed to serve exported data as .csv files. Since i do not want to create temporary files, i was writing to a ByteArrayInputStream to be used in an octet stream response for my webservice.
However, although this works fine for latin character content we also have content that may be in Chinese. The downloaded .csv file does not view the characters properly or rather does not write them properly (they only show up as question marks, even in plain text view e.g. with notepad).
We already checked the source of the problem not being how the data is stored, for example the encoding in the database is correct and it works fine when we export it as .json (here we can set charset utf-8).
As far as i understand a charset or encoding cannot be set for an octet stream.
So how can we export/stream this content as a file download without creating an actual file?
Some code examples below on how we do it currently. We use the apache common library component CSVPrinter to create the CSV format in text in a custom CSV streamer class:
#ApplicationScoped
public class JobRunDataCsvStreamer implements DataFormatStreamer<JobData> {
#Override
public ByteArrayInputStream streamDataToFormat(List<JobData> dataList) {
try {
ByteArrayOutputStream out = getCsvOutputStreamFor(dataList);
return new ByteArrayInputStream(out.toByteArray());
} catch (IOException e) {
throw new RuntimeException("Failed to convert job data: " + e.getMessage());
}
}
private ByteArrayOutputStream getCsvOutputStreamFor(List<JobData> dataList) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
CSVPrinter csvPrinter = new CSVPrinter(new PrintWriter(out), getHeaderFormat());
for (JobData jobData : dataList) {
csvPrinter.printRecord(extractStringRowData(jobData));
}
csvPrinter.flush();
csvPrinter.close();
return out;
}
private CSVFormat getHeaderFormat() {
return CSVFormat.EXCEL
.builder()
.setDelimiter(";")
.setHeader("ID", "Source term", "Target term")
.build();
}
private List<String> extractStringRowData(JobData jobData) {
return Arrays.asList(
String.valueOf(jobData.getId()),
jobData.getSourceTerm(),
jobData.getTargetTerm()
);
}
}
Here is the quarkus API endpoint for the download:
#Path("/jobs/data")
public class JobDataResource {
#Inject JobDataRepository jobDataRepository;
#Inject JobDataCsvStreamer jobDataCsvStreamer;
...
#GET
#Path("/export/csv")
#Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response getAllAsCsvExport() {
List<JobData> jobData = jobDataRepository.getAll();
ByteArrayInputStream stream = jobDataCsvStreamer.streamDataToFormat(jobData);
return Response.ok(stream, MediaType.APPLICATION_OCTET_STREAM)
.header("content-disposition", "attachment; filename = job-data.csv")
.build();
}
}
Screenshot of result in the downloaded file for chinese characters in the second column:
We tried setting headers etc. for encoding, but none of it worked. Is there a way to stream content which requires specific encoding as a file in Java web services? We tried using PrintWriter which works, but requies creating a local file on the server.
Edit: We tried using PrintWriter(out, false, StandardCharsets.UTF_8) for the PrintWriter to write to a byte array out stream for the response, which yields a different result but still with broken view in both Excel and plain text:
Screenshot:
Code for endpoint:
#GET
#Path("/export/csv")
#Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response getAllAsCsvExport() {
List<JobData> jobData = jobRunDataRepository.getAll();
ByteArrayOutputStream out = new ByteArrayOutputStream();
try{
PrintWriter pw = new PrintWriter(out, false, StandardCharsets.UTF_8);
pw.println(String.format("%s, %s, %s", "ID", "Source", "Target"));
for (JobData item : jobData) {
pw.println(String.format("%s, %s, %s",
String.valueOf(item.getId()),
String.valueOf(item.getSourceTerm()),
String.valueOf(item.getTargetTerm()))
);
}
pw.flush();
pw.close();
} catch (Exception e) {
throw new RuntimeException("Failed to convert job data: " + e.getMessage());
}
return Response.ok(out).build();
}
I'm working in a Spring Boot api that can receive very large objects and try to save it in a MongoDB database. Because of this the program sometimes throws me the next error:
org.bson.BsonMaximumSizeExceededException: Payload document size is larger than maximum of 16793600.
I'd read that MongoDB only permits objects of size below 16MB, this is very inconvenient for my system because an object can easily surpass this gap. To solve this I had read about GridFS, technology that allows to surpass the 16MB files gap.
Now I'm trying to implement GridFS in my system but I only had seen examples using files to save in the database, something like this:
gridFsOperations.store(new FileInputStream("/Users/myuser/Desktop/text.txt"), "myText.txt", "text/plain", metaData);
But I want to do is not to take the data from a file, but to the api to receive a object and save it, something like this:
#PostMapping
public String save(#RequestBody Object object){
DBObject metaData = new BasicDBObject();
metaData.put("type", "data");
gridFsOperations.store(object, metaData);
return "Stored successfully...";
}
Is it a posible way to doing this?
Get an InputStream from the request and pass it to a GridFSBucket. Here's a rough example:
In your controller:
#PostMapping
public ResponseEntity<String> uploadFile(MultipartHttpServletRequest request)
{
Iterator<String> iterator = request.getFilenames();
String filename = iterator.next();
MultipartFile mf = request.getFile(filename);
// I always have a service layer between controller and repository but for purposes of this example...
myDao.uploadFile(filename, mf.getInputStream());
}
In your DAO/repository:
private GridFSBucket bucket;
#Autowired
void setMongoDatabase(MongoDatabase db)
{
bucket = GridFSBuckets.create(db);
}
public ObjectId uploadFile(String filename, InputStream is)
{
Document metadata = new Document("type", "data");
GridFSUploadOptions opts = new GridFSUploadOptions().metadata(metadata);
ObjectId oid = bucket.uploadFromStream(filename, is, opts);
try
{
is.close();
}
catch (IOException ioe)
{
throw new UncheckedIOException(ioe);
}
return oid;
}
I paraphrased this from existing code so it may not be perfect but will be good enough to point you in the right direction.
I'm tring to read the metadata from an Idp using Open Saml 2. When i try to unmarshall the metadata openSaml show only this getter for attributes getUnknownAtrributes(). Looks like i am missing some point since when reading the Idp response SAML the code works very well. (it shows getAssertions() that returns a list of assertions).
I need to parse the metadata and find informations regarding the Idp.
Here the method
public Metadata metadataReader() {
ByteArrayInputStream bytesIn = new ByteArrayInputStream(ISSUER_METADATA_URL.getBytes());
BasicParserPool ppMgr = new BasicParserPool();
ppMgr.setNamespaceAware(true);
// grab the xml file
// File xmlFile = new File(this.file);
Metadata metadata = null;
try {
Document document = ppMgr.parse(bytesIn);
Element metadataRoot = document.getDocumentElement();
QName qName = new QName(metadataRoot.getNamespaceURI(), metadataRoot.getLocalName(),
metadataRoot.getPrefix());
Unmarshaller unmarshaller = Configuration.getUnmarshallerFactory().getUnmarshaller(qName);
metadata = (Metadata) unmarshaller.unmarshall(metadataRoot);
return metadata;
} catch (XMLParserException e) {
e.printStackTrace();
} catch (UnmarshallingException e) {
e.printStackTrace();
}
return null;
}
I suggest using a metadata provider to do the heavy lifting for you. FilesystemMetadataProvider is often a good fit.
I have a blog post that explains how to use it.
I am using Jackson 2.5.1 (com.fasterxml.jackson.core.JsonGenerator) to write JSON document to an output stream. It looks like the API allow us to write wrong JSON to the stream? I thought if we try to write elements in wrong context, it is supposed to throw JsonGenerationException or IOException. Here is the code snippet :
try {
JsonFactory jfactory = new JsonFactory();
ByteArrayOutputStream b = new ByteArrayOutputStream();
JsonGenerator jsonGenerator = jfactory.createJsonGenerator(b, JsonEncoding.UTF8);
jsonGenerator.writeStartObject();
jsonGenerator.writeStringField("str1", "blahblah");
jsonGenerator.writeNumber(1234);
jsonGenerator.writeEndObject();
jsonGenerator.close();
System.out.println(b.toString());
} catch (Exception e) {
e.printStackTrace();
}
The output is : {"str1":"blahblah":1234} and it is not a valid JSON. Is this expected behavior or I am missing something? I thought the API itself tracks if the objects are written in correct context. Does it need to be enforced by the application itself? It is not clear from the documentation :
http://fasterxml.github.io/jackson-core/javadoc/2.0.0/com/fasterxml/jackson/core/JsonGenerator.html
I have a java web application based on Spring MVC.
The task is to generate a pdf file. As all knows the spring engine has its own built-in iText library so the generating of pdf file is really simple. First of all we need to do is to overload AbstractView and create some PdfView. And the seconf thing is to use that view in controller. But in my application I am also have to be able to store generated pdf files on local drive or give my users some link to download that file. So the view in that case is not suitable for me.
I want to create some universal pdf generator that creates a pdf file and returns the bytes array. So I can use that array for file storing (on hard drive) or printing it directly in browser. And the question is - are there any way to use such engine (that returns only the bytes array) in PdfVIew solution? I am asking because overloaded buildPdfDocument method (in PdfView) already have PdfWriter and Document parameters.
Thank you
tldr; you should be able to use a view and save it to a file.
Try using Flying Saucer and its iTextRenderer when you overload AbstractPdfView.
import org.xhtmlrenderer.pdf.ITextRenderer;
public class MyAbstractView extends AbstractView {
OutputStream os;
public void buildPdfDocument(Map<String,Object> model, com.lowagie.text.Document document, com.lowagie.text.pdf.PdfWriter writer, HttpServletRequest request, HttpServletResponse response){
//process model params
os = new FileOutputStream(outputFile);
ITextRenderer renderer = new ITextRenderer();
String url = "http://www.mysite.com"; //set your sample url namespace here
renderer.setDocument(document, url); //use the passed in document
renderer.layout();
renderer.createPDF(os);
os.close();
}
}
protected final void renderMergedOutputModel(Map<String,Object> model,
HttpServletRequest request,
HttpServletResponse response)
throws Exception{
if(os != null){
response.outputStream = os;
}
public byte[] getPDFAsBytes(){
if(os != null){
byte[] stuff;
os.write(stuff);
return stuff;
}
}
}
You'll probably have to tweak the sample implementation shown here, but that should provide a basic gist.