I'm newble in Java and I need to extract data from NSF files to DXL xml files. I tried it with python, but OLE wrapper doesn't have a few major functionality. I have two different formates of DXL files: with rawitemdata and with richtext.
When I switched to Java Lotus Notes API I receive mistakes in rawitemdata fields. I use doc.convertToMIME(2); procedure to provide mail body in the HTML format for getting sameness for each type of documents. As result all fields stored as rawitemdata and it's great. It's what I want. But some rawitemdata fields have crashed encoding.
It's look like
<rawitemdata type="19">
AgAPAAAAAgAmAj4AhQEAAAAAAAANCi0tMF9fPUNDQkIwRTFFREZBRjk4Mjg4ZjllOGE5M2RmOTM4NjkwOTE4Y0NDQkIwRTFFREZBRjk4
MjgNCkNvbnRlbnQtVHJhbnNmZXItRW5jb2Rpbmc6IGJpbmFyeQ0KQ29udGVudC10eXBlOiBhcHBsaWNhdGlvbi9vY3RldC1zdHJlYW07
IA0KCW5hbWU9Ij0/S09JOC1SP0I/OFBMdjlPL3I3K3d1Wkc5amVBPT0/PSINCkNvbnRlbnQtRGlzcG9zaXRpb246IGF0dGFjaG1lbnQ7
IGZpbGVuYW1lPSI9P0tPSTgtUj84UEx2OU8vcjcrd3VaRzlqZUE9PT89Ig0KQ29udGVudC1JRDogPDNfXz1DQ0JCMEUxRURGQUY5ODI4
OGY5ZThhOTNkZjkzODY5MDkxQGxvY2FsPg0KDQoFzwXQBc4F0gXOBcoFzgXLLmRvY3g=
</rawitemdata>
After decode from base64 I got this
\x02\x00\x0f\x00\x00\x00\x02\x00&\x02>\x00\x85\x01\x00\x00\x00\x00\x00\x00\r\n-
-0__=CCBB0E1EDFAF98288f9e8a93df938690918cCCBB0E1EDFAF9828\r\nContent-Transfer-Encoding:
binary\r\nContent-type: application/octet-stream; \r\n\tname="=?KOI8-R?B?8PLv9O/r7+wuZG9jeA==?
="\r\nContent-Disposition: attachment; filename="=?KOI8-R?8PLv9O/r7+wuZG9jeA==?="\r\nContent-ID:
<3__=CCBB0E1EDFAF98288f9e8a93df93869091#local>\r\n\r\n\x05\xcf\x05\xd0\x05\xce\x05\xd2\x05\xce\x05\xca\x
05\xce\x05\xcb.docx
It's easy to explain:
first 20 bytes it's a header: \x02\x00\x0f\x00\x00\x00\x02\x00&\x02>\x00\x85\x01\x00\x00\x00\x00\x00\x00
to unpack it I use next code struct.unpack("<hhhh hhLL", data[:20])
it returns tuple like this: (2, 15, 0, 2, 550, 62, 389, 0) where the 5th element it is a length of body. But I changed the body and lazed to calc current header for new body
the body with rfc822 headers
After base64 and then koi8-r header Content-Disposition decoded we could see field name with content ПРОТОКОЛ.docx. And it's correct content.
If I try decode the last part of the body which contains byte encoding \x05\xcf\x05\xd0\x05\xce\x05\xd2\x05\xce\x05\xca\x
05\xce\x05\xcb.docx I got exception. This is part can't being decoded correctly.
It's look like cp1251 encoding under unicode. Because the content ПРОТОКОЛ.docx in cp1251 codepage has '\xcf\xd0\xce\xd2\xce\xca\xce\xcb.docx' bytes.
My java code is quite simple:
import lotus.domino.Database;
import lotus.domino.Session;
import lotus.domino.NotesFactory;
import java.io.File;
import java.io.FileWriter;
import java.io.BufferedWriter;
import lotus.domino.*;
public class ExportDXL {
public static void main(String[] args) throws Exception {
String server = "";
String dbPath = "C:\\share\\test.nsf";
NotesThread.sinitThread();
Session session = NotesFactory.createSession((String)null, (String)null, "test");
Database db = session.getDatabase(server, dbPath);
System.out.println("Db Title: " + db.getTitle());
DxlExporter exporter = session.createDxlExporter();
exporter.setConvertNotesBitmapsToGIF(true);
View view = db.getView("$ALL");
Document doc = view.getFirstDocument();
while(doc != null){
doc.convertToMIME(2); // or set 1 for get plain text
String xmldoc = doc.generateXML();
FileWriter fw = null;
String id = doc.getUniversalID();
fw = new FileWriter("C:\\lotus_test\\"+ id +".xml");
fw.write(xmldoc);
fw.flush();
fw.close();
doc = view.getNextDocument(doc);
}
}
}
How can I provide correct encoding in this case? Or how to set code page for Lotus Notes API ?
Related
I have been really banging my head against the wall with this one, uploading text files is fine, but when I upload a zip archive into my blob store -> it gets corrupted, and cannot be opened once downloaded.
Doing a hex compare (image below) of the original versus file that has been through Azure shows some subtle replacements have happened, but I cannot find the source of the change/corruption.
I have tried forcing UTF-8/Ascii/UTF-16, but found UTF-8 is probably correct, none have resolved the issue.
I have also tried different http libraries but got the same result.
Deployment environment is forcing unirest, and cannot use the Microsoft API (Which seems to work fine).
package blobQuickstart.blobAzureApp;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Base64;
import org.junit.Test;
import kong.unirest.HttpResponse;
import kong.unirest.Unirest;
public class StackOverflowExample {
#Test
public void uploadSmallZip() throws Exception {
File testFile = new File("src/test/resources/zip/simple.zip");
String blobStore = "secretstore";
UploadedFile testUploadedFile = new UploadedFile();
testUploadedFile.setName(testFile.getName());
testUploadedFile.setFile(testFile);
String contentType = "application/zip";
String body = readFileContent(testFile);
String url = "https://" + blobStore + ".blob.core.windows.net/naratest/" + testFile.getName() + "?sv=2020-02-10&ss=b&srt=o&sp=c&se=2021-09-07T20%3A10%3A50Z&st=2021-09-07T18%3A10%3A50Z&spr=https&sig=xvQTkCQcfMTwWSP5gXeTB5vHlCh2oZXvmvL3kaXRWQg%3D";
HttpResponse<String> response = Unirest.put(url)
.header("x-ms-blob-type", "BlockBlob").header("Content-Type", contentType)
.body(body).asString();
if (!response.isSuccess()) {
System.out.println(response.getBody());
throw new Exception("Failed to Upload File! Unexpected response code: " + response.getStatus());
}
}
private static String readFileContent(File file) throws Exception {
InputStream is = new FileInputStream(file);
ByteArrayOutputStream answer = new ByteArrayOutputStream();
byte[] byteBuffer = new byte[8192];
int nbByteRead;
while ((nbByteRead = is.read(byteBuffer)) != -1)
{
answer.write(byteBuffer, 0, nbByteRead);
}
is.close();
byte[] fileContents = answer.toByteArray();
String s = Base64.getEncoder().encodeToString(fileContents);
byte[] resultBytes = Base64.getDecoder().decode(s);
String encodedContents = new String(resultBytes);
return encodedContents;
}
}
Please help!
byte[] resultBytes = Base64.getDecoder().decode(s);
String encodedContents = new String(resultBytes);
You are creating a String from a byte array containing binary data. String is only for printable characters. You do multiple pointless encoding/decoding just taking more memory.
If the content is in a ZIP format, it's binary, just return the byte array. Or you can encode the content, but then you should return the content encoded. As a weakness, you're doing it all in memory, limiting potential size of the content.
Unirest file handlers will by default force a multipart body - not supported by Azure.
A Byte Array can be provided directly as per this: https://github.com/Kong/unirest-java/issues/248
Unirest.put("http://somewhere")
.body("abc".getBytes())
I am able to create a pdf file but when I try to open the output pdf file I am getting error : "the file is damaged"
Here is my code please help me.
String encodedBytes= "QmFzZTY0IGVuY29kaW5nIHNjaGVtZXMgYXJlIHVzZWQgd2hlbiBiaW5hcnkgZGF0YSBuZWVkcyB0byBiZSBzdG9yZWQgb3IgdHJhbnNmZXJyZWQgYXMgdGV4dHVhbCBkYXRh";
BASE64Decoder decoder = new BASE64Decoder();
byte[] decodedBytes = decoder.decodeBuffer(encodedBytes);
File file = new File("C:/Users/istest/Documents/test.pdf");
FileOutputStream fos = new FileOutputStream(file);
fos.write(decodedBytes);
Your string is not a valid PDF file.
A pdf file should start its proper Magic number (please refer to the Format indicators section of this link)
PDF files start with "%PDF" (hex 25 50 44 46).
or in Base64 : JVBERi
if you try your code with a valid PDF encoded string like this one, it might work.
But because you did not provided the code of your BASE64Decoder class, it is hard to be sure that it will work.
For that reason, here is a simple implementation of the java.util.Base64 package (Warning do not copy/past this example and do not try it before changing the given base64 string here with the correct one as supplied in the previous link...as noted in the bellow comment, in order to be short the correct string was replaced by a corrupted one)
import java.io.File;
import java.io.FileOutputStream;
import java.util.Base64;
class Base64DecodePdf {
public static void main(String[] args) {
File file = new File("./test.pdf");
try ( FileOutputStream fos = new FileOutputStream(file); ) {
// To be short I use a corrupted PDF string, so make sure to use a valid one if you want to preview the PDF file
String b64 = "JVBERi0xLjUKJYCBgoMKMSAwIG9iago8PC9GaWx0ZXIvRmxhdGVEZWNvZGUvRmlyc3QgMTQxL04gMjAvTGVuZ3==";
byte[] decoder = Base64.getDecoder().decode(b64);
fos.write(decoder);
System.out.println("PDF File Saved");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Credit : source.
I'm trying to do something fairly simple and read an i9 PDF form from an incoming FlowFile, parse the first and last name out of it into a JSON, then output the JSON to the outgoing FlowFile.
I found no official documentation on how to do this, but someone has written up several cookbooks on doing things in several scripting languages in NiFi here. It seems pretty straightforward and I'm pretty sure I'm doing what is written there, but I'm not even sure the PDF is being read at all. It simply passes the PDF unmodified out to REL_SUCCESS every time.
Link to sample PDF
import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
try {
//Load Flowfile contents
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = new PDFTextStripperByArea()
//Get the first page
List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
PDPage page = allPages.get(0)
//Define the areas to search and add them as search regions
stripper = new PDFTextStripperByArea()
Rectangle lname = new Rectangle(25, 226, 240, 15)
stripper.addRegion("lname", lname)
Rectangle fname = new Rectangle(276, 226, 240, 15)
stripper.addRegion("fname", fname)
//Load the results into a JSON
def boxMap = [:]
stripper.setSortByPosition(true)
stripper.extractRegions(page)
regions = stripper.getRegions()
for (String region : regions) {
String box = stripper.getTextForRegion(region)
boxMap.put(region, box)
}
Gson gson = new Gson()
//Remove random noise from the output
json = gson.toJson(boxMap, LinkedHashMap.class)
json = json.replace('\\n', '')
json = json.replace('\\r', '')
json = json.replace(',"', ',\n"')
//Overwrite flowfile contents with JSON
outputStream.write(json.getBytes(StandardCharsets.UTF_8))
} catch (Exception e){
System.out.println(e.getMessage())
session.transfer(flowFile, REL_FAILURE)
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
EDIT:
Was able to confirm that the flowFile object is being read properly by subbing a txt file in. So the problem seems to be that the inputStream is never being handed off to the PDDocument or something is happening when it does. I edited the code to try reading it into a File object first but that resulted in an error:
FlowFileHandlingException: null is not known in this session
EDIT Edit:
Solved by moving my try/catch. I don't seem to understand how that works, my code above has been edited and works properly.
session.get can return null, so definitely add a line after that if(!flowFile) return. Also put the try/catch outside the session.write, that way you can put the session.transfer(flowFile, REL_SUCCESS) after the session.write (inside the try) and the catch can transfer to failure.
Also I can't tell from the code how the PDFTextStripperByArea works to get the info from the incoming document. It looks like all the document stuff is inside the try, so wouldn't be available to the PDFTextStripper (and isn't passed in).
None of these things explain why you're getting the original flow file on the success relationship, but maybe there's something I'm not seeing that would be magically fixed by the changes above :)
Also, if you use log.info() or log.error() rather than System.out.println, you will see the output in the NiFi logs (and for error it will post a bulletin to the processor and you can see the message if you hover over the top right corner (red square if bulletin is present) of the processor.
I am trying to decode a pdf from a response and write it to a file.
The file gets created and appears to be the correct file size, but when I go to open it, I get an error that says, "There was an error opening this document. The file is damaged and could not be repaired."
I am using the code from this post to decode and create the file.
I set the base64 encoded file returned from the API as the variable vars.get("documentText")
Here is how my BeanShell PostProcessor code looks:
import org.apache.commons.io.FileUtils;
import org.apache.commons.codec.binary.Base64;
String Createresponse= vars.get("documentText");
vars.put("response",new String(Base64.decodeBase64(Createresponse.getBytes("UTF-8"))));
Output = vars.get("response");
f = new FileOutputStream("C:\\Users\\user\\Desktop\\Test.pdf");
p = new PrintStream(f);
this.interpreter.setOut(p);
print(Output);
f.close();
Am I doing something incorrectly?
I have also done the following, but get the same result:
byte[] data = Base64.decodeBase64(vars.get("documentText"));
FileOutputStream out = new FileOutputStream("C:\\Users\\user\\Desktop\\Test.pdf");
out.write(data);
out.close();
EDIT:
The entire PDF from the Response looks like the following: (these are just the first 5 lines (of approx. 7,548 lines), but they are all similar):
JVBERi0xLjQKMSAwIG9iago8PAovVGl0bGUgKP7/KQovQ3JlYXRvciAo/v8pCi9Qcm9kdWNlciAo
/v8AUQB0ACAANQAuADUALgAxKQovQ3JlYXRpb25EYXRlIChEOjIwMTcwMzI3MTgwNTEzKQo+Pgpl
bmRvYmoKMiAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMyAwIFIKPj4KZW5kb2JqCjQg
MCBvYmoKPDwKL1R5cGUgL0V4dEdTdGF0ZQovU0EgdHJ1ZQovU00gMC4wMgovY2EgMS4wCi9DQSAx
LjAKL0FJUyBmYWxzZQovU01hc2sgL05vbmU+PgplbmRvYmoKNSAwIG9iagpbL1BhdHRlcm4gL0Rl
I'm assuming this is what is causing an issue? Is there a way to convert the response to a single String that can be decoded?
EDIT 2:
So the
in the response is definitely my problem. I looked up the hex code character and it translates to a carriage return. If I manually copy the Response from within JMeter, paste it into Notepad++, remove
and then decode it manually, the PDF opens as it should.
I tried modifying my BeanShell script to remove the carriage return and then decode it, but it still isn't fully functional. The PDF now opens, however, it is just blank white pages. Here is my updated code:
String Createresponse= vars.get("documentText");
String b64 = Createresponse.replace("
","");
vars.put("response",new String(Base64.decodeBase64(b64)));
Output = vars.get("response");
f = new FileOutputStream("C:\\Users\\user\\Desktop\\Test.pdf");
p = new PrintStream(f);
this.interpreter.setOut(p);
print(Output);
f.close();
This works for me. You input data is wrong.
package com.test;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import org.junit.Test;
public class TestBase64 {
String data =
"JVBERi0xLjQKMSAwIG9iago8PAovVGl0bGUgKP7/KQovQ3JlYXRvciAo/v8pCi9Qcm9kdWNlciAo/v8AUQB0ACAANQAuADUALgAxKQovQ3JlYXRpb25EYXRlIChEOjIwMTcwMzI3MTgwNTEzKQo+Pgpl";
#Test
public void decodeBase64()
{
byte[] localData = Base64.getDecoder().decode(data);
try (FileOutputStream out = new FileOutputStream("/testout64.dat"))
{
out.write(localData);
out.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
This results in
%PDF-1.4
1 0 obj
<<
/Title (þÿ)
/Creator (þÿ)
/Producer (þÿ Q t 5 . 5 . 1)
/CreationDate (D:20170327180513)
>>
e
and seems to be valid PDF.
What is the &_#_x_d_;_ part? Seems to be some custom format characters.
I basically had the answer to my question, the problem was with the base64 encoded Response I was trying to decode was multi-line and included carriage return hex code.
My solution to this was to remove the carriage return hex code from the response and condense it to a single string of base64 encoded text and then write the file out.
import org.apache.commons.io.FileUtils;
import org.apache.commons.codec.binary.Base64;
String response = vars.get("documentText");
String encodedFile = response.replace("
","").replaceAll("[\n]+","");
// Decode the response
vars.put("decodedFile",new String(Base64.decodeBase64(encodedFile)));
// Write out the decoded file
Output = vars.get("decodedFile");
file = new FileOutputStream("C:\\Users\\user\\Desktop\\decodedFile.pdf");
p = new PrintStream(file);
this.interpreter.setOut(p);
print(Output);
p.flush();
file.close();
I am trying to get the base64 content of a MimePart in a MimeMultiPart, but I'm struggling with the Javamail package. I simply want the base64 encoded String of a certain inline image, there doesn't seem to be an easy way to do this though.
I wrote a method that will take the mime content (as a string) and an image name as a parameter, and searches for the part that contains the base64 content of that image name, and in the end returns this base64 string (as well as the content type but that is irrelevant for this question)
Here is the relevant code (including relevant imports):
import javax.activation.DataSource;
import javax.mail.MessagingException;
import javax.mail.Part;
import javax.mail.internet.MimeMultipart;
import javax.mail.internet.MimePart;
import javax.mail.util.ByteArrayDataSource;
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.io.IOUtils;
import com.sun.mail.util.BASE64DecoderStream;
private static String[] getBase64Content(String imageName, String mimeString) throws MessagingException, IOException
{
System.out.println("image name: " + imageName + "\n\n");
System.out.println("mime string: " + mimeString);
String[] base64Content = new String[2];
base64Content[0] = "";
base64Content[1] = "image/jpeg"; //some default value
DataSource source = new ByteArrayDataSource(new ByteArrayInputStream(mimeString.getBytes()), "multipart/mixed");
MimeMultipart mp = new MimeMultipart(source);
for (int i = 0; i < mp.getCount(); i++)
{
MimePart part = (MimePart) mp.getBodyPart(i);
String disposition = part.getDisposition();
if (disposition != null && disposition.equals(Part.INLINE))
{
if (part.getContentID() != null && part.getContentID().indexOf(imageName) > -1) //check if this is the right part
{
if (part.getContent() instanceof BASE64DecoderStream)
{
BASE64DecoderStream base64DecoderStream = (BASE64DecoderStream) part.getContent();
StringWriter writer = new StringWriter();
IOUtils.copy(base64DecoderStream, writer);
String base64decodedString = writer.toString();
byte[] encodedMimeByteArray = Base64.encodeBase64(base64decodedString.getBytes());
String encodedMimeString = new String(encodedMimeByteArray);
System.out.println("encoded mime string: " + encodedMimeString);
base64Content[0] = encodedMimeString;
base64Content[1] = getContentTypeString(part);
}
}
}
}
return base64Content;
}
I cannot paste all of the output as the post would be too long, but this is some of it:
image name: image001.gif#01CAD280.4D637150
This is a part of the mimeString input, it does find this (correct) part with the image name:
--_004_225726A14AF9134CB538EE7BD44373A04D9E3F3940menexch2007ex_
Content-Type: image/gif; name="image001.gif"
Content-Description: image001.gif
Content-Disposition: inline; filename="image001.gif"; size=1070;
creation-date="Fri, 02 Apr 2010 16:19:43 GMT";
modification-date="Fri, 02 Apr 2010 16:19:43 GMT"
Content-ID: <image001.gif#01CAD280.4D637150>
Content-Transfer-Encoding: base64
R0lGODlhEAAQAPcAABxuHJzSlDymHGy2XHTKbITCdNTu1FyqTHTCXJTKhLTarCSKHEy2JHy6bJza
lITKfFzCPEyWPHS+XHzCbJzSjFS+NLTirBx6HHzKdOz27GzCZJTOjCyWHKzWpHy2ZJTGhHS+VLzi
(more base64 string here that I'm not going to paste)
But when it finally prints the encoded mime string, this is a different string than I was expecting:
encoded mime string: R0lGODlhEAAQAO+/vQAAHG4c77+90pQ877+9HGzvv71cdO+/vWzvv73vv71077+977+977+9XO+/vUx077+9XO+/vcqE77+92qwk77+9HEzvv70kfO+/vWzvv73alO+
Clearly different from the one that has its output in the part above. I'm not even sure what I'm looking at here, but when I try to load this as an image in a html page, it won't work.
This is fairly frustrating for me, since all I want is a piece of the text that I'm already printing, but I'd rather not have to search through the mime string myself for the correct part, introducing all kinds of bugs.So I'd really prefer to use the Javamail library but could use some help on how to actually get that correct mime string.
Solved my issue, modified code to:
if (part.getContent() instanceof BASE64DecoderStream)
{
BASE64DecoderStream base64DecoderStream = (BASE64DecoderStream) part.getContent();
byte[] byteArray = IOUtils.toByteArray(base64DecoderStream);
byte[] encodedBase64 = Base64.encodeBase64(byteArray);
base64Content[0] = new String(encodedBase64, "UTF-8");
base64Content[1] = getContentTypeString(part);
}
And now it's displaying the image just fine.