I am working with the Tika Java library. I'm using Lucee (a sub part of ColdFusion) and when I use an online example to retrieve text from a PDF I get an empty string.
What is the setup?
I've installed Lucee locally and have access to an empty index.cfm page. I've added the Tika jar file to the project and I see it's loaded correctly in the Lucee admin.
What is the code?
The following part is the most simple code I could find to convert a PDF to text:
handler = createObject("java", "org.apache.tika.sax.BodyContentHandler");
metadata = createObject("java", "org.apache.tika.metadata.Metadata");
inputstream = createObject("java", "java.io.FileInputStream").init(createObject("java", "java.io.File").init('C:\lucee\tomcat\webapps\ROOT\test\dummy.pdf'));
pcontext = createObject("java", "org.apache.tika.parser.ParseContext");
pdfparser = createObject("java", "org.apache.tika.parser.AutoDetectParser");
pdfparser.parse(inputstream, handler, metadata, pcontext);
writeDump(handler.toString());
so when I run this I get an empty string and I expect the text inside the PDF. Also all the metadata is empty.
Conclusion
I think that the library is perhaps not loaded correctly. But what can I do to see where this is going wrong? I don't get any error, just empty values. Tried different PDFs and even different files. Tried the autoparser and different kind of codes. Is this maybe a Lucee problem? Or a Java problem?
Related
We are in the process of converting over to using the XSLT compiler for page generation. I have a Xalan Java extention to exploit the CSSDK and capture some meta data we have stored in the Extended Attributes for output to the page. No problems in getting the EA's rendered to the output file.
The problem is that I don't know how to dynamically capture the file path and name of the output file.
So just as POC, I have the CSVPath hard coded to the output file in my Java extension. Here's a code sample:
CSSimpleFile sourceFile = (CSSimpleFile)client.getFile(new CSVPath("/some-path-to-the-output.jsp"));
Can someone point me in the CSSDK to where I could capture the output file?
I found the answer.
First, get or create your CSClient. You can use the examples provided in the cssdk/samples. I tweaked one so that I captured the CSClient in the method getClientForCurrentUser(). Watch out for SOAP vs Java connections. In development, I was using a SOAP connection and for the make_toolkit build, the Java connection was required for our purposes.
Check the following snippet. The request CSClient is captured in the static variable client.
CSSimpleFile sourceFile = (CSSimpleFile)client.getFile(new CSVPath(XSLTExtensionContext.getContext().getOutputDirectory().toString() + "/" + XSLTExtensionContext.getContext().getOutputFileName()));
I am generating a PDF and trying to attach it to a mail as well as download it from browser using java. Download from browser works fine, but attaching to mail is where I am facing an issue. The file is attached. Attachment name and size of the file are intact. The problem is when I open the PDF from mail attachment, it shows nothing. correct number of pages with no content. When I attach the file downloaded from browser by hardcoding, it works fine. So I suppose the problem is not with the PDF generation. I tried opening both(one downloaded from browser and the other downloaded from mail) the files using comparing tool beyond compare. The one downloaded from mail shows conversion error. When I open with notepad++, both show different encoding. I not very familiar with these encoding thing. I suppose it is something to do with encoding.
I also observed that the content in mail download is same as the one at PDF generation. But the one at browser download is different.
An excerpt of what I get on browser download is as below(The content is too large to paste)
%PDF-1.4
%âãÏÓ
4 0 obj <</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 193/Length 11222/Height 58/Filter/DCTDecode>>stream
ÿØÿà
An excerpt of what I get on mail download is as below
%PDF-1.4
%????
4 0 obj <</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 193/Length 11222/Height 58/Filter/DCTDecode>>stream
????
I am using Spring MimeMessageHelper to send the message. I am using the below method to add attachment
MimeMessageHelper.addAttachment(fileName, new ByteArrayResource(attachmentContent.getBytes()), "application/pdf");
I've also tried another way of attaching but in vain
DataSource dataSource = new ByteArrayDataSource(bytes, "application/pdf");
MimeBodyPart pdfBodyPart = new MimeBodyPart();
pdfBodyPart.addHeader("Content-Type", "application/pdf;charset=UTF-8");
pdfBodyPart.addHeader("Content-disposition", "attachment; filename="+fileName);
pdfBodyPart.setDataHandler(new DataHandler(dataSource));
pdfBodyPart.setFileName(fileName);
mimeMessageHelper.getMimeMultipart().addBodyPart(pdfBodyPart);
Any help would be greatly appreciated. Thanks in advance
I'm not sure if this has anything to do with it but I noticed you're not setting the actual charset in pdfBodyPart.addHeader("Content-Type", "application/pdf;charset");, nor are you calling attachmentContent.getBytes() with a charset as parameter. How is it supposed to know which one you want to use?
What Content-Transfer-Encoding is being used for the attachment in the message you receive? Normally JavaMail will choose an appropriate value, but if document contains an unusual mix of plain text and binary, as your document seems to, JavaMail may not choose the best encoding. You can try adding pdfBodyPart.setHeader("Content-Transfer-Encoding", "base64");
I found out why it was'nt working. It is an encoding issue but nothing to do with MimeMessageHelper. The problem was I generated the PDF to an OutputStream and converted it to String and then converted it into byte array. When I converted to it to String the encoding changed resulting in the issue. So i fixed it by getting byte array from outputStream :)
I have the path to my XML file on my computer, but how can I use selenium (web automation tool) to inject the XML file ?
Usually how it is done (manually) is navigate to the URL and COPY AND PASTE the entire XML text into the provided text box..
Any ideas how to inject the file using automation ? There is no way to "drag" the XML file to the text box and I believe the way I'm thinking that it will work is very complicated.
I think this is actually what you want -
File xml = new File("xmlpath");
String url = xml.getAbsolutePath();
url = url.replace('\\', '/');
url = url.replace(" ", "%20");
String actual = "file:/" + url;
selenium.open(actual);
Then you should be able to get the xml using String theXML = selenium.getText("//rootxmlnode"); Then do what you will with it.
Check out the topic of Data Driven Testing to get you started. Something like this should get you going.
Selenium tool allows you to create an automatically generated code in Java.
So, you need to place any text in the provided text box and generate this Java-test code.
Next step is modifying of the generated test. You have to manually write a simplest code, which will read your XML file, get it contents and paste into the text box. The last thing is replacement (in the generated Java code of test!) of the mentioned above text-block to the contents of read XML.
A simplest way for reading file into a string is using Apache commons-io library.
For example: FileUtils.readFileToString(File file, String encoding) gives you a string object with contents of the file.
I need to get the source code of the particular URL using a java code. I was able to get the source code for UTF-8 encoded web page but was not able to get the code for ISO-8859-1 encoded character set. My question, is it possible to get the source code of website with iso-8859-1 using a java program? Please help
If you are reading by using following method you need to Specify character set explicitly by
URL url = new URL(URL_TO_READ);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),"ISO-8859-1" ));
How ever if there is little parsing include with your requirement I would suggest you to use JSOUP and it will read the character-set from the response of server, Also you could explicitly set the charset
i want to use pdfjet for a Google app engine project.
i downloaded the Java jar from the pPdfjet home page.
i followed an example given in a stack-overflow example and the examples given in the home page.
all the examples uses an empty constructor: PDF pdf=new PDF();. However when i try to use it,
it says that the constructor PDF() is undefined, further more all the method shown do not work:
pdf.wrap(): is undefined
pdf.save("Example_03.pdf"): is undefined
It looks like the examples on their web page are out of date. Look at the examples in the zip download instead. This simple example works for me:
OutputStream out = new FileOutputStream("test.pdf");
PDF pdf = new PDF(out);
Page page = new Page(pdf, Letter.PORTRAIT);
pdf.flush();
out.close();
Ok this is easy. Actually instead of taking from req.getOutputStream() directly create and instance of BytArrayOutputStream and use that.
For sending it just use out.toArray() as add it to the attacement part.