JackRabbit Search PDF File

JackRabbit Search PDF File - java

I am using Jackrabbit to do some basic file operations like add, delete, search, versioning and all. It was good until I got stuck with the search problem in PDF file. Please find below my code that works fine with all other formats like word, xcel, plain text and not working for PDF file. The code is not giving any exception upon execution, it just does not give any result if I give a PDF File. Is it because my PDF file is not indexed?? Please help me.
Query query = queryManager.createQuery("select * from [nt:resource] AS resource where contains(resource.*, '%sampletext%')", Query.JCR_SQL2);
QueryResult result = query.execute();
RowIterator ri = result.getRows();
while (ri.hasNext()) {
Row row = ri.nextRow();
System.out.println("Row: " + row.toString());
}
Thanks in advance

I can think of 3 possible root causes:
Possibly the PDF file is not yet indexed at that time (fulltext indexing is done in a background thread AFAIK)
The pdf library (pdfbox) is not in the classpath
The pdf could not be indexes for some reason, in which case you would see a warning in the log file.

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Add text in MS Word doc using apache POI

I have been trying to edit different types of documents using Apache POI. The script should handle both extensions .doc and .docx. I could successfully edit the .docx file using XWPF api and the required text was added at the end of the docx file.
For editing .doc files(which include header, footer and a few paragraphs), following script is used, which use HWPFDocument.
FileInputStream fis = new FileInputStream(args[0]);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
CharacterRun run = range.insertAfter("FROM SEHWAGGG A FOUUURRRRRR");
run.setBold(true);
run.setItalic(true);
The script works fine with normal documents which does not have header and footer. But seems that the issue appears with complex documents. It insert text, but in between the paragraphs (and at the beginning using insertBefore()). There are no text replacements required, just have to put the text at the end of the document. I searched similar scripts but most of them handle text replacement.
How can I add the text at the end, after all paragraphs?

I've tested It with the following document:
At first (with your original code) it completely destroyed the document:
By changing the following line, the insert works fine for me:
// Old
Range range = doc.getEndnoteRange();
// New
Range range = doc.getEndnoteRange();

I'm afraid you are out of luck with HWPF with the current state of the project.
I created a custom HWPF library for one of our clients, but the changes are not public. The changes were huge, so you can't spend - say - a week and assume that things will be fixed. You might get away with the current public HWPF when only some text needs to be replaced without changing the string length ("abc" -> "123" or "a " -> "1234").

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

Reading form checkbox values from a Word document using java

I have a word document with checkboxes in it, and I want to determine whether these are ticked or not and use these results with java. I have tried using a WordExtractor with Apache POI but it didn't seem to include the result.
If I save the docx in txt format it replaces each checkbox with a corresponding 0 or 1, which is ideal, but I don't know how to do that programmatically.

Seems that you are looking for FtCblsSubRecord class (I didn't try it):
http://poi.apache.org/apidocs/org/apache/poi/hssf/record/FtCblsSubRecord.html
http://poi.apache.org/apidocs/org/apache/poi/hssf/record/class-use/SubRecord.html
Results of search in google: checkbox site:poi.apache.org
=========================================
By this post seems not to be posible:
http://osdir.com/ml/user-poi.apache.org/2010-10/msg00068.html
Other post talking about this:
What API can add checkbox to MS Word file using Java?
Insert a checkbox in an Excel sheet using Apache POI

Clean way to convert spreadsheet with many rows into pdf

I'm not looking for a library to convert excel files to pdf, there are plenty of those available. I'm looking for a clean way to convert a spreadsheet with more rows than the width of a page into a pdf.
Can this even be done? I don't consider making the text smaller a valid option because it could feasibly reach an upper limit (i.e. 1 pt font), and there may be enough columns in the spreadsheet to actually reach that limit (~30).
My only idea right now is to make the pages landscape, but is there a way to have the pdf show as "two-up" with both of the pages in landscape and have the proper page ordering underneath to look like a cohesive spreadsheet?
Any other ideas? or suggestions for the idea I have?

Assuming you can read the Excel file (for instance with Apache POI), consider writing to the PDF with Apache FOP using a custom paper size that you define. It may be difficult to print without a roll paper printer, but it will display on the screen just fine.

Have you looked at JasperReports? It has a pretty strong templating engine.
I've never used JasperReports the way you do, but their specialty is dynamic reports so I'd guess they know how to handle page overflows in a nice way.

Here's what I ended up doing. It uses the QuickLook feature on MacOS to make a HTML file, then uses wkhtmltopdf to turn the HTML file into a PDF.
#!/usr/bin/python
#
# convert an excel workbook to a PDF on a Mac
#
#
from subprocess import Popen,call,PIPE
import os, os.path, sys
import xml.dom.minidom
import plistlib
if len(sys.argv)==1:
print("Usage: %s filename.xls output.pdf" % sys.argv[0])
exit(1)
if os.path.exists("xdir"):
raise RuntimeError,"xdir must not exists"
os.mkdir("xdir")
call(['qlmanage','-o','xdir','-p',sys.argv[1]])
# Now we need to find the sheets and sort them.
# This is done by reading the property list
qldir = sys.argv[1] + ".qlpreview"
propfile = open("%s/%s/%s" % ('xdir',qldir,'PreviewProperties.plist'))
plist = plistlib.readPlist(propfile)
attachments = plist['Attachments']
sheets = []
for k in attachments.keys():
if k.endswith(".html"):
basename = os.path.basename(k)
fn = attachments[k]['DumpedAttachmentFileName']
print("Found %s -> %s" % (basename,fn))
sheets.append((basename,fn))
sheets.sort()
# Use wkhtmltopdf to generate the PDF output
os.chdir("%s/%s" % ('xdir',qldir))
cmd = ['wkhtmltopdf'];
for (basename,fn) in sheets:
cmd.append(fn)
cmd.append("../../" + sys.argv[2])
try:
call(cmd)
except OSError:
print("\n\nERROR: %s is not installed\n\n" % (cmd[0]))
exit(1)
os.chdir("../..")
call(['/bin/rm','-rf','xdir'])

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JackRabbit Search PDF File - java

I can think of 3 possible root causes: Possibly the PDF file is not yet indexed at that time (fulltext indexing is done in a background thread AFAIK) The pdf library (pdfbox) is not in the classpath The pdf could not be indexes for some reason, in which case you would see a warning in the log file.

Related

Replacing text in XWPFParagraph without changing format of the docx file

Add text in MS Word doc using apache POI

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

Reading form checkbox values from a Word document using java

Clean way to convert spreadsheet with many rows into pdf

Categories

Resources