How to read bookmarks in PDF using itext at multi level? - java

I am using iText-Java to split PDFs at bookmark level.
Does anybody know or have any examples for splitting a PDF at bookmarks that exist at a level 2 or 3?
For ex: I have the bookmarks in the following levels:
Father
|-Son
|-Son
|-Daughter
|-|-Grand son
|-|-Grand daughter
Right now I have below code to read the bookmark which reads the base bookmark(Father). Basically SimpleBookmark.getBookmark(reader) line did all the work.
But I want to read the level 2 and level 3 bookmarks to split the content present between those inner level bookmarks.
public static void splitPDFByBookmarks(String pdf, String outputFolder){
try
{
PdfReader reader = new PdfReader(pdf);
//List of bookmarks: each bookmark is a map with values for title, page, etc
List<HashMap> bookmarks = SimpleBookmark.getBookmark(reader);
for(int i=0; i<bookmarks.size(); i++){
HashMap bm = bookmarks.get(i);
HashMap nextBM = i==bookmarks.size()-1 ? null : bookmarks.get(i+1);
//In my case I needed to split the title string
String title = ((String)bm.get("Title")).split(" ")[2];
log.debug("Titel: " + title);
String startPage = ((String)bm.get("Page")).split(" ")[0];
String startPageNextBM = nextBM==null ? "" + (reader.getNumberOfPages() + 1) : ((String)nextBM.get("Page")).split(" ")[0];
log.debug("Page: " + startPage);
log.debug("------------------");
extractBookmarkToPDF(reader, Integer.valueOf(startPage), Integer.valueOf(startPageNextBM), title + ".pdf",outputFolder);
}
}
catch (IOException e)
{
log.error(e.getMessage());
}
}
private static void extractBookmarkToPDF(PdfReader reader, int pageFrom, int pageTo, String outputName, String outputFolder){
Document document = new Document();
OutputStream os = null;
try{
os = new FileOutputStream(outputFolder + outputName);
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, os);
document.open();
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF data
PdfImportedPage page;
while(pageFrom < pageTo) {
document.newPage();
page = writer.getImportedPage(reader, pageFrom);
cb.addTemplate(page, 0, 0);
pageFrom++;
}
os.flush();
document.close();
os.close();
}catch(Exception ex){
log.error(ex.getMessage());
}finally {
if (document.isOpen())
document.close();
try {
if (os != null)
os.close();
} catch (IOException ioe) {
log.error(ioe.getMessage());
}
}
}
Your help is much appreciated.
Thanks in advance! :)

You get an ArrayList<HashMap> when you call SimpleBookmark.getBookmark(reader); (do the cast if you need it). Try to iterate through that Arraylist and see its structure. If a bookmarks have sons (as you call it), it will contains another list with the same structure.
A recursive method could be the solution.

Reference for those who are looking at this using itext7
public void walkOutlines(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) { //----------loop traversing all paths
for (PdfOutline child : outline.getAllChildren()){
if(child.getDestination() != null) {
prepareIndexFile(child,names,pdfDocument,titles,pageNum,list);
}
}
}
//-----Getting pageNumbers from outlines
public void prepareIndexFile(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) {
String title = outline.getTitle();
PdfDestination pdfDestination = outline.getDestination();
String pdfStr = ((PdfString)pdfDestination.getPdfObject()).toUnicodeString();
PdfArray array = (PdfArray) names.get(pdfStr);
PdfObject pdfObj = array != null ? array.get(0) : null;
Integer pageNumber = pdfDocument.getPageNumber((PdfDictionary)pdfObj);
titles.add(title);
pageNum.add(pageNumber);
if(outline.getAllChildren().size() > 0) {
for (PdfOutline child : outline.getAllChildren()){
prepareIndexFile(child,names,pdfDocument,titles,pageNum);
}
}
}
public boolean splitPdf(String inputFile, final String outputFolder) {
boolean splitSuccess = true;
PdfDocument pdfDoc = null;
try {
PdfReader pdfReaderNew = new PdfReader(inputFile);
pdfDoc = new PdfDocument(pdfReaderNew);
final List<String> titles = new ArrayList<String>();
List<Integer> pageNum = new ArrayList<Integer>();
PdfNameTree destsTree = pdfDoc.getCatalog().getNameTree(PdfName.Dests);
Map<String, PdfObject> names = destsTree.getNames();//--------------------------------------Core logic for getting names
PdfOutline root = pdfDoc.getOutlines(false);//--------------------------------------Core logic for getting outlines
walkOutlines(root,names, pdfDoc, titles, pageNum,content); //------Logic to get bookmarks and pageNumbers
if (titles == null || titles.size()==0) {
splitSuccess = false;
}else { //------Proceed if it has bookmarks
for(int i=0;i<titles.size();i++) {
String title = titles.get(i);
String startPageNmStr =""+pageNum.get(i);
int startPage = Integer.parseInt(startPageNmStr);
int endPage = startPage;
if(i == titles.size() - 1) {
endPage = pdfDoc.getNumberOfPages();
}else {
int nextPage = pageNum.get(i+1);
if(nextPage > startPage) {
endPage = nextPage - 1;
}else {
endPage = nextPage;
}
}
String outFileName = outputFolder + File.separator + getFileName(title) + ".pdf";
PdfWriter pdfWriter = new PdfWriter(outFileName);
PdfDocument newDocument = new PdfDocument(pdfWriter, new DocumentProperties().setEventCountingMetaInfo(null));
pdfDoc.copyPagesTo(startPage, endPage, newDocument);
newDocument.close();
pdfWriter.close();
}
}
}catch(Exception e){
//---log
}
}

Related

replace a text using pdfbox for PDF file

I have 4 pdf files that came from one .doc file and I use 4 methods to convert my doc to a pdf (foxite reader, nitro, webservice and Word).
Then I used pdfbox to search and replace some words. The problem is, for some reason it only works for the file from foxite reader and Word, but not for the files created by nitro and the webservice.
Can any one have a clue?
This is the code I used:
public static void replace(String s) {
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load(s); // Input PDF File Name
System.out.println("+e" + doc);
List pages = doc.getDocumentCatalog()
.getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
// System.out.println("ddd");
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
// System.out.println("jjjj");
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation()
.equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("#signature#")) {
string = string.replace("#signature#", "sam");
occurrences++;
}
// Word you want to change.
// Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation()
.equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("#signature#")) {
tempString = tempString.replace("#signature#", "sam");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save(s + "_convert.pdf"); // Output file name
} catch (Exception ex) {
System.out.println("eee+" + ex.getMessage());
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
ex.getStackTrace();
}
}
}
}

Add padding to table cells using iTextPdf

Following is a demo code to generate a PDF doc from a HTML source:
public class SimpleAdhocReport
{
public SimpleAdhocReport()
{
build();
}
private void build()
{
AdhocConfiguration configuration = new AdhocConfiguration();
AdhocReport report = new AdhocReport();
configuration.setReport(report);
AdhocColumn column = new AdhocColumn();
column.setName("item");
report.addColumn(column);
column = new AdhocColumn();
column.setName("orderdate");
report.addColumn(column);
column = new AdhocColumn();
column.setName("quantity");
report.addColumn(column);
column = new AdhocColumn();
column.setName("unitprice");
report.addColumn(column);
try
{
AdhocManager.saveConfiguration(configuration, new FileOutputStream("d:/configuration.xml"));
#SuppressWarnings("unused")
AdhocConfiguration loadedConfiguration = AdhocManager.loadConfiguration(new FileInputStream("d:/configuration.xml"));
JasperReportBuilder reportBuilder = AdhocManager.createReport(configuration.getReport());
reportBuilder.setDataSource(createDataSource());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
reportBuilder.toHtml(baos);
String html = new String(baos.toByteArray(), "UTF-8");
baos.close();
Whitelist wl = Whitelist.simpleText();
wl.addTags("table", "tr", "td");
String clean = Jsoup.clean(html, wl);
clean = clean.replace("<td></td>", "");
clean = clean.replace("<td> </td>", "");
clean = clean.replace("<td> ", "<td>");
Document doc = Jsoup.parse(clean);
for (Element element : doc.select("*"))
{
if (!element.hasText() && element.isBlock())
{
element.remove();
}
}
clean = doc.body().html();
int startIndex = clean.indexOf("<table>", 6);
int endIndex = clean.indexOf("</table>");
clean = clean.substring(startIndex, endIndex + 8);
BufferedWriter writer = new BufferedWriter(new FileWriter(("d:/test.html")));
writer.write(clean);
writer.close();
try
{
createPdf(clean);
}
catch (DocumentException e)
{
e.printStackTrace();
}
}
catch (DRException e)
{
e.printStackTrace();
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private JRDataSource createDataSource()
{
DRDataSource dataSource = new DRDataSource("item", "orderdate", "quantity", "unitprice");
for (int i = 0; i < 20; i++)
{
dataSource.add("Book", new Date(), (int) (Math.random() * 10) + 1,
new BigDecimal(Math.random() * 100 + 1).setScale(4, BigDecimal.ROUND_HALF_UP));
}
return dataSource;
}
public static void main(String[] args)
{
new SimpleAdhocReport();
}
public void createPdf(String html) throws IOException, DocumentException
{
com.itextpdf.text.Document document = new com.itextpdf.text.Document(PageSize.LETTER);
document.setMargins(30, 30, 80, 30);
PdfWriter.getInstance(document, new FileOutputStream("D:\\HTMLtoPDF.pdf"));
document.open();
PdfPTable table = null;
ElementList list = com.itextpdf.tool.xml.XMLWorkerHelper.parseToElementList(html, null);
for (com.itextpdf.text.Element element : list)
{
table = new PdfPTable((PdfPTable) element);
}
table.setWidthPercentage(100);
ArrayList<PdfPRow> rows = table.getRows();
for (PdfPRow rw : rows)
{
PdfPCell[] cells = rw.getCells();
for (PdfPCell cl : cells)
{
cl.setVerticalAlignment(com.itextpdf.text.Element.ALIGN_MIDDLE);
cl.setBorder(PdfPCell.NO_BORDER);
cl.setNoWrap(true);
cl.setPadding(10f);
cl.setCellEvent(new MyCell());
}
}
document.add(table);
document.close();
}
}
class MyCell implements PdfPCellEvent
{
public void cellLayout(PdfPCell cell, Rectangle position, PdfContentByte[] canvases)
{
float x1 = position.getLeft() - 2;
float x2 = position.getRight() + 2;
float y1 = position.getTop() + 2;
float y2 = position.getBottom() - 2;
PdfContentByte canvas = canvases[PdfPTable.LINECANVAS];
canvas.rectangle(x1, y1, x2 - x1, y2 - y1);
canvas.stroke();
}
}
I am working with jasper reports to create a adhoc report and generate the HTML from there. I have to generate a PDF from this HTML.
Couple of issue I am facing, any help is appreciated:
I am setting the
table.setWidthPercentage(100);
for a page with table its not working.
I have to increase spacing between columns. Tried what Bruno suggested here. Its not working. I have also tried using a solution from here with no luck. Ref. image below.
Also if i cell event to default is not working.
e.g.
table.getDefaultCell().setCellEvent()
Any suggestions ?
Update:
My Output
I was able to get the padding as i want by parsing the HTML following way:
public PdfPTable getTable(String cleanHTML) throws IOException
{
String CSS = "tr { text-align: center; } td { padding: 5px; }";
CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(new ByteArrayInputStream(CSS.getBytes()));
cssResolver.addCss(cssFile);
// HTML
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline pdf = new ElementHandlerPipeline(elements, null);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new ByteArrayInputStream(cleanHTML.getBytes()));
return (PdfPTable) elements.get(0);
}
That fixes the issue mentioned in question 2. Q3 is not longer required.

Replace Data to word Document In Alfresco using java code excluding junk characters

I am doing Bulk Upload Task in Alfresco.
Before this i created custom action to call java code, i also successfully read data from excel sheet, and i found node reference of target document as well as source Document. Using that node reference i am also able to create new multiple Documents.
Now My requirement is, I want to replace Excel Data in that newly created Document. I tried to replace it, But It replacing the String only in First line of document, and it also deleting Rest of the existing contents inside newly created document. I have written Below code for this.
In below code i am first simply trying to replace some hard coded data to the Document.
But My requirement is i want to replace the data inside document which i already read from excel file.
Java Code:
public class MoveReplacedActionExecuter extends ActionExecuterAbstractBase {
InputStream is;
Cell cell = null;
public static final String NAME = "move-replaced";
private FileFolderService fileFolderService;
private NodeService nodeService;
private ContentService contentService;
private SearchService searchService;
#Override
protected void addParameterDefinitions(List < ParameterDefinition > paramList) {
}
public void executeImpl(Action ruleAction, NodeRef actionedUponNodeRef) {
try {
ContentReader contentReader = contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT);
is = contentReader.getContentInputStream();
} catch (NullPointerException ne) {
System.out.println("Null Pointer Exception" + ne);
}
try {
Workbook workbook = new XSSFWorkbook(is);
Sheet firstSheet = workbook.getSheetAt(0);
Iterator < Row > iterator = firstSheet.rowIterator();
while (iterator.hasNext()) {
ArrayList < String > al = new ArrayList < > ();
System.out.println("");
Row nextRow = iterator.next();
Iterator < Cell > cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING:
System.out.print("\t" + cell.getStringCellValue());
al.add(cell.getStringCellValue());
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print("\t" + cell.getBooleanCellValue());
al.add(String.valueOf(cell.getBooleanCellValue()));
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.print("\t" + cell.getNumericCellValue());
al.add(String.valueOf(cell.getNumericCellValue()));
break;
}
}
}
is.close();
} catch (Exception e) {
e.printStackTrace();
}
String query = "PATH:\"/app:company_home/cm:Dipak/cm:OfferLetterTemplate.doc\"";
SearchParameters sp = new SearchParameters();
StoreRef storeRef = new StoreRef(StoreRef.PROTOCOL_WORKSPACE, "SpacesStore");
sp.addStore(storeRef);
sp.setLanguage(SearchService.LANGUAGE_LUCENE);
sp.setQuery(query);
ResultSet resultSet = searchService.query(sp);
System.out.println("Result Set" + resultSet.length());
NodeRef sourceNodeRef = null;
for (ResultSetRow row: resultSet) {
NodeRef currentNodeRef = row.getNodeRef();
sourceNodeRef = currentNodeRef;
System.out.println(currentNodeRef.toString());
}
NodeRef n = new NodeRef("workspace://SpacesStore/78342318-37b8-4b42-aadc-bb0ed5d413d9");
try {
org.alfresco.service.cmr.model.FileInfo fi = fileFolderService.copy(sourceNodeRef, n, "JustCreated" + Math.random() + ".doc");
NodeRef newNode = fi.getNodeRef();
QName TYPE_AUTHORTY = QName.createQName("sunpharma.hr.model", "hrdoctype");
nodeService.setType(newNode, TYPE_AUTHORTY);
ContentReader contentReader1 = contentService.getReader(newNode, ContentModel.PROP_CONTENT);
InputStream is2 = contentReader1.getContentInputStream();
POIFSFileSystem fs = new POIFSFileSystem(is2);
HWPFDocument doc = new HWPFDocument(fs);
doc = replaceText1(doc, "Company", "Datamatics");
ContentWriter writerDoc = contentService.getWriter(newNode, ContentModel.PROP_CONTENT, true);
writerDoc.putContent(doc.getDocumentText());
} catch (FileExistsException | FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
private static HWPFDocument replaceText1(HWPFDocument doc, String findText, String replaceText) {
System.out.println("In the method replacetext" + replaceText);
Range r1 = doc.getRange();
System.out.println("Range of Doc : " + r1);
for (int i = 0; i < r1.numSections(); ++i) {
Section s = r1.getSection(i);
for (int x = 0; x < s.numParagraphs(); x++) {
Paragraph p = s.getParagraph(x);
for (int z = 0; z < p.numCharacterRuns(); z++) {
CharacterRun run = p.getCharacterRun(z);
String text = run.text();
if (text.contains(findText)) {
run.replaceText(findText, replaceText);
} else {
System.out.println("NO text found");
}
}
}
}
return doc;
}
public void setFileFolderService(FileFolderService fileFolderService) {
this.fileFolderService = fileFolderService;
}
public void setNodeService(NodeService nodeService) {
this.nodeService = nodeService;
}
public void setContentService(ContentService contentService) {
this.contentService = contentService;
}
public void setSearchService(SearchService searchService) {
this.searchService = searchService;
}
}
Its not possible to take direct file stream object in alfresco.
so i created one file at local drive, in background i performed all replacement operations. and after that i read all data using file input stream object. and later i used file that stream with node.
and it gave me my desired output. :)

Splitting a large Pdf file with PDFBox gets large result files

I am processing some large pdf files, (up to 100MB and about 2000 pages), with pdfbox. Some of the pages contain a QR code, I want to split those files into smaller ones with the pages from one QR code to the next.
I got this, but the result file sizes are the same as the source file. I mean, if I cut a 100MB pdf file into a ten files I am getting ten files 100MB each.
This is the code:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("myFile.pdf"),
new RandomAccessFile(new File("./tmp/temp"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage page = (PDPage) pages.get(i);
BufferedImage firstPageImage =
page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
if(qrText != null and i!=0){
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
outputDocument.importPage((PDPage)pages.get(j));
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
documentoPdf.close();
}
I also tried the following code for storing the new file:
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
PDStream src = ((PDPage)pages.get(j)).getContents();
PDStream streamD = new PDStream(outputDocument);
streamD.addCompression();
PDPage newPage = new PDPage(new
COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
newPage.setContents(streamD);
byte[] buf = new byte[10240];
int amountRead = 0;
InputStream is = null;
OutputStream os = null;
is = src.createInputStream();
os = streamD.createOutputStream();
while((amountRead = is.read(buf,0,10240)) > -1) {
os.write(buf, 0, amountRead);
}
outputDocument.addPage(newPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
But this code creates files which lacks some content and also have the same size than the original.
How can I create smaller pdfs files from a larger one?
Is it posible with PDFBox? Is there any other library with which I can transform a single page into an image (for qr recognition), and also allows me to split a big pdf file into smaller ones?
Thx!
Thx! Tilman you are right, the PDFSplit command generates smaller files. I checked the PDFSplit code out and found that it removes the page links to avoid not needed resources.
Code extracted from Splitter.class :
private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
else
{
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
}
So eventually my code looks like this:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage firstPage = (PDPage) pages.get(i);
String qrText ="";
BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
firstPage =null;
try {
qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
} catch (NotFoundException e) {
e.printStackTrace();
} finally {
firstPageImage = null;
}
if(i != 0 && qrText!=null){
PDDocument outputDocument = new PDDocument();
outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
outputDocument.getDocumentCatalog().setViewerPreferences(
documentoPdf.getDocumentCatalog().getViewerPreferences());
for(int j = previusQR; j<i; j++){
PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));
importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
// only the resources of the page will be copied
importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );
processAnnotations(importedPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
previusQR = i;
outputDocument.save(f);
outputDocument.close();
}
}
}
Thank you very much!!

Getting TF-IDF values from index

The below code is for getting tf-idf value from indexes. But I get an error while running it, on the line with Correct_ME.
Using Lucene 4.8.
DocIndexing.java
public class DocIndexing {
private DocIndexing() {}
/** Index all text files under a directory.
* #param args
* #throws java.io.IOException */
public static void main(String[] args) throws IOException {
String usage = "java org.apache.lucene.demo.IndexFiles"
+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"
+ "This indexes the documents in DOCS_PATH, creating a Lucene index"
+ "in INDEX_PATH that can be searched with Searching";
String indexPath = "C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/Index";
String docsPath = "C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/ToBeIndexed";
boolean create = true;
for(int i=0;i<args.length;i++) {
if (null != args[i]) switch (args[i]) {
case "-index":
indexPath = args[i+1];
i++;
break;
case "-docs":
docsPath = args[i+1];
i++;
break;
case "-update":
create = false;
break;
}
}
if (docsPath == null) {
System.err.println("Usage: " + usage);
System.exit(1);
}
final File docDir = new File(docsPath);
if (!docDir.canRead() && !docDir.isDirectory() &&
!docDir.isHidden() &&
!docDir.exists()) {
System.out.println("Document directory '" +docDir.getAbsolutePath()+ "' does not exist or is not readable, please check the path");
System.exit(1);
}
Date start = new Date();
try {
System.out.println("Indexing to directory '" + indexPath + "'...");
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
//Filter filter = new PorterStemFilter();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, analyzer);
if (create) {
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
try (
IndexWriter writer = new IndexWriter(dir, iwc)) {
indexDocs(writer, docDir);
}
Date end = new Date();
System.out.println(end.getTime() - start.getTime() + " total milliseconds");
} catch (IOException e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
Tf_Idf tfidf = new Tf_Idf();
String field = null,term = null;
tfidf.scoreCalculator(field, term);
}
/*
* #param writer Writer to the index where the given file/dir info will be stored
* #param file The file to index, or the directory to recurse into to find files to index
* #throws IOException If there is a low-level I/O error
*/
static void indexDocs(IndexWriter writer, File file)
throws IOException {
// do not try to index files that cannot be read
if (file.canRead()) {
if (file.isDirectory()) {
String[] files = file.list();
// an IO error could occur
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
}
}
} else {
FileInputStream fis;
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
return;
}
try {
// make a new, empty document
Document doc = new Document();
// Field termV = new LongField("termVector", file.g)
Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
doc.add(pathField);
Field modifiedField = new LongField("modified", file.lastModified(), Field.Store.NO);
doc.add(modifiedField);
Field titleField = new TextField("title", file.getName(), Field.Store.YES);
doc.add(titleField);
Field contentsField = new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8)));
doc.add(contentsField);
//contentsField.setBoost((float)0.5);
//titleField.setBoost((float)2.5);
/* doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
doc.add(new TextField("title", file.getName(), Field.Store.YES));
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));
*/
// StringField..setBoost(1.2F);
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
// New index, so we just add the document (no old document can be there):
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.getPath()), doc);
}
} finally {
fis.close();
}
}
}
}
}
Tf-idf.java
public class Tf_Idf {
static float tf = 1;
static float idf = 0;
private float tfidf_score;
static float [] tfidf = null;
IndexReader indexReader;
public Tf_Idf() throws IOException {
this.indexReader = DirectoryReader.open(FSDirectory.open(new File("C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/Index")));
}
public void scoreCalculator (String field, String term) throws IOException
{
TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
TermsEnum termEnum = MultiFields.getTerms(indexReader, field).iterator(null);
BytesRef bytesRef=null;
while ((bytesRef = termEnum.next()) != null) {
if(bytesRef.utf8ToString().trim().equals(term.trim())) {
if(termEnum.seekExact(bytesRef)) {
idf = tfidfSIM.idf(termEnum.docFreq(), indexReader.numDocs());
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if(docsEnum != null) {
int doc=0;
while((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
tf = tfidfSIM.tf(docsEnum.freq());
tfidf_score = tf * idf ;
System.out.println(" -tfidf_score-" + tfidf_score);
}
}
}
}
}
}
}
It's obvious that you pass to MultiFields method a null IndexReader
IndexReader reader = null;
tfidf.scoreCalculator( reader, field,term);
You need to write something like this:
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(PATH_TO_LUCENE_INDEX)));
tfidf.scoreCalculator( reader, field,term);
You need to repalce PATH_TO_LUCENE_INDEX with real path, of course.
Another problem, that I see - you open IndexReader in Tf_Idf, but don't use it anywhere, may be it's a good idea to remove it or use it, inside of scoreCalculator method, e.g.
tfidf.scoreCalculator(field,term);
but in method use field of this class, - this.indexReader instead of just indexReader that you try to pass inside method scoreCalculator
UPD
public Tf_Idf() throws IOException {
this.reader = DirectoryReader.open(FSDirectory.open(new File("Index")));
}
In this code, you need to replace "Index" with real path to your Lucene index, e.g. - /home/user/index or C://index or wherever you have it.

Categories