Could not able to parse text and Image from PDF file - java

I have gone through tika Documentation. I find a solution to extract text. but it does not print return image.
.java File
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Imageextractor3 {
public static void main(String[] args)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);
File file=new File("C://Users//Vaibhav Shukla//Desktop//8577.00.pdf");
System.out.println(file);
FileInputStream stream = new FileInputStream(new File("C://Users//Vaibhav Shukla//Desktop//pdfs//hh.pdf"));
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
System.out.println(metadata);
String content = handler.toString();
FileOutputStream fos=new FileOutputStream("C://Users//Vaibhav Shukla//Desktop//pdfs//hd.doc");
fos.write(content.getBytes());
System.out.println("===============");
System.out.println(content);
System.out.println("Done");
}
}
I need suggestion how to add functionality which can detect the image in pdf file

A quick solution to convert extract images which are embedded in pdf.
public void extract(File file) throws IOException {
PDDocument doc=new PDDocument().load(file);
Iterator<PDPage> itr=doc.getDocumentCatalog().getPages().iterator();
while(itr.hasNext())
{
PDResources res=itr.next().getResources();
Iterable<COSName> cName=res.getXObjectNames();
Iterator<COSName> citr=cName.iterator();
while(citr.hasNext())
{
String imageName= citr.next().getName();
System.out.println(imageName);
COSName cosName=COSName.getPDFName(imageName);
PDImageXObject im = (PDImageXObject) res.getXObject(cosName);
File ff = new
File("C://Users//workspace//Desktop//pdfs//"+imageName+".png");
BufferedImage bi=im.getImage();
ImageIO.write(bi,"png",ff);
}
}}

Related

Lucene special character search

I am trying to move from Database search to Lucene search. I have few textfiles which has data, sample data in one of the text file is
N=Ethernet, L=null, IM=XX123, SN=286-054-754, HBF=null, BON=null,
VSR=null, DUID=null, MID=2, IP=10.21.122.136, MAC=60:C7:98:17:57:80,
SYNC=false, GN=null, CustParam3=null, CustParam2=null, VV=1.06.0007,
CustParam5=null, CustParam4=null, CustParam7=null, CustParam6=null,
BUNAME=null, PN=M132-409-01-R, CustParam8=null, CS=2015-09-30
19:49:25.0, CST=Inactive, BL=3.2, EE=Off, TID=190, PRL=VEM, PAV=null,
FAV=null, MON=2016-04-06 11:13:40.507, DON=null, LPDR=2015-09-30
19:50:23.85, SSID=null, PIP=null, DID=null, MDATE=null,
OV=rel-20120625-SC-3.1.2-B, CID=null, ICBI=false, TID=null,
LCR=2015-10-01 01:50:30.297, SS=No-Recent-Communication, CBU=null,
GMVR=, LID=store, FF=167340, HFP=RATNERCO >> blore, ISA=false,
TF=null, FAM=null, LDPDR=2015-09-30 19:50:39.113, STVER=True,
SID=null, LHB=2015-09-30 21:50:30.297, IDSS=false, FR=81796,
LMOS=2015-09-30 19:49:50.503, LCUS=null, MNAME=XX 123, BBUID=null,
CON=null, DBUN=null, ISDRA=false, POSV=null, UUID=2, TRAM=null,
SPOL=000000000, CustomField1=null, CustomField2=null,
CustomField3=null, MUID=2DE02CF3-0663-420A-8918-7A550E29F570,
CustomField4=null, CustomField5=null, HNAME=blore, customparam1=null,
HID=1048, LBDT=2015-07-06 12:03:45.0, DIC=null, AT=None, LID=null,
IDSA=false, LMPS=2015-09-30 15:49:50.457, MBUN=System, CNC=Ethernet,
LOC=null
I am creating index and searching using StandardAnalyzer, but I am unable to search with the string UUID=1, the value which I am getting here is
also that which does NOT have UUID=1 (In total I have two Files and both the files content are getting displayed). As the data has special characters I also tried using WhiteSpaceAnalyzer, but then it did not return any data. I created a custom analyzer which has whitespace, lowercase and standard token filter, but it did not help. I also extended StopwordAnalyzerBase to create my own analyzer and used NormalizeCharMap for replacing the special characters, it helped but I was not able to do wildcard search.
Request someone to help me out in this. I am very new to Lucene.
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class IndexCreator
{
public void createIndex(String inputFiles, String indexPath)
{
//Input Path Variable
final Path docDir = Paths.get(inputFiles);
try
{
//org.apache.lucene.store.Directory instance
Directory dir = FSDirectory.open( Paths.get(indexPath) );
//analyzer with the default stop words
//Analyzer analyzer = new NewStandardAnalyzer();
//Analyzer analyzer = buildAnalyzer();
//Analyzer analyzer = new WhitespaceAnalyzer();
Analyzer analyzer = new StandardAnalyzer();
//IndexWriter Configuration
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
//IndexWriter writes new index files to the directory
IndexWriter writer = new IndexWriter(dir, iwc);
//Its recursive method to iterate all files and directories
indexDocs(writer, docDir);
writer.commit();
}
catch (IOException e)
{
e.printStackTrace();
}
}
private void indexDocs(final IndexWriter writer, Path path) throws
IOException
{
//Directory?
if (Files.isDirectory(path))
{
//Iterate directory
Files.walkFileTree(path, new SimpleFileVisitor<Path>()
{
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException
{
try
{
//Index this file
indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
return FileVisitResult.CONTINUE;
}
});
}
else
{
//Index this file
indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
}
}
private void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException
{
try (InputStream stream = Files.newInputStream(file))
{
//Create lucene Document
Document doc = new Document();
String content = new String(Files.readAllBytes(file));
//content = content.replace("-", "\\-");
//content = content.replace(":", "\\:");
//content = content.replace("=", "\\=");
//content = content.replace(".", "\\.");
doc.add(new StringField("path", file.toString(), Field.Store.YES));
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", content, Store.YES));
//Updates a document by first deleting the document(s)
//containing <code>term</code> and then adding the new
//document. The delete and then add are atomic as seen
//by a reader on the same index
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
public static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();
}
public static void main(String[] args) {
IndexCreator indexCreator = new IndexCreator();indexCreator.createIndex(
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Data",
,
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Index");
System.out.println("Done");
}
}
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class Searcher
{
//directory contains the lucene indexes
private static final String INDEX_DIR =
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Index";
public static void main(String[] args) throws Exception
{
//Create lucene searcher. It search over a single IndexReader.
Searcher searcher = new Searcher();
//Search indexed contents using search term
/*searcher.searchInContent("NETWORKCONFIGURATION=Ethernet AND MACADDRESS=60\\:C7\\:98\\:17\\:57\\:80", searcher.createSearcher());
searcher.searchInContent("NETWORKCONFIGURATION=Ethern*", searcher.createSearcher());*/
searcher.searchInContent("UUID=1", searcher.createSearcher());
}
private void searchInContent(String textToFind, IndexSearcher searcher) throws Exception
{
//Create search query
//QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
//textToFind = QueryParser.escape(textToFind).toLowerCase();
Query query = qp.parse(textToFind);
//search the index
TopDocs hits = searcher.search(query, 10);
System.out.println("Total Results :: " + hits.totalHits);
for (ScoreDoc sd : hits.scoreDocs)
{
Document d = searcher.doc(sd.doc);
System.out.println("Path : "+ d.get("path") + ", Score : " + sd.score + ", Content : "+d.get("contents"));
}
}
private IndexSearcher createSearcher() throws IOException
{
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
//It is an interface for accessing a point-in-time view of a lucene index
IndexReader reader = DirectoryReader.open(dir);
//Index searcher
IndexSearcher searcher = new IndexSearcher(reader);
return searcher;
}
public static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();}}

how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?

Using java, I need to read a warc archive file, filter it depending on the content of the html page, and write a new archive file.
the following code reads the archive. how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?
import org.apache.commons.io.IOUtils;
import org.archive.io.ArchiveRecord;
import org.archive.io.warc.*;
import org.archive.wayback.resourcestore.resourcefile.WarcResource;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Iterator;
import java.util.concurrent.atomic.AtomicInteger;
public class Test126b {
public static void main() throws Exception {
File out = new java.io.File("out.warc.gz");
OutputStream bos = new BufferedOutputStream(new FileOutputStream(out));
WARCWriterPoolSettings settings = ...
WARCWriter writer = new WARCWriter(new AtomicInteger(), bos, out, settings);
File in = new java.io.File("in.warc.gz");
WARCReader reader = WARCReaderFactory.get(in);
Iterator<ArchiveRecord> it = reader.iterator();
while (it.hasNext()) {
ArchiveRecord archiveRecord = it.next();
if (archiveRecord.getHeader().getHeaderValue("WARC-Type") == "response") {
WARCRecord warcRecord = (WARCRecord) archiveRecord;
WarcResource warcResource = new WarcResource(warcRecord, reader);
warcResource.parseHeaders();
String url = warcResource.getWarcHeaders().getUrl();
System.out.println("+++ url: " + url);
byte[] content = IOUtils.toByteArray(warcResource);
String htmlPage = new String(content);
if (htmlPage.contains("hello world")) {
writer.writeRecord(warcRecordInfo) // how to reconstruct the WARCRecordInfo
}
}
}
reader.close();
writer.close();
}
}

How to parse input type value of html and convert it into pdf?

I am unable to parse the html input type value and convert it into pdf file.I am using pdfWriter to generate the pdf and using xmlworker-5.5.4.jar and itext.jar.It does not parse the input type value of html file and unable to convert into pdf file.This problem is generating when using htmlworker or xmlworker.
Code:
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringReader;
public class ParseHtml {
public static final String DEST = "D:/html_1.pdf";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
ParseHtml p=new ParseHtml();
p.createPdf(DEST);
}
#SuppressWarnings("deprecation")
public void createPdf(String file) throws IOException, DocumentException {
StringBuilder sb = new StringBuilder();
sb.append("<input type=\'text\' value=\"43645643634\"/>");
System.out.println("String------"+sb);
FileOutputStream outStream = new FileOutputStream(file);
Document document = new Document(PageSize.A4.rotate());
PdfWriter pdfWriter = PdfWriter.getInstance(document,outStream);
document.open();
document.newPage();
HTMLWorker htmlWorker = new HTMLWorker(document);
htmlWorker.parse(new StringReader(sb.toString()));
document.close();
outStream.close();
System.out.println("Document is created...");
}
}

JAVA Read sample doc file, fill with data and generate PDF

I am triing to make an automatization program in JAVA.
I have a sample doc file. I need to fill the blank parts or the <> "signed" parts from database,
than create pdf files.
I tried to read the word :
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
but this attemption is bad in many way cause i only need to write some of the parts, and this method make a single test from the doc.
So can you advice me some api that write in a word doc eg: after Name : or in the 5 row write this:
And when it finish with the word it should generate a pdf and do it again ...
I am looking a solution wich i found xssfworkbook with some extra function ( generate pdf of the doc ).
Or read the sample pdf and fill with datas and save to a new pdf.
Thx
Use Itext (http://sourceforge.net/projects/itext/)
and Apache POI Project http://poi.apache.org/index.html
A sample code :
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.hwpf.extractor.WordExtractor;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
public static void main(String[] args) {
String pdfPath = "C:/";
String pdfDocPath = null;
try {
InputStream is = new BufferedInputStream(new FileInputStream("C:/Test.doc"));
WordExtractor wd = new WordExtractor(is);
String text = wd.getText();
/* FOR DOCX
// IMPORT
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
// CODE
XWPFDocument hdoc = new XWPFDocument(is);
extractor = new XWPFWordExtractor(hdoc);
String text = extractor.getText();
*/
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(pdfPath + "viewDoc.pdf"));
document.open();
document.add(new Paragraph(text));
document.close();
pdfDocPath = pdfPath + "viewDoc.pdf";
System.out.println("Pdf document path is" + pdfDocPath);
}
catch (FileNotFoundException e1) {
System.out.println("File does not exist.");
}
catch (IOException ioe) {
System.out.println("IO Exception");
}
catch (DocumentException e) {
e.printStackTrace();
}
}

Tika in Action book examples Lucene StandardAnalyzer does not work

First of all I am a total noob when it comes to Tika and Lucene. I am working through the Tika in Action book trying out the examples. In chapter 5 this example is given:
package tikatest01;
import java.io.File;
import org.apache.tika.Tika;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
public class LuceneIndexer {
private final Tika tika;
private final IndexWriter writer;
public LuceneIndexer(Tika tika, IndexWriter writer) {
this.tika = tika;
this.writer = writer;
}
public void indexDocument(File file) throws Exception {
Document document = new Document();
document.add(new Field(
"filename", file.getName(),
Store.YES, Index.ANALYZED));
document.add(new Field(
"fulltext", tika.parseToString(file),
Store.NO, Index.ANALYZED));
writer.addDocument(document);
}
}
And this main method:
package tikatest01;
import java.io.File;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.tika.Tika;
public class TikaTest01 {
public static void main(String[] args) throws Exception {
String filename = "C:\\testdoc.pdf";
File file = new File(filename);
IndexWriter writer = new IndexWriter(
new SimpleFSDirectory(file),
new StandardAnalyzer(Version.LUCENE_30),
MaxFieldLength.UNLIMITED);
try {
LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
indexer.indexDocument(file);
}
finally {
writer.close();
}
}
}
I've added the libraries tika-app-1.5.jar, lucene-core-4.7.0.jar and lucene-analyzers-common-4.7.0.jar to the project.
Questions:
With the current version of Lucene the Field.Index is deprecated, what should I use instead?
MaxFieldLength is not found. I am missing an import?
For Lucene 4.7 this code for the indexer:
package tikatest01;
import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.tika.Tika;
public class LuceneIndexer {
private final Tika tika;
private final IndexWriter writer;
public LuceneIndexer(Tika tika, IndexWriter writer) {
this.tika = tika;
this.writer = writer;
}
public void indexDocument(File file) throws Exception {
Document document = new Document();
document.add(new TextField(
"filename", file.getName(), Store.YES));
document.add(new TextField(
"fulltext", tika.parseToString(file), Store.NO));
writer.addDocument(document);
}
}
And this code for the main class:
package tikatest01;
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.tika.Tika;
public class TikaTest01 {
public static void main(String[] args) throws Exception {
String dirname = "C:\\MyTestDir\\";
File dir = new File(dirname);
IndexWriter writer = new IndexWriter(
new SimpleFSDirectory(dir),
new IndexWriterConfig(
Version.LUCENE_47,
new StandardAnalyzer(Version.LUCENE_47)));
try {
LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
indexer.indexDocument(dir);
}
finally {
writer.close();
}
}
}
For Lucene 4.7 there isn't this kind of constructor for IndexWriter
Take a look on API - http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/index/IndexWriter.html
It show me only constructor with 2 params, so you need to adopt this example to new Lucene API

Categories