Could not able to parse text and Image from PDF file

Could not able to parse text and Image from PDF file - java

I have gone through tika Documentation. I find a solution to extract text. but it does not print return image.
.java File
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Imageextractor3 {
public static void main(String[] args)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);
File file=new File("C://Users//Vaibhav Shukla//Desktop//8577.00.pdf");
System.out.println(file);
FileInputStream stream = new FileInputStream(new File("C://Users//Vaibhav Shukla//Desktop//pdfs//hh.pdf"));
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
System.out.println(metadata);
String content = handler.toString();
FileOutputStream fos=new FileOutputStream("C://Users//Vaibhav Shukla//Desktop//pdfs//hd.doc");
fos.write(content.getBytes());
System.out.println("===============");
System.out.println(content);
System.out.println("Done");
}
}
I need suggestion how to add functionality which can detect the image in pdf file

A quick solution to convert extract images which are embedded in pdf.
public void extract(File file) throws IOException {
PDDocument doc=new PDDocument().load(file);
Iterator<PDPage> itr=doc.getDocumentCatalog().getPages().iterator();
while(itr.hasNext())
{
PDResources res=itr.next().getResources();
Iterable<COSName> cName=res.getXObjectNames();
Iterator<COSName> citr=cName.iterator();
while(citr.hasNext())
{
String imageName= citr.next().getName();
System.out.println(imageName);
COSName cosName=COSName.getPDFName(imageName);
PDImageXObject im = (PDImageXObject) res.getXObject(cosName);
File ff = new
File("C://Users//workspace//Desktop//pdfs//"+imageName+".png");
BufferedImage bi=im.getImage();
ImageIO.write(bi,"png",ff);
}
}}

Related

Lucene special character search

I am trying to move from Database search to Lucene search. I have few textfiles which has data, sample data in one of the text file is
N=Ethernet, L=null, IM=XX123, SN=286-054-754, HBF=null, BON=null,
VSR=null, DUID=null, MID=2, IP=10.21.122.136, MAC=60:C7:98:17:57:80,
SYNC=false, GN=null, CustParam3=null, CustParam2=null, VV=1.06.0007,
CustParam5=null, CustParam4=null, CustParam7=null, CustParam6=null,
BUNAME=null, PN=M132-409-01-R, CustParam8=null, CS=2015-09-30
19:49:25.0, CST=Inactive, BL=3.2, EE=Off, TID=190, PRL=VEM, PAV=null,
FAV=null, MON=2016-04-06 11:13:40.507, DON=null, LPDR=2015-09-30
19:50:23.85, SSID=null, PIP=null, DID=null, MDATE=null,
OV=rel-20120625-SC-3.1.2-B, CID=null, ICBI=false, TID=null,
LCR=2015-10-01 01:50:30.297, SS=No-Recent-Communication, CBU=null,
GMVR=, LID=store, FF=167340, HFP=RATNERCO >> blore, ISA=false,
TF=null, FAM=null, LDPDR=2015-09-30 19:50:39.113, STVER=True,
SID=null, LHB=2015-09-30 21:50:30.297, IDSS=false, FR=81796,
LMOS=2015-09-30 19:49:50.503, LCUS=null, MNAME=XX 123, BBUID=null,
CON=null, DBUN=null, ISDRA=false, POSV=null, UUID=2, TRAM=null,
SPOL=000000000, CustomField1=null, CustomField2=null,
CustomField3=null, MUID=2DE02CF3-0663-420A-8918-7A550E29F570,
CustomField4=null, CustomField5=null, HNAME=blore, customparam1=null,
HID=1048, LBDT=2015-07-06 12:03:45.0, DIC=null, AT=None, LID=null,
IDSA=false, LMPS=2015-09-30 15:49:50.457, MBUN=System, CNC=Ethernet,
LOC=null
I am creating index and searching using StandardAnalyzer, but I am unable to search with the string UUID=1, the value which I am getting here is
also that which does NOT have UUID=1 (In total I have two Files and both the files content are getting displayed). As the data has special characters I also tried using WhiteSpaceAnalyzer, but then it did not return any data. I created a custom analyzer which has whitespace, lowercase and standard token filter, but it did not help. I also extended StopwordAnalyzerBase to create my own analyzer and used NormalizeCharMap for replacing the special characters, it helped but I was not able to do wildcard search.
Request someone to help me out in this. I am very new to Lucene.
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class IndexCreator
{
public void createIndex(String inputFiles, String indexPath)
{
//Input Path Variable
final Path docDir = Paths.get(inputFiles);
try
{
//org.apache.lucene.store.Directory instance
Directory dir = FSDirectory.open( Paths.get(indexPath) );
//analyzer with the default stop words
//Analyzer analyzer = new NewStandardAnalyzer();
//Analyzer analyzer = buildAnalyzer();
//Analyzer analyzer = new WhitespaceAnalyzer();
Analyzer analyzer = new StandardAnalyzer();
//IndexWriter Configuration
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
//IndexWriter writes new index files to the directory
IndexWriter writer = new IndexWriter(dir, iwc);
//Its recursive method to iterate all files and directories
indexDocs(writer, docDir);
writer.commit();
}
catch (IOException e)
{
e.printStackTrace();
}
}
private void indexDocs(final IndexWriter writer, Path path) throws
IOException
{
//Directory?
if (Files.isDirectory(path))
{
//Iterate directory
Files.walkFileTree(path, new SimpleFileVisitor<Path>()
{
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException
{
try
{
//Index this file
indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
return FileVisitResult.CONTINUE;
}
});
}
else
{
//Index this file
indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
}
}
private void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException
{
try (InputStream stream = Files.newInputStream(file))
{
//Create lucene Document
Document doc = new Document();
String content = new String(Files.readAllBytes(file));
//content = content.replace("-", "\\-");
//content = content.replace(":", "\\:");
//content = content.replace("=", "\\=");
//content = content.replace(".", "\\.");
doc.add(new StringField("path", file.toString(), Field.Store.YES));
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", content, Store.YES));
//Updates a document by first deleting the document(s)
//containing <code>term</code> and then adding the new
//document. The delete and then add are atomic as seen
//by a reader on the same index
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
public static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();
}
public static void main(String[] args) {
IndexCreator indexCreator = new IndexCreator();indexCreator.createIndex(
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Data",
,
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Index");
System.out.println("Done");
}
}
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class Searcher
{
//directory contains the lucene indexes
private static final String INDEX_DIR =
"C:\\Lucene\\LuceneLatest\\LuceneLatestModified\\Index";
public static void main(String[] args) throws Exception
{
//Create lucene searcher. It search over a single IndexReader.
Searcher searcher = new Searcher();
//Search indexed contents using search term
/*searcher.searchInContent("NETWORKCONFIGURATION=Ethernet AND MACADDRESS=60\\:C7\\:98\\:17\\:57\\:80", searcher.createSearcher());
searcher.searchInContent("NETWORKCONFIGURATION=Ethern*", searcher.createSearcher());*/
searcher.searchInContent("UUID=1", searcher.createSearcher());
}
private void searchInContent(String textToFind, IndexSearcher searcher) throws Exception
{
//Create search query
//QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
//textToFind = QueryParser.escape(textToFind).toLowerCase();
Query query = qp.parse(textToFind);
//search the index
TopDocs hits = searcher.search(query, 10);
System.out.println("Total Results :: " + hits.totalHits);
for (ScoreDoc sd : hits.scoreDocs)
{
Document d = searcher.doc(sd.doc);
System.out.println("Path : "+ d.get("path") + ", Score : " + sd.score + ", Content : "+d.get("contents"));
}
}
private IndexSearcher createSearcher() throws IOException
{
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
//It is an interface for accessing a point-in-time view of a lucene index
IndexReader reader = DirectoryReader.open(dir);
//Index searcher
IndexSearcher searcher = new IndexSearcher(reader);
return searcher;
}
public static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();}}

how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?

Using java, I need to read a warc archive file, filter it depending on the content of the html page, and write a new archive file.
the following code reads the archive. how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?
import org.apache.commons.io.IOUtils;
import org.archive.io.ArchiveRecord;
import org.archive.io.warc.*;
import org.archive.wayback.resourcestore.resourcefile.WarcResource;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Iterator;
import java.util.concurrent.atomic.AtomicInteger;
public class Test126b {
public static void main() throws Exception {
File out = new java.io.File("out.warc.gz");
OutputStream bos = new BufferedOutputStream(new FileOutputStream(out));
WARCWriterPoolSettings settings = ...
WARCWriter writer = new WARCWriter(new AtomicInteger(), bos, out, settings);
File in = new java.io.File("in.warc.gz");
WARCReader reader = WARCReaderFactory.get(in);
Iterator<ArchiveRecord> it = reader.iterator();
while (it.hasNext()) {
ArchiveRecord archiveRecord = it.next();
if (archiveRecord.getHeader().getHeaderValue("WARC-Type") == "response") {
WARCRecord warcRecord = (WARCRecord) archiveRecord;
WarcResource warcResource = new WarcResource(warcRecord, reader);
warcResource.parseHeaders();
String url = warcResource.getWarcHeaders().getUrl();
System.out.println("+++ url: " + url);
byte[] content = IOUtils.toByteArray(warcResource);
String htmlPage = new String(content);
if (htmlPage.contains("hello world")) {
writer.writeRecord(warcRecordInfo) // how to reconstruct the WARCRecordInfo
}
}
}
reader.close();
writer.close();
}
}

How to parse input type value of html and convert it into pdf?

I am unable to parse the html input type value and convert it into pdf file.I am using pdfWriter to generate the pdf and using xmlworker-5.5.4.jar and itext.jar.It does not parse the input type value of html file and unable to convert into pdf file.This problem is generating when using htmlworker or xmlworker.
Code:
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringReader;
public class ParseHtml {
public static final String DEST = "D:/html_1.pdf";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
ParseHtml p=new ParseHtml();
p.createPdf(DEST);
}
#SuppressWarnings("deprecation")
public void createPdf(String file) throws IOException, DocumentException {
StringBuilder sb = new StringBuilder();
sb.append("<input type=\'text\' value=\"43645643634\"/>");
System.out.println("String------"+sb);
FileOutputStream outStream = new FileOutputStream(file);
Document document = new Document(PageSize.A4.rotate());
PdfWriter pdfWriter = PdfWriter.getInstance(document,outStream);
document.open();
document.newPage();
HTMLWorker htmlWorker = new HTMLWorker(document);
htmlWorker.parse(new StringReader(sb.toString()));
document.close();
outStream.close();
System.out.println("Document is created...");
}
}

JAVA Read sample doc file, fill with data and generate PDF

I am triing to make an automatization program in JAVA.
I have a sample doc file. I need to fill the blank parts or the <> "signed" parts from database,
than create pdf files.
I tried to read the word :
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
but this attemption is bad in many way cause i only need to write some of the parts, and this method make a single test from the doc.
So can you advice me some api that write in a word doc eg: after Name : or in the 5 row write this:
And when it finish with the word it should generate a pdf and do it again ...
I am looking a solution wich i found xssfworkbook with some extra function ( generate pdf of the doc ).
Or read the sample pdf and fill with datas and save to a new pdf.
Thx

Use Itext (http://sourceforge.net/projects/itext/)
and Apache POI Project http://poi.apache.org/index.html
A sample code :
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.hwpf.extractor.WordExtractor;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
public static void main(String[] args) {
String pdfPath = "C:/";
String pdfDocPath = null;
try {
InputStream is = new BufferedInputStream(new FileInputStream("C:/Test.doc"));
WordExtractor wd = new WordExtractor(is);
String text = wd.getText();
/* FOR DOCX
// IMPORT
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
// CODE
XWPFDocument hdoc = new XWPFDocument(is);
extractor = new XWPFWordExtractor(hdoc);
String text = extractor.getText();
*/
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(pdfPath + "viewDoc.pdf"));
document.open();
document.add(new Paragraph(text));
document.close();
pdfDocPath = pdfPath + "viewDoc.pdf";
System.out.println("Pdf document path is" + pdfDocPath);
}
catch (FileNotFoundException e1) {
System.out.println("File does not exist.");
}
catch (IOException ioe) {
System.out.println("IO Exception");
}
catch (DocumentException e) {
e.printStackTrace();
}
}

Tika in Action book examples Lucene StandardAnalyzer does not work

First of all I am a total noob when it comes to Tika and Lucene. I am working through the Tika in Action book trying out the examples. In chapter 5 this example is given:
package tikatest01;
import java.io.File;
import org.apache.tika.Tika;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
public class LuceneIndexer {
private final Tika tika;
private final IndexWriter writer;
public LuceneIndexer(Tika tika, IndexWriter writer) {
this.tika = tika;
this.writer = writer;
}
public void indexDocument(File file) throws Exception {
Document document = new Document();
document.add(new Field(
"filename", file.getName(),
Store.YES, Index.ANALYZED));
document.add(new Field(
"fulltext", tika.parseToString(file),
Store.NO, Index.ANALYZED));
writer.addDocument(document);
}
}
And this main method:
package tikatest01;
import java.io.File;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.tika.Tika;
public class TikaTest01 {
public static void main(String[] args) throws Exception {
String filename = "C:\\testdoc.pdf";
File file = new File(filename);
IndexWriter writer = new IndexWriter(
new SimpleFSDirectory(file),
new StandardAnalyzer(Version.LUCENE_30),
MaxFieldLength.UNLIMITED);
try {
LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
indexer.indexDocument(file);
}
finally {
writer.close();
}
}
}
I've added the libraries tika-app-1.5.jar, lucene-core-4.7.0.jar and lucene-analyzers-common-4.7.0.jar to the project.
Questions:
With the current version of Lucene the Field.Index is deprecated, what should I use instead?
MaxFieldLength is not found. I am missing an import?

For Lucene 4.7 this code for the indexer:
package tikatest01;
import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.tika.Tika;
public class LuceneIndexer {
private final Tika tika;
private final IndexWriter writer;
public LuceneIndexer(Tika tika, IndexWriter writer) {
this.tika = tika;
this.writer = writer;
}
public void indexDocument(File file) throws Exception {
Document document = new Document();
document.add(new TextField(
"filename", file.getName(), Store.YES));
document.add(new TextField(
"fulltext", tika.parseToString(file), Store.NO));
writer.addDocument(document);
}
}
And this code for the main class:
package tikatest01;
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.tika.Tika;
public class TikaTest01 {
public static void main(String[] args) throws Exception {
String dirname = "C:\\MyTestDir\\";
File dir = new File(dirname);
IndexWriter writer = new IndexWriter(
new SimpleFSDirectory(dir),
new IndexWriterConfig(
Version.LUCENE_47,
new StandardAnalyzer(Version.LUCENE_47)));
try {
LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
indexer.indexDocument(dir);
}
finally {
writer.close();
}
}
}

For Lucene 4.7 there isn't this kind of constructor for IndexWriter
Take a look on API - http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/index/IndexWriter.html
It show me only constructor with 2 params, so you need to adopt this example to new Lucene API

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Could not able to parse text and Image from PDF file - java

Related

Lucene special character search

how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?

How to parse input type value of html and convert it into pdf?

JAVA Read sample doc file, fill with data and generate PDF

Tika in Action book examples Lucene StandardAnalyzer does not work

Categories

Resources