I'm trying to extract only the text from a pdf file using Tika by following the tutorial on their website, but i'm just getting 25k lines of text back that looks like this.
%PDF-1.5
%µµµµ
1 0 obj
<>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
xœ•Wßoâ8?~Gâ˜Çät?Û±óCª"QÚ®î´+?t¼öÁ‚#£Mi?ÙëíóÙ??mºœˆ=žùfæóÌ„Òô/º¹™~™ÿqGrúÙm7?UÛɧ۸,éönN·‹ñhú H)!
-Öã‘"ÉŠt¡D–Q&?‘æ´x??$mðõi<ú?-≉\l£}<ÑÑ÷ø?-þ??î?
ˆ?F¢´0ê?ãkDƒºFä?ºjHWK‘Øa\?é1?Ld†t.lnHŠ‚¿wÕx´þm<¢û/s¢3ŽÔ?GGÏ2?ÒRZ¤Bë?+}÷oõµ°¹?Ù{ ½AL®EL?‘˜k?ͯ?3-¤6”Z+ŠãýL’HÄiXÐßq?½Ä&ªø¹Œ'6ª!^ÇJ‡•—¡hÚXÉ zæÝvà–•É„ê;ü0?;\àú??ïò1š+#àßH©?¤ÊÒÒòR&R?³r’ÜHeg¥Ü±H†#©ýÚ ·?V0†ffË”?ê??àÀ¨ÌY4Ï?dvWNpka€Ó?§ ¥?þ?±R?b/ùîYi?±Z/.Ur?ß™YÂH>eD?îX÷”Bboùã½K™?ø=Y#c¾??u8>¡#Dï?¢ìÈ:û8øš?–?ç™dç‰??±d%ó–Ð?=e¿¦§?É;%h“Bäi¯??çcW®º#S?ÝGn4÷?ú¨Þr#m¸÷¨Åö5ιµ¸Ûè¥q±2ÑOH«Ýž0®?:rO¯Ü¸UÓ?šÑíƒ!?+Š³`ý»¶Ž•
Û-oiýÌ^väh_o7ŒÐT8÷~’Î
I also get the same sort of thing when trying it with .docx format too,but it works fine with .txt. Does anyone know what i am doing wrong?
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("fake.pdf"));
ParseContext pcontext=new ParseContext();
//Text document parser
TXTParser TexTParser = new TXTParser();
TexTParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
The problems seems to be that you are trying to use a TXTParser to parse a PDF document. PDF stands for Portable Document Format which includes binary data in the file.
Fortunately Apache Tika comes with a wrapper that will automatically detect the format of the file you are trying to parse.
Try this example from the Tika Documentation:
public String parseExample() throws IOException, SAXException, TikaException {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}
Just realised i was using TXTParser instead of AutoDetectParser. Can someone close or delete this question please?
Related
I'm using Crawler4j to extract pages and pdf files. I already checked that the byte array I got is valid and can be output to a pdf file.
With this byte array, I do the following:
//Tika specific types
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream;
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
...
byte[] contentData = null;
contentData = page.getContentData(); //Crawler4j content, delivers valid PDF
//Path path = Paths.get("C:\\Test\\local.pdf"); //use this line to read from a local pdf
//Default fields:
String title = "pdf title";
String content = "";
String suggestions = "";
//
try {
////contentData = Files.readAllBytes(path); //use this line to read from a local pdf
inputstream = new ByteArrayInputStream(contentData);
pdfparser.parse(inputstream, handler, metadata,pcontext); //THIS LINE CRASHES
content = "pdf suggestions";
suggestions = handler.toString();
} catch (Exception e) {
LOGGER.warn("Error parsing with Tika.", e);
}
I marked the crashing line. The resulting Exception is the following:
WARN 2017-07-26 11:17:51,302 [Thread-5] de.searchadapter.crawler.solrparser.parser.file.PDFFileParser - Error parsing with Tika.
org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
at org.apache.tika.metadata.Metadata.add(Metadata.java:305)
at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:209)
at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:150)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:239)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
at de.searchadapter.crawler.solrparser.parser.file.PDFFileParser.parse(PDFFileParser.java:82)
at de.searchadapter.crawler.solrparser.SolrParser.parse(SolrParser.java:36)
at de.searchadapter.crawler.SolrJAdapter.indexDocs(SolrJAdapter.java:58)
at de.searchadapter.crawler.WebCrawler.onBeforeExit(WebCrawler.java:63)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:309)
at java.lang.Thread.run(Thread.java:745)
The code above is from the PDFFileParser. I'm not setting any property, so I'm puzzled where this error comes from.
Additional info: The PDF file seems to use an unknown font, the following warning comes up:
11:17:50.963 [Thread-5] WARN o.a.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for f_i (30) in font GGOLOE+TheSansC5-Plain
EDIT: I edited the code, so that it could read local pdf files. I tried another PDF file and didn't get the error. It seems like it results of the failing font.
l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.
Error output:
Exception in thread "main"
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException:
Your document contained more than 100000 characters, and so your
requested limit has been reached. To receive the full text of the
document, increase your limit.
How can l increase the limit to 10-15mb ?
I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
Here is my code:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";
FileInputStream inputstream = new FileInputStream(location);
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
Output:
System.out.println("Content of the PDF :" + pcontext);
Use
BodyContentHandler handler = new BodyContentHandler(-1);
to disable the limit.
From the Javadoc:
The internal string buffer is bounded at the given number of
characters. If this write limit is reached, then a SAXException is
thrown. Parameters: writeLimit - maximum number of characters to
include in the string, or -1 to disable the write limit
I am using Tika Apache Parser to parse a file and elastic search to index a file.
Let's suppose I have a doc file that need to be parsed. Here is the code example:
public String parseToPlainText() throws IOException, SAXException, TikaException {
BodyContentHandler handler = new BodyContentHandler();
InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
return handler.toString();
} finally {
stream.close();
}
}
As you can see test.doc file has been read at the stretch and if the file size is too large then it may cause outofmemoryerror. I need that I could read a file in small chuck of input-streams and parser.parse(stream, handler, metadata); could accept those stream chunks. I have another issue is that file can be of any type. So how could I split files in chuck of streams and how could parser accept it?
Importantly each file should be indexed as a single file even split into chunks while indexing, at the end.
I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?
I am trying to parse a plain text file using Tika but getting inconsistent
behavior.
More specifically, I have defined a simple handler as follows:
public class MyHandler extends DefaultHandler
{
#Override
public void characters(char ch[], int start, int length) throws SAXException
{
System.out.println(new String(ch));
}
}
Then, I parse the file ("myfile.txt") as follows:
Tika tika = new Tika();
InputStream is = new FileInputStream("myfile.txt");
Metadata metadata = new Metadata();
ContentHandler handler = new MyHandler();
Parser parser = new TXTParser();
ParseContext context = new ParseContext();
String mimeType = tika.detect(is);
metadata.set(HttpHeaders.CONTENT_TYPE, mimeType);
tikaParser.parse(is, handler, metadata, context);
I would expect all the text in the file to be printed out on screen, but a
small part in the end is not. More specifically, the characters() callback
keeps reading 4,096 characters per callback but in the end it apparently
leaves out the last 5,083 characters of this particular file (which is a few
MB long), so it even goes beyond missing the last callback.
Also, testing on another, small file, which is about 5,000 characters long,
no callback seems to take place!
The MIME type is correctly detected as text/plain in both cases.
Any ideas?
Thanks!
What version of Tika are you using? Looking at the source code it reads chunks of 4096 bytes which can be seen on line 129 of TXTParser. At line 132 the characters(...) routine is invoked.
In short, the target code is:
char[] buffer = new char[4096];
int n = reader.read(buffer);
while (n != -1) {
xhtml.characters(buffer, 0, n);
n = reader.read(buffer);
}
where reader is a BufferedReader. I cannot see any flaw in this code, hence I'm thinking you might be working an older version?