I'm using Crawler4j to extract pages and pdf files. I already checked that the byte array I got is valid and can be output to a pdf file.
With this byte array, I do the following:
//Tika specific types
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream;
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
...
byte[] contentData = null;
contentData = page.getContentData(); //Crawler4j content, delivers valid PDF
//Path path = Paths.get("C:\\Test\\local.pdf"); //use this line to read from a local pdf
//Default fields:
String title = "pdf title";
String content = "";
String suggestions = "";
//
try {
////contentData = Files.readAllBytes(path); //use this line to read from a local pdf
inputstream = new ByteArrayInputStream(contentData);
pdfparser.parse(inputstream, handler, metadata,pcontext); //THIS LINE CRASHES
content = "pdf suggestions";
suggestions = handler.toString();
} catch (Exception e) {
LOGGER.warn("Error parsing with Tika.", e);
}
I marked the crashing line. The resulting Exception is the following:
WARN 2017-07-26 11:17:51,302 [Thread-5] de.searchadapter.crawler.solrparser.parser.file.PDFFileParser - Error parsing with Tika.
org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
at org.apache.tika.metadata.Metadata.add(Metadata.java:305)
at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:209)
at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:150)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:239)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
at de.searchadapter.crawler.solrparser.parser.file.PDFFileParser.parse(PDFFileParser.java:82)
at de.searchadapter.crawler.solrparser.SolrParser.parse(SolrParser.java:36)
at de.searchadapter.crawler.SolrJAdapter.indexDocs(SolrJAdapter.java:58)
at de.searchadapter.crawler.WebCrawler.onBeforeExit(WebCrawler.java:63)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:309)
at java.lang.Thread.run(Thread.java:745)
The code above is from the PDFFileParser. I'm not setting any property, so I'm puzzled where this error comes from.
Additional info: The PDF file seems to use an unknown font, the following warning comes up:
11:17:50.963 [Thread-5] WARN o.a.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for f_i (30) in font GGOLOE+TheSansC5-Plain
EDIT: I edited the code, so that it could read local pdf files. I tried another PDF file and didn't get the error. It seems like it results of the failing font.
Related
I am trying to get file mime type by using the file extensions. I created a list for accepted mime types using https://www.iana.org/assignments/media-types/media-types.xml page.
For example, I need to get "audio/opus" value from a file like file.opus. However, I get "application/ogg" when I use the following approach:
public static void validateFileExtension(InputStream inputStream, String fileId) {
try {
BufferedInputStream bis = new BufferedInputStream(inputStream);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY, fileId);
MediaType mediaType = detector.detect(bis, md);
String fileTypeFromFile = mediaType.toString();
// code omitted for brevity
} catch (IOException e) {
throw new UploadFailedException(fileId);
}
}
I also made a search for this issue and there are some workarounds, but not completely fix the problem. So, how can get the correct mime type from a file using its extension or another way in Java?
Another issue that I am not sure, is it not safe to add application/octet-stream to the allowed file types / extensions? Because I also got "application/octet-stream" value by using other approaches. Any idea?
I'm trying to extract only the text from a pdf file using Tika by following the tutorial on their website, but i'm just getting 25k lines of text back that looks like this.
%PDF-1.5
%µµµµ
1 0 obj
<>>>
endobj
2 0 obj
<>
endobj
3 0 obj
<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>
endobj
4 0 obj
<>
stream
xœ•Wßoâ8?~Gâ˜Çät?Û±óCª"QÚ®î´+?t¼öÁ‚#£Mi?ÙëíóÙ??mºœˆ=žùfæóÌ„Òô/º¹™~™ÿqGrúÙm7?UÛɧ۸,éönN·‹ñhú H)!
-Öã‘"ÉŠt¡D–Q&?‘æ´x??$mðõi<ú?-≉\l£}<ÑÑ÷ø?-þ??î?
ˆ?F¢´0ê?ãkDƒºFä?ºjHWK‘Øa\?é1?Ld†t.lnHŠ‚¿wÕx´þm<¢û/s¢3ŽÔ?GGÏ2?ÒRZ¤Bë?+}÷oõµ°¹?Ù{ ½AL®EL?‘˜k?ͯ?3-¤6”Z+ŠãýL’HÄiXÐßq?½Ä&ªø¹Œ'6ª!^ÇJ‡•—¡hÚXÉ zæÝvà–•É„ê;ü0?;\àú??ïò1š+#àßH©?¤ÊÒÒòR&R?³r’ÜHeg¥Ü±H†#©ýÚ ·?V0†ffË”?ê??àÀ¨ÌY4Ï?dvWNpka€Ó?§ ¥?þ?±R?b/ùîYi?±Z/.Ur?ß™YÂH>eD?îX÷”Bboùã½K™?ø=Y#c¾??u8>¡#Dï?¢ìÈ:û8øš?–?ç™dç‰??±d%ó–Ð?=e¿¦§?É;%h“Bäi¯??çcW®º#S?ÝGn4÷?ú¨Þr#m¸÷¨Åö5ιµ¸Ûè¥q±2ÑOH«Ýž0®?:rO¯Ü¸UÓ?šÑíƒ!?+Š³`ý»¶Ž•
Û-oiýÌ^väh_o7ŒÐT8÷~’Î
I also get the same sort of thing when trying it with .docx format too,but it works fine with .txt. Does anyone know what i am doing wrong?
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("fake.pdf"));
ParseContext pcontext=new ParseContext();
//Text document parser
TXTParser TexTParser = new TXTParser();
TexTParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
The problems seems to be that you are trying to use a TXTParser to parse a PDF document. PDF stands for Portable Document Format which includes binary data in the file.
Fortunately Apache Tika comes with a wrapper that will automatically detect the format of the file you are trying to parse.
Try this example from the Tika Documentation:
public String parseExample() throws IOException, SAXException, TikaException {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}
Just realised i was using TXTParser instead of AutoDetectParser. Can someone close or delete this question please?
l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.
Error output:
Exception in thread "main"
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException:
Your document contained more than 100000 characters, and so your
requested limit has been reached. To receive the full text of the
document, increase your limit.
How can l increase the limit to 10-15mb ?
I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
Here is my code:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";
FileInputStream inputstream = new FileInputStream(location);
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
Output:
System.out.println("Content of the PDF :" + pcontext);
Use
BodyContentHandler handler = new BodyContentHandler(-1);
to disable the limit.
From the Javadoc:
The internal string buffer is bounded at the given number of
characters. If this write limit is reached, then a SAXException is
thrown. Parameters: writeLimit - maximum number of characters to
include in the string, or -1 to disable the write limit
I extracted meta information thru apache tika library, made some changes and now i want to write out changed information to file.
code snippet for extracting is here:
InputStream input = new FileInputStream(new File(...));
ContentHandler handler = new DefaultHandler();
Metadata metadata = new Metadata();
Parser parser = new Mp3Parser();
ParseContext parseCtx = new ParseContext();
parser.parse(input, handler, metadata, parseCtx);
input.close();
String[] tags = metadata.names();
i make some changes further like:
for(String tagName : tags){
metadata.remove(tagName);
}
And finally i want to write modified version. How can i do this?
I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)