I have a problem like mentioned above when extracting metadata from tif file. It has size over 450 MB. I was extracting using http://commons.apache.org/sanselan/ library in newest version(0.97). When I execute code:
String xmpMeta = null;
try {
xmpMeta = Sanselan.getXmpXml(file);
} catch ...
, I get following stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.sanselan.common.byteSources.ByteSourceInputStream.readBlock(ByteSourceInputStream.java:65)
at org.apache.sanselan.common.byteSources.ByteSourceInputStream.access$000(ByteSourceInputStream.java:24)
at org.apache.sanselan.common.byteSources.ByteSourceInputStream$CacheBlock.getNext(ByteSourceInputStream.java:54)
at org.apache.sanselan.common.byteSources.ByteSourceInputStream$CacheReadingInputStream.read(ByteSourceInputStream.java:147)
...
I have to admit that I was increasing Xms and Xmx properties of my vm and it also failed, but at the end I am not interested in increasing this properties becouse I can get heavier pictures to parse. I would be grateful for help in this issue or referencing another library to parse xmp metadata from JPEG / Tif files.
Well you could call java with more heap space by calling
java -Xmx512M FooProgramm
This will run java with 512M Heap space. I know that this is not a good solution.
Maybe you could try something out of this examples:
http://www.example-code.com/java/java-xmp-extract.asp
Related
Good morning,
We are executing the following code and we are reaching an error message when loading a certain number of dlls:
File file = new File("C:\\Users\\jevora\\Downloads\\dng_tests\\dllsCopies");
file.mkdirs();
for (int i = 1; i < 10000; i++) {
String filename = "heatedTankCvode" + i + ".dll";
Files.copy(new File("C:\\Users\\jevora\\Downloads\\dng_tests\\heatedTankCvode.dll").toPath(),
new File(file, filename).toPath(), StandardCopyOption.REPLACE_EXISTING);
NativeLibrary.getInstance(new File(file, filename).getAbsolutePath());
System.out.println("Loaded: " + filename);
}
As you can see here, we want to load 10,000 dlls using JNA. However,in the following log, the process stops at loading the instance 1,051:
Loaded: heatedTankCvode1048.dll
Loaded: heatedTankCvode1049.dll
Loaded: heatedTankCvode1050.dll
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'C:\Users\jevora\Downloads\dng_tests\dllsCopies\heatedTankCvode1051.dll': Native library (win32-x86-64/C:\Users\jevora\Downloads\dng_tests\dllsCopies\heatedTankCvode1051.dll)
About the code, first we copy the dll in a new location with a different name and, then, we try to load it. We wonder if there is a limitation to the amount of dlls that can be loaded. Is there a limitation? can we overcome it?
Thanks in advance
EDIT: I've tried with several memory configurations and it always stop in the 1051 instance
I think that the cause might be explained by this old Microsoft Forum post:
DLL Limit?
It appears that each DLL that you are loading is consuming a TLS (thread local storage) slot. There is a per process limit of 1088 on the number of TLS slots. From all that I have read, the limit is hard ... and there is no way to increase it.
From what I have read, a DLL doesn't have to use TLS, so you should investigate if you can change the way that your DLLs are created so that they don't do this.
There are so many questions with the same type of this error but I am not able to find the solution for my error, I checked on my diffrent devices but still not able to find exactly where this error occurs. My app is live currently and installed on 50k active devices, I got this error through my firebase and it occurs so many times.
Exception java.lang.OutOfMemoryError: Failed to allocate a 65548 byte allocation with 55872 free bytes and 54KB until OOM
com.android.okhttp.okio.Segment.<init> (Segment.java:62)
com.android.okhttp.okio.SegmentPool.take (SegmentPool.java:46)
com.android.okhttp.okio.Buffer.writableSegment (Buffer.java:1114)
com.android.okhttp.okio.InflaterSource.read (InflaterSource.java:66)
com.android.okhttp.okio.GzipSource.read (GzipSource.java:80)
com.android.okhttp.okio.RealBufferedSource$1.read (RealBufferedSource.java:374)
bmr.a (:com.google.android.gms.DynamiteModulesC:95)
bmk.a (:com.google.android.gms.DynamiteModulesC:1055)
bmq.a (:com.google.android.gms.DynamiteModulesC:5055)
bmq.run (:com.google.android.gms.DynamiteModulesC:54)
java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1113)
java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:588)
java.lang.Thread.run (Thread.java:818)
It seems that you have tried to upload file by "okhttp".If it is,try to use a "Filepath" instead of "File" as a param.
Seems like you try to unzip a large file, a file so large that all the device's memory is spent. Available memory is different on each device.
If possible, try to split the file into smaller chunks and handle them individually. Otherwize, try to use a streaming solution where you can unzip using a stream instead of loading the entire file into memory before starting the unzip.
Try this: Unzip using a stream or read documentation here: GZUPInputStream
just remove HttpLoggingInterceptor.Level.BODY
I'm working on application that is supposed to read a large amount of files (test set is about 80.000 files). It then extracts the text from these files. The files can be anything from txt, pdf, docx, etc. and will be parsed using Apache Tika.
Once the text is extracted, it will be indexed in ElasticSearch to become searchable. Elastic has, thus far, not been a problem in this.
The server on which this application will run will have limited RAM (about 2GB)
Current
The Tika implementation is as follows:
private static final int PARSE_STRING_LIMIT = 100000;
private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
public String parseToString(InputStream inputStream) throws IOException, TikaException {
try {
return TIKA_INSTANCE.parseToString(inputStream, new Metadata(), PARSE_STRING_LIMIT);
} finally {
IOUtils.closeQuietly(inputStream); //should already be closed by parseToString.
}
}
For each file, a document object is created and given the appropriate values for the ElasticSearch mapping. Text extraction is done as follows:
String text = TIKA_INSTANCE.parseToString(newBufferedInputStream(new FileInputStream(file)));
attachmentDocumentNew.setText(text);
text = null;
One more caveat is that this is a Spring-boot application which will eventually run as a server so it can be called upon whenever the index is necessary (and some other stuff, such as statistics).
The jar is run with the following VM arguments:
java -Xms512m -Xmx1024m -XX:UseG1GC -jar <jar>
The problem
Whenever I start indexing the files, i get an OutOfMemoryException. I tried profiling it using VisualVM, but it's mostly char[] and byte[], which don't provide a lot of information. I am also not well versed in multithreading or profiling (I do neither at this point), since I only have 2 years of programming experience.
The question
How do I reduce the memory footprint of the application without crashing the indexing?
Perhaps a more general question if the above is too specific:
How do I reduce the memory usage when reading a large amounts of files?
If you have experience building something like this, I'll also be open about any suggestions :)
Edit
To clarify, I do not have to write much (any?) code for the Elasticsearch part of the application, since this is done using an existing library written by the people here.
I'm trying to convert the first page of a pdf file to image using PDFBox.
When i'm loading a large pdf file i get an exception.
code:
PDDocument doc;
try {
InputStream input = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
doc = PDDocument.load(input);
PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
BufferedImage image =firstPage.convertToImage();
File outputfile = new File("image2.png");
ImageIO.write(image, "png", outputfile);
input.close();
doc.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
exception:
org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
at java.io.PushbackInputStream.unread(Unknown Source)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
... 5 more
An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be
doc = PDDocument.load(input);
but
doc = PDDocument.loadNonSeq(input, null);
that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.
First, find the current buffer size:
System.out.println(System.getProperty("org.apache.pdfbox.baseParser.pushBackSize"));
Now that you have a baseline, do exactly what it suggests. Increase the buffer size above what you just printed out using this:
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "<buffer size>");
Keep increasing the buffer size until it works. Hopefully you won't run out of memory, if you do increase heap.
This is how you set system properties at runtime. You could also pass it as an argument, but I find setting near the beginning of main will do the trick and makes it easier for future developers to maintain the project.
For whatever reason, with large files you don't have a big enough buffer to load the page. Maybe the page is loaded into a buffer before or while it's rendered into an image. My guess is that the DPI in the PDF is very high and can't fit in the buffer.
I had a similar issue, which I thought was related to a large pdf file based on the error, however it turned out it was not. It turned out to be a corrupt pdf file.
For our use case, we had a pdf template file (which we populate its form values programmatically) as a resource in our project that is cooked into our war.
The exception I was seeing for reference: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 480478 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize. We added the property and then ran things again and we got a different issue.
The next stack trace stated "Could not read embedded TTF for font TimesNewRoman,Bold". It took us a while, however after exploding the war and trying to open the pdf file in the war, we noticed that it was corrupt, but the pdf file that was in source was not corrupt and could be opened without issues.
The root cause of our issue was that we added "filtering" in our pom for our resource folder. We did this so that we could use some reflection to get some values in our health check page, but that corrupted the pdf file, which we figured out from the following reference: https://bitbucket.org/petermr/xhtml2stm/issues/12/pdf-files-are-being-corrupted-at-some
Below is an example of the filtering we setup that bit us:
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
Our solution was to remove this from our pom and rework how we got the information for our health page.
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.
Good Luck
I wrote code in java which reading one excel file and after encrypting data its writing againg. Code is working fine but when i am passing 12 mb size file to code its giving error
Warning: Usage of a local non-builtin name
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at jxl.read.biff.File.next(File.java:181)
at jxl.read.biff.SheetReader.read(SheetReader.java:375)
at jxl.read.biff.SheetImpl.readSheet(SheetImpl.java:716)
at jxl.read.biff.WorkbookParser.getSheet(WorkbookParser.java:257)
at ReadExcel.read(ReadExcel.java:41)
at ReadExcel.main(ReadExcel.java:162)
How can i resolve it.