Java reading and indexing large amount of files - java

I'm working on application that is supposed to read a large amount of files (test set is about 80.000 files). It then extracts the text from these files. The files can be anything from txt, pdf, docx, etc. and will be parsed using Apache Tika.
Once the text is extracted, it will be indexed in ElasticSearch to become searchable. Elastic has, thus far, not been a problem in this.
The server on which this application will run will have limited RAM (about 2GB)
Current
The Tika implementation is as follows:
private static final int PARSE_STRING_LIMIT = 100000;
private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
public String parseToString(InputStream inputStream) throws IOException, TikaException {
try {
return TIKA_INSTANCE.parseToString(inputStream, new Metadata(), PARSE_STRING_LIMIT);
} finally {
IOUtils.closeQuietly(inputStream); //should already be closed by parseToString.
}
}
For each file, a document object is created and given the appropriate values for the ElasticSearch mapping. Text extraction is done as follows:
String text = TIKA_INSTANCE.parseToString(newBufferedInputStream(new FileInputStream(file)));
attachmentDocumentNew.setText(text);
text = null;
One more caveat is that this is a Spring-boot application which will eventually run as a server so it can be called upon whenever the index is necessary (and some other stuff, such as statistics).
The jar is run with the following VM arguments:
java -Xms512m -Xmx1024m -XX:UseG1GC -jar <jar>
The problem
Whenever I start indexing the files, i get an OutOfMemoryException. I tried profiling it using VisualVM, but it's mostly char[] and byte[], which don't provide a lot of information. I am also not well versed in multithreading or profiling (I do neither at this point), since I only have 2 years of programming experience.
The question
How do I reduce the memory footprint of the application without crashing the indexing?
Perhaps a more general question if the above is too specific:
How do I reduce the memory usage when reading a large amounts of files?
If you have experience building something like this, I'll also be open about any suggestions :)
Edit
To clarify, I do not have to write much (any?) code for the Elasticsearch part of the application, since this is done using an existing library written by the people here.

Related

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!
I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

Delete file after staring connection using FileInputStream

I have a temporary file which I want to send the client from the controller in the Play Framework. Can I delete the file after opening a connection using FileInputStream? For example can I do something like this -
File file = getFile();
InputStream is = new FileInputStream(file);
file.delete();
renderBinary(is, "name.txt");
What if file is a large file? If I delete the file, will subsequent reads() on InputStream give an error? I have tried with files of around 1MB I don't get an error.
Sorry if this is a very naive question, but I could not find anything related to this and I am pretty new to Java
I just encountered this exact same scenario in some code I was asked to work on. The programmer was creating a temp file, getting an input stream on it, deleting the temp file and then calling renderBinary. It seems to work fine even for very large files, even into the gigabytes.
I was surprised by this and am still looking for some documentation that indicates why this works.
UPDATE: We did finally encounter a file that caused this thing to bomb. I think it was over 3 Gb. At that point, it became necessary to NOT delete the file while the rendering was in process. I actually ended up using the Amazon Queue service to queue up messages for these files. The messages are then retrieved by a scheduled deletion job. Works out nicely, even with clustered servers on a load balancer.
It seems counter-intuitive that the FileInputStream can still read after the file is removed.
DiskLruCache, a popular library in the Android world originating from the libcore of the Android platform, even relies on this "feature", as follows:
// Open all streams eagerly to guarantee that we see a single published
// snapshot. If we opened streams lazily then the streams could come
// from different edits.
InputStream[] ins = new InputStream[valueCount];
try {
for (int i = 0; i < valueCount; i++) {
ins[i] = new FileInputStream(entry.getCleanFile(i));
}
} catch (FileNotFoundException e) {
....
As #EJP pointed out in his comment on a similar question, "That's how Unix and Linux behave. Deleting a file is really deleting its name from the directory: the inode and the data persist while any processes have it open."
But I don't think it is a good idea to rely on it.

Creating a .txt file from scratch

I'm working on a microcontroller and I'm trying to write some data from some sensors into a .txt file on the SDcard and later on place the sd card in a card reader and read the data on the PC.
Does anyone know how to write a .txt file from scratch for a FAT32 file system? I don't have any predefined code/methods/functions to call, I'll need to create the code from nothin.
It's not a question for a specific programming language, that is why I tagged more than one. I can later on convert the code from C or Java to my programming language of choice. But I can't seem to find such low level methods/functions in any type of language :)
Any ideas?
FatFs is quite good, and highly portable. It has support for FAT12, FAT16 and FAT32, long filenames, seeking, reading and writing (most of these things can be switched on and off to change the memory footprint).
If you're really tight on memory there's also Petit FatFs, but it doesn't have write support by default and adding it would take some work.
After mounting the drive you'd simply open a file to create it. For example:
FATFS fatFs;
FIL newFile;
// The drive number may differ
if (f_mount(0, &fatFs) != FR_OK) {
// Something went wrong
}
if (f_open(&newFile, "/test.txt", FA_WRITE | FA_OPEN_ALWAYS) != FR_OK) {
// Something went wrong
}
If you really need to create the file using only your own code you'll have to traverse the FAT, looking for empty space and then creating new LFN entries (where you store the filename) and DIRENTs (which specify the clusters on the disk that will hold the file data).I can't see any reason for doing this except if this is some kind of homework / lab exercise. In any case you should do some reading about the FAT structure first and return with some more specific questions once you've got started.
In JAVA you can do like this
Writer output = null;
String text = "This is test message";
File file = new File("write.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(text);
output.close();
System.out.println("Your file has been written");

Random-access Zip file without writing it to disk

I have a 1-2GB zip file with 500-1000k entries. I need to get files by name in fraction of second, without full unpacking. If file is stored on HDD, this works fine:
public class ZipMapper {
private HashMap<String,ZipEntry> map;
private ZipFile zf;
public ZipMapper(File file) throws IOException {
map = new HashMap<>();
zf = new ZipFile(file);
Enumeration<? extends ZipEntry> en = zf.entries();
while(en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
map.put(ze.getName(), ze);
}
}
public Node getNode(String key) throws IOException {
return Node.loadFromStream(zf.getInputStream(map.get(key)));
}
}
But what can I do if program downloaded the zip file from Amazon S3 and has its InputStream (or byte array)? While downloading 1GB takes ~1 second, writing it to HDD may take some time, and it is slightly harder to handle multiple files since we don't have HDD garbage collector.
ZipInputStream does not allow to random access to entries.
It would be nice to create a virtual File in memory by byte array, but I couldn't find a way to.
You could mark the file to be deleted on exit.
If you want to go for an in-memory approach: Have a look at the new NIO.2 File API. Oracle provides a filesystem provider for zip/ jar and AFAIK ShrinkWrap provides an in-memory filesystem. You could try a combination of the two.
I've written some utility methods to copy directories and files to/from a Zip file using the NIO.2 File API (the library is Open Source):
Maven:
<dependency>
<groupId>org.softsmithy.lib</groupId>
<artifactId>softsmithy-lib-core</artifactId>
<version>0.3</version>
</dependency>
Tutorial:
http://softsmithy.sourceforge.net/lib/current/docs/tutorial/nio-file/index.html
API: CopyFileVisitor.copy
Especially PathUtils.resolve helps with resolving paths across filesystems.
You can use SecureBlackbox library, it allows ZIP operations on any seekable streams.
I think you should consider using your OS in order to create "in memory" file system (i.e - RAM drive).
In addition, take a look at the FileSystems API.
A completely different approach: If the server has the file on disk (and possibly cached in RAM already): make it give you the file(s) directly. In other words, submit which files you need and then take care to extract and deliver these on the server.
Blackbox library only has Extract(String name, String outputPath) method. Seems that it can randomly access any file in seekable zip-stream indeed, but it can't write result to byte array or return stream.
I couldn't find and documentation for ShrinkWrap. I couldn't find any suitable implementations of FileSystem/FileSystemProvider etc.
However, it turned out that Amazon EC2 instance I'm running (Large) somehow writes 1gb file to disk in ~1 second. So I just write file to the disk and use ZipFile.
If HDD would be slow, I think RAM disk would be the easiest solution.

Java Heap Space (CMS with huge files)

EDIT:
Got the directory to live. Now there's another issue in sight:
The files in the storage are stored with their DB id as a prefix
to their file names. Of course I don't want the users to see those.
Is there a way to combine the response.redirect and the header setting
für filename and size?
best,
A
Hi again,
new approach:
Is it possible to create a IIS like virtual directory within tomcat in order
to avoid streaming and only make use of header redirect? I played around with
contexts but could'nt get it going...
any ideas?
thx
A
Hi %,
I'm facing a wired issue with the java heap space which is close
to bringing me to the ropes.
The short version is:
I've written a ContentManagementSystem which needs to handle
huge files (>600mb) too. Tomcat heap settings:
-Xmx700m
-Xms400m
The issue is, that uploading huge files works eventhough it's
slow. Downloading files results in a java heap space exception.
Trying to download a 370mb file makes tomcat jump to 500mb heap
(which should be ok) and end in an Java heap space exception.
I don't get it, why does upload work and download not?
Here's my download code:
byte[] byt = new byte[1024*1024*2];
response.setHeader("Content-Disposition", "attachment;filename=\"" + fileName + "\"");
FileInputStream fis = null;
OutputStream os = null;
fis = new FileInputStream(new File(filePath));
os = response.getOutputStream();
BufferedInputStream buffRead = new BufferedInputStream(fis);
while((read = buffRead.read(byt))>0)
{
os.write(byt,0,read);
os.flush();
}
buffRead.close();
os.close();
If I'm getting it right the buffered reader should take care of any
memory issue, right?
Any help would be highly appreciated since I ran out of ideas
Best regards,
W
If I'm getting it right the buffered
reader should take care of any memory
issue, right?
No, that has nothing to do with memory issues, it's actually unnecessary since you're already using a buffer to read the file. Your problem is with writing, not with reading.
I can't see anything immediately wrong with your code. It looks as though Tomcat is buffering the entire response instead of streaming it. I'm not sure what could cause that.
What does response.getBufferSize() return? And you should try setting response.setContentLength() to the file's size; I vaguely remember that a web container under certain circumstances buffers the entire response in order to determine the content length, so maybe that's what's happening. It's good practice to do it anyway since it enables clients to display the download size and give an ETA for the download.
Try using the setBufferSize and flushBuffer methods of the ServletResponse.
You better use java.nio for that, so you can read resources partially and free resources already streamed!
Otherwise, you end up with memory problems despite the settings you've done to the JVM environment.
My suggestions:
The Quick-n-easy: Use a smaller array! Yes, it loops more, but this will not be a problem. 5 kilobytes is just fine. You'll know if this works adequately for you in minutes.
byte[] byt = new byte[1024*5];
A little bit harder: If you have access to sendfile (like in Tomcat with the Http11NioProtocol -- documentation here), then use it
A little bit harder, again: Switch your code to Java NIO's FileChannel. I have very, very similar code running on equally large files with hundreds of concurrent connections and similar memory settings with no problem. NIO is faster than plain old Java streams in these situations. It uses the magic of DMA (Direct Memory Access) allowing the data to go from disk to NIC without ever going through RAM or the CPU. Here is a code snippet for my own code base...I've ripped out much to show the basics. FileChannel.transferTo() is not guaranteed to send every byte, so it is in this loop.
WritableByteChannel destination = Channels.newChannel(response.getOutputStream());
FileChannel source = file.getFileInputStream().getChannel();
while (total < length) {
long sent = source.transferTo(start + total, length - total, destination);
total += sent;
}
The following code is able to streaming data to the client, allocating only a small buffer (BUFFER_SIZE, this is a soft point since you may want to adjust it):
private static final int OUTPUT_SIZE = 1024 * 1024 * 50; // 50 Mb
private static final int BUFFER_SIZE = 4096;
#Override
protected void doGet(HttpServletRequest request,HttpServletResponse response)
throws ServletException, IOException {
String fileName = "42.txt";
// build response headers
response.setStatus(200);
response.setContentLength(OUTPUT_SIZE);
response.setContentType("text/plain");
response.setHeader("Content-Disposition",
"attachment;filename=\"" + fileName + "\"");
response.flushBuffer(); // write HTTP headers to the client
// streaming result
InputStream fileInputStream = new InputStream() { // fake input stream
int i = 0;
#Override
public int read() throws IOException {
if (i++ < OUTPUT_SIZE) {
return 42;
} else {
return -1;
}
}
};
ReadableByteChannel input = Channels.newChannel(fileInputStream);
WritableByteChannel output = Channels.newChannel(
response.getOutputStream());
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
while (input.read(buffer) != -1) {
buffer.flip();
output.write(buffer);
buffer.clear();
}
input.close();
output.close();
}
Are you required to serve files using Tomcat? For this kind of tasks we have used separate download mechanism. We chained Apache -> Tomcat -> storage and then add rewrite rules for download. Then you just by-pass Tomcat and Apache will serve the file to client (Apache->storage). But if works only if you have files stored as files. If you read from DB or other type of non-file storage this solution cannot be used successfully. the overall scenario is that you generate download links for files as e.g. domain/binaries/xyz... and write redirect rule for domain/files using Apache mod_rewrite.
Do you have any filters in the application, or do you use the tcnative library? You could try to profile it with jvisualvm?
Edit: Small remark: Note that you have a HTTP response splitting attack possibility in the setHeader if you do not sanitize fileName.
Why don't you use tomcat's own FileServlet?
It can surely give out files much better than you can possible imagine.
A 2-MByte buffer is way too large! A few k should be ample. Megabyte-sized objects are a real issue for the garbage collector, since they often need to be treated separately from "normal" objects (normal == much smaller than a heap generation). To optimize I/O, your buffer only needs to be slightly larger than your I/O buffer size, i.e. at least as large as a disk block or network package.

Categories