java: how to search a string in a big file? [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
exception while Read very large file > 300 MB
Now, i want to search a string from a big file(>=300M). Because the file is big so i can't load it into memory.
What kind of ways can be provided to handle this problem?
Thanks

There are a few options:
Depending on your target OS, you might be able to hand off this task to a system utility such as grep (which is already optimized for this sort of work) and simply parse the output.
Even if the file were small enough to be contained in memory, you'd have to read it from disk either way. So, you can simply read it in, one line at a time, and compare your string to the contents as they are read. If your app only needs to find the first occurrence of a string in a target file, this has the benefit that, if the target string appears early in the file, you save having to read the entire file just to find something that's in the first half of the file.
Unless you have an upper limit on your app's memory usage (i.e. it must absolutely fit within 128 MB of RAM, etc.) then you can also increase the amount of RAM that the JVM will take up when you launch your app. But, because of the inefficiency of this (in terms of time, and disk I/O, as pointed out in #2), this is unlikely to be the course that you'll want to take, regardless of file size.

I would memory map the file. This doesn't use much heap (< 1 KB), regardless of the file size (up to 2 GB) and takes about 10 ms on most systems.
FileChannel ch = new FileInputStream(fileName).getChannel();
MappedByteBuffer mbb = ch.map(ch.MapMode.READ_ONLY, 0L, ch.size());
This works provided you have a minimum of 4 KB free (and your file is less than 2 GB long)

Related

Merging sub parts of multiple files into a single file in Java

I have n files each containing m blocks of data.
File 0 Contents:
file0.block1
file0.block2
file0.block3
file0.block4
..
file0.blockM
File 1 Contents:
file1.block1
file1.block2
file1.block3
file1.block4
..
file1.blockM
...
File n Contents:
fileN.block1
fileN.block2
fileN.block3
fileN.block4
..
fileN.blockM
The blocks are of variable size. Blocks having the same Id can have variable sizes across different files.
The merged file should look like this.
Merged File Contents:
file0.block1
file1.block1
...
fileN.block1
file0.block2
file1.block2
...
fileN.block2
..
file0.blockM
file1.blockM
...
fileN.blockM
Is N really so large that keeping the files open is not an option? At least on Linux the hard limit of possible open files is quite large. ulimit -Hn gives me 1048576 on Xubuntu 20.04. The soft limit is much smaller with 1024 by default but that can be raised using ulimit -n N. Not sure what sensible values for N are but you can try using what you think is the maximum N you will encounter in your application. Note: I do not know if Java imposes limits beyond what the OS does or if keeping a million files open costs a lot of memory (I would expect the memory cost for an InputStream to be in the order of a few KBs). Also, no idea how this works on Windows.
The only middle ground I can think of between either opening/closing files all the time or keeping all files open all the time would be to process a number of files at a time and join them into temporary files, then join the temp files to form the final result. Clearly, that avoids the opening/closing scenario but comes at the cost of re-writing the data more often, which might be slow on spinning disks and wears down SSDs if the files are of any significant size.

Fastest way to create a trie (JSON) from a 4GB file, using only 1GB of ram?

Perhaps I'm doing this the wrong way:
I have a 4GB (33million lines of text) file, where each line has a string in it.
I'm trying to create a trie -> The algorithm works.
The problem is that Node.js has a process memory limit of 1.4GB, so the moment I process 5.5 million lines, it crashes.
To get around this, I tried the following:
Instead of 1 Trie, I create many Tries, each having a range of the alphabet.
For example:
aTrie ---> all words starting with a
bTrie ---> all words starting with b...
etc...
But the problem is, I still can't keep all the objects in memory while reading the file, so each time I read a line, I load / unload a trie from disk. When there is a change I delete the old file, and write the updated trie from memory to disk.
This is SUPER SLOW! Even on my macbook pro with SSD.
I've considered writing this in Java, but then the problem of converting JAVA objects to json comes up (same problem with using C++ etc).
Any suggestions ?
You may extend memory size limit that the node process uses by specifying the option below;
ps: size in mb's.
node --max_old_space_size=4096
for more options please see:
https://github.com/thlorenz/v8-flags/blob/master/flags-0.11.md
Instead of using 26 Tries you could use a hash function to create an arbitrary number of sub-Tries. This way, the amount of data you have to read from disk is limited to the size of your sub-Trie that you determine. In addition, you could cache the recently used sub-Tries in memory and then persist the changes to disk asynchronously in the background if IO is still a problem.

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Update all occurence of String in File using Java [duplicate]

This question already has answers here:
Replace string in file
(2 answers)
Closed 8 years ago.
I have 1 file, which contains some String that need to be updated.
MY REPORT
REPORT RUN DATE : 27/08/2012 12:35:11 PAGE 1 of #TOTAL#
SUCCESSFUL AND UNSUCCESSFUL DAILY TRANSACTIONS REPORT
---record of data here----
MY REPORT
REPORT RUN DATE : 27/08/2012 12:35:11 PAGE 2 of #TOTAL#
SUCCESSFUL AND UNSUCCESSFUL DAILY TRANSACTIONS REPORT
---record of data here----
In case I just want to update all occurence of #TOTAL# to some number, is there a quick and effecient way to do this?
I understand that I can also use BufferedReader+BufferedWriter to print to another file and use String.replace it along the way, but I wonder if there is a better and elegant way to solve this...
The file wont exceed 10MB, so there is no need to concern whether the file can be to big ( exceed 1 GB etc )
If you don't care about the file being too large, and think calling replace() on every line is inelegant, I guess you can just read the entire file into a single String, call replace() once, then write it to the file.
... I wonder if there is a better and elegant way to solve this
It depends on what you mean by "better and elegant", but IMO the answer is no.
The file wont exceed 10MB, so there is no need to concern whether the file can be to big ( exceed 1 GB etc )
You are unlikely to exceed 1Gb. However:
You probably cannot be 100% sure that the file won't be bigger that 10Mb. For any program that has a significant life-time, you can rarely know that the requirements and usage patterns won't change over time.
In fact, a 10Mb text file may occupy up to 60Mb of memory if you load the entire lot into a StringBuilder. Firstly, the bytes are inflated into characters. Secondly, the algorithm used by StringBuilder to manage its backing array involves allocating a new array of double the size the original one. So peak memory usage could be up to 6 times the number of bytes in the file you are reading.
Note that 60Mb is greater than the default maximum heap size for some JVMs on some platforms.

Handling large datasets in Java/Clojure: littleBig data

I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?
Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound
on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.
Can I change the maximum allowed heap
space if I am using Swank-Clojure (via
Leiningen) the JVM has on startup?
You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.
So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.
I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.
If I package this application (like I
plan to) as an Uberjar, would I be
able to ensure my JVM has some kind of
minimum heap space?
Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.
Can I "sample" from the file; e.g.
read only every z lines?
java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.
Would it be possible to read in only
parts of a large (text) file at a
time, so I could import and process
the data in "chunks", e.g, n lines at
a time? If so, how?
line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.
Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)
Is there some faster way of accessing
the file I'd be reading from
(potentially rapidly, depending on the
implementation), other than simply
reading from it a bit at a time?
Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.
To make the most out of the memory you have, keep the data as primitive as possible.
For some actual numbers, let's assume you want to graph the contents of a music CD:
A CD has two channels, each with 44,100 samples per second
60 min. of music is then ~300 million data points
Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
Represented as primitive int array (4 bytes per datapoint): 1.2GB
Represented as Integer array (32 bytes per datapoint): 10GB
Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).
Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.
If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.
So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.
Update:
On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.
The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)
Tossing out a couple ideas from left field...
You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/
Or perhaps memory-mapped I/O.
A couple of thoughts:
Best way to handle large in-memory data sets in Java/Clojure is to use large primitive arrays. If you do this, you are basically using only a little more memory than the size of the underlying data. You handle these arrays in Clojure just fine with the aget/aset functionality
I'd be tempted to downsample, but maintain a way to lazily access the detailed points "on demand" if you need to, e.g. in the user interaction case. Kind of like the way that Google maps lets you see the whole world, and only loads the detail when you zoom in....
If you only care about the output image from the x-y plot, then you can construct it by loading in a few thousand points at a time (e.g. loading into your primitive arrays), plotting them then discarding then. In this way you won't need to hold the full data set in memory.

Categories