Recognize wav files with silence in Java - java

I need a function in JAVA, something like this:
Input: .wav file (or byte[] fileBytes)
Output: true/false (the file consists of silence only)
What is the best way to do it?
Thank you.
UPDATE:
The command that I use for recording:
arecord --format=S16_LE --max-file-time=60 --rate=16000 --file-type=wav randomvalue_i.wav
Silent = no audio at all

Well, the short answer is you'll want to scan the .WAV data and do a min/max value on it. A "silent" file the values should essentially all be 0.
The longer answer is that you'll want to understand the .WAV format, which you can find described here (http://soundfile.sapp.org/doc/WaveFormat/). You can probably skip over the first 44 bytes (RIFF, 'fmt') to get down to the data, then start looking at the bytes. The 'bits-per-sample' value from the header might be important, as 16-bit samples would mean you'd need to consolidate 2 'bytes' together to get a single sample. But, even so, both bytes would be 0 for a silent, 16-bit sample file. Ditto for NumChannels - in theory you should understand it, but again, both should be 0 for true 'silent'. If all the data is '0', it's silent.
"Silent" is a bit ambiguous. Above, I was strict and assumed it meant true '0' only. However, in a silent room, there would still be very low levels of background ambient noise. In that case, you'd need to be a bit more forgiving about the comparison. e.g. calculate a min/max for each sample, and insure that the range is within some tolerance. It can still be determined, but it just adds code.
For completeness:
public boolean isSilent(byte[] info) {
for (int idx = 44; idx < info.length; ++idx) {
if (info[idx] != 0)
return false;
}
return true;
}

You could have a .wav file that is what you consider "silence" and compare it to the other .wav file to see if they have the same frequency.

Related

handling comp3 and ebcidic conversion in java to ASCII for large files

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.
As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.
I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

Check for differences between two (large) files

I want to write a relatively simple program, that can backup files from my computer to a remote location and encrypt them in the process, while also computing a diff (well not really...I'm content with seeing if anything changed at all, not so much what has changed) between the local and the remote files to see which ones have changed and are necessary to update.
I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself
My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.
Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash
Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)
The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.
_File1 and _File2 are 2 files > 1.5GB of type java.io.File
public boolean compareStream() throws IOException {
int i = 0;
int step = 4096;
boolean equal = false;
FileInputStream fi1 = new FileInputStream(_File1);
FileInputStream fi2 = new FileInputStream(_File2);
byte[] fi1Content = new byte[step];
byte[] fi2Content = new byte[step];
if(_File1.length() == _File2.length()) { //Assumption 1
while(i*step < _File1.length()) {
fi1.read(fi1Content, 0, step); //Assumption 2
fi2.read(fi2Content, 0, step); //Assumption 2
equal = BLAKE512.isEqual(fi1Content, fi2Content); //Assumption 2
if(!equal) { //Assumption 3
break;
}
++i;
}
}
fi1.close();
fi2.close();
return equal;
}
The calculation for two equal 1.5 GB files takes around 4.2 seconds. Times are of course much shorter when the files differ, especially when they are of different length since it returns immediately.
Thank you for your suggestions :)
..I hope this isn't too broad
While assumptions are correct, they won't protect from rare false positives (when method says files are equal when they aren't):
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set will have the same hash
This is right, but because of hash collisions you can have the situation, when hashes of chunks are the same, but chunks themselves differ.

How to compare large text files?

I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.
But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?
How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT
I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1 looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)
Nonentheless all your approaches were very helpful to me, thank you for your replies!
I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.
Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.
This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.
Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.
That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.
Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.
Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).
Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.
there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.
You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.
No sure how good an answer this would be - but have a look at this page: http://c2.com/cgi/wiki?DiffAlgorithm - it summarises a few diff algorithms. Hunt-McIlroy algorithm is probably the better implementation. From that page there's also a link to a java implementation of the GNU diff. However, I think an implementation in C/C++ and compiled into native code will be much faster. If you're stuck with java, you may want to consider JNI.
Indeed, that could take a while. You have to make 1,200.000,000 line comparisions.
There are several possibilities to speed that up by an order of magnitute:
One would be to sort file2 and do kind of a binary search on file level.
Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)
The number of lines you read at once from file_1 does not matter, however. This is micro-optimization in the face of great complexity.
If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..
And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...
What you can do is sort each individual file. e.g. the UNIX sort or similar in Java. You can read the sorted files one line at a time to perform a merge sort.
I have never worked with such huge files but this is my idea and should work.
You could look into hash. Using SHA-1 Hashing.
Import the following
import java.io.FileInputStream;
import java.security.MessageDigest;
Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.
StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
for (int i = 0; i < mdbytes.length; i++) {
myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
System.out.println("Computed Hash = " + sb.toString());
SHA Code example focusing on Text File
SO Question about computing SHA in JAVA (Possibly helpful)
Another sample of hashing code.
Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.
Then if you get a different value you can do the super time consuming line by line check.
Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.
SHA checksum
If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.
However you can make some heuristics that can tell you with some kind of probability if the files are identical.
1) Check file size; that's the easiest.
2) Take a random file position and compare block of bytes starting at this position in the two files.
3) Repeat step 2) to achieve the needed probability.
You should compute and test how many reads (and size of block) are useful for your program.
My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.
You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertable or HBase for an example of how to store the index in a fast disk-based hash table.
So, in full, the algorithm would be something like:
Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
Get the hash of each line in the first file (Object.hashCode should be sufficient)
Get the offset in the file of the line so you can find it again later
Add the offset to the list of lines with matching hashCodes in the hash map
Compare each line of the second file to the set of line offsets in the index
Keep any lines which have matching entries
EDIT:
In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.
So something like index.get("mat32") would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.
try to avoid memory consuming and make it disc consuming.
i mean divide each file into loadable size parts and compare them, this may take some extra time but will keep you safe dealing with memory limits.
What about using source control like Mercurial? I don't know, maybe it isn't exactly what you want, but this is a tool that is designed to track changes between revisions. You can create a repository, commit the first file, then overwrite it with another one an commit the second one:
hg init some_repo
cd some_repo
cp ~/huge_file1.txt .
hg ci -Am "Committing first huge file."
cp ~/huge_file2.txt huge_file1.txt
hg ci -m "Committing second huge file."
From here you can get a diff, telling you what lines differ. If you could somehow use that diff to determine what lines were the same, you would be all set.
That's just an idea, someone correct me if I'm wrong.
I would try the following: for each file that you are comparing, create temporary files (i refer to it as partial file later) on disk representing each alphabetic letter and an additional file for all other characters. then read the whole file line by line. while doing so, insert the line into the relevant file that corresponds to the letter it starts with. since you have done that for both files, you can now limit the comparison for loading two smaller files at a time. a line starting with A for example can appear only in one partial file and there will not be a need to compare each partial file more than once. If the resulting files are still very large, you can apply the same methodology on the resulting partial files (letter specific files) that are being compared by creating files according to the second letter in them. the trade-of here would be usage of large disk space temporarily until the process is finished. in this process, approaches mentioned in other posts here can help in dealing with the partial files more efficiently.

How to read/write high-resolution (24-bit, 8 channel) .wav files in Java?

I'm trying to write a Java application that manipulates high resolution .wav files. I'm having trouble importing the audio data, i.e. converting the .wav file into an array of doubles.
When I use a standard approach an exception is thrown.
AudioFileFormat as = AudioSystem.getAudioFileFormat(new File("orig.wav"));
-->
javax.sound.sampled.UnsupportedAudioFileException: file is not a supported file type
Here's the file format info according to soxi:
dB$ soxi orig.wav
soxi WARN wav: wave header missing FmtExt chunk
Input File : 'orig.wav'
Channels : 8
Sample Rate : 96000
Precision : 24-bit
Duration : 00:00:03.16 = 303526 samples ~ 237.13 CDDA sectors
File Size : 9.71M
Bit Rate : 24.6M
Sample Encoding: 32-bit Floating Point PCM
Can anyone suggest the simplest method for getting this audio into Java?
I've tried using a few techniques. As stated above, I've experimented with the Java AudioSystem (on both Mac and Windows). I've also tried using Andrew Greensted's WavFile class, but this also fails (WavFileException: Compression Code 3 not supported). One workaround is to convert the audio to 16 bits using sox (with the -b 16 flag), but this is suboptimal since it increases the noise floor.
Incidentally, I've noticed that the file CAN be read by libsndfile. Is my best bet to write a jni wrapper around libsndfile, or can you suggest something quicker?
Note that I don't need to play the audio, I just need to analyze it, manipulate it, and then write it out to a new .wav file.
* UPDATE *
I solved this problem by modifying Andrew Greensted's WavFile class. His original version only read files encoded as integer values ("format code 1"); my files were encoded as floats ("format code 3"), and that's what was causing the problem.
I'll post the modified version of Greensted's code when I get a chance. In the meantime, if anyone wants it, send me a message.
Very late reply but others might benefit from this.
A libsndfile java wrapper now exists, which should help with audio read/write/conversion issues like this. The wrapper (generated by swig) is true to the libsndfile API, so the structure is a little different from typical Java style, but entirely usable.
In your particular case, usage would look something like this:
String filePath = "path/to/some/audio_8_channels.wav";
// open the sound file
SF_INFO sndInfo = new SF_INFO();
SWIGTYPE_p_SNDFILE_tag sndFile = libsndfile.sf_open(filePath, libsndfile.SFM_READ, sndInfo);
// create a buffer array large enough to hold all channels (double per channel) of all frames
int arraySize = sndInfo.getFrames() * sndInfo.getChannels();
CArrayDouble darray = new CArrayDouble(arraySize);
// read every frame (includes all channels) into the buffer array
long count = libsndfile.sf_readf_double(sndFile, darray.cast(), sndInfo.getFrames());
// iterate over the doubles that make up the audio
for (int i=0; i < arraySize; i++) {
// every 8 doubles is a frame..
double d = darray.getitem(i);
}
The google code project, including a compiled 'usable' version for JDK1.7 on Windows7 exists here:
http://code.google.com/p/libsndfile-java/
You could read the wav data in yourself, it's actually not that hard to do. Just search for the WAV file format information.

How can I set the file-read buffer size in Perl to optimize it for large files?

I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiquated, and am having a problem changing the default choice when it comes to Perl.
In the case of Perl, which I believe uses 8K buffers by default, similar to Java's choice, I can't find a reference using the perldoc website search engine (really Google) on how to increase the default file input buffer size to say, 64K.
From the above link, to show how 8K buffers don't scale:
If lines typically have about 60 characters each, then the 10,000-line file has about 610,000 characters in it. Reading the file line-by-line with buffering only requires 75 system calls and 75 waits for the disk, instead of 10,001.
So for a 50,000,000 line file with 60 characters per line (including the newline at the end), with an 8K buffer, it's going to make 366211 system calls to read a 2.8GiB file. As an aside, you can confirm this behaviour by looking at the disk i/o read delta (in Windows at least, top in *nix shows the same thing somehow too I'm sure) in the task manager process list as your Perl program takes 10 minutes to read in a text file :)
Someone asked the question about increasing the Perl input buffer size on perlmonks, someone replied here that you could increase the size of "$/", and thus increase the buffer size, however from the perldoc:
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer.
So I assume that this does not actually increase the buffer size that Perl uses to read ahead from the disk when using the typical:
while(<>) {
#do something with $_ here
...
}
"line-by-line" idiom.
Now it could be that a different "read a record at a time and then parse it into lines" version of the above code would be faster in general, and bypass the underlying problem with the standard idiom and not being able to change the default buffer size (if that's indeed impossible), because you could set the "record size" to anything you wanted and then parse each record into individual lines, and hope that Perl does the right thing and ends up doing one system call per record, but it adds complexity, and all I really want to do is get an easy performance gain by increasing the buffer used in the above example to a reasonably large size, say 64K, or even tuning that buffer size to the optimal size for long reads using a test script on my system, without needing extra hassle.
Things are much better in Java as far as straight-forward support for increasing the buffer size goes.
In Java, I believe the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although up-to-date references in the JDK docs are equivocal, e.g., the 1.5 docs say only:
The buffer size may be specified, or the default size may be accepted. The default is large enough for most purposes.
Luckily with Java you do not have to trust the JDK developers to have made the right decision for your application and can set your own buffer size (64K in this example):
import java.io.BufferedReader;
[...]
reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536);
[...]
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
/* do something with the line here */
foo(line);
}
There's only so much performance you can squeeze out of parsing one line at a time, even with a huge buffer, and modern hardware, and I'm sure there are ways to get every ounce of performance out of reading in a file by reading big many-line records and breaking each into tokens then doing stuff with those tokens once per record, but they add complexity and edge cases (although if there's an elegant solution in pure Java (only using the features present in JDK 1.5) that would be cool to know about). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl at least, while keeping things straight-forward.
My question is:
Is there a way to adjust that buffer size in Perl for the above typical "line-by-line" idiom, similar how the buffer size was increased in the Java example?
You can affect the buffering if you're running on an OS that supports setvbuf; see the documentation for IO::Handle.
If you're using perl v5.10 or later then there is no need
to explicitly create an IO::Handle object as described in the documentation, as all file handles are implicitly blessed into IO::Handle objects since that release.
use 5.010;
use strict;
use warnings;
use autodie;
use IO::Handle '_IOLBF';
open my $handle, '<:utf8', 'foo';
my $buffer;
$handle->setvbuf($buffer, _IOLBF, 0x10000);
while ( my $line = <$handle> ) {
...
}
No, there's not (short of recompiling a modified perl), but you can read the whole file into memory, then work line by line from that:
use File::Slurp;
my $buffer = read_file("filename");
open my $in_handle, "<", \$buffer;
while ( my $line = readline($in_handle) ) {
}
Note that perl before 5.10 defaulted to using stdio buffers in most places (but often cheating and accessing the buffers directly, not through the stdio library), but in
5.10 and later defaults to its own perlio layer system. The latter seems to use a 4k
buffer by default, but writing a layer that allows configuring this should be trivial
(once you figure out how to write a layer: see perldoc perliol).
Warning, the following code has only been light tested. The code below is a first shot at a function that will let you process a file line by line (hence the function name) with a user-definable buffer size. It takes up to four arguments:
an open filehandle (default is STDIN)
a buffer size (default is 4k)
a reference to a variable to store the line in (default is $_)
an anonymous subroutine to call on the file (the default prints the line).
The arguments are positional with the exception that the last argument may always be the anonymous subroutine. Lines are auto-chomped.
Probable bugs:
may not work on systems where line feed is the end of line character
will likely fail when combined with a lexical $_ (introduced in Perl 5.10)
You can see from an strace that it reads the file with the specified buffer size. If I like how testing goes, you may see this on CPAN soon.
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw/reftype/;
use Carp;
sub line_by_line {
local $_;
my #args = \(
my $fh = \*STDIN,
my $bufsize = 4*1024,
my $ref = \$_,
my $coderef = sub { print "$_\n" },
);
croak "bad number of arguments" if #_ > #args;
for my $arg_val (#_) {
if (reftype $arg_val eq "CODE") {
${$args[-1]} = $arg_val;
last;
}
my $arg = shift #args;
$$arg = $arg_val;
}
my $buf;
my $overflow ='';
OUTER:
while(sysread $fh, $buf, $bufsize) {
my #lines = split /(\n)/, $buf;
while (#lines) {
my $line = $overflow . shift #lines;
unless (defined $lines[0]) {
$overflow = $line;
next OUTER;
}
$overflow = shift #lines;
if ($overflow eq "\n") {
$overflow = "";
} else {
next OUTER;
}
$$ref = $line;
$coderef->();
}
}
if (length $overflow) {
$$ref = $overflow;
$coderef->();
}
}
my $bufsize = shift;
open my $fh, "<", $0
or die "could not open $0: $!";
my $count;
line_by_line $fh, sub {
$count++ if /lines/;
}, $bufsize;
print "$count\n";
I'm necroposting since this came up on this perlmonks thread
It's not possible to use setvbuf on perls using PerlIO, which the default since version 5.8.0. However, there is the PerlIO::buffersize module on CPAN that allows you to set the buffer size when opening a file:
open my $fh, '<:buffersize(65536)', $filename;
IIRC, you could also set the default for any new files by using this at the beginning of your script:
use open ':buffersize(65536)';

Categories