Efficient pattern search from a buffer in Java? - java

I've been trying to figure out an efficient way to work without reading twice from a file when using a buffer to search for a pattern of bytes. I've chosen to implement Runnable so I can divide my task to work in concurrent threads. My code looks something like this:
// constructor: initializes local variables.
public BytePatternSearcher(RandomAccessFile raFile, byte[] pattern, int bufferSize, long startPos, long endPos);
public void run()
{
while(amountToRead - raFile.read(buffer) > 0)
{
// search code
}
{
Now, I've hit a snag: my algorithm works in simple cases, but not in complex ones. I made the assumption that there are no cases of a pattern starting within one already being searched, that the pattern length is shorter than the buffer, and so on, limiting to one scan at a time and just iterating through the file. Naturally, this is not a very robust solution. Suppose I have a pattern of 'xxxxx' (length 5), my file is 'xxxxxxyxxxxxx' and my buffer size is 2 (x and y represent certain byte values). The string appears 4 times, and each check requires more than twice the buffer length.
How do I go about working things out without checking the same byte more than once for all cases?

The wikipedia has an entry for Boyer–Moore string search algorithm which also contains some sample implementations.

Related

How to implement a LIMITLESS String and StringBuilder

Java's String and StringBuilder are limited to a length of Integer.MAX_VALUE. In most use cases this is more than adequate, but I have just encountered a use case in which I need to handle and return a String greater than 2,684,354,560 characters.
This is required for capturing an incoming stream of characters, in which I do not have control over the size of the stream, nor do I have the option of re-architecting the solution. What I can do at most is replace a method in an existing module, or introduce a new class that replaces String and StringBuilder in that method.
As a temporary workaround, to prevent the OutOfMemory exception thrown when the StringBuilder length exceeds Integer.MAX_VALUE, I implemented the follow safeAppend():
private void safeAppend(StringBuilder ret, String current) {
if ((long)ret.length() + current.length() > Integer.MAX_VALUE) {
String truncateLeadingPart;
if (current.length() < ret.length()) {
truncateLeadingPart = ret.substring(current.length());
}
else {
int startIndex = (int)((long)ret.length()+current.length()-Integer.MAX_VALUE);
truncateLeadingPart = ret.substring(Math.min(ret.length(), startIndex));
}
ret.setLength(0);
ret.append(truncateLeadingPart);
}
ret.append(current);
}
This methods truncates the leading part and always keeps the trailing 2,147,483,647 characters part. However, this workaround/safeguard proved to be inadequate for the task at hand because we cannot afford losing any data captured from the stream.
What is a recommended approach to implementing a String and StringBuilder that are NOT limited by an int max size?
A limit of a long max size could be sufficient. Also, a single LimitlessString class that can be appended efficiently like StringBuilder is also adequate.
You wont be able to String or StringBuffer as the 32-bit length is baked into the interface. That's also true of arrays and NIO buffers, unfortunately (there have been proposals to fix this, but nothing at the time of writing).
Obviously streaming or using random file access would be a good solution if that is possible.
You are left with implementing something else. Ropes use a binary tree to represent composition of string parts. More common is to use an array of arrays, or for better GC an array of directly allocated (or memory-mapped file) NIO buffers. Someone remarked a few years ago that this area of Computer Science still has scope for more PhDs.
Well, if you Really-Really need to extend String/StringBuilder classes in such way you have to either create new class, that won't extend String/StringBuilder, because thay are marked as final, or you can change JRE binaries to make String/StringBuilder non-final. Anyway, both solutions sucks and will lead to huge support effort and will generate a lot of WTFs in future.
String and StringBuilder are final classes and cannot be patched. StringWriter would have been better.
Nice would have been:
not using two-byte chars, but bytes (CharBuffer upon ByteBuffer);
compressing (GzipOutputStream);
(as you did) periodically remove a huge chunk to a file or such;
[An aside] in the newer java there is support for single byte encodings which would not allow more characters but would use half the memory.
You'll meet resizing on appending, so the system will slow down.

Most efficient way to create a string out of a list of characters then clear it

I'm trying to create a JSON-like format to load components from files and while writing the parser I've run into an interesting performance question.
The parser reads the file character by character, so I have a LinkedList as a buffer. After reaching the end of a key (:) or a value (,) the buffer has to be emptied and a string constructed of it.
My question is what is the most efficient way to do this.
My two best bets would be:
for (int i = 0; i < buff.size(); i++)
value += buff.removeFirst().toString();
and
value = new String((char[]) buff.toArray(new char[buff.size()]));
Instead of guessing this you should write a benchmark. Take a look at How do I write a correct micro-benchmark in Java to understand how to write a benchmark with JMH.
Your for loop would be inefficient as you are concatenating 1-letter Strings using + operator. This leads to creation and immediate throwing away intermediate String objects. You should use StringBuilder if you plan to concatenate in a loop.
The second option should use a zero-length array as per Arrays of Wisdom of the Ancients article which dives into internal details of the JVM:
value = new String((char[]) buff.toArray(new char[0]));

Reading large files for a simulation (Java crashes with out of heap space)

For a school assignment, I need to create a Simulation for memory accesses. First I need to read 1 or more trace files. Each contains memory addresses for each access. Example:
0 F001CBAD
2 EEECA89F
0 EBC17910
...
Where the first integer indicates a read/write etc. then the hex memory address follows. With this data, I am supposed to run a simulation. So the idea I had was parse these data into an ArrayList<Trace> (for now I am using Java) with trace being a simple class containing the memory address and the access type (just a String and an integer). After which I plan to loop through these array lists to process them.
The problem is even at parsing, it running out of heap space. Each trace file is ~200MB. I have up to 8. Meaning minimum of ~1.6 GB of data I am trying to "cache"? What baffles me is I am only parsing 1 file and java is using 2GB according to my task manager ...
What is a better way of doing this?
A code snippet can be found at Code Review
The answer I gave on codereview is the same one you should use here .....
But, because duplication appears to be OK, I'll duplicate the answer here.
The issue is almost certainly in the structure of your Trace class, and it's memory efficiency. You should ensure that the instrType and hexAddress are stored as memory efficient structures. The instrType appears to be an int, which is good, but just make sure that it is declared as an int in the Trace class.
The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a line and you think you are just getting the hexString from it... but in reality, the hexString contains the entire line.... yeah, really. For example, look at the following code:
public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}
Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
hexAddr = new String(tokenizer.nextToken().toCharArray());
But even better would be:
long hexAddr = parseHexAddress(tokenizer.nextToken());
Like rolfl I answered your question in the code review. The biggest issue, to me, is the reading everything into memory first and then processing. You need to read a fixed amount, process that, and repeat until finished.
Try use class java.nio.ByteBuffer instead of java.util.ArrayList<Trace>. It should also reduce the memory usage.
class TraceList {
private ByteBuffer buffer;
public TraceList(){
//allocate byte buffer
}
public void put(byte operationType, int addres) {
//put data to byte buffer
}
public Trace get(int index) {
//get data from byte buffer by index
byte type = ...//read type
int addres = ...//read addres
return new Trace(type, addres)
}
}

Existing solution to "smart" initial capacity for StringBuilder

I have a piece logging and tracing related code, which called often throughout the code, especially when tracing is switched on. StringBuilder is used to build a String. Strings have reasonable maximum length, I suppose in the order of hundreds of chars.
Question: Is there existing library to do something like this:
// in reality, StringBuilder is final,
// would have to create delegated version instead,
// which is quite a big class because of all the append() overloads
public class SmarterBuilder extends StringBuilder {
private final AtomicInteger capRef;
SmarterBuilder(AtomicInteger capRef) {
int len = capRef.get();
// optionally save memory with expense of worst-case resizes:
// len = len * 3 / 4;
super(len);
this.capRef = capRef;
}
public syncCap() {
// call when string is fully built
int cap;
do {
cap = capRef.get();
if (cap >= length()) break;
} while (!capRef.compareAndSet(cap, length());
}
}
To take advantage of this, my logging-related class would have a shared capRef variable with suitable scope.
(Bonus Question: I'm curious, is it possible to do syncCap() without looping?)
Motivation: I know default length of StringBuilder is always too little. I could (and currently do) throw in an ad-hoc intitial capacity value of 100, which results in resize in some number of cases, but not always. However, I do not like magic numbers in the source code, and this feature is a case of "optimize once, use in every project".
Make sure you do the performance measurements to make sure you really are getting some benefit for the extra work.
As an alternative to a StringBuilder-like class, consider a StringBuilderFactory. It could provide two static methods, one to get a StringBuilder, and the other to be called when you finish building a string. You could pass it a StringBuilder as argument, and it would record the length. The getStringBuilder method would use statistics recorded by the other method to choose the initial size.
There are two ways you could avoid looping in syncCap:
Synchronize.
Ignore failures.
The argument for ignoring failures in this situation is that you only need a random sampling of the actual lengths. If another thread is updating at the same time you are getting an up-to-date view of the string lengths anyway.
You could store the string length of each string in a statistic array. run your app, and at shutdown you take the 90% quartil of your string length (sort all str length values, and take the length value at array pos = sortedStrings.size() * 0,9
That way you created an intial string builder size where 90% of your strings will fit in.
Update
The value could be hard coded (like java does for value 10 in ArrayList), or read from a config file, or calclualted automatically in a test phase. But the quartile calculation is not for free, so best you run your project some time, measure the 90% quartil on the fly inside the SmartBuilder, output the 90% quartil from time to time, and later change the property file to use the value.
That way you would get optimal results for each project.
Or if you go one step further: Let your smart Builder update that value from time to time in the config file.
But this all is not worth the effort, you would do that only for data that have some millions entries, like digital road maps, etc.

Convert ASCII byte[] to String

I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.
ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.
What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]
If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment
Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).
Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)

Categories