Fastest way to find a subset of strings in another string? - java

I am decoding a byte file made by huffman encoding, i turn the bytes into string and then search the values i have been given by the huffman tree. I have a hash table with the encode value and the byte value of the original file. Here is my code.
for(int i = 0, j = 1; j <= encodedString.length(); j++){
if(huffEncodeTable.get( encodedString.substring(i, j)) != null){
decodedString.append(huffEncodeTable.get( encodedString.substring(i, j)));
i = j;
}
Its pretty simple, its a loop that itterates over all the string, the problem comes when the string its too large, -with compress files of size larger that 100KB- its takes a really long time to process them, so i want to know if its a way to make this process in a faster way or if its better to store my encode values in another structure intead of the hastable.
huffEncodeTable -> hashtable
encodedString -> String with the huffman values
decodedString -> The String that will represent the original bytes of the original file

A couple of suggestions:
Every time you append to a String, a new String is created. You should use StringBuilder instead. This is probably the main problem, as I see it.
Also, I'd use hashtable.containsKey instead of get to check for a key's existence. I doubt it impacts your performance much though.
You also might save a bit of time if you store the results of the call to substring, and so only call it once.
So, something like.
StringBuilder sb = new StringBuilder()
String currentString;
for(int i = 0, j = 1; j <= encodedString.length(); j++){
currentString = encodedString.substring(i, j)
if(huffEncodeTable.containsKey( currentString )){
sb.append(huffEncodeTable.get( currentString ));
i = j;
}
}
return sb.toString(); //Or whatever you do with it.

Using substring for different lengths of strings would really slow things down. In Java 7 it takes a copy of the original string creating two objects. You are much better off creating one substring and doing a search against a NavigableMap.
Using a NavigableMap will allow you to find the longest matching string in one operation and reduce the number of strings you need to store in the map.
Note: even so the size of the Map will be O(N^2) where N is the maximum string length you can look back, so you have to place a sensible limit on the size of N.
Note2: You will be lucky to get within a tenth of the speed of the built in huffman code (which is written for you, is standard and works) So if performance matters, use that.

Related

What is more efficient? Storing a split string in an array, or calling the split method everytime you need it

What is going to be faster, storing a split string into an array and using this array within my program, or could I call the .split() method on the string whenever I needed an array to iterate through?
String main = "1,2,3,4,5,6";
String[] array = main.split(",");
vs
main.split(",");
whenever I need to use the input values?
I realise it will be way more readable if I were to store the string in an array. I'd just like to know if the .split() takes more computing time than using an array. Since the split method returns an array containing the split strings.
A simple example(?) loop to go with the question:
for (int i = main.length - 1; i >= 0; i--){}
vs
for (int i = main.split(",") - 1; i >= 0; i--){}
It's a trade off, like most such things in programming. If you split just once and use the array directly from then on, you'll save processing time at the expense of memory. If you split every time, you'll save memory at the expense of processing time.
One is more time efficient, the other is more space efficient.
As you can see, the split() method returns an array so behind the scenes the main.split(",") will iterate every time you call it through main String to extract the values. So it's faster to use it only once and use the result.
I would prefer to split once and keep the tokens around regardless of the size of the array. If the resulting array is large, it will be more expensive to split each time. If it is small, the resultant storage is probably not going to be a factor.
If your worried about storage for a large array, then the last time you split should also be a concern. To mitigate that, simply assign null to the array when your done and let the garbage collector do its thing.
If I were going to iterate thru an array of tokens, I would probably do it like this.
for (String token : main.split(",")) {
// do some stuff.
}
which creates the array once.

Most efficient way to create a string out of a list of characters then clear it

I'm trying to create a JSON-like format to load components from files and while writing the parser I've run into an interesting performance question.
The parser reads the file character by character, so I have a LinkedList as a buffer. After reaching the end of a key (:) or a value (,) the buffer has to be emptied and a string constructed of it.
My question is what is the most efficient way to do this.
My two best bets would be:
for (int i = 0; i < buff.size(); i++)
value += buff.removeFirst().toString();
and
value = new String((char[]) buff.toArray(new char[buff.size()]));
Instead of guessing this you should write a benchmark. Take a look at How do I write a correct micro-benchmark in Java to understand how to write a benchmark with JMH.
Your for loop would be inefficient as you are concatenating 1-letter Strings using + operator. This leads to creation and immediate throwing away intermediate String objects. You should use StringBuilder if you plan to concatenate in a loop.
The second option should use a zero-length array as per Arrays of Wisdom of the Ancients article which dives into internal details of the JVM:
value = new String((char[]) buff.toArray(new char[0]));

JAVA string how can i implement the length method

My roommate's teacher gave them a assignment to implement string length method in JAVA?
we have thought out two ways.
Check the element,and when get the out of bounds exception,it means the end of string,we catch this exception,then we can get the length.
Every time a string is pass to calculate the length,we add the special character to the end of it,it can be '\0',or "A",etc..
But we all think this two way may can finish the assignment,but they are bad(or bad habit to do with exception),it's not cool.
And we have googled it,but don't get what we want.
Something like this?
int i = 0;
for (char ch : string.toCharArray()) {
i++;
}
The pseudo-code you probably want is:
counter = 0
for(Character c in string) {
counter = counter + 1
}
This requires you to find a way to turn a Java String into an array of characters.
Likely the teacher is trying to make his or her students think, and will be satisfied with creative solutions that solve the problem.
None of these solutions would be used in the real world, because we have the String.length() method. But the creative, problem-solving process you're learning would be used in real development.
"1. Check the element,and when get the out of bounds exception,it means the end of string,we catch this exception,then we can get the length."
Here, you're causing an exception to be thrown in the normal case. A common style guideline is for exceptions to be thrown only in exceptional cases. Compared to normal flow of control, throwing an exception can be more expensive and more difficult to follow by humans.
That said, this one of your ideas has a potential advantage for very long strings. All of the posted answers so far run in linear time and space. The time and/or additional space they take to execute is proportional to the length of the string. With this approach, you could implement an O(log n) search for the length of the string.
Linear or not, it's possible that the teacher would find this approach acceptable for its creativity. Avoid if the teacher has communicated the idea that exceptions are only for exceptional cases.
"2. Every time a string is pass to calculate the length,we add the special character to the end of it,it can be '\0',or "A",etc.."
This idea has a flaw. What happens if the string contains your special character?
EDIT
A simple implementation would be to get a copy of the underlying char array with String.toCharArray(), then simply take its length. Unlike your ideas, this is not an in-place approach - making the copy requires additional space in memory.
String s = "foo";
int length = s.toCharArray().length;
Try this
public static int Length(String str) {
str = str + '\0';
int count = 0;
for (int i = 0; str.charAt(i) != '\0'; i++) {
count++;
}
return count;
}
What about:
"your string".toCharArray().length

Split an array of common English words into separate lists/arrays based on word length in Java

I'm trying to search an array of common English words to see if a specific word is contained in it, based on a text file. Since this array has >700,000 words and around 1000 words need to be checked if in the array multiple times, I thought it would be more efficient to separate the words into separate arrays or lists based on length. Is there an easy way to do this without using a switch or lots of if statements? Like so:
for(int i = 0; i < commonWordArray.length; i++) {
if(commonWordArray[i].length == 2) {
twoLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 3) {
threeLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 4) {
fourLetterList.add(commonWordArray[i]);
}
...etc
}
Then doing the same thing when checking the words:
for(int i = 0; i < checkWords.length; i++) {
if(checkWords[i].length == 2) {
if(twoLetterList.contains(checkWords[i])) {
...etc
}
Step 1
Create word buckets.
ArrayList<ArrayList<String>> buckets = new ArrayList<>();
for(int i = 0; i < maxWordLength; i++) {
buckets.add(new ArrayList<String>());
}
Step 2
Add words to your buckets.
buckets.get(word.length()).add(word);
This approach has the downside that some of your buckets may go unused. This is not an issue if you are only filtering common English words, as they do not exceed 30 characters in length. Creating 10-15 extra lists is a trivial overhead for a computer. The largest uncommon but non-technical word is 183 characters. Technical words exceed 180,000 characters, by which point this approach is clearly not practical.
The upside of this approach is that ArrayList.get() and ArrayList.add() both run in constant (O(1)) time.
Use a List<Set<String>> sets. That is, given a String word, find first the proper set (Set<String> set = sets.get(word.length)) - create the set if needed, extend the list if needed. Then just do a set.add(word). Done!
Edit/Hint: a (good) programmer should be lazy - if you need to do/write the same thing twice, you're doing something wrong.
Assuming you've got memory for it (which your current approach relies on), why not just a single Set<String>? Simpler, faster.
If you want to use multiple strings to search, you may want to try something like the Aho Corasick algorithm.
Alternatively, you may want to turn the problem around and check if a string from the 700k array is in the 1k array. To this, you won't have memory issues (imho) and you may do it with a simple dictionary (balanced tree). so you'd have 700k log2(1000).
Use a Trie, which is a memory-efficient storage mechanism which excels at storing words and checking for whether they exist or not.
Implementing one on your own is a fun exercise, or look at existing implementations.

Can I optimize this code?

I am trying to retrieve the data from the table and convert each row into CSV format like
s12, james, 24, 1232, Salaried
The below code does the job, but takes a long time, with tables of rows exceeding 1,00,000.
Please advise on optimizing technique:
while(rset1.next()!=false) {
sr=sr+"\n";
for(int j=1;j<=rsMetaData.getColumnCount();j++)
{
if(j< 5)
{
sr=sr+rset1.getString(j).toString()+",";
}
else
sr=sr+rset1.getString(j).toString();
}
}
/SR
Two approaches, in order of preference:
Stream the output
PrintWriter csvOut = ... // Construct a write from an outputstream, say to a file
while (rs.next())
csvOut.println(...) // Write a single line
(note that you should ensure that your Writer / OutputStream is buffered, although many are by default)
Use a StringBuilder
StringBuilder sb = new StringBuilder();
while (rs.next())
sb.append(...) // Write a single line
The idea here is that appending Strings in a loop is a bad idea. Imagine that you have a string. In Java, Strings are immutable. That means that to append to a string you have to copy the entire string and then write more to the end. Since you are appending things a little bit at a time, you will have many many copies of the string which aren't really useful.
If you're writing to a File, it's most efficient just to write directly out with a stream or a Writer. Otherwise you can use the StringBuilder which is tuned to be much more efficient for appending many small strings together.
I'm no Java expert, but I think it's always bad practice to use something like getColumnCount() in a conditional check. This is because after each loop, it runs that function to see what the column count is, instead of just referencing a static number. Instead, set a variable equal to that number and use the variable to compare against j.
You might want to use a StringBuilder to build the string, that's much more efficient when you're doing a lot of concatenation. Also if you have that much data, you might want to consider writing it directly to wherever you're going to put it instead of building it in memory at first, if that's a file or a socket, for example.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
sr.append('\n');
for (int j = 1; j <= columnCount; j++) {
sr.append(rset1.getString(j));
if (j < 5) {
sr.append(',');
}
}
}
As a completely different, but undoubtely the most optimal alternative, use the DB-provided export facilities. It's unclear which DB you're using, but as per your question history you seem to be doing a lot with Oracle. In this case, you can export a table into a CSV file using UTL_FILE.
See also:
Generating CSV files using Oracle
Stored procedure example on Ask Tom
As the other answers say, stop appending to a String. In Java, String objects are immutable, so each append must do a full copy of the string, turning this into an O(n^2) operation.
The other is big slowdown is fetch size. By default, the driver is likely to fetch one row at a time. Even if this takes 1ms, that limits you to a thousand rows per second. A remote database, even on the same network, will be much worse. Try calling setFetchSize(1000) on the Statement. Beware that setting the fetch size too big can cause out of memory errors with some database drivers.
I don't believe minor code changes are going to make a substantive difference. I'd surely use a StringBuffer however.
He's going to be reading a million rows over a wire, assuming his database is on a separate machine. First, if performance is unacceptable, I'd run that code on the database server and clip the network out of the equation. If it's the sort of code that gets run once a week as a batch job that may be ok.
Now, what are you going to do with the StringBuffer or String once it is fully loaded from the database? We're looking at a String that could be 50 Mbyte long.
This should be 1 iota faster since it removes the unneeded (i<5) check.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
for (int j = 1; j < columnCount; j++) {
sr.append(rset1.getString(j)).append(",");
}
// I suspect the 'if (j<5)' really meant, "if we aren't on the last
// column then tack on a comma." So we always tack it on above and
// write the last column and a newline now.
sr.append(rset1.getString(columnCount)).append("\n");
}
}
Another answer is to change the select so it returns a comma-sep string. Then we read the single-column result and append it to the StringBuffer.
I forget the syntax now, but something like:
select column1 || "," || column2 || "," ... from table;
Now we don't need to loop and comma concatenation business.
StringBuilder sr = new StringBuilder();
while (rset1.next()) {
sr.append(rset1.getString(1)).append("\n");
}
}

Categories