ByteArray - string comparison, without using bytearray.tostring() - java

I am using Java for a map reduce programming.
I have a byte Array with 10 MB data in it. I want to compare each byte to see if it is a space or not, my basic purpose is to get each word in this byte array, by separating the words using space (that is my idea, any other suggestion is welcome). I can for sure do it using string, i.e first converting the whole byte array to string, then comparing and then doing a substring to get each word, but this duplicates the data. I don't want anything that creates a duplicate like stringbuilder, StringTokenizer, substring.
I want each word in the bytearray, but without any duplicates since I am doing in memory computing and duplicates make me run out of resources. Any suggestion/idea how to proceed would be appreaciated.

If you just want to avoid creating a String for the whole array (and strings for the words are OK), you could do
HashSet<String> words = new HashSet<String>();
int pos = 0;
int len = byteArray.length;
for (int i = 0; i <= len; i++) {
if (i == len || byteArray[i] == ' ') {
if (i > pos + 1) {
String word = new String(byteArray, pos, i - pos, "UTF-8");
words.add(word);
}
pos = i + 1;
}
}
p.s. Your comment seems to suggest that you read the byte array from a file. Why not avoid that and read the words from the file directly? If you can use a newline (\n) as the delimiter (instead of a space), you could just do something like this:
HashSet<String> words = new HashSet<String>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(args), "UTF-8"));
while (true) {
String word = reader.readLine();
if (word == null) {
break;
}
words.add(word);
}
reader.close();

Related

How to generate a matrix of frequency of consecutive characters from txt file in java?

I have a large txt file(2GB). I read the whole txt file character by character to find out the frequency of each character in the whole txt file using the following code snippet.
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(file),
Charset.forName("UTF-8")));
int c;
while ((c = reader.read()) != -1) {
char ch = (char) c;
// rest of the code
}
Now I need to generate a matrix with the frequency of consecutive characters.
For example, how many times a character 'b' exists after character 'a'(consecutive,immediate character) and vice versa.
Suppose, I have a input string(from the file) : cad bed abed dada
The frequency matrix, would be like
Please click here to see the image
How to do this? Will appreciate any help and suggestion.
Thank you.
Keep track of the last character read. if lastchar=='' continue. use a Map to store the values.you can then loop over the combinations and pull the value from the map , or you could address a 2d array directly by subtracting the int value for char 'a' from the current character pairs.
Map<String, Integer> table = new HashMap<>();
String last = "";
for (char c : input.toCharArray()) {
if (last.isEmpty()) {
last = String.format("%c", c);
continue;
}
String thing = last + c;
Integer count = table.getOrDefault(thing, 0);
table.put(thing, count + 1);
last = String.format("%c", c);
}

Efficient way to replace all special characters and numbers in a large text file in Java

I'm currently working on a program that creates a pie chart based on frequencies of letters in a text file, my test file is relatively large and although my program works great on smaller files it is very slow for large files. I want to cut down the time it takes by figuring out a more efficient way to search through the text file and remove special characters and numbers. This is the code I have right now for this portion:
public class readFile extends JPanel {
protected static String stringOfChar = "";
public static String openFile(){
String s = "";
try {
BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
while((s = reader.readLine()) != null){
String newstr = s.replaceAll("[^a-z A-Z]"," ");
stringOfChar+=newstr;
}
reader.close();
return stringOfChar;
}
catch (Exception e) {
System.out.println("File not found.");
}
return stringOfChar;
}
The code reads through the text file character by character, replacing all special characters with a space, after this is done I sort the string into a hashmap for characters and frequencies.
I know from testing that this portion of the code is what is causing the bulk of extra time to process the file, but I'm not sure how I could replace all the characters in an efficient manner.
Your code has two inefficiencies:
It constructs throw-away strings with special characters replaced by space in s.replaceAll
It builds large strings by concatenating String objects with +=
Both of these operations create a lot of unnecessary objects. On top of this, the final String object is thrown away as well as soon as the final result, the map of counts, is constructed.
You should be able to fix both these deficiencies by constructing the map as you read through the file, avoiding both the replacements and concatenations:
public static Map<Character,Integer> openFileAndCount() {
Map<Character,Integer> res = new HashMap<Character,Integer>();
BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
String s;
while((s = reader.readLine()) != null) {
for (int i = 0 ; i != s.length() ; i++) {
char c = s.charAt(i);
// The check below lets through all letters, not only Latin ones.
// Use a different check to get rid of accented letters
// e.g. è, à, ì and other characters that you do not want.
if (!Character.isLetter(c)) {
c = ' ';
}
res.put(c, res.containsKey(c) ? res.get(c).intValue()+1 : 1);
}
}
return res;
}
Instead of using the operator + use the class StringBuilder to concatenate strings:
A mutable sequence of characters.
It is a lot more efficient.
Concatenate strings generate a new string for each concatenation. So if you need to that for many times you have a lot of string creations for intermediate strings that are never used because you need only the final result.
A StringBuilder use a different internal representation so it is not necessary to create new objects for every concatenation.
Also replaceAll is very unefficient creating a new String every time.
Here a more efficient code using StringBuilder:
...
StringBuilder build = new StringBuilder();
while((s = reader.readLine()) != null){
for (char ch : s) {
if (!(ch >= 'a' && ch <= 'z')
&& !(ch >= 'A' && ch <= 'Z')
&& ch != ' ') {
build.append(" ");
} else {
build.append(ch);
}
}
}
...
return build.toString();
...

Array Index out of Bound Exception for returning Char Array

I am new to Java programming and I was writing code to replace spaces in Strings with %20 and return the final String. Here is the code for the problem. Since I am new to programming please tell me what I did wrong. Sorry for my bad English.
package Chapter1;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Problem4 {
public char[] replaceSpaces(char[] str_array, int length)
{
int noOfSpaces=0,i,newLength;
for(i=0;i<length;i++)
{
if(str_array[i]==' ')
{
noOfSpaces++;
}
newLength = length + noOfSpaces * 2;
str_array[newLength]='\0';
for(i=0;i<length-1;i++)
{
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
}
}
return str_array;
}
public static void main(String args[])throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Please enter the string:");
String str = reader.readLine();
char[] str_array = str.toCharArray();
int length = str.length();
Problem4 obj = new Problem4();
char[] result = obj.replaceSpaces(str_array, length);
System.out.println(result);
}
}
But I get the following error:
Please enter the string:
hello world
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 11
at Chapter1.Problem4.replaceSpaces(Problem4.java:19)
at Chapter1.Problem4.main(Problem4.java:46)
How about using String.replaceAll():
String str = reader.readLine();
str = str.replaceAll(" ", "02%");
Sample code here
EDIT:
The problem is at line 19:
str_array[newLength]='\0';//<-- newLength exceeds the char array size
Here array is static i.e. the size is fixed you can use StringBuilder, StringBuffer, etc to build the new String without worrying about the size for such small operations.
Assuming that you want to see what mistakes you made when implementing your approach, instead of looking for a totally different approach:
(1) As has been pointed out, once an array has been allocated, its size cannot be changed. Your method takes str_array as a parameter, but the resulting array will likely be larger than str_array. Therefore, since str_array's length cannot be changed, you'll need to allocate a new array to hold the result, rather than using str_array. You've computed newLength correctly; allocate a new array of that size:
char[] resultArray = new char[newLength];
(2) As Elliott pointed out, Java strings don't need \0 terminators. If, for some reason, you really want to create an array that has a \0 character at the end, then you have to add 1 to your computed newLength to account for the extra character.
(3) You're actually creating the resulting array backward. I don't know if that is intentional.
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
i starts with the first character of the string and goes upward; you're filling in characters starting with the last character of the string (newLength) and going backward. If that's what you intended to do, it wasn't clear from your question. Did you want the output to be "dlrow%20olleh"?
(4) If you did intend to go backward, then what the above code does with a space is to put %20 in the string (backwards), but then it also puts the space into the result. If the input character is a space, you want to make sure you don't execute the two lines that copy the input character to the result. So you'll need to add an else. (Note that this problem will lead to an out-of-bounds error, because you're trying to put more characters into the result than you computed.) You'll need to have an else in there even if you really meant to build the string forwards and need to change the logic to make it go forward.
Java arrays are not dynamic (they are Object instances, and they have a field length property that does not change). Because they store the length as a field, it is important to know that they're not '\0' terminated (your attempt to add such a terminator is causing your index out of bounds Exception). Your method doesn't appear to access any instance fields or methods, so I'd make it static. Then you could use a StringBuilder and a for-each loop. Something like
public static char[] replaceSpaces(char[] str_array) {
StringBuilder sb = new StringBuilder();
for (char ch : str_array) {
sb.append((ch != ' ') ? ch : "%20");
}
return sb.toString().toCharArray();
}
Then call it like
char[] result = replaceSpaces(str_array);
Finally, you might use String str = reader.readLine().replace(" ", "+"); or replaceAll(" ", "%20") as suggested by #Arvind here.
P.S. When you finally get your result you'll need to fix your call to print it.
System.out.println(Arrays.toString(result));
or
System.out.println(new String(result));
A char[] is not a String and Java arrays (disappointingly) don't override toString() so you'll get the one from Object.
please tell me what I did wrong
You tried to replace a single character with three characters %20. That's not possible because arrays are fixed length.
Therefore you must allocate a new char[] and copy the characters from str_array into the new array.
for (i = 0; i < length; i++) {
if (str_array[i] == ' ') {
noOfSpaces++;
}
}
newLength = length + noOfSpaces * 2;
char[] newArray = new char[newLength];
// copy characters from str_array into newArray
The exception is raised in this line str_array[newLength]='\0'; because value of newLength is greater than length of str_array.
Array size cannot be increased once it is defined. So try the alternative solution.
char[] str_array1=Arrays.copyOf(str_array, str_array.length+1);
str_array1[newLength]='\0';
don't forget to import the new package import java.util.Arrays;

Code optimization by chosing another datastructure

I have a piece of code that needs to be optimized.
for (int i = 0; i < wordLength; i++) {
for (int c = 0; c < alphabetLength; c++) {
if (alphabet[c] != x.word.charAt(i)) {
String res = WordList.Contains(x.word.substring(0,i) +
alphabet[c] +
x.word.substring(i+1));
if (res != null && WordList.MarkAsUsedIfUnused(res)) {
WordRec wr = new WordRec(res, x);
if (IsGoal(res)) return wr;
q.Put(wr);
}
}
}
Words are represented by string. The problem is that the code on line 4-6 creates to many string objects, because strings are immutable.
Which data structure should I change my word representation to, if I want to get faster code ? I have tried to change it to char[], but then I have problem with getting the following code work:
x.word.substring(0,i)
How to get subarray from a char[] ? And how to concatenate the char and char[] on line 4.6 ?
Is there any other suitable and mutable datastrucure that I can use ? I have thought of stringbuffer but can't find suitable operations on stringbuffers.
This function generates, given a specific word, all the word that differs by one character.
WordRec is just a class with a string representing a word, and a pointer to the "father" of that word.
Thanks in advance
You can reduce number of objects by using this approach:
StringBuilder tmp = new StringBuilder(wordLength);
tmp.append(x.word);
for (int i=...) {
for (int c=...) {
if (...) {
char old = tmp.charAt(i);
tmp.setCharAt(i, alphabet[c]);
String res = tmp.toString();
tmp.setCharAt(i, old);
...
}
}
}

How to detect end of string in byte array to string conversion?

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks
May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.
0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.
You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.
Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.
It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);
Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

Categories