Find the position of the last english character in a stream - java

I have a reader which receives message packets as stream(ByteArrayInputStream).
Each packet contains data consisting of English characters followed by binary digits.
adghfjiyromn1000101010100......
What is the most efficient way to copy over(not strip) the characters out of this stream as a sequence.
So,expected output of the above packet would be(without modifying the original stream) :
adghfjiyromn
I am not only concerned about the logic,but also the exact stream manipulation routines to use;considering that the reader would read about 3-4 packets every second hypothetically.
It would also help to provide the justification on why we would prefer a particular data type(byte[],char[] or string) for tackling this.

I think the best way is to read the ByteArrayInputStream byte by byte:
ByteArrayInputStream msg = ...
int c;
String s;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
s += x;
}

i think its the best way :
1-convert Your ByteArrayInputStream to String ( or StringBuffer)
2-find first Index of 0 or 1
3-use substring ( 0 , FIRST_INDEX )

You : Each packet contains data consisting of English characters followed by binary digits.
Me : Data is in bytearrayinputstream, hence everything is in binary.
Does your 1000101010100...... are characters '1' & '0'?
If yes
ByteArrayInputStream msg = //whatever
int totalBytes = msg.available();
int c;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
}
int currentPos = msg.available() + 1; //you need to unread the 1st 0 or 1
System.out.println("Position = "+(totalBytes-currentPos));

Related

How to generate a matrix of frequency of consecutive characters from txt file in java?

I have a large txt file(2GB). I read the whole txt file character by character to find out the frequency of each character in the whole txt file using the following code snippet.
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(file),
Charset.forName("UTF-8")));
int c;
while ((c = reader.read()) != -1) {
char ch = (char) c;
// rest of the code
}
Now I need to generate a matrix with the frequency of consecutive characters.
For example, how many times a character 'b' exists after character 'a'(consecutive,immediate character) and vice versa.
Suppose, I have a input string(from the file) : cad bed abed dada
The frequency matrix, would be like
Please click here to see the image
How to do this? Will appreciate any help and suggestion.
Thank you.
Keep track of the last character read. if lastchar=='' continue. use a Map to store the values.you can then loop over the combinations and pull the value from the map , or you could address a 2d array directly by subtracting the int value for char 'a' from the current character pairs.
Map<String, Integer> table = new HashMap<>();
String last = "";
for (char c : input.toCharArray()) {
if (last.isEmpty()) {
last = String.format("%c", c);
continue;
}
String thing = last + c;
Integer count = table.getOrDefault(thing, 0);
table.put(thing, count + 1);
last = String.format("%c", c);
}

Hex string to ByteBuffer conversion with Java 8 streams

I am looking for a way to read hex strings from a file line by line and append them as converted bytes to some ByteBuffer.
ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
Files.lines(filePath).foreach( l ->
byteBuffer.put(
// first of all strip newlines and normalize string
l.replaceAll("/\n|\r/g", "").toUpperCase()
// but what to do here?
// is there something like
// take next 2 characters (-> Consumer)
// and replace them with the converted byte?
// E.g. "C8" -> 0xC8
// until the end of the string is reached
)
);
This has been answered a million time. But I wondered if there is a solution using streams like returned by Files.lines().
Generally I like this answer. Can anybody help me about translating that into java-8 stream-based solution or completing my example from above?
Thank you!
You can use a utility method to parse the line as a hex string to a byte array:
public static byte[] hexStringToByteArray(String str) {
if(str.startsWith("0x")) { // Get rid of potential prefix
str = str.substring(2);
}
if(str.length() % 2 != 0) { // If string is not of even length
str = '0' + str; // Assume leading zeroes were left out
}
byte[] result = new byte[str.length() / 2];
for(int i = 0; i < str.length(); i += 2) {
String nextByte = str.charAt(i) + "" + str.charAt(i + 1);
// To avoid overflow, parse as int and truncate:
result[i / 2] = (byte) Integer.parseInt(nextByte, 16);
}
return result;
}
ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
Files.lines(filePath).forEach( l ->
byteBuffer.put(
hexStringToByteArray(l.replaceAll("/\n|\r/g", "").toUpperCase())
)
);
This looks a bit like an xy problem, as reading the file “line by line” is already part of your attempted solution while you actual task does not include any requirement to read the file “line by line”.
Actually, you want to process all hexadecimal numbers of the source, regardless of the line terminators, which is a job for java.util.Scanner. It also allows to process the items using the Stream API, though this specific task does not benefit much from it, compared to a loop:
ByteBuffer bb = ByteBuffer.allocate(1024);
try(Scanner s = new Scanner(yourFile)) {
s.findAll("[0-9A-Fa-f]{2}")
.mapToInt(m -> Integer.parseInt(m.group(), 16))
.forEachOrdered(i -> { if(bb.hasRemaining()) bb.put((byte)i); });
}
try(Scanner s = new Scanner(yourFile)) {
Pattern p = Pattern.compile("[0-9A-Fa-f]{2}");
for(;;) {
String next = s.findWithinHorizon(p, 0);
if(next == null) break;
if(!bb.hasRemaining()) // the thing hard to do with Stream API
bb = ByteBuffer.allocate(bb.capacity()*2).put(bb.flip());
bb.put((byte)Integer.parseInt(next, 16));
}
}
Note that these examples use Java 9. In Java 8, the Buffer returned by Buffer.flip() needs a type cast back to ByteBuffer and Scanner.findAll is not available but has to be replaced by a back-port like the one in this answer.

ByteArray - string comparison, without using bytearray.tostring()

I am using Java for a map reduce programming.
I have a byte Array with 10 MB data in it. I want to compare each byte to see if it is a space or not, my basic purpose is to get each word in this byte array, by separating the words using space (that is my idea, any other suggestion is welcome). I can for sure do it using string, i.e first converting the whole byte array to string, then comparing and then doing a substring to get each word, but this duplicates the data. I don't want anything that creates a duplicate like stringbuilder, StringTokenizer, substring.
I want each word in the bytearray, but without any duplicates since I am doing in memory computing and duplicates make me run out of resources. Any suggestion/idea how to proceed would be appreaciated.
If you just want to avoid creating a String for the whole array (and strings for the words are OK), you could do
HashSet<String> words = new HashSet<String>();
int pos = 0;
int len = byteArray.length;
for (int i = 0; i <= len; i++) {
if (i == len || byteArray[i] == ' ') {
if (i > pos + 1) {
String word = new String(byteArray, pos, i - pos, "UTF-8");
words.add(word);
}
pos = i + 1;
}
}
p.s. Your comment seems to suggest that you read the byte array from a file. Why not avoid that and read the words from the file directly? If you can use a newline (\n) as the delimiter (instead of a space), you could just do something like this:
HashSet<String> words = new HashSet<String>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(args), "UTF-8"));
while (true) {
String word = reader.readLine();
if (word == null) {
break;
}
words.add(word);
}
reader.close();

Efficient ByteArrayInputStream manipulation

I am working with a ByteArrayInputStream that contains an XML document consisting of one element with a large base 64 encoded string as the content of the element. I need to remove the surrounding tags so I can decode the text and output it as a pdf document.
What is the most efficient way to do this?
My knee-jerk reaction is to read the stream into a byte array, find the end of the start tag, find the beginning of the end tag and then copy the middle part into another byte array; but this seems rather inefficient and the text I am working with can be large at times (128KB). I would like a way to do this without the extra byte arrays.
Base 64 does not use the characters < or > so I'm assuming you are using a web-safe base64 variant meaning you do not need to worry about HTML entities or comments inside the content.
If you are really sure that the content has this form, then do the following:
Scan from the right looking for a '<'. This will be the beginning of the close tag.
Scan left from that position looking for a '>'. This will be the end of the start tag.
The base 64 content is between those two positions, exclusive.
You can presize your second array by using
((end - start + 3) / 4) * 3
as an upper bound on the decoded content length, and then b64decode into it. This works because each 4 base64 digits encodes 3 bytes.
If you want to get really fancy, since you know the first few bytes of the array contain ignorable tag data and the encoded data is smaller than the input, you could destructively decode the data over your current byte buffer.
Do your search and conversion while you are reading the stream.
// find the start tag
byte[] startTag = new byte[]{'<', 't', 'a', 'g', '>'};
int fnd = 0;
int tmp = 0;
while((tmp = stream.read()) != -1) {
if(tmp == startTag[fnd])
fnd++;
else
fnd=0;
if(fnd == startTage.size()) break;
}
// get base64 bytes
while(true) {
int a = stream.read();
int b = stream.read();
int c = stream.read();
int d = stream.read();
byte o1,o2,o3; // output bytes
if(a == -1 || a == '<') break;
//
...
outputStream.write(o1);
outputStream.write(o2);
outputStream.write(o3);
}
note The above was written in my web browser, so syntax errors may exist.

How to detect end of string in byte array to string conversion?

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks
May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.
0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.
You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.
Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.
It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);
Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

Categories