I am looking for a way to read hex strings from a file line by line and append them as converted bytes to some ByteBuffer.
ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
Files.lines(filePath).foreach( l ->
byteBuffer.put(
// first of all strip newlines and normalize string
l.replaceAll("/\n|\r/g", "").toUpperCase()
// but what to do here?
// is there something like
// take next 2 characters (-> Consumer)
// and replace them with the converted byte?
// E.g. "C8" -> 0xC8
// until the end of the string is reached
)
);
This has been answered a million time. But I wondered if there is a solution using streams like returned by Files.lines().
Generally I like this answer. Can anybody help me about translating that into java-8 stream-based solution or completing my example from above?
Thank you!
You can use a utility method to parse the line as a hex string to a byte array:
public static byte[] hexStringToByteArray(String str) {
if(str.startsWith("0x")) { // Get rid of potential prefix
str = str.substring(2);
}
if(str.length() % 2 != 0) { // If string is not of even length
str = '0' + str; // Assume leading zeroes were left out
}
byte[] result = new byte[str.length() / 2];
for(int i = 0; i < str.length(); i += 2) {
String nextByte = str.charAt(i) + "" + str.charAt(i + 1);
// To avoid overflow, parse as int and truncate:
result[i / 2] = (byte) Integer.parseInt(nextByte, 16);
}
return result;
}
ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
Files.lines(filePath).forEach( l ->
byteBuffer.put(
hexStringToByteArray(l.replaceAll("/\n|\r/g", "").toUpperCase())
)
);
This looks a bit like an xy problem, as reading the file “line by line” is already part of your attempted solution while you actual task does not include any requirement to read the file “line by line”.
Actually, you want to process all hexadecimal numbers of the source, regardless of the line terminators, which is a job for java.util.Scanner. It also allows to process the items using the Stream API, though this specific task does not benefit much from it, compared to a loop:
ByteBuffer bb = ByteBuffer.allocate(1024);
try(Scanner s = new Scanner(yourFile)) {
s.findAll("[0-9A-Fa-f]{2}")
.mapToInt(m -> Integer.parseInt(m.group(), 16))
.forEachOrdered(i -> { if(bb.hasRemaining()) bb.put((byte)i); });
}
try(Scanner s = new Scanner(yourFile)) {
Pattern p = Pattern.compile("[0-9A-Fa-f]{2}");
for(;;) {
String next = s.findWithinHorizon(p, 0);
if(next == null) break;
if(!bb.hasRemaining()) // the thing hard to do with Stream API
bb = ByteBuffer.allocate(bb.capacity()*2).put(bb.flip());
bb.put((byte)Integer.parseInt(next, 16));
}
}
Note that these examples use Java 9. In Java 8, the Buffer returned by Buffer.flip() needs a type cast back to ByteBuffer and Scanner.findAll is not available but has to be replaced by a back-port like the one in this answer.
Related
I am using Java for a map reduce programming.
I have a byte Array with 10 MB data in it. I want to compare each byte to see if it is a space or not, my basic purpose is to get each word in this byte array, by separating the words using space (that is my idea, any other suggestion is welcome). I can for sure do it using string, i.e first converting the whole byte array to string, then comparing and then doing a substring to get each word, but this duplicates the data. I don't want anything that creates a duplicate like stringbuilder, StringTokenizer, substring.
I want each word in the bytearray, but without any duplicates since I am doing in memory computing and duplicates make me run out of resources. Any suggestion/idea how to proceed would be appreaciated.
If you just want to avoid creating a String for the whole array (and strings for the words are OK), you could do
HashSet<String> words = new HashSet<String>();
int pos = 0;
int len = byteArray.length;
for (int i = 0; i <= len; i++) {
if (i == len || byteArray[i] == ' ') {
if (i > pos + 1) {
String word = new String(byteArray, pos, i - pos, "UTF-8");
words.add(word);
}
pos = i + 1;
}
}
p.s. Your comment seems to suggest that you read the byte array from a file. Why not avoid that and read the words from the file directly? If you can use a newline (\n) as the delimiter (instead of a space), you could just do something like this:
HashSet<String> words = new HashSet<String>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(args), "UTF-8"));
while (true) {
String word = reader.readLine();
if (word == null) {
break;
}
words.add(word);
}
reader.close();
This is my first post on here, so apologies for the problems it likely has.
I've been working with a custom input stream recently, that uses a byte array to store data in (similar to a ByteArrayInputStream) but with more control over the pointer. The problem is for some reason my implementation of read() starts returning negative numbers after the values get past 127, which causes DataInputStream to assume it's the EOF.
I've condensed things into a small program to demonstrate the problem:
(broken up into pieces, because I can't seem to figure out how to fit it all into a single code block)
The custom input stream class:
class TestByteArrayInputStream extends InputStream {
byte[] data;
{
// fill with some data
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dout = new DataOutputStream(out);
try {
for (int i = 0; i < 256; i++) { // fill array with shorts valued 0-255
dout.writeShort(i);
}
} catch (Throwable t) {
t.printStackTrace();
}
data = out.toByteArray();
}
int pointer = 0;
#Override
public int read() throws IOException {
if (pointer >= data.length) {
pointer = 0;
}
return data[pointer++]; // I've tried casting this to a char to remove signing, and using Integer.valueOf, but neither solve the problem.
}
}
And here's the main method:
public class Bugdemo {
public static void main(String[] args) {
TestByteArrayInputStream tin = new TestByteArrayInputStream();
DataInputStream din = new DataInputStream(tin);
try { // read through normally
for (int i = 0; i < 256; i++) {
System.out.println(din.readShort());
}
} catch (Throwable t) {
System.out.println(t.toString()); // avoid logging issues
}
tin.pointer = 0; // reset to beginning of data
try {
for (int i = 0; i < 256; i++) {
// readShort code with added debugging
int ch1 = tin.read();
int ch2 = tin.read();
if ((ch1 | ch2) < 0) {
System.out.print("readshort \"eof\": ");
System.out.printf("data in array is %02X ", tin.data[tin.pointer - 2]);
System.out.printf("%02X ", tin.data[tin.pointer - 1]);
System.out.printf(" but got %02X ", ch1);
System.out.printf("%02X from read()", ch2);
System.out.println();
//throw new EOFException(); // this is in DataInputStream.readShort after if((ch1 | ch2) < 0)
} else {
System.out.println((short) ((ch1 << 8) + (ch2 << 0)));
}
}
} catch (Throwable t) {
t.printStackTrace();
}
}
}
And here's the output (pasted so this isn't too long): http://paste.ubuntu.com/6642589/
(is there a better way of doing this on here?)
The important bit:
readshort "eof": data in array is 00 80 but got 00 FFFFFF80 from read()
From my debugging I'm pretty sure it's a casting issue from the byte in the array to an int for returning in read(), but shouldn't it cast properly naturally? If not, what's the proper way of doing this?
readShort works as expected, read too.
Integer datatypes in Java are signed, including byte. As read returns a byte and you are outputting this value as it is you get the negative representation. You have to convert it to an int with the unsigned value before printing with ch1 & 0xff.
No suprises here. The Max Value for bytes is (2 to power 7) - 1
http://docs.oracle.com/javase/7/docs/api/java/lang/Byte.html#MAX_VALUE
all types in java are signed so byte can hold values between -128 +127. You are putting two bytes by writing short
for (int i = 0; i < 256; i++) { // fill array with shorts valued 0-255
dout.writeShort(i);
but the code reads just one byte:
return data[pointer++];
it should be done this way
DataInputStream din = new DataInputStream(new ByteArrayInputStream(out.toByteArray()));
...
return din.readShort();
It isn't always easy. Just a quick summary for those who are still lost:
simple input stream
The read() of an input stream just returns a value in range [0;255].
However, if no data is available, then it will return a -1.
int value = inputStream.read(); // -1 if no data
If you would just cast this to a byte, then you are creating an overflow and you actually convert the range of [0;255] to a range of [-128;127].
byte signedValue = (byte) value;
data input stream
Now, if you wrap your InputStream in a DataInputStream then additional methods are available, such as the readByte() method. This method will return a value in range [-128;127] because that is the range of the byte type of java. Often you may want to convert it to a positive value.
If there is no data available, then of course a DataInputStream cannot return -1. So instead it will throw an EOFException.
byte value = dataInputStream.readByte(); // throws EOFException
int positiveValue = value & 0xFF;
char character = (char) positiveValue;
PS: The DataInputStream offers some convenient methods that help you to read the values immediately in the correct value range.
int positiveValue = dataInputStream.readUnsignedByte();
int positiveValue = dataInputStream.readUnsignedShort();
socket input stream
But it can be more complex. If your initial input stream is actually a SocketInputStream then no matter which method you use or how you wrap it, you will not receive -1 or EOFException. Instead you will receive a SocketTimeoutException.
socket.setSoTimeout(1000);
int value = socketInputStream.read(); // throws SocketTimeoutException
byte signedValue = (byte) value;
char character = (char) value;
There is just a little shortcoming in that last statement: Very rarely the read() method of a SocketInputStream will not return a SocketTimeoutException in case of a timeout. It can actually return a -1 if the input stream is not correctly bound. In that case the connection is broken, and you need to close everything down and reconnect.
I have a reader which receives message packets as stream(ByteArrayInputStream).
Each packet contains data consisting of English characters followed by binary digits.
adghfjiyromn1000101010100......
What is the most efficient way to copy over(not strip) the characters out of this stream as a sequence.
So,expected output of the above packet would be(without modifying the original stream) :
adghfjiyromn
I am not only concerned about the logic,but also the exact stream manipulation routines to use;considering that the reader would read about 3-4 packets every second hypothetically.
It would also help to provide the justification on why we would prefer a particular data type(byte[],char[] or string) for tackling this.
I think the best way is to read the ByteArrayInputStream byte by byte:
ByteArrayInputStream msg = ...
int c;
String s;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
s += x;
}
i think its the best way :
1-convert Your ByteArrayInputStream to String ( or StringBuffer)
2-find first Index of 0 or 1
3-use substring ( 0 , FIRST_INDEX )
You : Each packet contains data consisting of English characters followed by binary digits.
Me : Data is in bytearrayinputstream, hence everything is in binary.
Does your 1000101010100...... are characters '1' & '0'?
If yes
ByteArrayInputStream msg = //whatever
int totalBytes = msg.available();
int c;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
}
int currentPos = msg.available() + 1; //you need to unread the 1st 0 or 1
System.out.println("Position = "+(totalBytes-currentPos));
The default implementation of RandomAccessFile is 'broken', in the sense that you can't specify which encoding your file is in.
I'm looking for an alternative which matches the following criteria:
Encoding-aware
Random access! (dealing with very big files, need to be able to position the cursor using a byte offset without streaming the whole thing).
I had a poke around in Commons IO, but there's nothing there. I'd rather not have to implement this myself, because there are entirely too many places it could go wrong.
RandomAccessFile is intended for accessing binary data. It is not possible to efficiently create a random access encoded file which is appropriate in all situations.
Even if you find such a solution I would check it carefully to ensure it suits your needs.
If you were to write it, I would suggest considering a random position of row and column rather than character offset from the start of the file.
This has the advantage that you only have to remember where the start of each line is and you can scan the line to get your character. If you index the position of every character, this could use 4 bytes for every character (assuming the file is < 4 GB)
The answer turned out to be less painful than I assumed:
// This gives me access to buffering and charset magic
new BufferedReader(new InputStreamReader(Channels.newInputStream(randomAccessFile.getChannel()), encoding)), encoding
....
I can then implement a readLine() method which reads character by character. Using String.getBytes(encoding) I can keep track of the offset in the file. Calling seek() on the underlying RandomAccessFile allows me to reposition the cursor at will. There are probably some bugs lurking in there, but the basic tests seem to work.
public String readLine() throws IOException {
eol = "";
lastLineByteCount = 0;
StringBuilder builder = new StringBuilder();
char[] characters = new char[1];
int status = reader.read(characters, 0, 1);
if (status == -1) {
return null;
}
char c = characters[0];
while (status != -1) {
if (c == '\n') {
eol += c;
break;
}
if (c == '\r') {
eol += c;
} else {
builder.append(c);
}
status = reader.read(characters, 0, 1);
c = characters[0];
}
String line = builder.toString();
lastLineByteCount = line.getBytes(encoding).length + eol.getBytes(encoding).length;
return line;
}
I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks
May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.
0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.
You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.
Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.
It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);
Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}