I am working with a ByteArrayInputStream that contains an XML document consisting of one element with a large base 64 encoded string as the content of the element. I need to remove the surrounding tags so I can decode the text and output it as a pdf document.
What is the most efficient way to do this?
My knee-jerk reaction is to read the stream into a byte array, find the end of the start tag, find the beginning of the end tag and then copy the middle part into another byte array; but this seems rather inefficient and the text I am working with can be large at times (128KB). I would like a way to do this without the extra byte arrays.
Base 64 does not use the characters < or > so I'm assuming you are using a web-safe base64 variant meaning you do not need to worry about HTML entities or comments inside the content.
If you are really sure that the content has this form, then do the following:
Scan from the right looking for a '<'. This will be the beginning of the close tag.
Scan left from that position looking for a '>'. This will be the end of the start tag.
The base 64 content is between those two positions, exclusive.
You can presize your second array by using
((end - start + 3) / 4) * 3
as an upper bound on the decoded content length, and then b64decode into it. This works because each 4 base64 digits encodes 3 bytes.
If you want to get really fancy, since you know the first few bytes of the array contain ignorable tag data and the encoded data is smaller than the input, you could destructively decode the data over your current byte buffer.
Do your search and conversion while you are reading the stream.
// find the start tag
byte[] startTag = new byte[]{'<', 't', 'a', 'g', '>'};
int fnd = 0;
int tmp = 0;
while((tmp = stream.read()) != -1) {
if(tmp == startTag[fnd])
fnd++;
else
fnd=0;
if(fnd == startTage.size()) break;
}
// get base64 bytes
while(true) {
int a = stream.read();
int b = stream.read();
int c = stream.read();
int d = stream.read();
byte o1,o2,o3; // output bytes
if(a == -1 || a == '<') break;
//
...
outputStream.write(o1);
outputStream.write(o2);
outputStream.write(o3);
}
note The above was written in my web browser, so syntax errors may exist.
Related
I want to split a String to a String[] array, whose elements meet following conditions.
s.getBytes(encoding).length should not exceed maxsize(int).
If I join the splitted strings with StringBuilder or + operator, the result should be exactly the original string.
The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.
The desired prototype is shown below.
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
And the testing code:
public boolean isNice(String str, String encoding, int max)
{
//boolean success=true;
StringBuilder b=new StringBuilder();
String[] splitted= SplitStringByByteLength(str,encoding,max);
for(String s: splitted)
{
if(s.getBytes(encoding).length>max)
return false;
b.append(s);
}
if(str.compareTo(b.toString()!=0)
return false;
return true;
}
Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.
Thank you in advance.
Edit: I added my code impementation. (Inefficient)
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
ArrayList<String> splitted=new ArrayList<String>();
StringBuilder builder=new StringBuilder();
//int l=0;
int i=0;
while(true)
{
String tmp=builder.toString();
char c=src.charAt(i);
if(c=='\0')
break;
builder.append(c);
if(builder.toString().getBytes(encoding).length>maxsize)
{
splitted.add(new String(tmp));
builder=new StringBuilder();
}
++i;
}
return splitted.toArray(new String[splitted.size()]);
}
Is this the only way to solve this problem?
The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:
public final CoderResult encode(CharBuffer in,
ByteBuffer out,
boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...
In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:
...
CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.
A possible code could be:
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}
This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).
The problem lies in the existence of Unicode "supplementary characters" (see Javadoc of the Character class), that take up two "character places" (a surrogate pair) in a String, and you shouldn't split your String in the middle of such a pair.
An easy approach to splitting would be to stick to the worst-case that a single Unicode code point can take at most four bytes in UTF-8, and split the string after every 99 code points (using string.offsetByCodePoints(pos, 99) ). In most cases, you won't fill the 400 bytes, but you'll be on the safe side.
Some words about code points and characters
When Java started, Unicode had less than 65536 characters, so Java decided that 16 bits were enough for a character. Later the Unicode standard exceeded the 16-bit limit, and Java had a problem: a single Unicode element (now called a "code point") no longer fit into a single Java character.
They decided to go for an encoding into 16-bit entities, being 1:1 for most usual code points, and occupying two "characters" for the exotic code points beyond the 16-bit limit (the pair built from so-called "surrogate characters" from a spare code range below 65535). So now it can happen that e.g. string.charAt(5) and string.charAt(6) must be seen in combination, as a "surrogate pair", together encoding one Unicode code point.
That's the reason why you shouldn't split a string at an arbitrary index.
To help the application programmer, the String class then got a new set of methods, working in code point units, and e.g. string.offsetByCodePoints(pos, 99) means: from the index pos, advance by 99 code points forward, giving an index that will often be pos+99 (in case the string doesn't contain anything exotic), but might be up to pos+198, if all the following string elements happen to be surrogate pairs.
Using the code-point methods, you are safe not to land in the middle of a surrogate pair.
So I just wrote a program that reads a specific file and returns the frequency of each character used. This was done by using a singly linked list(not java LinkedList, but very similar). What I want to know is why this:
while(txtFile.read() != -1){
Character letter = (char) txtFile.read();
freqBag.add(Character.toLowerCase(letter));
}
doesn't work(it doesn't return the correct frequency of the given character), and why this:
int c;
while((c = txtFile.read()) != -1){
Character letter = (char) c;
freqBag.add(Character.toLowerCase(letter));
}
works. I wrote the first one, and a friend helped me fix it.
It doesn't work because you're discarding characters. Each read() function brings back the next byte (as a signed int), so your code is dropping every even character (0, 2, 4...).
while(txtFile.read() != -1){ // Read and discard a character
Character letter = (char) txtFile.read(); // Read a character into letter
reqBag.add(Character.toLowerCase(letter)); // Store this letter
}
Your friend's code shouldn't be working either:
int c; // variable outside the loop
while((c = txtFile.read()) != -1){ // Read a character into c, compare to -1
Character letter = (char) txtFile.read(); // Read another character
freqBag.add(Character.toLowerCase(letter)); // Store this letter
}
The correct method would be to read just once:
int c;
while((c = txtFile.read()) != -1) {
freqBag.add(Character.toLowerCase((char)c));
}
I suspect either you have a typo, or you used a different file and didn't realize that letters were still being dropped.
First of all you need to keep in mind that when you call read method you already read one byte from file, so if you do it inside of your while statement you lose one byte.
Second thing is that for me (considering operators precedence) this two pieces of code does exact same thing so the problem might be in other part of code.
I have a reader which receives message packets as stream(ByteArrayInputStream).
Each packet contains data consisting of English characters followed by binary digits.
adghfjiyromn1000101010100......
What is the most efficient way to copy over(not strip) the characters out of this stream as a sequence.
So,expected output of the above packet would be(without modifying the original stream) :
adghfjiyromn
I am not only concerned about the logic,but also the exact stream manipulation routines to use;considering that the reader would read about 3-4 packets every second hypothetically.
It would also help to provide the justification on why we would prefer a particular data type(byte[],char[] or string) for tackling this.
I think the best way is to read the ByteArrayInputStream byte by byte:
ByteArrayInputStream msg = ...
int c;
String s;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
s += x;
}
i think its the best way :
1-convert Your ByteArrayInputStream to String ( or StringBuffer)
2-find first Index of 0 or 1
3-use substring ( 0 , FIRST_INDEX )
You : Each packet contains data consisting of English characters followed by binary digits.
Me : Data is in bytearrayinputstream, hence everything is in binary.
Does your 1000101010100...... are characters '1' & '0'?
If yes
ByteArrayInputStream msg = //whatever
int totalBytes = msg.available();
int c;
while ((c = msg.read())!= -1) {
char x = (char) c;
if (x=='1' || x=='0') break;
}
int currentPos = msg.available() + 1; //you need to unread the 1st 0 or 1
System.out.println("Position = "+(totalBytes-currentPos));
I want to convert bytes in to String.
I have one android application and I am using flatfile for data storage.
Suppose I have lots of record in my flatfile.
Here in flat file database, my record size is fixed and its 10 characters and here I am storing lots of String records sequence.
But when I read one record from the flat file, then it is fixed number of bytes for each record. Because I wrote 10 bytes for every record.
If my string is S="abc123";
then it is stored in flat file like abc123 ASCII values for each character and rest would be 0.
Means byte array should be [97 ,98 ,99 ,49 ,50 ,51,0,0,0,0].
So when I want to get my actual string from the byte array, at that time I am using below code and it is working fine.
But when I give my inputString = "1234567890" then it creates problem.
public class MainActivity extends Activity {
public static short messageNumb = 0;
public static short appID = 16;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
// record with size 10 and its in bytes.
byte[] recordBytes = new byte[10];
// fill record by 0's
Arrays.fill(recordBytes, (byte) 0);
// input string
String inputString = "abc123";
int length = 0;
int SECTOR_LENGTH = 10;
// convert in bytes
byte[] inputBytes = inputString.getBytes();
// set how many bytes we have to write.
length = SECTOR_LENGTH < inputBytes.length ? SECTOR_LENGTH
: inputBytes.length;
// copy bytes in record size.
System.arraycopy(inputBytes, 0, recordBytes, 0, length);
// Here i write this record in the file.
// Now time to read record from the file.
// Suppose i read one record from the file successfully.
// convert this read bytes to string which we wrote.
Log.d("TAG", "String is = " + getStringFromBytes(recordBytes));
}
public String getStringFromBytes(byte[] inputBytes) {
String s;
s = new String(inputBytes);
return s = s.substring(0, s.indexOf(0));
}
}
But I am getting problem when my string has complete 10 characters. At that time I have two 0's in my byte array so in this line
s = s.substring(0, s.indexOf(0));
I am getting the below exception:
java.lang.StringIndexOutOfBoundsException: length=10; regionStart=0; regionLength=-1
at java.lang.String.startEndAndLength(String.java:593)
at java.lang.String.substring(String.java:1474)
So what can I do when my string length is 10.
I have two solutions- I can check my inputBytes.length == 10 then make it not to do subString condition otherwise check contains 0 in byte array.
But i don't want to use this solution because I used this thing at lots of places in my application. So, is there any other way to achieve this thing?
Please suggest me some good solution which works in every condition. I think at last 2nd solution would be great. (check contains 0's in byte array and then apply sub string function).
public String getStringFromBytes(byte[] inputBytes) {
String s;
s = new String(inputBytes);
int zeroIndex = s.indexOf(0);
return zeroIndex < 0 ? s : s.substring(0, zeroIndex);
}
i think this line cause the error
s = s.substring(0, s.indexOf(0));
s.indexOf(0)
returns -1 , perhaps you should specifiy the ASCII code
for zero which is 48
so this will work s = s.substring(0, s.indexOf(48));
check documentation for indexOf(int)
public int indexOf (int c) Since: API Level 1 Searches in this string
for the first index of the specified character. The search for the
character starts at the beginning and moves towards the end of this
string.
Parameters c the character to find. Returns the index in this string
of the specified character, -1 if the character isn't found.
I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks
May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.
0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.
You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.
Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.
It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);
Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}