I searched for an example of how to compress a string in Java.
I have a function to compress then uncompress. The compress seems to work fine:
public static String encStage1(String str)
{
String format1 = "ISO-8859-1";
String format2 = "UTF-8";
if (str == null || str.length() == 0)
{
return str;
}
System.out.println("String length : " + str.length());
ByteArrayOutputStream out = new ByteArrayOutputStream();
String outStr = null;
try
{
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
outStr = out.toString(format2);
System.out.println("Output String lenght : " + outStr.length());
} catch (Exception e)
{
e.printStackTrace();
}
return outStr;
}
But the reverse is complaining about the string not being in GZIP format, even when I pass the return from encStage1 straight back into the decStage3:
public static String decStage3(String str)
{
if (str == null || str.length() == 0)
{
return str;
}
System.out.println("Input String length : " + str.length());
String outStr = "";
try
{
String format1 = "ISO-8859-1";
String format2 = "UTF-8";
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(str.getBytes(format2)));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, format2));
String line;
while ((line = bf.readLine()) != null)
{
outStr += line;
}
System.out.println("Output String lenght : " + outStr.length());
} catch (Exception e)
{
e.printStackTrace();
}
return outStr;
}
I get this error when I call with a string return from encStage1:
public String encIDData(String idData)
{
String tst = "A simple test string";
System.out.println("Enc 0: " + tst);
String stg1 = encStage1(tst);
System.out.println("Enc 1: " + toHex(stg1));
String dec1 = decStage3(stg1);
System.out.println("unzip: " + toHex(dec1));
}
Output/Error:
Enc 0: A simple test string
String length : 20
Output String lenght : 40
Enc 1: 1fefbfbd0800000000000000735428efbfbdefbfbd2defbfbd495528492d2e51282e29efbfbdefbfbd4b07005aefbfbd21efbfbd14000000
Input String length : 40
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
A small error is:
gzip.write(str.getBytes());
takes the default platform encoding, which on Windows will never be ISO-8859-1. Better:
gzip.write(str.getBytes(format1));
You could consider taking "Cp1252", Windows Latin-1 (for some European languages), instead of "ISO-8859-1", Latin-1. That adds comma like quotes and such.
The major error is converting the compressed bytes to a String. Java separates binary data (byte[], InputStream, OutputStream) from text (String, char, Reader, Writer) which internally is always kept in Unicode. A byte sequence does not need to be valid UTF-8. You might get away by converting the bytes as a single byte encoding (ISO-8859-1 for instance).
The best way would be
gzip.write(str.getBytes(StandardCharsets.UTF_8));
So you have full Unicode, every script may be combined.
And uncompressing to a ByteArrayOutputStream and new String(baos.toByteArray(), StandardCharsets.UTF_8).
Using BufferedReader on an InputStreamReader with UTF-8 is okay too, but a readLine throws away the newline characters
outStr += line + "\r\n"; // Or so.
Clean answer:
public static byte[] encStage1(String str) throws IOException
{
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
{
try (GZIPOutputStream gzip = new GZIPOutputStream(out))
{
gzip.write(str.getBytes(StandardCharsets.UTF_8));
}
return out.toByteArray();
//return out.toString(StandardCharsets.ISO_8859_1);
// Some single byte encoding
}
}
public static String decStage3(byte[] str) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(str)))
{
int b;
while ((b = gis.read()) != -1) {
baos.write((byte) b);
}
}
return new String(baos.toByteArray(), StandardCharset.UTF_8);
}
usage of toString/getBytes for encoding/decoding is a wrong way. try to use something like BASE64 encoding for this purpose (java.util.Base64 in jdk 1.8)
as a proof try this simple test:
import org.testng.annotations.Test;
import java.io.ByteArrayOutputStream;
import static org.testng.Assert.assertEquals;
public class SimpleTest {
#Test
public void test() throws Exception {
final String CS = "utf-8";
byte[] b0 = {(byte) 0xff};
ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write(b0);
out.close();
byte[] b1 = out.toString(CS).getBytes(CS);
assertEquals(b0, b1);
}
}
Related
I am receiving a file with shiftJis encoding. It has Japanese characters with shift in and shift out characters at the beginning and end of each multi byte string.
As per my requirement, I have to convert this file to utf-8 and remove the SI and SO characters from the utf-8 file? what is the best way to do this? Should I remove them before utf-8 conversion or after? and how do I remove it? thanks in advance.
my javacode is as below
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
String inFilePath = "src\\encoding\\input\\dfd02.PGP_dec";
String filePath = "src\\encoding\\output\\";
String utf8FileNm = "utf8-out.txt";
String charsetName = "x-SJIS_0213";
InputStream in;
try {
in = new FileInputStream(inFilePath);
Reader reader = new InputStreamReader(in, charsetName);
StringBuilder sb = new StringBuilder();
int read;
while ((read = reader.read()) != -1){
sb.append((char)read);
}
reader.close();
String string = sb.toString();
OutputStream out = new FileOutputStream(filePath + charsetName + "-" + utf8FileNm);
Writer writer = new OutputStreamWriter(out, "UTF-8");
writer.write(string);
writer.close();
System.out.println("Finished writing the input file in UTF-8 format");
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
I have some code to uncompress gzip a compressedString as below:
public static String decompress(String compressedString) throws IOException {
byte[] byteCompressed = compressedString.getBytes(StandardCharsets.UTF_8)
final StringBuilder outStr = new StringBuilder();
if ((byteCompressed == null) || (byteCompressed.length == 0)) {
return "";
}
if (isCompressed(byteCompressed)) {
final GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(byteCompressed));
final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
String line;
while ((line = bufferedReader.readLine()) != null) {
outStr.append(line);
}
} else {
outStr.append(byteCompressed);
}
return outStr.toString();
}
public static boolean isCompressed(final byte[] compressed) {
return (compressed[0] == (byte) (GZIPInputStream.GZIP_MAGIC)) && (compressed[1] == (byte) (GZIPInputStream.GZIP_MAGIC >> 8));
}
I use this code to uncompress a String as below:
H4sIAAAAAAAAAHNJLQtJLS4BALwLiloHAAAA
But this code uncompress a unexpected String although I can uncompress online normally in the web
Anyone can help me give the right uncompress code? Thanks
Your string is base64 encoded gzip data, so you need to base64 decode it, instead of trying to encode it as UTF-8 bytes.
String input = "H4sIAAAAAAAAAHNJLQtJLS4BALwLiloHAAAA";
byte[] byteCompressed = Base64.getDecoder().decode(input);
// ... rest of your code
I want to split data based on character values which are two right parenthesis )) as start of substring and carriage return CR as the end of substring. The data comes in form of bytes Am stuck on how to split it. This is so far what I have come up with.
public class ByteDecoder {
public static void main(String[] args) throws IOException {
InputStream is = null;
DataInputStream dis = null;
try{
is = new FileInputStream("byte.log");
dis = new DataInputStream(is);
int count = is.available();
byte[] bs = new byte[count];
dis.read(bs);
for (byte b:bs)
{
char c = (char)b;
System.out.println(c);
//convert bytes to hex string
// String c = DatatypeConverter.printHexBinary( bs);
}
}catch(Exception e){
e.printStackTrace();
}finally{
if(is!=null)
is.close();
if(dis!=null)
dis.close();
}
}
}
CR (unlucky 13) as end marker of binary data might be a bit dangerous. More dangerous seems how the text and bytes became written: the text must be written as bytes in some encoding.
But considering that, one could wrap the FileInputStream in your own ByteLogInputStream, and there hold the reading state:
/**
* An InputStream converting bytes between ASCII "))" and CR to hexadecimal.
* Typically wrapped as:
* <pre>
* try (BufferedReader in = new BufferedReader(
* new InputStreamReader(
* new ByteLogInputStream(
* new FileInputStream(file), "UTF-8"))) {
* ...
* }
* </pre>
*/
public class ByteLogInputStream extends InputStream {
private enum State {
TEXT,
AFTER_RIGHT_PARENT,
BINARY
}
private final InputStream in;
private State state = State.TEXT;
private int nextHexDigit = 0;
public ByteLogInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
if (nextHexDigit != 0) {
int hex = nextHexDigit;
nextHexDigit = 0;
return hex;
}
int ch = in.read();
if (ch != -1) {
switch (state) {
case TEXT:
if (ch == ')') {
state = State.AFTER_RIGHT_PARENT;
}
break;
case AFTER_RIGHT_PARENT:
if (ch == ')') {
state = State.BINARY;
}
break;
case BINARY:
if (ch == '\r') {
state = State.TEXT;
} else {
String hex2 = String.format("%02X", ch);
ch = hex2.charAt(0);
nextHexDigit = hex2.charAt(1);
}
break;
}
}
return ch;
}
}
As one binary byte results in two hexadecimal digits, you need to buffer a nextHexDigit for the next digit.
I did not override available (to account for a possible nextHexDigit).
If you want to check whether \r\n follows, one should use a PushBackReader. I did use an InputStream, as you did not specify the encoding.
I'm trying in Java to decode URL containing % encoded characters
I've tried using java.net.URI class to do the job, but it's not always working correctly.
String test = "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise";
URI uri = new URI(test);
System.out.println(uri.getPath());
For the test String "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise", the result is correct "/wiki/Fondation_Alliance_française" (%C3%A7 is correctly replaced by ç).
But for some other test strings, like "http://sv.wikipedia.org/wiki/Anv%E4ndare:Lsjbot/Statistik#Drosophilidae", it gives an incorrect result "/wiki/Anv�ndare:Lsjbot/Statistik" (%E4 is replaced by � instead of ä).
I did some testing with getRawPath() and URLDecoder class.
System.out.println(URLDecoder.decode(uri.getRawPath(), "UTF8"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "ISO-8859-1"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "WINDOWS-1252"));
Depending on the test String, I get correct results with different encodings:
For %C3%A7, I get a correct result with "UTF-8" encoding as expected, and incorrect results with "ISO-8859-1" or "WINDOWS-1252" encoding
For %E4, it's the opposite.
For both test URL, I get the correct page if I put them in Chrome address bar.
How can I correctly decode the URL in all situations ?
Thanks for any help
==== Answer ====
Thanks to the suggestions in McDowell answer below, it now seems to work. Here's what I now have as code:
private static void appendBytes(ByteArrayOutputStream buf, String data) throws UnsupportedEncodingException {
byte[] b = data.getBytes("UTF8");
buf.write(b, 0, b.length);
}
private static byte[] parseEncodedString(String segment) throws UnsupportedEncodingException {
ByteArrayOutputStream buf = new ByteArrayOutputStream(segment.length());
int last = 0;
int index = 0;
while (index < segment.length()) {
if (segment.charAt(index) == '%') {
appendBytes(buf, segment.substring(last, index));
if ((index < segment.length() + 2) &&
("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 1)) >= 0) &&
("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 2)) >= 0)) {
buf.write((byte) Integer.parseInt(segment.substring(index + 1, index + 3), 16));
index += 3;
} else if ((index < segment.length() + 1) &&
(segment.charAt(index + 1) == '%')) {
buf.write((byte) '%');
index += 2;
} else {
buf.write((byte) '%');
index++;
}
last = index;
} else {
index++;
}
}
appendBytes(buf, segment.substring(last));
return buf.toByteArray();
}
private static String parseEncodedString(String segment, Charset... encodings) {
if ((segment == null) || (segment.indexOf('%') < 0)) {
return segment;
}
try {
byte[] data = parseEncodedString(segment);
for (Charset encoding : encodings) {
try {
if (encoding != null) {
return encoding.newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
decode(ByteBuffer.wrap(data)).toString();
}
} catch (CharacterCodingException e) {
// Incorrect encoding, try next one
}
}
} catch (UnsupportedEncodingException e) {
// Nothing to do
}
return segment;
}
Anv%E4ndare
As PopoFibo says this is not a valid UTF-8 encoded sequence.
You can do some tolerant best-guess decoding:
public static String parse(String segment, Charset... encodings) {
byte[] data = parse(segment);
for (Charset encoding : encodings) {
try {
return encoding.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(data))
.toString();
} catch (CharacterCodingException notThisCharset_ignore) {}
}
return segment;
}
private static byte[] parse(String segment) {
ByteArrayOutputStream buf = new ByteArrayOutputStream();
Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
.matcher(segment);
int last = 0;
while (matcher.find()) {
appendAscii(buf, segment.substring(last, matcher.start()));
byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
buf.write(hex);
last = matcher.end();
}
appendAscii(buf, segment.substring(last));
return buf.toByteArray();
}
private static void appendAscii(ByteArrayOutputStream buf, String data) {
byte[] b = data.getBytes(StandardCharsets.US_ASCII);
buf.write(b, 0, b.length);
}
This code will successfully decode the given strings:
for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
"Anv%E4ndare")) {
String result = parse(test, StandardCharsets.UTF_8,
StandardCharsets.ISO_8859_1);
System.out.println(result);
}
Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.
If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.
Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.
I've written a bit more about URLs and Java here.
How can i decompress a String that was zipped by PHP gzcompress() function?
Any full examples?
thx
I tried it now like this:
public static String unzipString(String zippedText) throws Exception
{
ByteArrayInputStream bais = new ByteArrayInputStream(zippedText.getBytes("UTF-8"));
GZIPInputStream gzis = new GZIPInputStream(bais);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader in = new BufferedReader(reader);
String unzipped = "";
while ((unzipped = in.readLine()) != null)
unzipped+=unzipped;
return unzipped;
}
but it's not working if i i'm trying to unzip a PHP gzcompress (-ed) string.
PHP's gzcompress uses Zlib NOT GZIP
public static String unzipString(String zippedText) {
String unzipped = null;
try {
byte[] zbytes = zippedText.getBytes("ISO-8859-1");
// Add extra byte to array when Inflater is set to true
byte[] input = new byte[zbytes.length + 1];
System.arraycopy(zbytes, 0, input, 0, zbytes.length);
input[zbytes.length] = 0;
ByteArrayInputStream bin = new ByteArrayInputStream(input);
InflaterInputStream in = new InflaterInputStream(bin);
ByteArrayOutputStream bout = new ByteArrayOutputStream(512);
int b;
while ((b = in.read()) != -1) {
bout.write(b); }
bout.close();
unzipped = bout.toString();
}
catch (IOException io) { printIoError(io); }
return unzipped;
}
private static void printIoError(IOException io)
{
System.out.println("IO Exception: " + io.getMessage());
}
Try a GZIPInputStream. See this example and this SO question.
See
http://developer.android.com/reference/java/util/zip/InflaterInputStream.html
since the DEFLATE algorithm is gzip.