Crawl web page encoding issue - negative value in byte

Crawl web page encoding issue - negative value in byte - java

I use following code to crawl a web page.
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet(url);
CloseableHttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
System.out.println(entity.getContentType());
//output: Content-Type: text/html; charset=ISO-8859-1
I found that the character "’" has the byte value -110, which cannot be mapped to a valid character in either iso-8859-1 or utf-8.
I try to manually open the page and copy the character and save as a text file, then I saw the byte value is actually 39. I think the OS did the conversion when the character gone through the clipboard
What I want is just to save the web page as original to local disk.
I made a simple code to save the content to disk. I directly read bytes and write bytes. When I open the saved file with Hex Editor, I can see the value of the byte is 146 (-110).
InputStream in = entity.getContent();
FileOutputStream fos = new FileOutputStream(new File("D:/test.html"));
byte[] buffer = new byte[1024];
int len = 0;
while((len = in.read(buffer)) > 0) {
fos.write(buffer, 0, len);
buffer = new byte[1024];
}
in.close();
fos.close();
So now the issue become how to reconstruct the character from the byte 146(-110). I will keep trying and update if I got anything.

Maybe you could give some code how you save the page to the disk? And did you check the value for ’?
It looks like the character ’ is 3 bytes long unless my pasting or your copying failed. Check this out:
public static void main(String[] args) {
char c = '’';
System.out.println("character: " + c);
System.out.println("int: " + (int)c);
String s = new String("’");
// Java uses UTF-16 encoding, other encodings will give different values
byte[] bytes = s.getBytes();
System.out.println("bytes: " + Arrays.toString(bytes));
}
Edit: I've found the following suggested approach to charset handling, give it a try:
ContentType contentType = ContentType.getOrDefault(entity);
Charset charset = contentType.getCharset();
Reader reader = new InputStreamReader(entity.getContent(), charset);
Source: https://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

A byte in Java is a signed type with a value of -128 to 127. The most significant bit is used to indicate the sign. For example, 0111 1111 == 127, and 1000 0000 == -128.
I looked up your character (’) in an ANSI table and found that it has a value of 146 (which is of course greater than 127). The binary representation is 1001 0010, and so interpreting this as a signed value will yield -110.
To reproduce what you are seeing:
String s = new String("’"); // ’ is ansi character 146
byte[] bytes = s.getBytes();
System.out.println( (int)bytes[0] ); // prints -110
To convert the byte value to an unsigned representation:
char c = (char)(bytes[0] & 0xFF);
System.out.println( (int)c ); // prints 146

Related

Base64 Encoded to Decoded File Conversion Problem

I am processing very large files (> 2Gig). Each input file is Base64 encoded, andI am outputting to new files after decoding. Depending on the buffer size (LARGE_BUF) and for a given input file, my input to output conversion either works fine, is missing one or more bytes, or throws an exception at the outputStream.write line (IllegalArgumentException: Last unit does not have enough bits). Here is the code snippet (could not cut and paste so my not be perfect):
.
.
final int LARGE_BUF = 1024;
byte[] inBuf = new byte[LARGE_BUF];
try(InputStream inputStream = new FileInputStream(inFile); OutputStream outStream new new FileOutputStream(outFile)) {
for(int len; (len = inputStream.read(inBuf)) > 0); ) {
String out = new String(inBuf, 0, len);
outStream.write(Base64.getMimeDecoder().decode(out.getBytes()));
}
}
For instance, for my sample input file, if LARGE_BUF is 1024, output file is 4 bytes too small, if 2*1024, I get the exception mentioned above, if 7*1024, it works correctly. Grateful for any ideas. Thank you.

First, you are converting bytes into a String, then immediately back into bytes. So, remove the use of String entirely.
Second, base64 encoding turns each sequence of three bytes into four bytes, so when decoding, you need four bytes to properly decode three bytes of original data. It is not safe to create a new decoder for each arbitrarily read sequence of bytes, which may or may not have a length which is an exact multiple of four.
Finally, Base64.Decoder has a wrap(InputStream) method which makes this considerably easier:
try (InputStream inputStream = Base64.getDecoder().wrap(
new BufferedInputStream(
Files.newInputStream(Paths.get(inFile))))) {
Files.copy(inputStream, Paths.get(outFile));
}

How to Inflate the same data in Python

I have a code in Java which works fine, and I need to inflate the same data in python
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
public static byte[] Inflate(byte[] compressedContent) throws IOException {
ByteArrayOutputStream s = new ByteArrayOutputStream();
InflaterInputStream iis = new InflaterInputStream(new ByteArrayInputStream(compressedContent), new Inflater(true));
byte[] buffer = new byte[4096];
int len;
while ((len = iis.read(buffer)) != -1) {
s.write(buffer, 0, len);
}
iis.close();
s.flush();
s.close();
return s.toByteArray();
}
Using
StringUtils.newStringUtf8(inflate(Base64.decodeBase64("PZLHrptQAET_xevHE8VgnB1gyqVjig0bRLkUg-k9yr_HiZTsZo5mZjU_T1GSwHEMp7aCzenH6fR1-ivDae_gx7MwGuDwoWX6PwN3uYjFpDRK2XZRfnJQQXA5MIK3N_s7oEDFb9qruFmVNtmCtuuOX6qcTEVP5k-Hv7t-mVnfo-XgDa4LBkIt9lMmtKBz4kful_eDNBUONYQ95CXHBRY3dSlEYcC063oXC8hMkKLXRof6Re3vS8U1w-A0oRQt0spqnGifob-1orDhK-bMYflYVOR8KQC_YxVjjekaHuUxvQOZXBgdI4ubvl6z-p0BF-AjY2qNca48qY6j80Wa6Wxjvl8c31AG5V6vto8FG3vZ2c1jvt28MuvIdyjTx1otQPLMC71iOHjqtpFihNLmQVhPdSzbuM8rJ_eocJ4z12DzvFDZGwyeC109TGV2xjsQ32kv5VGB2NH1XFiGVd8xkE9PRI1oDHFwRck_25y3KlxMWKmlDrw7Br75nrunSsrNJbZwzq5rTRivAuhmBZz12RRacuxyeSz5ZIcMqFk8Il8U7nYEsLHHqLRP92oEGfvQZgfqLuuNWf-qlXqc56TiLpdjlfvAU-LwGG599wrdKST41sHeiKCbCZckNLW-aT8V0_tC7FzPh1pZWO6uykgGHtpOp0J9KzxKlPdXvwy9FTV0geUAmjERfR_mgwDciiqlr0qahOlKSMrW524DzAY4Fv8-18x1_XWCW1d-aFh-CE2dUfTXbw")))
The Java code works well, but I cannot convert it to Python as follows..
def Base64UrlDecode(data):
"""Decode base64, padding being optional.
:param data: Base64 data as an ASCII byte string
:returns: The decoded byte string.
"""
if isinstance(data, unicode):
data = data.encode('utf-8')
missing_padding = len(data) % 4
if missing_padding != 0:
data += b'=' * (4 - missing_padding)
return base64.decodestring(data)
url_decode = Base64UrlDecode(token) # The token is the same string as the above one.
# https://docs.python.org/2/library/zlib.html#zlib.compressobj
for i in range(-15, 32): # try all possible ones, but none works.
try:
decode = zlib.decompress(url_decode, i)
except:
pass

The true in Inflater(true)in Java means inflation of raw deflate data with no header or trailer. To get that same operation in Python, the second argument to zlib.decompress() must be -15. So you don't need to try different values there.
The next thing to check is your Base64 decoding. The result of that must be different in the two cases, so look to see where they are different to find your bug.

How to unpack a binary file in java?

May somebody help me to know how can I do in java what I do in ruby with the code below.
The ruby code below uses unpack('H*')[0] to stores the complete binary file content in variable "var" in ASCII format.
IO.foreach(ARGV[0]){ |l|
var = l.unpack('H*')[0]
} if File.exists?(ARGV[0])
Update:
Hi Aru. I've tested the way you say in the form below
byte[] bytes = Files.readAllBytes(testFile.toPath());
str = new String(bytes,StandardCharsets.UTF_8);
System.out.println(str);
But when I print the content of variable "str", the printout shows only little squares, like is not decoding the content. I'd like to store in "str" the content of binary file in ASCII format.
Update #2:
Hello Aru, I'm trying to store in array of bytes all the binary file's content but I don't know how to do it. It worked
with "FileUtils.readFileToByteArray(myFile);" but this is an external library, is there a built in option to do it?
File myFile = new File("./Binaryfile");
byte[] binary = FileUtils.readFileToByteArray(myFile); //I have issues here to store in array of bytes all binary content
String hexString = DatatypeConverter.printHexBinary(binary);
System.out.println(hexString);
Update #3:
Hello ursa and Aru, Thanks for your help. I've tried both of your solutions and works so fine, but seeing Files.readAllBytes() documentation
it says that is not intended to handle big files and the binary file I want to analyse is more than 2GB :(. I see an option with your solutions, read
chunk by chunk. The chunks inside the binary are separated by the sequence FF65, so is there a way to tweak your codes to only process one chunk at a
time based on the chunk separator? If not, maybe with some external library.
Update #4:
Hello, I'm trying to modify your code since I'd like to read variable size chunks based of
value of "Var".
How can I set an offset to read the next chunk in your code?
I mean,
- in first iteration read the first 1024,
- In this step Var=500
- in 2d iteration read the next 1024 bytes, beginning from 1024 - Var = 1024-500 = 524
- In this step Var=712
- in 3rd iteration read the next 1024 bytes, beginning from 1548 - Var = 1548-712 = 836
- and so on
is there a method something like read(number of bytes, offset)?

You can use commons-codec Hex class + commons-io FileUtils class:
byte[] binary = FileUtils.readFileToByteArray(new File("/Users/user/file.bin");
String hexEncoded = Hex.encodeHex(binary);
But if you just want to read content of TEXT file you can use:
String content = FileUtils.readFileToString(new File("/Users/user/file.txt", "ISO-8859-1");
With JRE 7 you can use standard classes:
public static void main(String[] args) throws Exception {
Path path = Paths.get("path/to/file");
byte[] data = Files.readAllBytes(path);
char[] hexArray = "0123456789ABCDEF".toCharArray();
char[] hexChars = new char[data.length * 2];
for ( int j = 0; j < data.length; j++ ) {
int v = data[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
System.out.println(new String(hexChars));
}

This should do what you want:
try {
File inputFile = new File("someFile");
byte inputBytes[] = Files.readAllBytes(inputFile.toPath());
String hexCode = DatatypeConverter.printHexBinary(inputBytes);
System.out.println(hexCode);
} catch (IOException e) {
System.err.println("Couldn't read file: " + e);
}
If you don't want to read the entire file at once, you can do so as well. You'll need an InputStream of some sort.
File inputFile = new File("C:\\Windows\\explorer.exe");
try (InputStream input = new FileInputStream(inputFile)) {
byte inputBytes[] = new byte[1024];
int readBytes;
// Read until all bytes were read
while ((readBytes = input.read(inputBytes)) != -1) {
System.out.printf("%4d bytes were read.\n", readBytes);
System.out.println(DatatypeConverter.printHexBinary(inputBytes));
}
} catch (FileNotFoundException ex) {
System.err.println("Couldn't read file: " + ex);
} catch (IOException ex) {
System.err.println("Error while reading file: " + ex);
}

Stream decoding of Base64 data

I have some large base64 encoded data (stored in snappy files in the hadoop filesystem).
This data was originally gzipped text data.
I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.
Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?
Am I right if I read the characters till the '\r\n' delimiter and decode it line by line?
e.g. :
for (int i = 0; i < byteData.length; i++) {
if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
i += 2;
else
i += 1;
byteBuffer.put(Base64.decodeBase64(record));
byteCounter = 0;
record = new byte[8192];
} else {
record[byteCounter++] = byteData[i];
}
}
Sadly, this approach doesn't give any human readable output.
Ideally, I would like to stream read, decode, and stream out the data.
Right now, I'm trying to put in an inputstream and then copy to a gzipout
byteBuffer.get(bufferBytes);
InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);
And it gives me a
java.io.IOException: Corrupt GZIP trailer

Let's go step by step:
You need a GZIPInputStream to read zipped data (that and not a GZIPOutputStream; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires an InputStream in the constructor.
You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false to decode data. This in turn requires another input stream - the raw, Base64 encoded data.
This stream depends on how you get your data; ideally the data should be available as InputStream - problem solved. If not, you may have to use the ByteArrayInputStream (if binary), StringBufferInputStream (if string) etc.
Roughly this logic is:
InputStream fromHadoop = ...; // 3rd paragraph
Base64InputStream b64is = // 2nd paragraph
new Base64InputStream(fromHadoop, false, 80, "\n".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is); // 1st paragraph
Please pay attention to the arguments of Base64InputStream (line length and end-of-line byte array), you may need to tweak them.

Thanks to Nikos for pointing me in the right direction.
Specifically this is what I did:
private static final byte NEWLINE = (byte) '\n';
private static final byte CARRIAGE_RETURN = (byte) '\r';
byte[] lineSeparators = new byte[] {CARRIAGE_RETURN, NEWLINE};
Base64InputStream b64is = new Base64InputStream(inputStream, false, 76, lineSeparators);
GZIPInputStream zis = new GZIPInputStream(b64is);
Isn't 76 the length of the Base64 line? I didn't try with 80, though.

random characters in byte to string conversion java

I am converting byte [] into a string. Everytime that I convert the byte array to a string, it has a prefixed-type character before it every single time. I have tried different characters, uppercase, etc.. Still has the prefix.
When I write the byte code to system output, it still has the character.
System.out.write(theByteArray);
System.out.println(new String(theByteArray, "UTF-8"));
When I write the text to a file, it seems like the byte array printed flawlessly, but then I scan it and end up with the weird prefix symbol...
Text to be encrypted >
"aaaa"
Text when decrypted and converted to a string >
"aaaa"
The Character seems to disappear, here is an image of it.
I am wanting to compare the given string to another string, kind of like decrypting a password, and comparing it to a database. If one matches, then it gives access.
Code that is generating this byte code.
Keep in mind, the byte I am looking at is decData, and this is NOT my code.
byte[] encData;
byte[] decData;
File inFile = new File(fileName+ ".encrypted");
//Generate the cipher using pass:
Cipher cipher = FileEncryptor.makeCipher(pass, false);
//Read in the file:
FileInputStream inStream = new FileInputStream(inFile);
encData = new byte[(int)inFile.length()];
inStream.read(encData);
inStream.close();
//Decrypt the file data:
decData = cipher.doFinal(encData);
//Figure out how much padding to remove
int padCount = (int)decData[decData.length - 1];
//Naive check, will fail if plaintext file actually contained
//this at the end
//For robust check, check that padCount bytes at the end have same value
if( padCount >= 1 && padCount <= 8 ) {
decData = Arrays.copyOfRange( decData , 0, decData.length - padCount);
}
FileOutputStream target = new FileOutputStream(new File(fileName + ".decrypted.txt"));
target.write(decData);
target.close();

Looks like encData contains BOM and I think Java, when reading in a stream with BOM, will just treat the BOM as an UTF-8 character, which caused the "prefix". You can try the solution suggested here: Reading UTF-8 - BOM marker.
On the other hand, byte order mark is optional and not recommended for UTF-8 encoding. So two questions to ask is:
Is the original data encoded using utf-8?
If it is, it might be worth while to find out why did the BOM gets into the original data in the first place.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Crawl web page encoding issue - negative value in byte - java

Related

Base64 Encoded to Decoded File Conversion Problem

How to Inflate the same data in Python

How to unpack a binary file in java?

Stream decoding of Base64 data

random characters in byte to string conversion java

Categories

Resources