Decode multiple times encoded String - java

I have written Java code to decode a string encoded with "UTF-8". That String was encoded three times. I am using this code in the ETL. so, I can use an ETL step three times in a row, but it will be a little inefficient. I researched over the internet but didn't find anything promising. Is there any way in Java to decode the String encoded multiple times?
Here's my input string "uri":
file:///C:/Users/nikhil.karkare/dev/pentaho/data/ba-repo-content-original/public/Development+Activity/Defects+Unresolved+%252528by+Non-Developer%252529.xanalyzer
Here's my code which is decoding this string:
import java.net.URLDecoder;
import java.nio.charset.StandardCharsets;
import java.io.*;
String decodedValue;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
// First, get a row from the default input hop
//
Object[] r = getRow();
// If the row object is null, we are done processing.
//
if (r == null) {
setOutputDone();
return false;
}
// It is always safest to call createOutputRow() to ensure that your output row's Object[] is large
// enough to handle any new fields you are creating in this step.
//
Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());
String newFileName = get(Fields.In, "uri").getString(r);
try{
decodedValue = URLDecoder.decode(newFileName, "UTF-8");
}
catch (UnsupportedEncodingException e) {
throw new AssertionError("UTF-8 is unknown");
}
// Set the value in the output field
//
get(Fields.Out, "decodedValue").setValue(outputRow, decodedValue);
// putRow will send the row on to the default output hop.
//
putRow(data.outputRowMeta, outputRow);
return true;}
Output of this code is following:
file:///C:/Users/nikhil.karkare/dev/pentaho/data/ba-repo-content-original/public/Development Activity/Defects Unresolved %2528by Non-Developer%2529.xanalyzer
When I run this code in the ETL three times, I get the output I want, which is this:
file:///C:/Users/nikhil.karkare/dev/pentaho/data/ba-repo-content-original/public/Development Activity/Defects Unresolved (by Non-Developer).xanalyzer

URL encoding replaces %, ( and ) with resp. %25.%28 and %29.
String s = "file:///C:/Users/nikhil.karkare/dev/pentaho/data/"
+ "ba-repo-content-original/public/Development+Activity/"
+ "Defects+Unresolved+%252528by+Non-Developer%252529.xanalyzer";
// %253528 ... %252529
s = URLDecoder.decode(s, "UTF-8");
// %2528 ... %2529
s = URLDecoder.decode(s, "UTF-8");
// %28 .. %29
s = URLDecoder.decode(s, "UTF-8");
// ( ... )

Just a for loop did the job:
String newFileName = get(Fields.In, "uri").getString(r);
decodedValue = newFileName;
for (int i=0; i<=3; i++){
try{
decodedValue = URLDecoder.decode(decodedValue, "UTF-8");
}
catch (UnsupportedEncodingException e) {
throw new AssertionError("UTF-8 is unknown");
}
}

Related

Read and Write to File Chunk By Chunk

I am developing a file encryption program. I was using the function below to encrypt files
until I realized that it is not suitable for big ones; because it reads all file content into memory. Now, I need to create a function that can read and write file content in chunks. How can I do this?
private fun encryptFile(file: File) {
val originalData = file.readBytes()
val encryptData = encrypt(originalData)
encryptData?.run {
file.writeBytes(this)
}
}
Your encrypt function obviously can't stay that way. It'll have to become a thing that wraps an InputStream or OutputStream, and then it's fairly trivial.
Note that handrolling encryption is a near 100% guarantee you'll mess it up, and crypto streams already exist. Any reason you're reinventing a wheel and signing up to mess up security by reinventing things you shouldn't?
Have a look at code. OP
// ...
StringBuilder sb = new StringBuilder();
String line;
while ((line = inputStream.readLine()) != null) {
sb.append(line);
// if enough content is read, extract the chunk
while (sb.length() >= chunkSize) {
String c = sb.substring(0, chunkSize);
// do something with the string
// add the remaining content to the next chunk
sb = new StringBuilder(sb.substring(chunkSize));
}
}
// thats the last chunk
String c = sb.toString();
// do something with the string
EDIT: What about using Chilkat library link to download a Chillkat lib
Code example for encypting chunk of file
import com.chilkatsoft.*;
public class ChilkatExample {
static {
try {
System.loadLibrary("chilkat");
} catch (UnsatisfiedLinkError e) {
System.err.println("Native code library failed to load.\n" + e);
System.exit(1);
}
}
public static void main(String argv[])
{
CkCrypt2 crypt = new CkCrypt2();
crypt.put_CryptAlgorithm("aes");
crypt.put_CipherMode("cbc");
crypt.put_KeyLength(128);
crypt.SetEncodedKey("000102030405060708090A0B0C0D0E0F","hex");
crypt.SetEncodedIV("000102030405060708090A0B0C0D0E0F","hex");
String fileToEncrypt = "qa_data/hamlet.xml";
CkFileAccess facIn = new CkFileAccess();
boolean success = facIn.OpenForRead(fileToEncrypt);
if (success != true) {
System.out.println("Failed to open file that is to be encrytped.");
return;
}
String outputEncryptedFile = "qa_output/hamlet.enc";
CkFileAccess facOutEnc = new CkFileAccess();
success = facOutEnc.OpenForWrite(outputEncryptedFile);
if (success != true) {
System.out.println("Failed to encrypted output file.");
return;
}
// Let's encrypt in 10000 byte chunks.
int chunkSize = 10000;
int numChunks = facIn.GetNumBlocks(chunkSize);
crypt.put_FirstChunk(true);
crypt.put_LastChunk(false);
CkBinData bd = new CkBinData();
int i = 0;
while (i < numChunks) {
i = i+1;
if (i == numChunks) {
crypt.put_LastChunk(true);
}
// Read the next chunk from the file.
// The last chunk will be whatever amount remains in the file..
bd.Clear();
facIn.FileReadBd(chunkSize,bd);
// Encrypt.
crypt.EncryptBd(bd);
// Write the encrypted chunk to the output file.
facOutEnc.FileWriteBd(bd,0,0);
crypt.put_FirstChunk(false);
}
// Make sure both FirstChunk and LastChunk are restored to true after
// encrypting or decrypting in chunks. Otherwise subsequent encryptions/decryptions
// will produce unexpected results.
crypt.put_FirstChunk(true);
crypt.put_LastChunk(true);
facIn.FileClose();
facOutEnc.FileClose();
// Decrypt the encrypted output file in a single call using CBC mode:
String decryptedFile = "qa_output/hamlet_dec.xml";
success = crypt.CkDecryptFile(outputEncryptedFile,decryptedFile);
// Assume success for the example..
// Compare the contents of the decrypted file with the original file:
boolean bSame = facIn.FileContentsEqual(fileToEncrypt,decryptedFile);
System.out.println("bSame = " + bSame);
}
}

Java - Printing unicode from text file doesn't output corresponding UTF-8 character

I have this text file with numerous unicodes and trying to print the corresponding UTF-8 characters in the console but all it prints is the hex string. Like if I copy any of the values and paste them into a System.out it works fine, but not when reading them from the text file.
The following is my code for reading the file, which contains lines of values like \u00C0, \u00C1, \u00C2, \u00C3 which are printed to the console and not the values I want.
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v -> { System.out.println(v); });
} catch (IOException e) {
e.printStackTrace();
}
}
This is the method I used to parse html that had the unicodes in the first place.
private void parseGermanEncoding() {
try
{
File encoding = new File("encoding.html");
Document document = Jsoup.parse(encoding, "UTF-8", "http://example.com/");
Element table = document.getElementsByClass("codetable").first();
Path f = Paths.get("unicode.txt");
try (BufferedWriter wr = new BufferedWriter(new FileWriter(f.toFile())))
{
for (Element row : table.select("tr"))
{
Elements tds = row.select("td");
String unicode = tds.get(0).text();
if (unicode.startsWith("U+"))
{
unicode = unicode.substring(2);
}
wr.write("\\u" + unicode);
wr.newLine();
}
wr.flush();
wr.close();
}
} catch (IOException e)
{
e.printStackTrace();
}
}
You will need to convert the string from unicode encoded string to UTF-8 encoded string. You could follow the steps, 1.convert the string to byte array using myString.getBytes("UTF-8") and 2.get the UTF-8 encoded string using new String(byteArray, "UTF-8"). The code block needs to be surrounded with try/catch for UnsupportedEncodingException.
Thanks to OTM's comment above I was able to get a working solution for this. You take the unicode string, convert to hex using Integer.parseInt() and finally casting to char to get the actual value. This solution is based on this post provided by OTM - How to convert a string with Unicode encoding to a string of letters
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v ->
{
String output = "";
// Takes unicode digits and converts to HEX value
int parse = Integer.parseInt(v, 16);
// Get the actual value of the hex value
output += (char) parse;
System.out.println(output);
});
} catch (IOException e) {
e.printStackTrace();
}
}

Java parseInteger throwing error

I have the following code snippet from my tester class.
FileReader freader=new FileReader(filename);
BufferedReader inputFile=new BufferedReader(freader);
int numScores = 0;
String playerType = "";
String nameHome = "";
String playerName = "";
String home = "";
String location = "";
int score = 0;
String date = "";
double courseRating = 0;
int courseSlope = 0;
ArrayList<Player> players = new ArrayList<Player>();
while (inputFile.read()!= -1) {
numScores = Integer.parseInt(inputFile.readLine());
playerType = inputFile.readLine();
nameHome = inputFile.readLine();
StringTokenizer st = new StringTokenizer(nameHome,",");
playerName = st.nextToken();
home = st.nextToken();
The program compiles, however when the tester is run, I get the following output error.
Exception in thread "main" java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at PlayerTest.main(PlayerTest.java:34)
I've tried researching this and what I fould was there's possibly a space when it changes the String that is read from the data file and converts it to an int. I tried reading directly into a strin, trimming the string, then converting to the int, but it got the same error.
This was the code that replaced numScores = Integer.parseInt(inputFile.readLine());
tempScores = inputFile.readLine();
tempScores.trim();
System.out.println(tempScores);
numScores = Integer.parseInt(tempScores);
Any help is appreciated.
*edited to show sample data
Sample data from file
3
B
James Smith, Strikers
FWB Bowling, 112,09/22/2012
White Sands, 142,09/24/2012
Cordova Lanes,203,09/24/2012
Possibly, your File contains empty lines. These are read as "" and therefore cannot be converted to int.
Furthermore, it is possible that you read the first character of each line by the read-statement in the header of the while-loop, so that it is ignored in the readline command. Then a number of length 1 (like "1") would become an empty line.
In any case, the construction of your loop is a bug.
You can put it all in an if statement:
if(!tempScores.equalsIgnoreCase(""))
{
I ran into a similar issue today. I was reading a response from REST end point and try to parse the json response. Bam! hit an error. Later on I realize the file had a BOM.
My suggestion is create a var
String var = inputFile.readLine();
int numScores = Integer.parseInt(var);
add a breakpoint and inspect what var contains, in my case the response had a BOM an empty unicode character code 65279 / 0xfeff. In any debugger worth it's salt you should be able to see each character.
if it's the case you need to strip that value from the string.
I used this library to detect this issue org.yaml:snakeyaml:1.16
import org.yaml.snakeyaml.reader.UnicodeReader;
//more code
private String readStream(InputStream inputStream) throws IOException {
UnicodeReader unicodeReader = new UnicodeReader(inputStream);
char[] charBuffer = new char[BUFFER_SIZE];
int read;
StringBuilder buffer = new StringBuilder(BUFFER_SIZE);
while ((read = unicodeReader.read(charBuffer,0,BUFFER_SIZE)) != -1) {
buffer.append(charBuffer, 0, read);
}
return buffer.toString();
}
You need to understand this please look into it.
Basic understanding is
try {
//Something that can throw an exception.
} catch (Exception e) {
// To do whatever when the exception is caught.
}
There is also an finally block which will always be execute even if there is an error. it is used like this
try {
//Something that can throw an exception.
} catch (Exception e) {
// To do whatever when the exception is caught & the returned.
} finally {
// This will always execute if there is an exception or no exception.
}
In your particular case you can have the following exceptions (link).
InputMismatchException - if the next token does not match the Integer regular expression, or is out of range
NoSuchElementException - if input is exhausted
IllegalStateException - if this scanner is closed
So you would need to catch exceptions like
try {
rows=scan.nextInt();
} catch (InputMismatchException e) {
// When the InputMismatchException is caught.
System.out.println("The next token does not match the Integer regular expression, or is out of range");
} catch (NoSuchElementException e) {
// When the NoSuchElementException is caught.
System.out.println("Input is exhausted");
} catch (IllegalStateException e) {
// When the IllegalStateException is caught.
System.out.println("Scanner is close");
}

How to handle UNICODE URL?

If I want to pase the following URL's in Java:
... what handle should I have with the String.
So far I have been unable to handle that String's, all I've got are ???? chars.
Thanks.
Modified in 2012.09.09:
package pruebas;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.util.Vector;
public class Prueba03
{
public static void main(String argumentos[])
{
Vector<String> listaURLs = new Vector<String>();
listaURLs.add("http://президент.рф/");
listaURLs.add("http://www.中国政府.政务.cn");
listaURLs.add("http://www.原來我不帥.cn/");
listaURLs.add("http://وزارة-الأتصالات.مصر/");
URL currentURL;
URLConnection currentConnection;
int currentSize;
for(int i=0; i<listaURLs.size(); i++)
{
try
{
System.out.println(URLDecoder.decode(listaURLs.get(i), URLEncoder.encode(listaURLs.get(i), "UTF-8")));
} // End of the try.
catch(UnsupportedEncodingException uee)
{
uee.printStackTrace();
} // End of the catch.
catch(Exception e)
{
e.printStackTrace();
} // End of the catch.
try
{
currentURL = new URL(listaURLs.get(i));
System.out.println("currentURL" + " = " + currentURL);
currentConnection = currentURL.openConnection();
System.out.println("currentConnection" + " = " + currentConnection);
currentSize = currentConnection.getContentLength();
System.out.println("currentSize" + " = " + currentSize);
} // End of the try.
catch(Exception e)
{
e.printStackTrace();
} // End of the catch.
} // End of the for.
} // End of the main method.
} // End of the Prueba02 class.
For domain name, you should convert unicode host name using Punycode.
Punycode is a way to convert unicode string to ascii string.
The following link shows a JAVA method to convert Unicode Domain Name to International Domain Name.
https://docs.oracle.com/javase/6/docs/api/java/net/IDN.html#toASCII(java.lang.String)
URL u = new URL(url);
String host = u.getHost();
String[] labels = host.split("\\.");
for (int i = 0; i < labels.length; i++) {
labels[i] = java.net.IDN.toUnicode(labels[i]);
}
host = StringUtils.join(labels, ".");
System.out.println(host);
Also, you can test some unicode URL using online punycode converter.
https://www.punycoder.com/
For example, "http://www.中国政府.政务.cn" is converted into "http://www.xn--fiqs8sirgfmh.xn--zfr164b.cn/".
Based on #hyunjong answer its not working to use toUnicode, use toASCII instead. And if you prefer kotlin, you can use this code
val a = "http://www.中国政府.政务.cn"
val u = URL(a);
val labels = u.host.split("\\.");
val result = labels.joinToString(separator = ".") { s ->
java.net.IDN.toASCII(s)
}
print(result) //www.xn--fiqs8sirgfmh.xn--zfr164b.cn
You can try the follow code:
import java.net.URLDecoder;
import java.net.URLEncoder;
public class Test7 {
public static void main(String[] args) throws Exception {
String str = "http://www.中国政府.政务.cn";
System.out.println(URLDecoder.decode(str, URLEncoder.encode(str,
"UTF-8")));
}
}
Not sure what do you mean by "parse" - what do you intend to do with these parts?
Arabic and Russian, as far as I know are supported by UTF-8.
Not sure what is your source of data (some sort of Stream perhaps?) but String has a CTOR that accepts the desired encoding.
You should be able to obtain a string NOT containing ??? when it comes to Arabic and Russian if you use this CTOR (with the "UTF-8" argument)
You may try use the following:
String pageUrl = "http://www.中国政府.政务.cn";
try
{
URL url = new URL(pageUrl);
System.out.println(url.toURI().toASCIIString());
}
catch (MalformedURLException e1)
{
// TODO Auto-generated catch block
e1.printStackTrace();
}
catch (URISyntaxException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
The result is as expected:
http://www.%E4%B8%AD%E5%9B%BD%E6%94%BF%E5%BA%9C.%E6%94%BF%E5%8A%A1.cn
But converting to URI has its own disadvantage, you should replace manually the special characters like '|', '"', '#' to its URL encoding.

What's the different between javascript deflate and java.util.zip.Deflater

I wrote some Javascript code.
compress with base64 and deflate
function base64 (str) {
return new Buffer(str).toString("base64");
}
function deflate (str) {
return RawDeflate.deflate(str);
}
function encode (str) {
return base64(deflate(str));
}
var str = "hello, world";
console.log("Test Encode");
console.log(encode(str));
I converted "hello, world" to 2f8d48710d6e4229b032397b2492f0c2
and I want to decompress this string(2f8d48710d6e4229b032397b2492f0c2) in java
I put the str in a file, then:
public static String decompress1951(final String theFilePath) {
byte[] buffer = null;
try {
String ret = "";
System.out.println("can come to ret");
InputStream in = new InflaterInputStream(new Base64InputStream(new FileInputStream(theFilePath)), new Inflater(true));
System.out.println("can come to in");
while (in.available() != 0) {
buffer = new byte[20480];
*****line 64 excep happen int len = in.read(buffer, 0, 20480);
if (len <=0) {
break;
}
ret = ret + new String(buffer, 0, len);
}
in.close();
return ret;
} catch (IOException e) {
System.out.println("Has IOException");
System.out.println(e.getMessage());
e.printStackTrace();
}
return "";
}
But I have an exception:
java.util.zip.ZipException: invalid stored block lengths
at java.util.zip.InflaterInputStream.read(Unknown Source)
at com.cnzz.mobile.datacollector.DecompressDeflate.decompress1951(DecompressDeflate.java:64)
at com.cnzz.mobile.datacollector.DecompressDeflate.main(DecompressDeflate.java:128)
The java code up there works perfectly. As in the comment, you somehow got the encoded value wrong. The encoded value I got using the javascript value is y0jNycnXUSjPL8pJAQA=
Then, when you copy this value to file and call decompress1951, you do in fact get back hello, world as required. Don't know what to say on the javascript part as the code you use seems to sync up nicely with examples on the distribution web pages. I notice there is the original and the fork so maybe there is some confusion there? Anyhow there is this jsfiddle which I think can be seen as a working version if you want to take a look at that one.

Categories