JAVA to Perl - port XOR encryptor class

JAVA to Perl - port XOR encryptor class - java

I have the folowing JAVA class of a XOR "encryption" class:
import java.io.PrintStream;
public class Encryptor
{
private static final String m_strPrivateKey = "4p0L#r1$";
public Encryptor()
{
}
public static String encrypt(String pass)
{
String strTarget = XORString(pass);
strTarget = StringToHex(strTarget);
return strTarget;
}
public static String decrypt(String pass)
{
String strTarget = HexToString(pass);
strTarget = XORString(strTarget);
return strTarget;
}
private static String GetKeyForLength(int nLength)
{
int nKeyLen = "4p0L#r1$".length();
int nRepeats = nLength / nKeyLen + 1;
String strResult = "";
for(int i = 0; i < nRepeats; i++)
{
strResult = strResult + "4p0L#r1$";
}
return strResult.substring(0, nLength);
}
private static String HexToString(String str)
{
StringBuffer sb = new StringBuffer();
char buffDigit[] = new char[4];
buffDigit[0] = '0';
buffDigit[1] = 'x';
int length = str.length() / 2;
byte bytes[] = new byte[length];
for(int i = 0; i < length; i++)
{
buffDigit[2] = str.charAt(i * 2);
buffDigit[3] = str.charAt(i * 2 + 1);
Integer b = Integer.decode(new String(buffDigit));
bytes[i] = (byte)b.intValue();
}
return new String(bytes);
}
private static String XORString(String strTarget)
{
int nTargetLen = strTarget.length();
String strPaddedKey = GetKeyForLength(nTargetLen);
String strResult = "";
byte bytes[] = new byte[nTargetLen];
for(int i = 0; i < nTargetLen; i++)
{
int b = strTarget.charAt(i) ^ strPaddedKey.charAt(i);
bytes[i] = (byte)b;
}
String result = new String(bytes);
return result;
}
private static String StringToHex(String strInput)
{
StringBuffer hex = new StringBuffer();
int nLen = strInput.length();
for(int i = 0; i < nLen; i++)
{
char ch = strInput.charAt(i);
int b = ch;
String hexStr = Integer.toHexString(b);
if(hexStr.length() == 1)
{
hex.append("0");
}
hex.append(Integer.toHexString(b));
}
return hex.toString();
}
public static void main(String args[])
{
if(args.length < 1)
{
System.err.println("Missing password!");
System.exit(-1);
}
String pass = args[0];
String pass2 = encrypt(pass);
System.out.println("Encrypted: " + pass2);
pass2 = decrypt(pass2);
System.out.println("Decrypted: " + pass2);
if(!pass.equals(pass2))
{
System.out.println("Test Failed!");
System.exit(-1);
}
}
}
I tried to port it to Perl like this:
#!/usr/bin/perl
use strict;
use warnings;
my $pass = shift || die "Missing password!\n";
my $pass2 = encrypt($pass);
print "Encrypted: $pass2\n";
$pass2 = decrypt($pass2);
print "Decrypted: $pass2\n";
if ($pass ne $pass2) {
print "Test Failed!\n";
exit(-1);
}
sub encrypt {
my $pass = shift;
my $strTarget = XORString($pass);
$strTarget = StringToHex($strTarget);
return $strTarget;
}
sub decrypt {
my $pass = shift;
my $strTarget = HexToString($pass);
$strTarget = XORString($strTarget);
return $strTarget;
}
sub GetKeyForLength {
my $nLength = shift;
my $nKeyLen = length '4p0L#r1$';
my $nRepeats = $nLength / $nKeyLen + 1;
my $strResult = '4p0L#r1$' x $nRepeats;
return substr $strResult, 0, $nLength;
}
sub HexToString {
my $str = shift;
my #bytes;
while ($str =~ s/^(..)//) {
my $b = eval("0x$1");
push #bytes, chr sprintf("%d", $b);
}
return join "", #bytes;
}
sub XORString {
my $strTarget = shift;
my $nTargetLen = length $strTarget;
my $strPaddedKey = GetKeyForLength($nTargetLen);
my #bytes;
while ($strTarget) {
my $b = (chop $strTarget) ^ (chop $strPaddedKey);
unshift #bytes, $b;
}
return join "", #bytes;
}
sub StringToHex {
my $strInput = shift;
my $hex = "";
for my $ch (split //, $strInput) {
$hex .= sprintf("%02x", ord $ch);
}
return $hex;
}
Code seems ok but the problem is the JAVA class outputs different results than the Perl code.
In JAVA I have the plain-text passsword
mentos
and it is encoded as
&4\=80CHB'
What should I do to my Perl script to get the same result? Where I do wrong?
Another two examples: plain-text
07ch4ssw3bby
is encoded as:
,#(0\=DM.'# '8WQ2T
(note the space after #)
Last example, plain-text:
conf75
encoded as:
&7]P0G-#!
Thanks for help!
Ended up with this, thanks to Joni Salonen:
#!/usr/bin/perl
# XOR password decoder
# Greets: Joni Salonen # stackoverflow.com
$key = pack("H*","3cb37efae7f4f376ebbd76cd");
print "Enter string to decode: ";
$str=<STDIN>;chomp $str; $str =~ s/\\//g;
$dec = decode($str);
print "Decoded string value: $dec\n";
sub decode{ #Sub to decode
#subvar=#_;
my $sqlstr = $subvar[0];
$cipher = unpack("u", $sqlstr);
$plain = $cipher^$key;
return substr($plain, 0, length($cipher));
}
My only and last problem is that when a "\" is found (actually "\\" as one escaped the real character) the decryption goes wrong :-\
Example encoded string:
"(4\\4XB\:7"G#, "
(I escaped it with double-quotes, last characters of the string is a space, it should decode to
"ovFsB6mu"
Update: thanks to Joni Salonen, I have 100% working final version:
#!/usr/bin/perl
# XOR password decoder
# Greets: Joni Salonen # stackoverflow.com
$key = pack("H*","3cb37efae7f4f376ebbd76cd");
print "Enter string to decode: ";
$str=<STDIN>;chomp $str; $str =~s/\\(.)/$1/g;
$dec = decode($str);
print "Decoded string value: $dec\n";
sub decode{ #Sub to decode
#subvar=#_;
my $sqlstr = $subvar[0];
$cipher = unpack("u", $sqlstr);
$plain = $cipher^$key;
return substr($plain, 0, length($cipher));
}

Your encryption loop skips the first character of $strTarget if it happens to be '0'. You could compare it against an empty string instead of checking if it's "true":
while ($strTarget ne '') {
my $b = (chop $strTarget) ^ (chop $strPaddedKey);
unshift #bytes, $b;
}
Update: This program decrypts your strings:
use feature ':5.10';
$key = pack("H*","3cb37efae7f4f376ebbd76cd");
say decrypt("&4\=80CHB'"); # mentos
say decrypt(",#(0\=DM.'# '8WQ2T"); # 07ch4ssw3bby
say decrypt("&7]P0G-#!"); # conf75
sub decrypt {
$in = shift;
$cipher = unpack("u", $in);
$plain = $cipher^$key;
return substr($plain, 0, length($cipher));
}

Related

Jaunt Java getText() returning correct text but with lots of "?"

The title explains all, also, I have tried removing them
(because the text is there, but instead of "aldo" there is "al?do", also it seems to have a random pattern)
with (String).replace("?", ""), but with no success.
I have also used this, with a combination of UTF_8,UTF_16 and ISO-8859, with no success.
byte[] ptext = tempName.getBytes(UTF_8);
String tempName1 = new String(ptext, UTF_16);
An example of what I am getting:
Studded Regular Sweatshirt // Instead of this
S?tudde?d R?eg?ular? Sw?eats?h?irt // I get this
Could it be the website that notices the headless browser and tries to "spoof" its content? How can I overcome this?

It looks very likely that site you scrapping intent mix up the 3f and 64 characters into your result.
so you have to mask your self as a normal browser to scrapping or filter it out by replacing.
text simple
Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
after filteration
Scarface Embroidered Leather
//Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
//Scarface Embroidered Leathe
String hex="5363613f3f3f7266613f3f3f63653f3f3f20453f3f3f6d62723f3f3f6f69‌643f3f3f65726564204c‌653f3f3f61746865";
byte[] bytes= hexStringToBytes(hex);
//the only line you need
String res = new String(bytes,"UTF-8").replaceAll("\\\u003f","").replaceAll('�',"").replaceAll("�","");
private static byte charToByte(char c) {
return (byte) "0123456789ABCDEF".indexOf(new String(c));
}
public static byte[] hexStringToBytes(String hexString) {
if (hexString == null || hexString.equals("")) {
return null;
}
hexString = hexString.toUpperCase();
int length = hexString.length() / 2;
char[] hexChars = hexString.toCharArray();
byte[] d = new byte[length];
for (int i = 0; i < length; i++) {
int pos = i * 2;
d[i] = (byte) (charToByte(hexChars[pos]) << 4 | charToByte(hexChars[pos + 1]));
}
return d;
}
public static String bytesToHexString(byte[] src){
StringBuilder stringBuilder = new StringBuilder("");
if (src == null || src.length <= 0) {
return null;
}
for (int i = 0; i < src.length; i++) {
int v = src[i] & 0xFF;
String hv = Integer.toHexString(v);
if (hv.length() < 2) {
stringBuilder.append(0);
}
stringBuilder.append(hv);
}
return stringBuilder.toString();
}
public String printHexString( byte[] b) {
String a = "";
for (int i = 0; i < b.length; i++) {
String hex = Integer.toHexString(b[i] & 0xFF);
if (hex.length() == 1) {
hex = '0' + hex;
}
a = a+hex;
}
return a;
}

Switch from using a class to using a method

So I've been working on this bit of code for awhile now. It's about encrypting and decrypting a message and producing the two keys used alternatively to encrypt a message using the Caesar Cipher method of changing letters with corresponding letters from a shifted alphabet. For example "Fruit" would be "Hwwnv" according to the shifted alphabets found by implementing +2 to every other letter starting with the first letter, and implementing +5 to every other letter starting with the second letter. I've been using an instance of another class called BreakCaesarThree to find these two keys, dkey_0 and dkey_1 and a decrypted message. I would rather use my method breakCaesarTwo instead, because of the ease of having all my necessary code in one class. How would I go about doing this? How do I change it so that I'm using breakCaesarTwo method instead of BreakCaesarThree class, and still be able to print out the dkey_0 and dkey_1 and a decrypted message? I am hoping that changing to using the breakCaesarTwo method will yield the right results.
Note: Right now calling BreakCaesarThree doesn't yield a decrypted message or give the right keys (I get 0s).
Here's my TestCaesarCipherTwo code which includes the breakCaesarTwo method:
import edu.duke.*;
public class TestCaesarCipherTwo {
private String alphabetLower;
private String alphabetUpper;
private String shiftedAlphabetLower1;
private String shiftedAlphabetUpper1;
private String shiftedAlphabetLower2;
private String shiftedAlphabetUpper2;
private int mainKey1;
private int mainKey2;
private int dkey_0;
private int dkey_1;
/**
*
*/
public void simplebreaker()
{
FileResource fr = new FileResource();
String encrypted = fr.asString();
BreakCaesarThree bct = new BreakCaesarThree();
String broken = bct.decrypt(encrypted);
System.out.println("Keys found: " + bct.dkey_0 + ", " + bct.dkey_1 + "\n" + broken);
}
public String halfOfString(String message, int start) {
StringBuilder halfString = new StringBuilder();
for (int index=start;index < message.length();index += 2) {
halfString.append(message.charAt(index));
}
return halfString.toString();
}
public String decrypt(String input) {
CaesarCipherTwoKeys cctk= new CaesarCipherTwoKeys(26 - mainKey1, 26 - mainKey2);
String decrypted = cctk.encrypt(input);
return decrypted;
}
public int[] countOccurrencesOfLetters(String message) {
//snippet from lecture
String alph = "abcdefghijklmnopqrstuvwxyz";
int[] counts = new int[26];
for (int k=0; k < message.length(); k++) {
char ch = Character.toLowerCase(message.charAt(k));
int dex = alph.indexOf(ch);
if (dex != -1) {
counts[dex] += 1;
}
}
return counts;
}
public int maxIndex(int[] values) {
int maxDex = 0;
for (int k=0; k < values.length; k++) {
if (values[k] > values[maxDex]) {
maxDex = k;
}
}
return maxDex;
}
public void simpleTests()
{
int key1 = 17;
int key2 = 3;
FileResource fr = new FileResource();
String message = fr.asString();
CaesarCipherTwoKeys cctk = new CaesarCipherTwoKeys(key1, key2);
String encrypted = cctk.encrypt(message);
System.out.println(encrypted);
String decrypted = cctk.decrypt(encrypted);
System.out.println(decrypted);
BreakCaesarThree bct = new BreakCaesarThree();
String broken = bct.decrypt(encrypted);
System.out.println("Keys found: " + bct.dkey_0 + ", " + bct.dkey_1 + "\n" + broken);
}
public String breakCaesarTwo(String input) {
String in_0 = halfOfString(input, 0);
String in_1 = halfOfString(input, 1);
// Find first key
// Determine character frequencies in ciphertext
int[] freqs_0 = countOccurrencesOfLetters(in_0);
// Get the most common character
int freqDex_0 = maxIndex(freqs_0);
// Calculate key such that 'E' would be mapped to the most common ciphertext character
// since 'E' is expected to be the most common plaintext character
int dkey_0 = freqDex_0 - 4;
// Make sure our key is non-negative
if (dkey_0 < 0) {
dkey_0 = dkey_0+26;
}
// Find second key
int[] freqs_1 = countOccurrencesOfLetters(in_1);
int freqDex_1 = maxIndex(freqs_1);
int dkey_1 = freqDex_1 - 4;
if (freqDex_1 < 4) {
dkey_1 = dkey_1+26;
}
CaesarCipherTwoKeys cctk = new CaesarCipherTwoKeys(dkey_0, dkey_1);
return cctk.decrypt(input);
}
}
I'd like to implement the changes here:
public void simplebreaker()
{
FileResource fr = new FileResource();
String encrypted = fr.asString();
BreakCaesarThree bct = new BreakCaesarThree();
String broken = bct.decrypt(encrypted);
System.out.println("Keys found: " + bct.dkey_0 + ", " + bct.dkey_1 + "\n" + broken);
}
and here:
public void simpleTests()
{
int key1 = 17;
int key2 = 3;
FileResource fr = new FileResource();
String message = fr.asString();
CaesarCipherTwoKeys cctk = new CaesarCipherTwoKeys(key1, key2);
String encrypted = cctk.encrypt(message);
System.out.println(encrypted);
String decrypted = cctk.decrypt(encrypted);
System.out.println(decrypted);
BreakCaesarThree bct = new BreakCaesarThree();
String broken = bct.decrypt(encrypted);
System.out.println("Keys found: " + bct.dkey_0 + ", " + bct.dkey_1 + "\n" + broken);

base64 decoding to UTF-8, one character not displaying correctly

I am trying to decode a string from base64 to UTF-8 for an assignment.
Not having programmed Java for a while I am probably not using the most efficient method, however I managed to implement a function working 99% correctly.
Decoding the example string in Base64: VGhpcyBpcyBhbiBBcnhhbiBzYW1wbGUgc3RyaW5nIHRoYXQgc2hvdWxkIGJlIGVhc2lseSBkZWNvZGVkIGZyb20gYmFzZTY0LiAgSXQgaW5jbHVkZXMgYSBudW1iZXIgb2YgVVRGOCBjaGFyYWN0ZXJzIHN1Y2ggYXMgdGhlIPEsIOksIOgsIOcgYW5kICYjOTYwOyBjaGFyYWN0ZXJzLg==
Results in:
This is an Arxan sample string that should be easily decoded from base64. It includes a number of UTF8 characters such as the ñ, é, è, ç and &#960 characters.
However, in the place of the &#960 should be the π symbol being outputted.
Note that I removed the ; after &#960 in here as it seems Stackoverflow automatically corrected it to π
I have tried many things such as creating a byte array and printing that, but still not working.
I am using Eclipse, can it be that just the output there displays incorrectly?
Does somebody have a suggestion to get this to work?
Thanks,
Vincent
Here is my code:
package base64;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
public class Base64 {
public static void main(String[] args) {
//Input strings
String base64 = "VGhpcyBpcyBhbiBBcnhhbiBzYW1wbGUgc3RyaW5nIHRoYXQgc2hvdWxkIGJlIGVhc2lseSBkZWNvZGVkIGZyb20gYmFzZTY0LiAgSXQgaW5jbHVkZXMgYSBudW1iZXIgb2YgVVRGOCBjaGFyYWN0ZXJzIHN1Y2ggYXMgdGhlIPEsIOksIOgsIOcgYW5kICYjOTYwOyBjaGFyYWN0ZXJzLg==";
//String base64 = "YW55IGNhcm5hbCBwbGVhc3U=";
String utf8 = "any carnal pleas";
//Base64 to UTF8
System.out.println("Base64 conversion to UTF8");
System.out.println("-------------------------");
System.out.println("Input base64-string: " + base64);
System.out.println("Output UTF8-string: " + stringFromBase64(base64));
System.out.println();
//UTF8 to Base64
System.out.println("UTF8 conversion to base64");
System.out.println("-------------------------");
System.out.println("Input UTF8-string: " + utf8);
System.out.println("Output base64-string: " + stringToBase64(utf8));
System.out.println();
System.out.println("Pi is π");
}
public static String stringFromBase64(String base64) {
StringBuilder binary = new StringBuilder();
int countPadding = countPadding(base64); //count number of padding symbols in source string
//System.out.println("No of *=* in the input is : " + countPadding);
//System.out.println(base64);
for(int i=0; i<(base64.length()-countPadding); i++)
{
int base64Value = fromBase64(String.valueOf(base64.charAt(i))); //convert Base64 character to Int
String base64Binary = Integer.toBinaryString(base64Value); //convert Int to Binary string
StringBuilder base64BinaryCopy = new StringBuilder(); //debugging
if (base64Binary.length()<6) //adds required zeros to make 6 bit string
{
for (int j=base64Binary.length();j<6;j++){
binary.append("0");
base64BinaryCopy.append("0"); //debugging
}
base64BinaryCopy.append(base64Binary); // debugging
} else // debugging
{
base64BinaryCopy.append(base64Binary); //debugging
} // debugging
//System.out.println(base64.charAt(i) + " = " + base64Value + " = " + base64BinaryCopy); //debugging
binary.append(base64Binary);
}
//System.out.println(binary);
//System.out.println(binary.length());
StringBuilder utf8String = new StringBuilder();
for (int bytenum=0;bytenum<(binary.length()/8);bytenum++) //parse string Byte-by-Byte
{
StringBuilder utf8Bit = new StringBuilder();
for (int bitnum=0;bitnum<8;bitnum++){
utf8Bit.append(binary.charAt(bitnum+(bytenum*8)));
}
char utf8Char = (char) Integer.parseInt(utf8Bit.toString(), 2); //Byte to utf8 char
utf8String.append(String.valueOf(utf8Char)); //utf8 char to string and append to final utf8-string
//System.out.println(utf8Bit + " = " + Integer.parseInt(utf8Bit.toString(), 2) + " = " + utf8Char + " = " + utf8String); //debugging
}
return utf8String.toString();
}
public static String stringToBase64(String utf8) {
StringBuilder binary = new StringBuilder();
String paddingString = "";
String paddingSymbols = "";
for(int i=0; i<(utf8.length()); i++)
{
int utf8Value = utf8.charAt(i); //convert utf8 character to Int
String utf8Binary = Integer.toBinaryString(utf8Value); //convert Int to Binary string
StringBuilder utf8BinaryCopy = new StringBuilder(); //debugging
if (utf8Binary.length()<8) //adds required zeros to make 8 bit string
{
for (int j=utf8Binary.length();j<8;j++){
binary.append("0");
utf8BinaryCopy.append("0"); //debugging
}
utf8BinaryCopy.append(utf8Binary); // debugging
} else // debugging
{
utf8BinaryCopy.append(utf8Binary); //debugging
} // debugging
//System.out.println(utf8.charAt(i) + " = " + utf8Value + " = " + utf8BinaryCopy);
binary.append(utf8Binary);
}
if ((binary.length() % 6) == 2) {
paddingString = "0000"; //add 4 padding zeroes
paddingSymbols = "==";
} else if ((binary.length() % 6) == 4) {
paddingString = "00"; //add 2 padding zeroes
paddingSymbols = "=";
}
binary.append(paddingString); //add padding zeroes
//System.out.println(binary);
//System.out.println(binary.length());
StringBuilder base64String = new StringBuilder();
for (int bytenum=0;bytenum<(binary.length()/6);bytenum++) //parse string Byte-by-Byte per 6 bits
{
StringBuilder base64Bit = new StringBuilder();
for (int bitnum=0;bitnum<6;bitnum++){
base64Bit.append(binary.charAt(bitnum+(bytenum*6)));
}
int base64Int = Integer.parseInt(base64Bit.toString(), 2); //Byte to Int
char base64Char = toBase64(base64Int); //Int to Base64 char
base64String.append(String.valueOf(base64Char)); //base64 char to string and append to final Base64-string
//System.out.println(base64Bit + " = " + base64Int + " = " + base64Char + " = " + base64String); //debugging
}
base64String.append(paddingSymbols); //add padding ==
return base64String.toString();
}
public static char toBase64(int a) { //converts integer to corresponding base64 char
String charBase64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
//charBase64 = new char[]{'A','B','C','D','E','F','G','H','I','J','K','L','M','N'};
return charBase64.charAt(a);
}
public static int fromBase64(String x) { //converts base64 string to corresponding integer
String charBase64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
return charBase64.indexOf(x);
}
public static int countPadding(String countPadding) { //counts the number of padding symbols in base64 input string
int index = countPadding.indexOf("=");
int count = 0;
while (index != -1) {
count++;
countPadding = countPadding.substring(index + 1);
index = countPadding.indexOf("=");
}
return count;
}
}

UTF8 is a character encoding that transforms a given char to 1, 2 or more bytes. Your code assumes that each byte should be transformed to a character. That works fine for ASCII characters such as a, b, c that are indeed transformed to a single byte by UTF8, but it doesn't work for characters like PI, which are transformed to a multi-byte sequence.
Your algorithm is awfully inefficient, and I would just ditch it and use a ready-to-use ecnoder/decoder. The JDK 8 comes with one. Guava and commons-codec also do. Your code should be as simple as
String base64EncodedByteArray = "....";
byte[] decodedByteArray = decoder.decode(base64EncodedByteArray);
String asString = new String(decodedByteArray, StandardCharSets.UTF_8);
or, for the other direction:
String someString = "VGhpcyBpcyBhb...";
byte[] asByteArray = someString.getBytes(StandardCharSets.UTF_8);
String base64EncodedByteArray = encoder.encode(asBytArray);

Tokenize devnagari words into letters

I have something like
a = "बिक्रम मेरो नाम हो"
I want to achieve something like in Java
a[0] = बि
a[1] = क्र
a[3] = म

Java internally stores each character of any language in UTF-16(2 bytes) so you can safely access the characters individually.

Try This:
String a = "बिक्रम मेरो नाम हो";
int strLen = a.length();
char array[] = new char[strLen];
String strArray1[] = new String[strLen];
for (int i=0 ; i< strLen ; i++)
{
array[i] = a.charAt(i);
strArray1[i] = Character.toString(a.charAt(i));
System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i] );
}
Output:
Index = 0* Char = ब** String =ब
Index = 1* Char = ि** String =ि
Index = 2* Char = क** String =क
Index = 3* Char = ्** String =्
Index = 4* Char = र** String =र
Index = 5* Char = म** String =म
Index = 6* Char = ** String =
Index = 7* Char = म** String =म
Index = 8* Char = े** String =े
Index = 9* Char = र** String =र
Index = 10* Char = ो** String =ो
Index = 11* Char = ** String =
Index = 12* Char = न** String =न
Index = 13* Char = ा** String =ा
Index = 14* Char = म** String =म
Index = 15* Char = ** String =
Index = 16* Char = ह** String =ह
Index = 17* Char = ो** String =ो
Note:
In order to allow eclipse to allow you to save your java program with foreign characters(Hindi alphabets), do the following:
Go to:
"Windows > Preferences > General > Content Types > Text > {Choose file type}
{Selected file type} > Default encoding > UTF-8" and click Update.

Did you try icu4j?
BreakIterator character instance has a possibility to split String to characters

My code is not at all optimized, sorry about that but it works!
Just change the path of the file in which you are going to enter the devnagri sentence and it should work.
public static void main(String[] args) throws IOException
{
BufferedReader br = new BufferedReader(new FileReader("/home/ubuntu/Documents/trainforjava.txt")); //PLEASE ENTER PATH HERE
String[] devFull = new String[]{
"अ","आ", "इ", "ई", "उ", "ऊ", "ऋ"
, "ऌ" ,"ऍ", "ए", "ऐ", "ऑ", "ओ", "औ",
"क", "ख", "ग", "घ" ,"ङ",
"च" ,"छ" ,"ज"," झ"," ञ",
"ट","ठ", "ड"," ढ"," ण",
"त", "थ", "द", "ध", "न",
"प", "फ", "ब"," भ","म",
"य", "र", "ल", "ळ",
"व", "श" ,"ष","स" ,"ह"
};
String[] uniDev = new String[]
{
"905","906","907","908","909","90a","90b",
"90c","90d","90f","910","911","913","914",
"915","916","917","918","919",
"91a","91b","91c","91d","91e",
"91f","920","921","922","923",
"924","925","926","927","928",
"92a","92b","92c","92d","92e",
"92f","930","932","933",
"935","936","937","938","939"
};
String[] devHalf = new String[]
{
"$़","ऽ","$ा","$ि" ,
"$ी", "$ ु","$ू","$ृ","$ॄ","$ॅ",
"$े","$ै","$ॉ",
"$ो","$ौ"
};
String[] gujHalf = new String[]
{
"$઼","ઽ","$ા","$િ" ,
"$ી","$ુ","$ૂ","$ૃ","$ૄ","$ૅ",
"$ે","$ૈ","$ૉ",
"$ો","$ૌ"
};
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while( (line = br.readLine() ) != null)
{
line=line.replaceAll(" ", ""); //remove white spaces if any
System.out.println();
//System.out.println(line);
int strLength = line.length();
// String a = "बिक्रम मेरो नाम हो";
int strLen = line.length();
char array[] = new char[strLen];
String strArray1[] = new String[strLen];
int mark[] = new int[strLen+1];
String unis[]=new String[strLen];
int cnt=0;
String newCharD[]=new String [strLen];
String newCharG[]=new String [strLen];
String tempD=null;
String tempG=null;
String arr = null;
String next =null;
String temp=null;
String uniNext=null;
int hold=0;
int j=0;
for (int i=0 ; i< strLen ; i++)
{
j=i+1;
array[i] = line.charAt(i);
strArray1[i] = Character.toString(line.charAt(i));
if(i<(strLen-1))
{
char nbit = line.charAt(j);
next=Character.toString(line.charAt(j));
uniNext=Integer.toHexString(nbit);
//System.out.print("\nUninext:\t"+uniNext);
}
unis[i]=Integer.toHexString(array[i]);
mark[strLen]=1;
if((Arrays.asList(devFull).contains(Character.toString(array[i]))) && (!uniNext.equalsIgnoreCase("94d")) )
{
mark[i]=1;
}
else
{
mark[i]=0;
}
//
//System.out.println();
//System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i]+ "Unicode="+unis[i]+"Mark="+mark[i]);
//System.out.print(unis[i].toString());
}
int start=0;
start=0;
for(int l1=0;l1<=strLen;l1++)
{
//start=0;
if(l1==0)
{
temp=Character.toString(array[l1]);
}
else
{
if(mark[l1]==0)
{
temp=temp+Character.toString(array[l1]);
}
else
{
System.out.print(" "+temp);
newCharD[start]=temp;
start++;
temp=null;
if(l1!=strLen)
{
temp=Character.toString(array[l1]);
}
}
}
}
/* for(int s=0;s<start;s++)
{
System.out.print(" "+newCharD[s]);
}*/
for(int s=0;s<start;s++)
{
}
}
}
finally {
br.close();
}
//PrintStream out = new PrintStream(new //FileOutputStream("/home/ubuntu/Documents/trainforjavaoutput.txt"));
//System.setOut(out);
}

Try this for Hindi :-
import java.io.*;
import java.text.BreakIterator;
import java.util.Locale;
public class Test {
public static void main(String[] args) throws IOException
{
String text = "बिक्रम मेरो नाम हो";
Locale hindi = new Locale("hi", "IN");
BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
end != BreakIterator.DONE;
start = end, end = breaker.next()) {
System.out.println(text.substring(start,end));
}
}
}
OUTPUT:-
बि
क्र
म
मे
रो
ना
म
हो
BreakIterator Java Documentation: https://docs.oracle.com/javase/tutorial/i18n/text/about.html

In order to split the string by letters rather than characters, going by dvasanth's suggestion, you can try below:
String x = "बिक्रम मेरो नाम हो";
x=x.replaceAll(" ", ""); // Remove all spaces
int strLength = x.length();
String [] letterArray = new String (strLength /2);
String combined = "";
for (int i=0, j=0; i < strLength ; i=i+2,j++)
{
strArray1[i] = Character.toString(x.charAt(i));
if (i+1 < strLength)
{
strArray1[i+1] = Character.toString(x.charAt(i+1));
combined = strArray1[i]+strArray1[i+1]; // This line provides the letters.
// Assumption is that each letter is 2 unicode characters long.
}
else
{
combined = strArray1[i];
}
letterArray [j] = combined;
System.out.println("Split string by letters is : "+combined);
System.out.println("Split string by letters in array is : "+letterArray [j]);
}
Output is:
Split string by letters is : बि
Split string by letters is : क्
Split string by letters is : रम
Split string by letters is : मे
Split string by letters is : रो
Split string by letters is : ना
Split string by letters is : मह
Split string by letters is : ो
Note:
In order to allow eclipse to allow you to save your java program with foreign characters(Hindi alphabets), do the following:
Go to:
"Windows > Preferences > General > Content Types > Text > {Choose file type}
{Selected file type} > Default encoding > UTF-8" and click Update.

How to refactor, fix and optimize this character replacement function in java

While tuning the application, I found this routine that strips XML string of CDATA tags and replaces certain characters with character references so these could be displayed in a HTML page.
The routine is less than perfect; it will leave trailing space and will break with StringOutOfBounds exception if there is something wrong with the XML.
I have created a few unit tests when I started working on the routing, but the present functionality can be improved, so these serve more of a reference.
The routine needs refactoring for sanity reasons. But, the real reason I need to fix this routine is to improve a performance. It has become a serious performance bottleneck in the application.
package engine;
import junit.framework.Assert;
import junit.framework.TestCase;
public class StringFunctionsTest extends TestCase {
public void testEscapeXMLSimple(){
final String simple = "<xml><SvcRsData>a<![CDATA[<sender>John & Smith</sender>]]></SvcRsData></xml> ";
final String expected = "<xml><SvcRsData>a<sender>John & Smith</sender></SvcRsData></xml> ";
String result = StringFunctions.escapeXML(simple);
Assert.assertTrue(result.equals(expected));
}
public void testEscapeXMLCDATAInsideCDATA(){
final String stringWithCDATAInsideCDATA = "<xml><SvcRsData>a<![CDATA[<sender>John <![CDATA[Inner & CD ]]>& Smith</sender>]]></SvcRsData></xml> ";
final String expected = "<xml><SvcRsData>a<sender>John <![CDATA[Inner & CD & Smith</sender>]]></SvcRsData></xml> ";
String result = StringFunctions.escapeXML(stringWithCDATAInsideCDATA);
Assert.assertTrue(result.equals(expected));
}
public void testEscapeXMLCDATAWithoutClosingTag(){
final String stringWithCDATAWithoutClosingTag = "<xml><SvcRsData>a<![CDATA[<sender>John & Smith</sender></SvcRsData></xml> ";
try{
String result = StringFunctions.escapeXML(stringWithCDATAWithoutClosingTag);
}catch(StringIndexOutOfBoundsException exception){
Assert.assertNotNull(exception);
}
}
public void testEscapeXMLCDATAWithTwoCDATAClosingTags(){
final String stringWithCDATAWithTwoClosingTags = "<xml><SvcRsData>a<![CDATA[<sender>John Inner & CD ]]>& Smith</sender>]]>bcd & efg</SvcRsData></xml> ";
final String expectedAfterSecondClosingTagNotEscaped = "<xml><SvcRsData>a<sender>John Inner & CD & Smith</sender>]]>bcd & efg</SvcRsData></xml> ";
String result = StringFunctions.escapeXML(stringWithCDATAWithTwoClosingTags);
Assert.assertTrue(result.equals(expectedAfterSecondClosingTagNotEscaped));
}
public void testEscapeXMLSimpleTwoCDATA(){
final String stringWithTwoCDATA = "<xml><SvcRsData>a<![CDATA[<sender>John & Smith</sender>]]>abc<sometag>xyz</sometag><sometag2><![CDATA[<recipient>Gorge & Doe</recipient>]]></sometag2></SvcRsData></xml> ";
final String expected = "<xml><SvcRsData>a<sender>John & Smith</sender>abc<sometag>xyz</sometag><sometag2><recipient>Gorge & Doe</recipient></sometag2></SvcRsData></xml> ";
String result = StringFunctions.escapeXML(stringWithTwoCDATA);
Assert.assertTrue(result.equals(expected));
}
public void testEscapeXMLOverlappingCDATA(){
final String stringWithTwoCDATA = "<xml><SvcRsData>a<![CDATA[<sender>John & <![CDATA[Smith</sender>]]>abc<sometag>xyz</sometag><sometag2><recipient>Gorge & Doe</recipient>]]></sometag2></SvcRsData></xml> ";
final String expectedMess = "<xml><SvcRsData>a<sender>John & <![CDATA[Smith</sender>abc<sometag>xyz</sometag><sometag2><recipient>Gorge & Doe</recipient>]]></sometag2></SvcRsData></xml> ";
String result = StringFunctions.escapeXML(stringWithTwoCDATA);
Assert.assertTrue(result.equals(expectedMess));
}
}
This is the function:
package engine;
public class StringFunctions {
public static String escapeXML(String s) {
StringBuffer result = new StringBuffer();
int stringSize = 0;
int posIniData = 0, posFinData = 0, posIniCData = 0, posFinCData = 0;
String stringPreData = "", stringRsData = "", stringPosData = "", stringCData = "", stringPreCData = "", stringTempRsData = "";
String stringNewRsData = "", stringPosCData = "", stringNewCData = "";
short caracter;
stringSize = s.length();
posIniData = s.indexOf("<SvcRsData>");
if (posIniData > 0) {
posIniData = posIniData + 11;
posFinData = s.indexOf("</SvcRsData>");
stringPreData = s.substring(0, posIniData);
stringRsData = s.substring(posIniData, posFinData);
stringPosData = s.substring(posFinData, stringSize);
stringTempRsData = stringRsData;
posIniCData = stringRsData.indexOf("<![CDATA[");
if (posIniCData > 0) {
while (posIniCData > 0) {
posIniCData = posIniCData + 9;
posFinCData = stringTempRsData.indexOf("]]>");
stringPreCData = stringTempRsData.substring(0,
posIniCData - 9);
stringCData = stringTempRsData.substring(posIniCData,
posFinCData);
stringPosCData = stringTempRsData.substring(
posFinCData + 3, stringTempRsData.length());
stringNewCData = replaceCharacter(stringCData);
stringTempRsData = stringTempRsData.substring(
posFinCData + 3, stringTempRsData.length());
stringNewRsData = stringNewRsData + stringPreCData
+ stringNewCData;
posIniCData = stringTempRsData.indexOf("<![CDATA[");
}
} else {
stringNewRsData = stringRsData;
}
stringNewRsData = stringNewRsData + stringPosCData;
s = stringPreData + stringNewRsData + stringPosData;
stringSize = s.length();
}
for (int i = 0; i < stringSize; i++) {
caracter = (short) s.charAt(i);
if (caracter > 128) {
result.append("&#");
result.append(caracter);
result.append(';');
} else {
result.append((char) caracter);
}
}
return result.toString();
}
private static String replaceCharacter(String s) {
StringBuffer result = new StringBuffer();
int stringSize = s.length();
short caracter;
for (int i = 0; i < stringSize; i++) {
caracter = (short) s.charAt(i);
if (caracter > 128 || caracter == 34 || caracter == 38
|| caracter == 60 || caracter == 62) {
result.append("&#");
result.append(caracter);
result.append(';');
} else {
result.append((char) caracter);
}
}
return result.toString();
}
}

Have a look at the StringEscapeUtils Class from Apache Commons. It contains an escapeXML method.

it looks to me that you are doing something that has already been done before, probably in apache commons.
Your function is so convoluted that im not sure if you are really 'escapingXML' or something something more. if all you are doing is escaping xml, then you should google for a better implementation.

Take a look at StringEscapeUtils from Apache Commons. This has functions to escape/unescape XML/HTML reliably.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JAVA to Perl - port XOR encryptor class - java

Related

Jaunt Java getText() returning correct text but with lots of "?"

Switch from using a class to using a method

base64 decoding to UTF-8, one character not displaying correctly

Tokenize devnagari words into letters

How to refactor, fix and optimize this character replacement function in java

Categories

Resources