hadoop mapper input deal with hex values - java

I have list of tweet as the input to the hdfs, and try to perform a map-reduce task. This is my mapper implementation:
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
String[] fields = value.toString().split("\t");
StringBuilder sb = new StringBuilder();
for (int i = 1; i < fields.length; i++) {
if (i > 1) {
sb.append("\t");
}
sb.append(fields[i]);
}
tid.set(fields[0]);
content.set(sb.toString());
context.write(tid, content);
} catch(DecoderException e) {
e.printStackTrace();
}
}
As you can see, I tried to split the input by "\t", but the input (value.toString()) looks like this when I print it out:
2014\x091880284777\x09argento_un\x090\x090\x09RT #topmusic619: #RETWEET THIS!!!!!\x5CnFOLLOW ME &amp
; EVERYONE ELSE THAT RETWEETS THIS FOR 35+ FOLLOWERS\x5Cn#TeamFollowBack #Follow2BeFollowed #TajF\xE2\x80\xA6
here is another example:
2014\x0934447260\x09RBEKP\x090\x090\x09\xE2\x80\x9C#LENEsipper: Wild lmfaooo RT #Yerrp08: L**o some
n***a nutt up while gettin twerked
I noted that \x09 should be a tab character (ASCII 09 is tab), So I tried to use apache Hex:
String tmp = value.toString();
byte[] bytes = Hex.decodeHex(tmp.toCharArray());
But the decodeHex function returns null.
This is weird, since some of the characters are in hex while others are not. How can I decode them?
Edit:
Also note that besides tab, emojis are also encoded as hex values.

Related

Convert inputstream byte to character

I have text with contents
12 13 14
The text has 8 spaces between values 12 and 13 and 13 and 14
My java method is receiving the text as inputstream thru an argument and storing each contents in a byte array, and further then convert each byte to a character
public class FileUpload implements RequestStreamHandler{
String fileObjKeyName = "sample1.txt";
String bucketName="";
/**
* #param args
*/
#Override
public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context) throws IOException {
LambdaLogger logger = context.getLogger();
byte[] bytes = IOUtils.toByteArray(inputStream);
StringBuilder sb = new StringBuilder();
StringBuilder sb1 = new StringBuilder();
sb.append("[ ");
sb1.append("[ ");
for (byte b : bytes) {
sb.append(b);
char ch = (char) b;
sb1.append(ch);
}
sb.append("]");
sb1.append("] ");
logger.log(sb.toString());
logger.log(sb1.toString());
}
}
The Decimal representation for the each bytes are correctly printed as below
[ 4950323232323232323249513232323232323232324952]
However when converted to character, only one decimal value '32' (for spaces) between the values are getting converted, skipping all remaining in between spaces bytes.
[ 12 13 14]
Can anyone suggest, the reason for this.
How you convert byte to string? it will be same. see below code:
public static void main(String[] args) {
byte[] bytes = "12 13 14".getBytes();
System.out.println(Arrays.toString(bytes));
String str = new String(bytes,StandardCharsets.UTF_8);
System.out.println(str);
}
Your example shows that you're using AWS, for which you will often check the results and the produced logs online, with a tool that supports HTML.
And in HTML, when you write several consecutive spaces, they are displayed as only one.
Your String object, withing Java, does contain the 8 spaces. But when you give it to a logger to be eventually displayed in a webpage, the spaces are collapsed and displayed as only one.
This is easy to prove: just add the following code at the end of your method:
String s = sb1.toString();
logger.log("s length: " + s.length());
for(int i = 0; i < s.length(); i++) {
logger.log("s[" + i + "]: " + s.charAt(i));
}
It demonstrates the length and exact content of the String. If you're not seeing that exact content when displaying the String, it is the fault of the tool that displays it.

Convert string from file to ASCII and binary

Say I open a text file like this:
public static void main(String[] args) throws IOException {
String file_name = "file.txt";
try {
Read file = new ReadFile(file_name);
String[] Lines = file.openFile();
for (int i = 0; i < es.length; i++) {
System.out.println(Lines[i]);
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
Now, I want to change the result to binary (for further conversion into AMI coding), and I suppose that firstly I should turn it to ASCII (though I'm also not 100% certain if that's absolutely necessary), but I'm not sure if I should better change it to chars, or perhaps is there an easier way?
Please, mind that I'm just a beginner.
Do you happen to know for sure that the files will be ASCII encoded? Assuming it is, you can just use the getBytes() function of string:
byte[] lineDefault = line.getBytes();
There is a second option for .getBytes() as well if you don't want to use the default encoding. I often am using:
byte[] lineUtf8 = line.getBytes("UTF-8");
which gives byte sequences which are equivalent to ASCII for characters whose hex values are less than 0x80.

byte array to Hindi Unicode Value

Hi I have a small function which prints byte to Hindi which is stored as Unicode. My function is like
public static void byteArrayToPrintableHindi(byte[] iData) {
String value = "";
String unicode = "\\u";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < iData.length; i++) {
if (i % 2 == 0) {
value = value.concat(unicode.concat(String.format("%02X", iData[i])));
sb.append(String.format("%02X", iData[i]));
} else {
value = value.concat(String.format("%02X", iData[i]));
}
}
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
and the output is
value = \u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F
चुड़ामणि
I am expecting the value to print
चुड़ामणि
I don't know why it is not printing the desired output.
You're misunderstanding how \uXXXX escape codes work. When the Java compiler reads your source code, it interprets those escape codes and translates them to Unicode characters. You cannot at runtime build a string that consists of \uXXXX codes and expect Java to automatically translate that into Unicode characters - that's not how it works. It only works with literal \uXXXX codes in your source code.
You can simply do this:
public static void byteArrayToPrintableHindi(byte[] iData) throws UnsupportedEncodingException {
String value = new String(iData, "UTF-16");
System.out.println("value = "+value);
System.out.println("\u091A\u0941\u0921\u093C\u093E\u092E\u0923\u093F");
}
assuming that the data is UTF-16-encoded.

how to fix wrong encoding in translation application?

I implemented some translational application and faced with the problem - incorrect output.
For example:
Input:
"Three predominant stories interweave: a dynastic war among several
families for control of Westeros; the rising threat of the dormant
cold supernatural Others dwelling beyond an immense wall of ice on
Westeros' northern border; and the am"
Output:
"%0D%0A%0D%0AThe+история+ -
+A+Песня+из+Лед+и+Fire+принимает++++вымышленный+континентах+Вестероса+и+Essos%2C+with+a+история++тысяч++лет.++Точка+++++главе+в+в+история+
- +a+ограниченной+перспектива+++ассортимент++символы+,+растет+from+девяти+в+в+первое++тридцать
один+++пятый+of+the+романов.+Три+преобладающим+рассказы+переплетаются%3A+a+династические+war+среди+несколько+семей+for+control++Вестероса%3B++рост+угрозу+of+the+спящие+cold+сверхъестественное+Другие+жилье+за+an+огромный+wall++лед+on+Вестероса%27+сев.
границы%3B+и++am"
I know that URLEncoder is the reason of wrong output (all these "+" and "%"), but don't know how to fix it.
Here is some code:
// This method should take an original text that should be
// translated and encode it to use as URL parameter.
private String encodeText(String text) throws IOException {
return URLEncoder.encode(text, "UTF-8");
}
// It shold “extract” translated text from Yandex Translator response.
// More details about response format you can find at
// http://api.yandex.ru/translate/doc/dg/reference/translate.xml,
// we need to use XML interface.
private String parseContent(String content)
throws UnsupportedEncodingException {
String begin = "<text>";
String end = "</text>";
String result = "";
int i, j;
i = content.indexOf(begin);
j = content.indexOf(end);
if ((i != -1) && (j != -1)) {
result = content.substring((i + begin.length()), j);
}
return new String(result.getBytes(), "UTF-8");
}
// method translate() should return translation of original text.
// urlSourceProvider loads translated text
public String translate(String original) throws IOException {
return parseContent(urlSourceProvider
.load(prepareURL(encodeText(original))));
}
Try:
String result = URLDecoder.decode(variable, "UTF-8");
it should decode your text.

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f) using Java?
EDIT:
I guess the question is not very clear... Basically what I want is this:
Given string s="blalbla" I want to get string "\uXXX\uYYYY"
You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.
public String getUnicodeEscapes(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char codeUnit = aString.charAt(ctr);
String hexString = Integer.toHexString(codeUnit);
String padAmount = "0000".substring(hexString.length());
buffer.append("\\u");
buffer.append(padAmount);
buffer.append(hexString);
}
return buffer.toString();
}
else
{
return null;
}
}
The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.
The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:
public String getUnicodeCodepoints(String aString)
{
if (aString != null && aString.length() > 0)
{
int length = aString.length();
StringBuilder buffer = new StringBuilder(length);
for (int ctr = 0; ctr < length; ctr++)
{
char ch = aString.charAt(ctr);
if (Character.isLowSurrogate(ch))
{
continue;
}
else
{
int codePoint = aString.codePointAt(ctr);
String hexString = Integer.toHexString(codePoint);
String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
String padAmount = zeroPad.substring(hexString.length());
buffer.append(" U+");
buffer.append(padAmount);
buffer.append(hexString);
}
}
return buffer.toString();
}
else
{
return null;
}
}
The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, क and ् combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters - 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 (the page will not display this String literal correctly, but you can copy this just fine; it should be Javascript but written using the supplementary character set for Mathematical alphanumeric symbols) will return U+1d4a5 U+1d4b6 U+1d4cb U+1d4b6 U+1d4c8 U+1d4b8 U+1d4c7 U+1d4be U+1d4c5 U+1d4c9.
public static void main(String[] args) {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f"));
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
System.out.println(s);
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}
I'm not aware of a build-in solution, so:
StringBuilder builder = new StringBuilder();
for(int i=0; i<yourString.length(); i++) {
builder.append(String.format("\\u%04x", yourString.charAt(i)));
}
String encoded = builder.toString();
Edit: sry, I thought you wanted to get the String encoded to \uXXXX expressions ...
You didn't saying what encoding you are after, but based on the tag I'm assuming you want the UTF-8 encoding. Here's how:
byte[] utf8 =
"\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f".getBytes("UTF-8");
You can then write a simple loop to output the bytes in utf8 in hexadecimal or decimal ... or do something else with them.
System.out.println ("\u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f");
works like a charm for me:
Служебная

Categories