Java Regex match UTF-8 String (Without Copy)

Java Regex match UTF-8 String (Without Copy) - java

I'm loading large UTF-8 text from a SocketChannel, and need to extract some values. Pattern matching with java.util.regex is great for this, but decoding to Java's UTF-16 with CharBuffer cb = UTF_8.decode(buffer); copies this buffer, using double the space.
Is there a way to create a CharBuffer 'view' in UTF-8, or otherwise pattern match with a charset?

You can create lightweight CharSequence wrapping ByteBuffer which does simple byte to char conversion without proper UTF8 handling.
As long as your regex is Latin1 characters only, it would work event on "naively" converted string.
Only ranges matched by reg ex needs to be properly decodec from UTF8.
Below in code illustrating this approach.
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.junit.Test;
import junit.framework.Assert;
public class RegExSnippet {
private static Charset UTF8 = Charset.forName("UTF8");
#Test
public void testByteBufferRegEx() throws UnsupportedEncodingException {
// this UTF8 byte encoding of test string
byte[] bytes = ("lkfmd;wmf;qmfqv amwfqwmf;c "
+ "<tag>This is some non ASCII text 'кирилицеский текст'</tag>"
+ "kjnfdlwncdlka-lksnflanvf ").getBytes(UTF8);
ByteBuffer bb = ByteBuffer.wrap(bytes);
ByteSeqWrapper bsw = new ByteSeqWrapper(bb);
// pattern should contain only LATIN1 characters
Matcher m = Pattern.compile("<tag>(.*)</tag>").matcher(bsw);
Assert.assertTrue(m.find());
String body = m.group(1);
// extracted part is properly decoded as UTF8
Assert.assertEquals("This is some non ASCII text 'кирилицеский текст'", body);
}
public static class ByteSeqWrapper implements CharSequence {
final ByteBuffer buffer;
public ByteSeqWrapper(ByteBuffer buf) {
this.buffer = buf;
}
#Override
public int length() {
return buffer.remaining();
}
#Override
public char charAt(int index) {
return (char) (0xFF & buffer.get(index));
}
#Override
public CharSequence subSequence(int start, int end) {
ByteBuffer bb = buffer.duplicate();
bb.position(bb.position() + start);
bb.limit(bb.position() + (end - start));
return new ByteSeqWrapper(bb);
}
#Override
public String toString() {
// a little hack to apply proper encoding
// to a parts extracted by matcher
CharBuffer cb = UTF8.decode(buffer);
return cb.toString();
}
}
}

Related

Need help in converting EBCDIC to Hexadecimal

I am writing an hive UDF to convert the EBCDIC character to Hexadecimal.
Ebcdic characters are present in hive table.Currently I am able to convert it, bit it is ignoring few characters while conversion.
Example:
This is the EBCDIC value stored in table:
AGNSAÃ±AÂ¦Ã»ÃÃÂÃÂµÂjÂqÂÂÂÂ Â Ã ()
Converted hexadecimal:
c1c7d5e2000a5cd4f6ef99187d07067203a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
What I want as output:
c1c7d5e200010a5cd4f6ef99187d0706720103a0200258dd9736009f000000800017112400000000001000084008403c000000000000000080
It is ignoring to convert the below EBCDIC characters:
01 - It is start of heading
10 - It is a escape
15 - New line.
Below is the code I have tried so far:
public class EbcdicToHex extends UDF {
public String evaluate(String edata) throws UnsupportedEncodingException {
byte[] ebcdiResult = getEBCDICRawData(edata);
String hexResult = getHexData(ebcdiResult);
return hexResult;
}
public byte[] getEBCDICRawData (String edata) throws UnsupportedEncodingException {
byte[] result = null;
String ebcdic_encoding = "IBM-037";
result = edata.getBytes(ebcdic_encoding);
return result;
}
public String getHexData(byte[] result){
String output = asHex(result);
return output;
}
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}
While converting, its ignoring few EBCDIC characters. How to make them also converted to hexadecimal?

I think the problem lies elsewhere, I created a small testcase where I create a String based on those 3 bytes you claim to be ignored, but in my output they do seem to be converted correctly:
private void run(String[] args) throws Exception {
byte[] bytes = new byte[] {0x01, 0x10, 0x15};
String str = new String(bytes, "IBM-037");
byte[] result = getEBCDICRawData(str);
for(byte b : result) {
System.out.print(Integer.toString(( b & 0xff ) + 0x100, 16).substring(1) + " ");
}
System.out.println();
System.out.println(evaluate(str));
}
Output:
01 10 15
011015
Based on this it seems both your getEBCDICRawData and evaluate method seem to be working correctly and makes me believe your String value may already be incorrect to start with. Could it be the String is already missing those characters? Or perhaps a long shot, but maybe the charset is incorrect? There are different EBCDIC charsets, so maybe the String is composed using a different one? Although I doubt this would make much difference for the 01, 10 and 15 bytes.
As a final remark, but probably unrelated to your problem, I usually prefer to use the encode/decode functions on the charset object to do such conversions:
String charset = "IBM-037";
Charset cs = Charset.forName(charset);
ByteBuffer bb = cs.encode(str);
CharBuffer cb = cs.decode(bb);

Create StringBuilder from byte[]

Is there a way to create a StringBuilder from a byte[]?
I want to improve memory usage using StringBuilder but what I have first is a byte[], so I have to create a String from the byte[] and then create the StringBuilder from the String and I don't see this solution as optimal.
Thanks

Basically, your best option seems to be using CharsetDecoder directly.
Here's how:
byte[] srcBytes = getYourSrcBytes();
//Whatever charset your bytes are endoded in
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
//ByteBuffer.wrap simply wraps the byte array, it does not allocate new memory for it
ByteBuffer srcBuffer = ByteBuffer.wrap(srcBytes);
//Now, we decode our srcBuffer into a new CharBuffer (yes, new memory allocated here, no can do)
CharBuffer resBuffer = decoder.decode(srcBuffer);
//CharBuffer implements CharSequence interface, which StringBuilder fully support in it's methods
StringBuilder yourStringBuilder = new StringBuilder(resBuffer);
ADDED:
After some tests it seems that the simple new String(bytes) is much faster and it seems there is no simple way to make it faster than that. Here is the test I ran:
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.text.ParseException;
public class ConsoleMain {
public static void main(String[] args) throws IOException, ParseException {
StringBuilder sb1 = new StringBuilder("abcdefghijklmnopqrstuvwxyz");
for (int i=0;i<19;i++) {
sb1.append(sb1);
}
System.out.println("Size of buffer: "+sb1.length());
byte[] src = sb1.toString().getBytes("UTF-8");
StringBuilder res;
long startTime = System.currentTimeMillis();
res = testStringConvert(src);
System.out.println("Conversion using String time (msec): "+(System.currentTimeMillis()-startTime));
if (!res.toString().equals(sb1.toString())) {
System.err.println("Conversion error");
}
startTime = System.currentTimeMillis();
res = testCBConvert(src);
System.out.println("Conversion using CharBuffer time (msec): "+(System.currentTimeMillis()-startTime));
if (!res.toString().equals(sb1.toString())) {
System.err.println("Conversion error");
}
}
private static StringBuilder testStringConvert(byte[] src) throws UnsupportedEncodingException {
String s = new String(src, "UTF-8");
StringBuilder b = new StringBuilder(s);
return b;
}
private static StringBuilder testCBConvert(byte[] src) throws CharacterCodingException {
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer srcBuffer = ByteBuffer.wrap(src);
CharBuffer resBuffer = decoder.decode(srcBuffer);
StringBuilder b = new StringBuilder(resBuffer);
return b;
}
}
Results:
Size of buffer: 13631488
Conversion using String time (msec): 91
Conversion using CharBuffer time (msec): 252
And a modified (less memory-consuming) version on IDEONE: Here.

If it is short statements you want, then there is no way around the String step in between. The String constructor mixes conversion and object construction for convenience in a very common case, but there is no such convenience constructor for a StringBuilder.
If it is performance you are interested in, then you might avoid the intermediate String object by using something like this:
new StringBuilder(Charset.forName(charsetName).decode(ByteBuffer.wrap(inBytes)))
If you want to be able to fine-tune performance, you can control the decode process yourself. For example, you might want to avoid using too much memory, by using averageCharsPerByte as an estimate of how much memory will be needed. Instead of resizing the buffer if that estimate was too short, you could use the resulting StringBuilder to accumulate all the parts.
CharsetDecoder cd = Charset.forName(charsetName).newDecoder();
cd.onMalformedInput(CodingErrorAction.REPLACE);
cd.onUnmappableCharacter(CodingErrorAction.REPLACE);
int lengthEstimate = Math.ceil(cd.averageCharsPerByte()*inBytes.length) + 1;
ByteBuffer inBuf = ByteBuffer.wrap(inBytes);
CharBuffer outBuf = CharBuffer.allocate(lengthEstimate);
StringBuilder out = new StringBuilder(lengthEstimate);
CoderResult cr;
while (true) {
cr = cd.decode(inBuf, outBuf, true);
out.append(outBuf);
outBuf.clear();
if (cr.isUnderflow()) break;
if (!cr.isOverflow()) cr.throwException();
}
cr = cd.flush(outBuf);
if (!cr.isUnderflow()) cr.throwException();
out.append(outBuf);
I doubt that the above code will be worth the effort in most applications, though. If an application is that interested in performance, it probably shouldn't be dealing with StringBuilder either, but handle everything at the buffer level.

How to parse UTF-8 representation to String in Java?

Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ?
Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
byte[] utf8 = new byte[arr.length];
int index=0;
for (String ch : arr) {
utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
}
String newStr = new String(utf8, "UTF-8");
System.out.println(newStr);
}
catch (UnsupportedEncodingException e) {
// handle the UTF-8 conversion exception
}
}
}
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
ArrayList<Byte> arrList = new ArrayList<Byte>();
String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
for (String c : codes) {
int code = Integer.parseInt(c,HEXADECIMAL);
byte[] bytes = intToByteArray(code);
for (byte b : bytes) {
if (b != 0) arrList.add(b);
}
}
byte[] utf8 = new byte[arrList.size()];
for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
str = new String(utf8, "UTF-8");
System.out.println(str);
}
catch (UnsupportedEncodingException e) {
// handle the exception when
}
}
// Takes a 4 byte integer and and extracts each byte
public static final byte[] intToByteArray(int value) {
return new byte[] {
(byte) (value >>> 24),
(byte) (value >>> 16),
(byte) (value >>> 8),
(byte) (value)
};
}
}

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?
If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
String result = "\u0068\u0065\u006c\u006c\u006f\u000a";
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

I'm sure there must be a better way, but using just the JDK:
public static String handleEscapes(final String s)
{
final java.util.Properties props = new java.util.Properties();
props.setProperty("foo", s);
final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
try
{
props.store(baos, null);
final String tmp = baos.toString().replace("\\\\", "\\");
props.load(new java.io.StringReader(tmp));
}
catch(final java.io.IOException ioe) // shouldn't happen
{ throw new RuntimeException(ioe); }
return props.getProperty("foo");
}
uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).
(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

Read a file line by line in reverse order

I have a java ee application where I use a servlet to print a log file created with log4j. When reading log files you are usually looking for the last log line and therefore the servlet would be much more useful if it printed the log file in reverse order. My actual code is:
response.setContentType("text");
PrintWriter out = response.getWriter();
try {
FileReader logReader = new FileReader("logfile.log");
try {
BufferedReader buffer = new BufferedReader(logReader);
for (String line = buffer.readLine(); line != null; line = buffer.readLine()) {
out.println(line);
}
} finally {
logReader.close();
}
} finally {
out.close();
}
The implementations I've found in the internet involve using a StringBuffer and loading all the file before printing, isn't there a code light way of seeking to the end of the file and reading the content till the start of the file?

[EDIT]
By request, I am prepending this answer with the sentiment of a later comment: If you need this behavior frequently, a "more appropriate" solution is probably to move your logs from text files to database tables with DBAppender (part of log4j 2). Then you could simply query for latest entries.
[/EDIT]
I would probably approach this slightly differently than the answers listed.
(1) Create a subclass of Writer that writes the encoded bytes of each character in reverse order:
public class ReverseOutputStreamWriter extends Writer {
private OutputStream out;
private Charset encoding;
public ReverseOutputStreamWriter(OutputStream out, Charset encoding) {
this.out = out;
this.encoding = encoding;
}
public void write(int ch) throws IOException {
byte[] buffer = this.encoding.encode(String.valueOf(ch)).array();
// write the bytes in reverse order to this.out
}
// other overloaded methods
}
(2) Create a subclass of log4j WriterAppender whose createWriter method would be overridden to create an instance of ReverseOutputStreamWriter.
(3) Create a subclass of log4j Layout whose format method returns the log string in reverse character order:
public class ReversePatternLayout extends PatternLayout {
// constructors
public String format(LoggingEvent event) {
return new StringBuilder(super.format(event)).reverse().toString();
}
}
(4) Modify my logging configuration file to send log messages to both the "normal" log file and a "reverse" log file. The "reverse" log file would contain the same log messages as the "normal" log file, but each message would be written backwards. (Note that the encoding of the "reverse" log file would not necessarily conform to UTF-8, or even any character encoding.)
(5) Create a subclass of InputStream that wraps an instance of RandomAccessFile in order to read the bytes of a file in reverse order:
public class ReverseFileInputStream extends InputStream {
private RandomAccessFile in;
private byte[] buffer;
// The index of the next byte to read.
private int bufferIndex;
public ReverseFileInputStream(File file) {
this.in = new RandomAccessFile(File, "r");
this.buffer = new byte[4096];
this.bufferIndex = this.buffer.length;
this.in.seek(file.length());
}
public void populateBuffer() throws IOException {
// record the old position
// seek to a new, previous position
// read from the new position to the old position into the buffer
// reverse the buffer
}
public int read() throws IOException {
if (this.bufferIndex == this.buffer.length) {
populateBuffer();
if (this.bufferIndex == this.buffer.length) {
return -1;
}
}
return this.buffer[this.bufferIndex++];
}
// other overridden methods
}
Now if I want to read the entries of the "normal" log file in reverse order, I just need to create an instance of ReverseFileInputStream, giving it the "revere" log file.

This is a old question. I also wanted to do the same thing and after some searching found there is a class in apache commons-io to achieve this:
org.apache.commons.io.input.ReversedLinesFileReader

I think a good choice for this would be using RandomFileAccess class. There is some sample code for back-reading using this class on this page. Reading bytes this way is easy, however reading strings might be a bit more challenging.

If you are in a hurry and want the simplest solution without worrying too much about performance, I would give a try to use an external process to do the dirty job (given that you are running your app in a Un*x server, as any decent person would do XD)
new BufferedReader(new InputStreamReader(Runtime.getRuntime().exec("tail yourlogfile.txt -n 50 | rev").getProcess().getInputStream()))

A simpler alternative, because you say that you're creating a servlet to do this, is to use a LinkedList to hold the last N lines (where N might be a servlet parameter). When the list size exceeds N, you call removeFirst().
From a user experience perspective, this is probably the best solution. As you note, the most recent lines are the most important. Not being overwhelmed with information is also very important.

Good question. I'm not aware of any common implementations of this. It's not trivial to do properly either, so be careful what you choose. It should deal with character set encoding and detection of different line break methods. Here's the implementation I have so far that works with ASCII and UTF-8 encoded files, including a test case for UTF-8. It does not work with UTF-16LE or UTF-16BE encoded files.
import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import junit.framework.TestCase;
public class ReverseLineReader {
private static final int BUFFER_SIZE = 8192;
private final FileChannel channel;
private final String encoding;
private long filePos;
private ByteBuffer buf;
private int bufPos;
private byte lastLineBreak = '\n';
private ByteArrayOutputStream baos = new ByteArrayOutputStream();
public ReverseLineReader(File file, String encoding) throws IOException {
RandomAccessFile raf = new RandomAccessFile(file, "r");
channel = raf.getChannel();
filePos = raf.length();
this.encoding = encoding;
}
public String readLine() throws IOException {
while (true) {
if (bufPos < 0) {
if (filePos == 0) {
if (baos == null) {
return null;
}
String line = bufToString();
baos = null;
return line;
}
long start = Math.max(filePos - BUFFER_SIZE, 0);
long end = filePos;
long len = end - start;
buf = channel.map(FileChannel.MapMode.READ_ONLY, start, len);
bufPos = (int) len;
filePos = start;
}
while (bufPos-- > 0) {
byte c = buf.get(bufPos);
if (c == '\r' || c == '\n') {
if (c != lastLineBreak) {
lastLineBreak = c;
continue;
}
lastLineBreak = c;
return bufToString();
}
baos.write(c);
}
}
}
private String bufToString() throws UnsupportedEncodingException {
if (baos.size() == 0) {
return "";
}
byte[] bytes = baos.toByteArray();
for (int i = 0; i < bytes.length / 2; i++) {
byte t = bytes[i];
bytes[i] = bytes[bytes.length - i - 1];
bytes[bytes.length - i - 1] = t;
}
baos.reset();
return new String(bytes, encoding);
}
public static void main(String[] args) throws IOException {
File file = new File("my.log");
ReverseLineReader reader = new ReverseLineReader(file, "UTF-8");
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}
public static class ReverseLineReaderTest extends TestCase {
public void test() throws IOException {
File file = new File("utf8test.log");
String encoding = "UTF-8";
FileInputStream fileIn = new FileInputStream(file);
Reader fileReader = new InputStreamReader(fileIn, encoding);
BufferedReader bufReader = new BufferedReader(fileReader);
List<String> lines = new ArrayList<String>();
String line;
while ((line = bufReader.readLine()) != null) {
lines.add(line);
}
Collections.reverse(lines);
ReverseLineReader reader = new ReverseLineReader(file, encoding);
int pos = 0;
while ((line = reader.readLine()) != null) {
assertEquals(lines.get(pos++), line);
}
assertEquals(lines.size(), pos);
}
}
}

you can use RandomAccessFile implements this function,such as:
import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import com.google.common.io.LineProcessor;
public class FileUtils {
/**
* 反向读取文本文件（UTF8）,文本文件分行是通过\r\n
*
* #param <T>
* #param file
* #param step 反向寻找的步长
* #param lineprocessor
* #throws IOException
*/
public static <T> T backWardsRead(File file, int step,
LineProcessor<T> lineprocessor) throws IOException {
RandomAccessFile rf = new RandomAccessFile(file, "r");
long fileLen = rf.length();
long pos = fileLen - step;
// 寻找倒序的第一行:\r
while (true) {
if (pos < 0) {
// 处理第一行
rf.seek(0);
lineprocessor.processLine(rf.readLine());
return lineprocessor.getResult();
}
rf.seek(pos);
char c = (char) rf.readByte();
while (c != '\r') {
c = (char) rf.readByte();
}
rf.readByte();//read '\n'
pos = rf.getFilePointer();
if (!lineprocessor.processLine(rf.readLine())) {
return lineprocessor.getResult();
}
pos -= step;
}
}
use:
FileUtils.backWardsRead(new File("H:/usersfavs.csv"), 40,
new LineProcessor<Void>() {
//TODO implements method
.......
});

The simplest solution is to read through the file in forward order, using an ArrayList<Long> to hold the byte offset of each log record. You'll need to use something like Jakarta Commons CountingInputStream to retrieve the position of each record, and will need to carefully organize your buffers to ensure that it returns the proper values:
FileInputStream fis = // .. logfile
BufferedInputStream bis = new BufferedInputStream(fis);
CountingInputStream cis = new CountingInputSteam(bis);
InputStreamReader isr = new InputStreamReader(cis, "UTF-8");
And you probably won't be able to use a BufferedReader, because it will attempt to read-ahead and throw off the count (but reading a character at a time won't be a performance problem, because you're buffering lower in the stack).
To write the file, you iterate the list backwards and use a RandomAccessFile. There is a bit of a trick: to properly decode the bytes (assuming a multi-byte encoding), you will need to read the bytes corresponding to an entry, and then apply a decoding to it. The list, however, will give you the start and end position of the bytes.
One big benefit to this approach, versus simply printing the lines in reverse order, is that you won't damage multi-line log messages (such as exceptions).

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
/**
* Inside of C:\\temp\\vaquar.txt we have following content
* vaquar khan is working into Citi He is good good programmer programmer trust me
* #author vaquar.khan#gmail.com
*
*/
public class ReadFileAndDisplayResultsinReverse {
public static void main(String[] args) {
try {
// read data from file
Object[] wordList = ReadFile();
System.out.println("File data=" + wordList);
//
Set<String> uniquWordList = null;
for (Object text : wordList) {
System.out.println((String) text);
List<String> tokens = Arrays.asList(text.toString().split("\\s+"));
System.out.println("tokens" + tokens);
uniquWordList = new HashSet<String>(tokens);
// If multiple line then code into same loop
}
System.out.println("uniquWordList" + uniquWordList);
Comparator<String> wordComp= new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
if(o1==null && o2 ==null) return 0;
if(o1==null ) return o2.length()-0;
if(o2 ==null) return o1.length()-0;
//
return o2.length()-o1.length();
}
};
List<String> fs=new ArrayList<String>(uniquWordList);
Collections.sort(fs,wordComp);
System.out.println("uniquWordList" + fs);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
static Object[] ReadFile() throws IOException {
List<String> list = Files.readAllLines(new File("C:\\temp\\vaquar.txt").toPath(), Charset.defaultCharset());
return list.toArray();
}
}
Output:
[Vaquar khan is working into Citi He is good good programmer programmer trust me
tokens[vaquar, khan, is, working, into, Citi, He, is, good, good, programmer, programmer, trust, me]
uniquWordList[trust, vaquar, programmer, is, good, into, khan, me, working, Citi, He]
uniquWordList[programmer, working, vaquar, trust, good, into, khan, Citi, is, me, He]
If you want to Sort A to Z then write one more comparater

Concise solution using Java 7 Autoclosables and Java 8 Streams :
try (Stream<String> logStream = Files.lines(Paths.get("C:\\logfile.log"))) {
logStream
.sorted(Comparator.reverseOrder())
.limit(10) // last 10 lines
.forEach(System.out::println);
}
Big drawback: only works when lines are strictly in natural order, like log files prefixed with timestamps but without exceptions

Decode base64Url in Java

https://web.archive.org/web/20110422225659/https://en.wikipedia.org/wiki/Base64#URL_applications
talks about base64Url - Decode
a modified Base64 for URL variant exists, where no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_'
I created the following function:
public static String base64UrlDecode(String input) {
String result = null;
BASE64Decoder decoder = new BASE64Decoder();
try {
result = decoder.decodeBuffer(input.replace('-','+').replace('/','_')).toString();
}
catch (IOException e) {
System.out.println(e.getMessage());
}
return result;
}
it returns a very small set of characters that don't even resemble to the expected results.
any ideas?

Java8+
import java.util.Base64;
return Base64.getUrlEncoder().encodeToString(bytes);

Base64 encoding is part of the JDK since Java 8. URL safe encoding is also supported with java.util.Base64.getUrlEncoder(), and the "=" padding can be skipped by additionally using the java.util.Base64.Encoder.withoutPadding() method:
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public String encode(String raw) {
return Base64.getUrlEncoder()
.withoutPadding()
.encodeToString(raw.getBytes(StandardCharsets.UTF_8));
}

With the usage of Base64 from Apache Commons, who can be configured to URL safe, I created the following function:
import org.apache.commons.codec.binary.Base64;
public static String base64UrlDecode(String input) {
String result = null;
Base64 decoder = new Base64(true);
byte[] decodedBytes = decoder.decode(input);
result = new String(decodedBytes);
return result;
}
The constructor Base64(true) makes the decoding URL-safe.

In the Android SDK, there's a dedicated flag in the Base64 class: Base64.URL_SAFE, use it like so to decode to a String:
import android.util.Base64;
byte[] byteData = Base64.decode(body, Base64.URL_SAFE);
str = new String(byteData, "UTF-8");

Guava now has Base64 decoding built in.
https://google.github.io/guava/releases/17.0/api/docs/com/google/common/io/BaseEncoding.html

public static byte[] encodeUrlSafe(byte[] data) {
byte[] encode = Base64.encode(data);
for (int i = 0; i < encode.length; i++) {
if (encode[i] == '+') {
encode[i] = '-';
} else if (encode[i] == '/') {
encode[i] = '_';
}
}
return encode;
}
public static byte[] decodeUrlSafe(byte[] data) {
byte[] encode = Arrays.copyOf(data, data.length);
for (int i = 0; i < encode.length; i++) {
if (encode[i] == '-') {
encode[i] = '+';
} else if (encode[i] == '_') {
encode[i] = '/';
}
}
return Base64.decode(encode);
}

Right off the bat, it looks like your replace() is backwards; that method replaces the occurrences of the first character with the second, not the other way around.

#ufk's answer works, but you don't actually need to set the urlSafe flag when you're just decoding.
urlSafe is only applied to encode operations. Decoding seamlessly
handles both modes.
Also, there are some static helpers to make it shorter and more explicit:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
public static String base64UrlDecode(String input) {
StringUtils.newStringUtf8(Base64.decodeBase64(input));
}
Docs
newStringUtf8()
decodeBase64()

This class can help:
import android.util.Base64;
public class Encryptor {
public static String encode(String input) {
return Base64.encodeToString(input.getBytes(), Base64.URL_SAFE);
}
public static String decode(String encoded) {
return new String(Base64.decode(encoded.getBytes(), Base64.URL_SAFE));
}
}

I know the answer is already there, but still, if someone wants...
import java.util.Base64; public
class Base64BasicEncryptionExample {
publicstaticvoid main(String[] args) {
// Getting encoder
Base64.Encoder encoder = Base64.getUrlEncoder();
// Encoding URL
String eStr = encoder.encodeToString
("http://www.javatpoint.com/javatutorial/".getBytes());
System.out.println("Encoded URL: "+eStr);
// Getting decoder
Base64.Decoder decoder = Base64.getUrlDecoder();
// Decoding URl
String dStr = new String(decoder.decode(eStr));
System.out.println("Decoded URL: "+dStr);
}
}
Took help from: https://www.javatpoint.com/java-base64-encode-decode

In Java try the method Base64.encodeBase64URLSafeString() from Commons Codec library for encoding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex match UTF-8 String (Without Copy) - java

Related

Need help in converting EBCDIC to Hexadecimal

Create StringBuilder from byte[]

How to parse UTF-8 representation to String in Java?

Read a file line by line in reverse order

Decode base64Url in Java

Categories

Resources