Cast arbitrary escaped character to int - java

I have a method that, at the end, takes a character array (with one element), and returns the cast of that character:
char[] first = {'a'};
return (int)first[0];
However, sometimes I have character arrays with two elements, where the first is always a "\" (i.e. it is a character array that "contains" an escaped character):
char second = {'\\', 'n'};
I would like to return (int)'\n', but I do not know how to convert that array into a single escaped character. I am okay checking whether or not the array is of length 1 or 2, but I really don't want to have a long switch or if/else block to go through every possible escaped character.

How about making an HashMap of escape character vs the second ? like:
Map<Character, int> escapeMap = new HashMap<>();
escapeMap.put('n', 10);
Then make something like:
If (second[0] == '\\') {
return escapeMap.get(second[1]);
}
else
{
return (int)first[0];
}

You could use a map to store the escape sequences mappings to their corresponding characters. If you assume that the escape sequence will always be just one character with the code below 128, you could simplify the mappings to something like this:
char[] escaped = {..., '\n', ...'\t', ...}
where the character '\n' is on the (int)'n'-th position of the array.
Then you would find the the escaped character just by escaped[(int)second[1]]. You just need to check the array bounds, if an invalid escape sequence is found.

Here is an ugly hack. This works for me, but appears to be unreliable, buggy and time-consuming (and hence don't use this in critical parts).
char[] second = {'\\','n'};
String s = new String(second);
//write the String to an OutputStream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
writer.write("key=" + s);
writer.close();
//load the String using Properties
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Properties prop = new Properties();
prop.load(bais);
baos.close();
bais.close();
//now get the character
char c = prop.getProperty("key").charAt(0);
System.out.println((int)c);
output: 10 (which is the output of System.out.println((int)'\n');)

When printing (int)'a' the decimal value of that character from the UniCode table is being printed. In the case of a that would be 93. http://unicode-table.com/en/
\n is being identified as (Uni)code LF which means Line Feed or New Line. In the Unicode table thats the same as decimal number 10. Which indeed gets printed if you write: System.out.println((int)'\n');
Same goes for the characters \b , \f , \r , \t' ,\',\"and \\ which have special meaning for the compiler and have a special character code like LF for \n. Look them up if you want to know the details.
In that light the most simplest solution would be:
char[] second = {'\\', 'n'};
if (second.length > 1) {
System.out.println((int)'\n');
} else {
System.out.println(second[0]);
}
and thats only if \n is the only escape sequence you encounter.

Related

Java : Skip Unicode characters while reading a file

I am reading a text file using the below code,
try (BufferedReader br = new BufferedReader(new FileReader(<file.txt>))) {
for (String line; (line = br.readLine()) != null;) {
//I want to skip a line with unicode character and continue next line
if(line.toLowerCase().startsWith("\\u")){
continue;
//This is not working because i get the character itself and not the text
}
}
}
The text file:
How to skip all the unicode characters while reading a file ?
You can skip all lines that contains non ASCII characters:
if(Charset.forName("US-ASCII").newEncoder().canEncode(line)){
continue;
}
All characters in a String are Unicode. A String is a counted sequence of UTF-16 code units. By "Unicode", you must mean not also in some unspecified set of other character sets. For sake of argument, let's say ASCII.
A regular expression can sometimes be the simplest expression of a pattern requirement:
if (!line.matches("\\p{ASCII}*")) continue;
That is, if the string does not consist only of any number, including 0, (that's what * means) of "ASCII" characters, then continue.
(String.matches looks for a match on the whole string, so the actual regular expression pattern is ^\p{ASCII}*$. )
Something like this might get you going:
for (char c : line.toCharArray()) {
if (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN) {
// do something with this character
}
}
You could use that as a starting point to either discard each non-basic character, or discard the entire line if it contains a single non-basic character.

un escapeing special characters using java

I have give following value (escaping using Windows-1252)
ABC &#145 ; &#146 ; &#147 ; &#148 ; &#226 ;, &#234 ;, &#238 ;, &#244 ;, &#251 ;
(I need to add space to display exact value actual there is no space between number and ;)
but the actual value is and I want the same value as below
ABC ‘ ’ “ ” â, ê, î, ô, û
I have tried HtmlUtils.htmlUnescape(decodedString); but did not work
I am getting output like
ABC â, ê, î, ô, û
‘ ’ “ ” is removed.
Can you please provide how to do this in java?
The quote characters are probably still in the string, they are just invisible when displayed. That's because in Unicode or ISO 8859-1, the code point 145 is not assigned to a visible character.
The best solution (if possible) is to pass the encoding to the unescapeHtml method.
An alternative is to call htmlUnescape first and then map the cp1252 codepoints to the corresponding Unicode code points, using the following code:
String unescapeHtmlCp1252(String input) {
String nohtml = HtmlUtils.htmlUnescape(input);
byte[] bytes = nohtml.getBytes(StandardCharsets.ISO_8859_1);
String result = new String(bytes, Charset.forName("cp1252"));
return result;
}
When you step through this code with a debugger and inspect the nohtml string, you will probably see characters with the value 145, 146, and so on. This means that the characters are still there at this point.
Later, when the characters are converted into pixels by using a font, these characters do not have a definition and are therefore just ignored. But until this step, they are still there.
You can use a regular expression for that.
Pattern p = Pattern.compile("&#(\\d+);");
StringBuffer out = new StringBuffer();
String s = "ABC‘’âD";
Matcher m = p.matcher(s);
int startIdx = 0;
byte[] bytes = new byte[]{0};
while(startIdx < s.length() && m.find(startIdx)) {
if (m.start() > startIdx) {
out.append(s.substring(startIdx, m.start()));
}
// fetch the numeric value from the encoding and put it into a byte array
bytes[0] = (byte)Short.parseShort(m.group(1));
// convert the windows 1252 encoded byte array into a java string
out.append(new String(bytes,"Windows-1252"));
startIdx = m.end();
}
if (startIdx < s.length()) {
out.append(s.substring(startIdx));
}
The output / result will be something like
ABC‘’âD

Reading from InputStream until double quotation marks

Need help reading from InputStream to a list of bytes until quotation marks.
The problem is, InputStream reads bytes and I'm not sure how to stop it reading when it reaches quotation marks ... I thought about something like this:
public static List<Byte> getQuoted(InputStream in) throws IOException {
int c;
LinkedList<Byte> myList = new LinkedList<>();
try {
while ((in.read()) != "\"") { ?????
list.add(c)
.....
The while condition is a problem, of course the quotation marks are String while int is expected.
"\"" is a String. If you want just the character representation of ", use '"' instead.
Note that your code will not work as you expect if your file is not in ASCII format (and the behaviour will be inconsistent between different character sets) (it does of course depend what you expect).
If in ASCII, each character will take up a single byte in the file and InputStream::read() reads a single byte (thus a single ASCII character) so everything will work fine.
If in a character set that takes up more than 1 byte per character (e.g. Unicode), each read will read less than a single character and your code will probably not work as expected.
Reader::read() (and using Character rather than Byte) is advised since it will read a character, not just a byte.
Also, you're missing an assignment:
while ((in.read()) != '"')
should be
while ((c = in.read()) != '"')

java StreamTokenizer wordChars() and nextToken()

This might be a dumb question but I am having a hard time recognizing how StreamTokenizer delimit input streams. Is it delimited by space and nextline? I am also confused with the use of wordChars(). For example:
public static int getSet(String workingDirectory, String filename, List<String> set) {
int cardinality = 0;
File file = new File(workingDirectory,filename);
try {
BufferedReader in = new BufferedReader(new FileReader(file));
StreamTokenizer text = new StreamTokenizer(in);
text.wordChars('_','_');
text.nextToken();
while (text.ttype != StreamTokenizer.TT_EOF) {
set.add(text.sval);
cardinality++;
// System.out.println(cardinality + " " + text.sval);
text.nextToken();
}
in.close();
} catch (IOException ex) {
ex.printStackTrace();
}
return cardinality;
}
If the text file includes such string:A_B_C D_E_F.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
And what will the tokens be in this case?
Thank you very much.
how StreamTokenizer delimit input streams. Is it delimited by space and nextline?
Short Answer is Yes
The parsing process is controlled by a table and a number of flags that can be set to various states. The stream tokenizer can recognize identifiers, numbers, quoted strings, and various comment styles. In addition, an instance has four flags. One of the flags indicate that whether line terminators are to be returned as tokens or treated as white space that merely separates tokens.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
Short Answer is Yes
WordChars takes two inputs. First(low) is lower end for the character set and second(high) is upper end of the character set. If low is passed with the value less than 0 then it will be set to 0. Since you are passing _ = 95, lower end will be accepted as _=95. If high is passed less than 255 then it is accepted as the high end of the character set range. Since you are passing high as _=95, this is also accepted. Now when it tries to determine the range of characters from low-to-high, it finds only one character, which is _ itself. In that case, _ will be the only character recognized as word character.
Please check this
Pattern splitRegex = Pattern.compile("_");
String[] tokens = splitRegex.split(stringtobesplitedbydelimeter);
or you can also use
String[] tokens = stringtobesplitedbydelimeter.split('_')

How to represent empty char in Java Character class

I want to represent an empty character in Java as "" in String...
Like that char ch = an empty character;
Actually I want to replace a character without leaving space.
I think it might be sufficient to understand what this means: no character not even space.
You may assign '\u0000' (or 0).
For this purpose, use Character.MIN_VALUE.
Character ch = Character.MIN_VALUE;
char means exactly one character. You can't assign zero characters to this type.
That means that there is no char value for which String.replace(char, char) would return a string with a diffrent length.
As Character is a class deriving from Object, you can assign null as "instance":
Character myChar = null;
Problem solved ;)
An empty String is a wrapper on a char[] with no elements. You can have an empty char[]. But you cannot have an "empty" char. Like other primitives, a char has to have a value.
You say you want to "replace a character without leaving a space".
If you are dealing with a char[], then you would create a new char[] with that element removed.
If you are dealing with a String, then you would create a new String (String is immutable) with the character removed.
Here are some samples of how you could remove a char:
public static void main(String[] args) throws Exception {
String s = "abcdefg";
int index = s.indexOf('d');
// delete a char from a char[]
char[] array = s.toCharArray();
char[] tmp = new char[array.length-1];
System.arraycopy(array, 0, tmp, 0, index);
System.arraycopy(array, index+1, tmp, index, tmp.length-index);
System.err.println(new String(tmp));
// delete a char from a String using replace
String s1 = s.replace("d", "");
System.err.println(s1);
// delete a char from a String using StringBuilder
StringBuilder sb = new StringBuilder(s);
sb.deleteCharAt(index);
s1 = sb.toString();
System.err.println(s1);
}
As chars can be represented as Integers (ASCII-Codes), you can simply write:
char c = 0;
The 0 in ASCII-Code is null.
If you want to replace a character in a String without leaving any empty space then you can achieve this by using StringBuilder. String is immutable object in java,you can not modify it.
String str = "Hello";
StringBuilder sb = new StringBuilder(str);
sb.deleteCharAt(1); // to replace e character
I was looking for this. Simply set the char c = 0; and it works perfectly. Try it.
For example, if you are trying to remove duplicate characters from a String , one way would be to convert the string to char array and store in a hashset of characters which would automatically prevent duplicates.
Another way, however, will be to convert the string to a char array, use two for-loops and compare each character with the rest of the string/char array (a Big O on N^2 activity), then for each duplicate found just set that char to 0..
...and use new String(char[]) to convert the resulting char array to string and then sysout to print (this is all java btw). you will observe all chars set to zero are simply not there and all duplicates are gone. long post, but just wanted to give you an example.
so yes set char c = 0; or if for char array, set cArray[i]=0 for that specific duplicate character and you will have removed it.
You can't. "" is the literal for a string, which contains no characters. It does not contain the "empty character" (whatever you mean by that).
In java there is nothing as empty character literal, in other words, '' has no meaning unlike "" which means a empty String literal
The closest you can go about representing empty character literal is through zero length char[], something like:
char[] cArr = {}; // cArr is a zero length array
char[] cArr = new char[0] // this does the same
If you refer to String class its default constructor creates a empty character sequence using new char[0]
Also, using Character.MIN_VALUE is not correct because it is not really empty character rather smallest value of type character.
I also don't like Character c = null; as a solution mainly because jvm will throw NPE if it tries to un-box it. Secondly, null is basically a reference to nothing w.r.t reference type and here we are dealing with primitive type which don't accept null as a possible value.
Assuming that in the string, say str, OP wants to replace all occurrences of a character, say 'x', with empty character '', then try using:
str.replace("x", "");
char ch = Character.MIN_VALUE;
The code above will initialize the variable ch with the minimum value that a char can have (i.e. \u0000).
this is how I do it.
char[] myEmptyCharArray = "".toCharArray();
You can do something like this:
mystring.replace(""+ch, "");
String before = EMPTY_SPACE+TAB+"word"+TAB+EMPTY_SPACE
Where
EMPTY_SPACE = " " (this is String)
TAB = '\t' (this is Character)
String after = before.replaceAll(" ", "").replace('\t', '\0')
means
after = "word"
You can only re-use an existing character. e.g. \0 If you put this in a String, you will have a String with one character in it.
Say you want a char such that when you do
String s =
char ch = ?
String s2 = s + ch; // there is not char which does this.
assert s.equals(s2);
what you have to do instead is
String s =
char ch = MY_NULL_CHAR;
String s2 = ch == MY_NULL_CHAR ? s : s + ch;
assert s.equals(s2);
Use the \b operator (the backspace escape operator) in the second parameter
String test= "Anna Banana";
System.out.println(test); //returns Anna Banana<br><br>
System.out.println(test.replaceAll(" ","\b")); //returns AnnaBanana removing all the spaces in the string

Categories