Reading from InputStream until double quotation marks - java

Need help reading from InputStream to a list of bytes until quotation marks.
The problem is, InputStream reads bytes and I'm not sure how to stop it reading when it reaches quotation marks ... I thought about something like this:
public static List<Byte> getQuoted(InputStream in) throws IOException {
int c;
LinkedList<Byte> myList = new LinkedList<>();
try {
while ((in.read()) != "\"") { ?????
list.add(c)
.....
The while condition is a problem, of course the quotation marks are String while int is expected.

"\"" is a String. If you want just the character representation of ", use '"' instead.
Note that your code will not work as you expect if your file is not in ASCII format (and the behaviour will be inconsistent between different character sets) (it does of course depend what you expect).
If in ASCII, each character will take up a single byte in the file and InputStream::read() reads a single byte (thus a single ASCII character) so everything will work fine.
If in a character set that takes up more than 1 byte per character (e.g. Unicode), each read will read less than a single character and your code will probably not work as expected.
Reader::read() (and using Character rather than Byte) is advised since it will read a character, not just a byte.
Also, you're missing an assignment:
while ((in.read()) != '"')
should be
while ((c = in.read()) != '"')

Related

How to check if on the end of line is \n or \r or \r\n in JAVA

I need to check every charackter in file and cast it on byte. But unfortunetely scanner not gives any possibilities to not spliting last charackter of line...
I try to do something like this :
Scanner in = new Scanner(new File(path));
List<Byte> byteList = new ArrayList<>();
while (in.hasNextLine()) {
String a = in.nextLine();
if (in.hasNextLine()) {
a = a + (char) (13);
}
for (char c : a.toCharArray()) {
byteList.add((byte) c);
}
}
byte[] bytes = new byte[byteList.size()];
for (int i = 0; i < byteList.size(); i++) {
bytes[i] = byteList.get(i);
}
return bytes;
}
Have you maybe any idea for the solution on this problem ?
I'll be grateful for your help.
You cannot do this with Scanner.readLine() or BufferedReader.readLine() because both of these APIs consume the line separators.
You could conceivably do it using Scanner.next() with a custom separator regex that causes the line separators to be included in the tokens. (Hint: using a look-behind.)
However for what you are actually doing in the code, either a FileInputStream or a FileReader would be better.
This brings me to another thing.
What is this code supposed to do?
What it actually does is to convert Unicode code units into bytes by throwing away the top bits. That might make sense if the input charset was ASCII or (maybe) LATIN-1. But for anything else, it is probably going to mangle the text.
If you are trying read the file as (raw) bytes, simply use FileInputStream + BufferedInputStream. Then read / process the bytes directly. The line terminators won't require any special handling.
If you are trying to read the file as encoded characters in some charset and transliterate it to another one (e.g. ASCII). You should be writing to a FileWriter + BufferedWriter. Once again, line separator / terminator characters will be preserved ... and you can "normalize" them it you want to.
If you are doing something else ... well this is probably not the right way to do it. A List<Byte> is going to be inefficient and difficult to convert to something that other Java APIs can deal with directly.
Read the whole file, including all line endings, in as a single string:
String fileStr = in.useDelimiter("\\A").next();
The regex \A matches start of input, which is never encountered, so the entire input stream is returned from next().
If your situation requires all line endings to be corrected to a specific line ending, despite whatever the file contains, do this:
fileStr = fileStr.replaceAll("\\R", "\n");
The regex \R matches all types of line endings.
Of course this can all be done as 1 line:
String fileStr = in.useDelimiter("\\A").next().replaceAll("\\R", "\n");

Java : Skip Unicode characters while reading a file

I am reading a text file using the below code,
try (BufferedReader br = new BufferedReader(new FileReader(<file.txt>))) {
for (String line; (line = br.readLine()) != null;) {
//I want to skip a line with unicode character and continue next line
if(line.toLowerCase().startsWith("\\u")){
continue;
//This is not working because i get the character itself and not the text
}
}
}
The text file:
How to skip all the unicode characters while reading a file ?
You can skip all lines that contains non ASCII characters:
if(Charset.forName("US-ASCII").newEncoder().canEncode(line)){
continue;
}
All characters in a String are Unicode. A String is a counted sequence of UTF-16 code units. By "Unicode", you must mean not also in some unspecified set of other character sets. For sake of argument, let's say ASCII.
A regular expression can sometimes be the simplest expression of a pattern requirement:
if (!line.matches("\\p{ASCII}*")) continue;
That is, if the string does not consist only of any number, including 0, (that's what * means) of "ASCII" characters, then continue.
(String.matches looks for a match on the whole string, so the actual regular expression pattern is ^\p{ASCII}*$. )
Something like this might get you going:
for (char c : line.toCharArray()) {
if (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN) {
// do something with this character
}
}
You could use that as a starting point to either discard each non-basic character, or discard the entire line if it contains a single non-basic character.

Cast arbitrary escaped character to int

I have a method that, at the end, takes a character array (with one element), and returns the cast of that character:
char[] first = {'a'};
return (int)first[0];
However, sometimes I have character arrays with two elements, where the first is always a "\" (i.e. it is a character array that "contains" an escaped character):
char second = {'\\', 'n'};
I would like to return (int)'\n', but I do not know how to convert that array into a single escaped character. I am okay checking whether or not the array is of length 1 or 2, but I really don't want to have a long switch or if/else block to go through every possible escaped character.
How about making an HashMap of escape character vs the second ? like:
Map<Character, int> escapeMap = new HashMap<>();
escapeMap.put('n', 10);
Then make something like:
If (second[0] == '\\') {
return escapeMap.get(second[1]);
}
else
{
return (int)first[0];
}
You could use a map to store the escape sequences mappings to their corresponding characters. If you assume that the escape sequence will always be just one character with the code below 128, you could simplify the mappings to something like this:
char[] escaped = {..., '\n', ...'\t', ...}
where the character '\n' is on the (int)'n'-th position of the array.
Then you would find the the escaped character just by escaped[(int)second[1]]. You just need to check the array bounds, if an invalid escape sequence is found.
Here is an ugly hack. This works for me, but appears to be unreliable, buggy and time-consuming (and hence don't use this in critical parts).
char[] second = {'\\','n'};
String s = new String(second);
//write the String to an OutputStream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
writer.write("key=" + s);
writer.close();
//load the String using Properties
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Properties prop = new Properties();
prop.load(bais);
baos.close();
bais.close();
//now get the character
char c = prop.getProperty("key").charAt(0);
System.out.println((int)c);
output: 10 (which is the output of System.out.println((int)'\n');)
When printing (int)'a' the decimal value of that character from the UniCode table is being printed. In the case of a that would be 93. http://unicode-table.com/en/
\n is being identified as (Uni)code LF which means Line Feed or New Line. In the Unicode table thats the same as decimal number 10. Which indeed gets printed if you write: System.out.println((int)'\n');
Same goes for the characters \b , \f , \r , \t' ,\',\"and \\ which have special meaning for the compiler and have a special character code like LF for \n. Look them up if you want to know the details.
In that light the most simplest solution would be:
char[] second = {'\\', 'n'};
if (second.length > 1) {
System.out.println((int)'\n');
} else {
System.out.println(second[0]);
}
and thats only if \n is the only escape sequence you encounter.

unwrapping String within String

I received a message from a queuing service, which I thought would be a UTF-8 encoded String. It turned out to be a quoted and escaped String within a String. That is, the first and last characters of the String itself are ", each newline is two characters \n, quotation marks (numerous because this is XML) are \", and single UTF-8 characters in foreign languages are represented as six characters (e.g., \uABCD). I know I can unwrap all this by rolling my own, but I thought there must be a combination of methods that can do this already. What might that incantation be?
After feedback from #JonSkeet and #njzk2, I came up with this, which worked:
// gradle: 'org.apache.commons:commons-lang3:3.3.2'
import org.apache.commons.lang3.StringEscapeUtils;
String s = serviceThatSometimesReturnsQuotedStringWithinString();
String usable = null;
if (s.length() > 0 && s.charAt(0) == '"' && s.charAt(s.length()-1) == '"') {
usable = StringEscapeUtils.unescapeEcmaScript(s.substring(1, s.length()-1));
} else {
usable = s;
}

java StreamTokenizer wordChars() and nextToken()

This might be a dumb question but I am having a hard time recognizing how StreamTokenizer delimit input streams. Is it delimited by space and nextline? I am also confused with the use of wordChars(). For example:
public static int getSet(String workingDirectory, String filename, List<String> set) {
int cardinality = 0;
File file = new File(workingDirectory,filename);
try {
BufferedReader in = new BufferedReader(new FileReader(file));
StreamTokenizer text = new StreamTokenizer(in);
text.wordChars('_','_');
text.nextToken();
while (text.ttype != StreamTokenizer.TT_EOF) {
set.add(text.sval);
cardinality++;
// System.out.println(cardinality + " " + text.sval);
text.nextToken();
}
in.close();
} catch (IOException ex) {
ex.printStackTrace();
}
return cardinality;
}
If the text file includes such string:A_B_C D_E_F.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
And what will the tokens be in this case?
Thank you very much.
how StreamTokenizer delimit input streams. Is it delimited by space and nextline?
Short Answer is Yes
The parsing process is controlled by a table and a number of flags that can be set to various states. The stream tokenizer can recognize identifiers, numbers, quoted strings, and various comment styles. In addition, an instance has four flags. One of the flags indicate that whether line terminators are to be returned as tokens or treated as white space that merely separates tokens.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
Short Answer is Yes
WordChars takes two inputs. First(low) is lower end for the character set and second(high) is upper end of the character set. If low is passed with the value less than 0 then it will be set to 0. Since you are passing _ = 95, lower end will be accepted as _=95. If high is passed less than 255 then it is accepted as the high end of the character set range. Since you are passing high as _=95, this is also accepted. Now when it tries to determine the range of characters from low-to-high, it finds only one character, which is _ itself. In that case, _ will be the only character recognized as word character.
Please check this
Pattern splitRegex = Pattern.compile("_");
String[] tokens = splitRegex.split(stringtobesplitedbydelimeter);
or you can also use
String[] tokens = stringtobesplitedbydelimeter.split('_')

Categories