java StreamTokenizer wordChars() and nextToken() - java

This might be a dumb question but I am having a hard time recognizing how StreamTokenizer delimit input streams. Is it delimited by space and nextline? I am also confused with the use of wordChars(). For example:
public static int getSet(String workingDirectory, String filename, List<String> set) {
int cardinality = 0;
File file = new File(workingDirectory,filename);
try {
BufferedReader in = new BufferedReader(new FileReader(file));
StreamTokenizer text = new StreamTokenizer(in);
text.wordChars('_','_');
text.nextToken();
while (text.ttype != StreamTokenizer.TT_EOF) {
set.add(text.sval);
cardinality++;
// System.out.println(cardinality + " " + text.sval);
text.nextToken();
}
in.close();
} catch (IOException ex) {
ex.printStackTrace();
}
return cardinality;
}
If the text file includes such string:A_B_C D_E_F.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
And what will the tokens be in this case?
Thank you very much.

how StreamTokenizer delimit input streams. Is it delimited by space and nextline?
Short Answer is Yes
The parsing process is controlled by a table and a number of flags that can be set to various states. The stream tokenizer can recognize identifiers, numbers, quoted strings, and various comment styles. In addition, an instance has four flags. One of the flags indicate that whether line terminators are to be returned as tokens or treated as white space that merely separates tokens.
Does text.wordChars('_','_') mean only underscore will be considered as valid words?
Short Answer is Yes
WordChars takes two inputs. First(low) is lower end for the character set and second(high) is upper end of the character set. If low is passed with the value less than 0 then it will be set to 0. Since you are passing _ = 95, lower end will be accepted as _=95. If high is passed less than 255 then it is accepted as the high end of the character set range. Since you are passing high as _=95, this is also accepted. Now when it tries to determine the range of characters from low-to-high, it finds only one character, which is _ itself. In that case, _ will be the only character recognized as word character.

Please check this
Pattern splitRegex = Pattern.compile("_");
String[] tokens = splitRegex.split(stringtobesplitedbydelimeter);
or you can also use
String[] tokens = stringtobesplitedbydelimeter.split('_')

Related

Java : Skip Unicode characters while reading a file

I am reading a text file using the below code,
try (BufferedReader br = new BufferedReader(new FileReader(<file.txt>))) {
for (String line; (line = br.readLine()) != null;) {
//I want to skip a line with unicode character and continue next line
if(line.toLowerCase().startsWith("\\u")){
continue;
//This is not working because i get the character itself and not the text
}
}
}
The text file:
How to skip all the unicode characters while reading a file ?
You can skip all lines that contains non ASCII characters:
if(Charset.forName("US-ASCII").newEncoder().canEncode(line)){
continue;
}
All characters in a String are Unicode. A String is a counted sequence of UTF-16 code units. By "Unicode", you must mean not also in some unspecified set of other character sets. For sake of argument, let's say ASCII.
A regular expression can sometimes be the simplest expression of a pattern requirement:
if (!line.matches("\\p{ASCII}*")) continue;
That is, if the string does not consist only of any number, including 0, (that's what * means) of "ASCII" characters, then continue.
(String.matches looks for a match on the whole string, so the actual regular expression pattern is ^\p{ASCII}*$. )
Something like this might get you going:
for (char c : line.toCharArray()) {
if (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN) {
// do something with this character
}
}
You could use that as a starting point to either discard each non-basic character, or discard the entire line if it contains a single non-basic character.

reading text file java

I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks
use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."
If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}

Cast arbitrary escaped character to int

I have a method that, at the end, takes a character array (with one element), and returns the cast of that character:
char[] first = {'a'};
return (int)first[0];
However, sometimes I have character arrays with two elements, where the first is always a "\" (i.e. it is a character array that "contains" an escaped character):
char second = {'\\', 'n'};
I would like to return (int)'\n', but I do not know how to convert that array into a single escaped character. I am okay checking whether or not the array is of length 1 or 2, but I really don't want to have a long switch or if/else block to go through every possible escaped character.
How about making an HashMap of escape character vs the second ? like:
Map<Character, int> escapeMap = new HashMap<>();
escapeMap.put('n', 10);
Then make something like:
If (second[0] == '\\') {
return escapeMap.get(second[1]);
}
else
{
return (int)first[0];
}
You could use a map to store the escape sequences mappings to their corresponding characters. If you assume that the escape sequence will always be just one character with the code below 128, you could simplify the mappings to something like this:
char[] escaped = {..., '\n', ...'\t', ...}
where the character '\n' is on the (int)'n'-th position of the array.
Then you would find the the escaped character just by escaped[(int)second[1]]. You just need to check the array bounds, if an invalid escape sequence is found.
Here is an ugly hack. This works for me, but appears to be unreliable, buggy and time-consuming (and hence don't use this in critical parts).
char[] second = {'\\','n'};
String s = new String(second);
//write the String to an OutputStream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
writer.write("key=" + s);
writer.close();
//load the String using Properties
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Properties prop = new Properties();
prop.load(bais);
baos.close();
bais.close();
//now get the character
char c = prop.getProperty("key").charAt(0);
System.out.println((int)c);
output: 10 (which is the output of System.out.println((int)'\n');)
When printing (int)'a' the decimal value of that character from the UniCode table is being printed. In the case of a that would be 93. http://unicode-table.com/en/
\n is being identified as (Uni)code LF which means Line Feed or New Line. In the Unicode table thats the same as decimal number 10. Which indeed gets printed if you write: System.out.println((int)'\n');
Same goes for the characters \b , \f , \r , \t' ,\',\"and \\ which have special meaning for the compiler and have a special character code like LF for \n. Look them up if you want to know the details.
In that light the most simplest solution would be:
char[] second = {'\\', 'n'};
if (second.length > 1) {
System.out.println((int)'\n');
} else {
System.out.println(second[0]);
}
and thats only if \n is the only escape sequence you encounter.

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Java simple sentence parser

is there any simple way to create sentence parser in plain Java
without adding any libs and jars.
Parser should not just take care about blanks between words,
but be more smart and parse: . ! ?,
recognize when sentence is ended etc.
After parsing, only real words could be all stored in db or file, not any special chars.
thank you very much all in advance :)
You might want to start by looking at the BreakIterator class.
From the JavaDoc.
The BreakIterator class implements
methods for finding the location of
boundaries in text. Instances of
BreakIterator maintain a current
position and scan over text returning
the index of characters where
boundaries occur. Internally,
BreakIterator scans text using a
CharacterIterator, and is thus able to
scan text held by any object
implementing that protocol. A
StringCharacterIterator is used to
scan String objects passed to setText.
You use the factory methods provided
by this class to create instances of
various types of break iterators. In
particular, use getWordIterator,
getLineIterator, getSentenceIterator,
and getCharacterIterator to create
BreakIterators that perform word,
line, sentence, and character boundary
analysis respectively. A single
BreakIterator can work only on one
unit (word, line, sentence, and so
on). You must use a different iterator
for each unit boundary analysis you
wish to perform.
Line boundary analysis determines
where a text string can be broken when
line-wrapping. The mechanism correctly
handles punctuation and hyphenated
words.
Sentence boundary analysis allows
selection with correct interpretation
of periods within numbers and
abbreviations, and trailing
punctuation marks such as quotation
marks and parentheses.
Word boundary analysis is used by
search and replace functions, as well
as within text editing applications
that allow the user to select words
with a double click. Word selection
provides correct interpretation of
punctuation marks within and following
words. Characters that are not part of
a word, such as symbols or punctuation
marks, have word-breaks on both sides.
Character boundary analysis allows
users to interact with characters as
they expect to, for example, when
moving the cursor through a text
string. Character boundary analysis
provides correct navigation of through
character strings, regardless of how
the character is stored. For example,
an accented character might be stored
as a base character and a diacritical
mark. What users consider to be a
character can differ between
languages.
BreakIterator is intended for use with
natural languages only. Do not use
this class to tokenize a programming
language.
See demo: BreakIteratorDemo.java
Based on #Jarrod Roberson's answer, I have created a util method that uses BreakIterator and returns the list of sentences.
public static List<String> tokenize(String text, String language, String country){
List<String> sentences = new ArrayList<String>();
Locale currentLocale = new Locale(language, country);
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);
sentenceIterator.setText(text);
int boundary = sentenceIterator.first();
int lastBoundary = 0;
while (boundary != BreakIterator.DONE) {
boundary = sentenceIterator.next();
if(boundary != BreakIterator.DONE){
sentences.add(text.substring(lastBoundary, boundary));
}
lastBoundary = boundary;
}
return sentences;
}
Just use regular expression (\s+ - it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array.
Then you may iterate over that array and check whether word ends with .?! (String.endsWith() to find end of sentences.
And before saving any word use once again regular expression to remove every non-alphanumeric character.
Of course, use StringTokenizer
import java.util.StringTokenizer;
public class Token {
public static void main(String[] args) {
String sentence = "Java! simple ?sentence parser.";
String separator = "!?.";
StringTokenizer st = new StringTokenizer( sentence, separator, true );
while ( st.hasMoreTokens() ) {
String token = st.nextToken();
if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
System.out.println( "special char:" + token );
}
else {
System.out.println( "word :" + token );
}
}
}
}
String Tokenizer
Scanner
Ex.
StringTokenizer tokenizer = new StringTokenizer(input, " !?.");

Categories