Java StringTokenizer.nextToken() skips over empty fields - java

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.:
one->two->->three
Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs.
Data is collected using a loop :
while ((strLine = br.readLine()) != null) {
StringTokenizer st = new StringTokenizer(strLine, "\t");
String test = st.nextToken();
...
}
Yet Java ignores this "empty string" and skips the field.
Is there a way to circumvent this behaviour and force java to read in empty fields anyway?

There is a RFE in the Sun's bug database about this StringTokenizer issue with a status Will not fix.
The evaluation of this RFE states, I quote:
With the addition of the java.util.regex package in 1.4.0, we have
basically obsoleted the need for StringTokenizer. We won't remove the
class for compatibility reasons. But regex gives you simply what you need.
And then suggests using String#split(String) method.

Thank you at all. Due to the first comment I was able to find a solution:
Yes you are right, thank you for your reference:
Scanner s = new Scanner(new File("data.txt"));
while (s.hasNextLine()) {
String line = s.nextLine();
String[] items= line.split("\t", -1);
System.out.println(items[5]);
//System.out.println(Arrays.toString(cols));
}

You can use Apache Commons StringUtils.splitPreserveAllTokens(). It does exactly what you need.

I would use Guava's Splitter, which doesn't need all the big regex machinery, and is more well-behaved than String's split() method:
Iterable<String> parts = Splitter.on('\t').split(string);

As you can see in the Java Doc http://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html you can use the Constructor public StringTokenizer(String str, String delim, boolean returnDelims) with returnDelims true
So it returns each Delimiter as a seperate string!
Edit:
DON'T use this way, as #npe already typed out, StringTokenizer shouldn't be used any more! See JavaDoc:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.

public class TestStringTokenStrict {
/**
* Strict implementation of StringTokenizer
*
* #param str
* #param delim
* #param strict
* true = include NULL Token
* #return
*/
static StringTokenizer getStringTokenizerStrict(String str, String delim, boolean strict) {
StringTokenizer st = new StringTokenizer(str, delim, strict);
StringBuffer sb = new StringBuffer();
while (st.hasMoreTokens()) {
String s = st.nextToken();
if (s.equals(delim)) {
sb.append(" ").append(delim);
} else {
sb.append(s).append(delim);
if (st.hasMoreTokens())
st.nextToken();
}
}
return (new StringTokenizer(sb.toString(), delim));
}
static void altStringTokenizer(StringTokenizer st) {
while (st.hasMoreTokens()) {
String type = st.nextToken();
String one = st.nextToken();
String two = st.nextToken();
String three = st.nextToken();
String four = st.nextToken();
String five = st.nextToken();
System.out.println(
"[" + type + "] [" + one + "] [" + two + "] [" + three + "] [" + four + "] [" + five + "]");
}
}
public static void main(String[] args) {
String input = "Record|One||Three||Five";
altStringTokenizer(getStringTokenizerStrict(input, "|", true));
}}

Related

Replace the words in String without using String replace

Is there any solution on how to replace words in string without using String replace?
As you all can see this is like hard coded it. Is there any method to make it dynamically? I heard that there is some library file able to make it dynamically but I am not very sure.
Any expert out there able to give me some solutions? Thank you so much and have a nice day.
for (int i = 0; i < results.size(); ++i) {
// To remove the unwanted words in the query
test = results.toString();
String testresults = test.replace("numFound=2,start=0,docs=[","");
testresults = testresults.replace("numFound=1,start=0,docs=[","");
testresults = testresults.replace("{","");
testresults = testresults.replace("SolrDocument","");
testresults = testresults.replace("numFound=4,start=0,docs=[","");
testresults = testresults.replace("SolrDocument{", "");
testresults = testresults.replace("content=[", "");
testresults = testresults.replace("id=", "");
testresults = testresults.replace("]}]}", "");
testresults = testresults.replace("]}", "");
testresults = testresults.replace("}", "");
In this case, you will need learn regular expression and a built-in String function String.replaceAll() to capture all possible unwanted words.
For example:
test.replaceAll("SolrDocument|id=|content=\\[", "");
Simply create and use a custom String.replace() method which happens to use the String.replace() method within it:
public static String customReplace(String inputString, String replaceWith, String... stringsToReplace) {
if (inputString.equals("")) { return replaceWith; }
if (stringsToReplace.length == 0) { return inputString; }
for (int i = 0; i < stringsToReplace.length; i++) {
inputString = inputString.replace(stringsToReplace[i], replaceWith);
}
return inputString;
}
In the example method above you can supply as many strings as you like to be replaced within the stringsToReplace parameter as long as they are delimited with a comma (,). They will all be replaced with what you supply for the replaceWith parameter.
Here is an example of how it can be used:
String test = "This is a string which contains numFound=2,start=0,docs=[ crap and it may also "
+ "have numFound=1,start=0,docs=[ junk in it along with open curly bracket { and "
+ "the SolrDocument word which might also have ]}]} other crap in there too.";
testResult = customReplace(strg, "", "numFound=2,start=0,docs=[ ", "numFound=1,start=0,docs=[ ",
+ "{ ", "SolrDocument ", "]}]} ");
System.out.println(testResult);
You can also pass a single String Array which contains all your unwanted strings within its elements and pass that array to the stringsToReplace parameter, for example:
String test = "This is a string which contains numFound=2,start=0,docs=[ crap and it may also "
+ "have numFound=1,start=0,docs=[ junk in it along with open curly bracket { and "
+ "the SolrDocument word which might also have ]}]} other crap in there too.";
String[] unwantedStrings = {"numFound=2,start=0,docs=[ ", "numFound=1,start=0,docs=[ ",
"{ ", "SolrDocument ", "]}]} "};
String testResult = customReplace(test, "", unwantedStrings);
System.out.println(testResult);

String Tokenizer missing 2 values off array

I am taking creating a StringTokenizer like so and populating an ArrayList using the tokens:
LogUtils.log("saved emails: " + savedString);
StringTokenizer st = new StringTokenizer(savedString, ",");
mListEmailAddresses = new ArrayList<String>();
for (int i = 0; i < st.countTokens(); i++) {
String strEmail = st.nextToken().toString();
mListEmailAddresses.add(strEmail);
}
LogUtils.log("mListEmailAddresses: emails: " + mListEmailAddresses.toString());
11-20 09:56:59.518: I/test(6794): saved emails: hdhdjdjdjd,rrfed,ggggt,tfcg,
11-20 09:56:59.518: I/test(6794): mListEmailAddresses: emails: [hdhdjdjdjd, rrfed]
As you can see mListEmailAddresses is missing 2 values off the end of the array. What should I do to fix this. From my eyes the code looks correct but maybe I am misunderstanding something.
Thanks.
using hasMoreTokens is the solution
while(st.hasMoreTokens()){
String strEmail = st.nextToken().toString();
mListEmailAddresses.add(strEmail);
}
Use the following while loop
StringTokenizer st = new StringTokenizer(savedString, ",");
mListEmailAddresses = new ArrayList<String>();
while (st.hasMoreTokens()) {
String strEmail = st.nextToken();
mListEmailAddresses.add(strEmail);
}
Note, you don't need to call toString, nextToken will return the string.
Alternatively, you could use the split method
String[] tokens = savedString.split(",");
mListEmailAddresses = new ArrayList<String>();
mListEmailAddresses.addAll(Arrays.asList(tokens));
Note, the API docs for StringTokenizer state:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
st.countTokens() method calculates the number of times that this tokenizer's nextToken() method can be called before it generates an exception. The current position is not advanced.
To get all elements in ArrayList you should use following code
while(st.hasMoreTokens()) {
String strEmail = st.nextToken().toString();
mListEmailAddresses.add(strEmail);
}

Cut ':' && " " from a String with a tokenizer

right now I am a little bit confused. I want to manipulate this string with a tokenizer:
Bob:23456:12345 Carl:09876:54321
However, I use a Tokenizer, but when I try:
String signature1 = tok.nextToken(":");
tok.nextToken(" ")
I get:
12345 Carl
However I want to have the first int and the second int into a var.
Any ideas?
You have two different patterns, maybe you should handle both separated.
Fist you should split the space separated values. Only use the string split(" "). That will return a String[].
Then for each String use tokenizer.
I believe will works.
Code:
String input = "Bob:23456:12345 Carl:09876:54321";
String[] words = input.split(" ")
for (String word : words) {
String[] token = each.split(":");
String name = token[0];
int value0 = Integer.parseInt(token[1]);
int value1 = Integer.parseInt(token[2]);
}
Following code should do:
String input = "Bob:23456:12345 Carl:09876:54321";
StringTokenizer st = new StringTokenizer(input, ": ");
while(st.hasMoreTokens())
{
String name = st.nextToken();
String val1 = st.nextToken();
String val2 = st.nextToken();
}
Seeing as you have multiple patterns, you cannot handle them with only one tokenizer.
You need to first split it based on whitespace, then split based on the colon.
Something like this should help:
String[] s = "Bob:23456:12345 Carl:09876:54321".split(" ");
System.out.println(Arrays.toString(s ));
String[] so = s[0].split(":", 2);
System.out.println(Arrays.toString(so));
And you'd get this:
[Bob:23456:12345, Carl:09876:54321]
[Bob, 23456:12345]
If you must use tokeniser then I tink you need to use it twice
String str = "Bob:23456:12345 Carl:09876:54321";
StringTokenizer spaceTokenizer = new StringTokenizer(str, " ");
while (spaceTokenizer.hasMoreTokens()) {
StringTokenizer colonTokenizer = new StringTokenizer(spaceTokenizer.nextToken(), ":");
colonTokenizer.nextToken();//to igore Bob and Carl
while (colonTokenizer.hasMoreTokens()) {
System.out.println(colonTokenizer.nextToken());
}
}
outputs
23456
12345
09876
54321
Personally though I would not use tokenizer here and use Claudio's answer which splits the strings.

using tokenizer to read a line

public void GrabData() throws IOException
{
try {
BufferedReader br = new BufferedReader(new FileReader("data/500.txt"));
String line = "";
int lineCounter = 0;
int TokenCounter = 1;
arrayList = new ArrayList < String > ();
while ((line = br.readLine()) != null) {
//lineCounter++;
StringTokenizer tk = new StringTokenizer(line, ",");
System.out.println(line);
while (tk.hasMoreTokens()) {
arrayList.add(tk.nextToken());
System.out.println("check");
TokenCounter++;
if (TokenCounter > 12) {
er = new DataRecord(arrayList);
DR.add(er);
arrayList.clear();
System.out.println("check2");
TokenCounter = 1;
}
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Driver.class.getName()).log(Level.SEVERE, null, ex);
}
}
Hello , I am using a tokenizer to read the contents of a line and store it into an araylist. Here the GrabData class does that job.
The only problem is that the company name ( which is the third column in every line ) is in quotes and has a comma in it. I have included one line for your example. The tokenizer depends on the comma to separate the line into different tokens. But the company name throws it off i guess. If it weren't for the comma in the company column , everything goes as normal.
Example:-
Essie,Vaill,"Litronic , Industries",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com
Any ideas?
First of all StringTokenizer is considered to be legacy code. From Java doc:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Using the split() method you get an array of strings. While iterating through the array you can check if the current string starts with a quote and if that's the case check if the next one ends with a quote. If you meet these 2 conditions then you know you didn't split where you wanted and you can merge these 2 together, process it like you want and continue iterating through the array normally after that. In that pass you will probably do i+=2 instead of your regular i++ and it should go unnoticed.
You can accomplish this using Regular Expressions. The following code:
String s = "asd,asdasd,asd\"asdasdasd,asdasdasd\", asdasd, asd";
System.out.println(s);
s = s.replaceAll("(?<=\")([^\"]+?),([^\"]+?)(?=\")", "$1 $2");
s = s.replaceAll("\"", "");
System.out.println(s);
yields
asd,asdasd,asd, "asdasdasd,asdasdasd", asdasd, asd
asd,asdasd,asd, asdasdasd asdasdasd, asdasd, asd
which, from my understanding, is the preprocessing you require for your tokenizer-code to work. Hope this helps.
While StringTokenizer might not natively handle this for you, a couple lines of code will do it... probably not the most efficient, but should get the idea across...
while(tk.hasMoreTokens()) {
String token = tk.nextToken();
/* If the item is encapsulated in quotes, loop through all tokens to
* find closing quote
*/
if( token.startsWIth("\"") ){
while( tk.hasMoreTokens() && ! tk.endsWith("\"") ) {
// append our token with the next one. Don't forget to retain commas!
token += "," + tk.nextToken();
}
if( !token.endsWith("\"") ) {
// open quote found but no close quote. Error out.
throw new BadFormatException("Incomplete string:" + token);
}
// remove leading and trailing quotes
token = token.subString(1, token.length()-1);
}
}
As you can see, in the class description, the use of StringTokenizer is discouraged by Oracle.
Instead of using tokenizer I would use the String split() method
which you can use a regular expression as argument and significantly reduce your code.
String str = "Essie,Vaill,\"Litronic , Industries\",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com";
String[] strs = str.split("(?<! ),(?! )");
List<String> list = new ArrayList<String>(strs.length);
for(int i = 0; i < strs.length; i++) list.add(strs[i]);
Just pay attention to your regex, using this one you're assuming that the comma will be always between spaces.

tokenizer null character

inputValue = "111,GOOG,20,475.0"
StringTokenizer tempToken = new StringTokenizer(inputValue, ",");
while(tempToken.hasMoreTokens() == true)
{
test = token.nextToken();
counterTest++;
}
It's giving me some invalid correct NULL character
I started to learn stringtokenizer and I'm not sure at this point what wen't wrong logicly I think it works out but am I forgetting something?
I see some typo in your code.
However,using StringTokenizer is discouraged in new code. From the javadocs:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
The recommended way is to use String#split.
Something like:
private void customSplit(String source) {
String[] tokens = source.split(";");
for (int i = 0; i < tokens; i++) {
System.out.println("Token" + i + "is: " + token[i]);
}
}
Your code snippet is working with some minor adjustments, maybe your missing something simple, so check the rewritten full example below:
public static void main(String[] args) throws Exception {
String inputValue = "111,GOOG,20,475.0";
StringTokenizer tempToken = new StringTokenizer(inputValue, ",");
int counterTest = 0;
while (tempToken.hasMoreTokens()) {
String test = tempToken.nextToken();
System.out.println(test);
counterTest++;
}
System.out.println("-------------------");
System.out.println("counterTest = " + counterTest);
}
Output:
111
GOOG
20
475.0
-------------------
counterTest = 4

Categories