How to add delimiters from the StringTokenizers to a seperate string? - java

I am inputting a string and I want to add the delimeters in that string to a different string and I was wondering how you would do that. This is the code I have at the moment.
StringTokenizer tokenizer = new StringTokenizer(input, "'.,><-=[]{}+!##$%^&*()~`;/?");
while (tokenizer.hasMoreTokens()){
//add delimeters to string here
}
Any help would be greatly appreciated(:

If you want StringTokenizer to return the delimiters it parses, you would need to add a flag to the constructor as shown here
StringTokenizer tokenizer = new StringTokenizer(input, "'.,><-=[]{}+!##$%^&*()~`;/?", true);
But if you are searching only for delimiters I dont think this is the right approach.

I don't think StringTokenizer is good for this task, try
StringBuilder sb = new StringBuilder();
for(char c : input.toCharArray()) {
if ("'.,><-=[]{}+!##$%^&*()~`;/?".indexOf(c) >= 0) {
sb.append(c);
}
}

I'm guessing you want to extract all the delimiters from the string and process them
String allTokens = "'.,><-=[]{}+!##$%^&*()~`;/?";
StringTokenizer tokenizer = new StringTokenizer(input, allTokens, true);
while(tokenizer.hasMoreTokens()) {
String nextToken = tokenizer.nextToken();
if(nextToken.length()==1 && allTokens.contains(nextToken)) {
//this token is a delimiter
//append to string or whatever you want to do with the delimiter
processDelimiter(nextToken);
}
}
Create a processDelimiter method in which you add the delimiter to a different string or perform any action you want.

This would even take care of repeated usage of delimeters
String input = "adfhkla.asijdf.';.akjsdhfkjsda";
String compDelims = "'.,><-=[]{}+!##$%^&*()~`;/?";
String delimsUsed = "";
for (char a : compDelims.toCharArray()) {
if (input.indexOf(a) > 0 && delimsUsed.indexOf(a) == -1) {
delimsUsed += a;
}
}
System.out.println("The delims used are " + delimsUsed);

Related

Split command on a nextElement

I am making a java servlet and am trying to make it display a preview of 3 different articles. I want it to preview the first sentence of each article, but can't seem to get split to work properly since I am reading the articles in with tokenizer. So I have something like:
while ((s = br.readLine()) != null) {
out.println("<tr>");
StringTokenizer s2 = new StringTokenizer(s, "|");
while (s2.hasMoreElements()) {
if (index == 0) {
out.println("<td class='first'>" + s2.nextElement() + "</td>");
}
out.println("</tr>");
}
index = 0;
}
How do I make s2.nextElement print out only the first sentence instead of the whole article? I imagine I could do split with a delimiter of ".", but can't get the code to work right. Thanks.
Try
s2.nextElement().split("\\.")[0];
to get the first sentence in the paragraph.
It would be better to use a Scanner:
Scanner scanner = new Scanner(new File("articles.txt"));
while (scanner.hasNext()) {
String article = scanner.next();
String[] parts = article.split("\\s*\\|\\s*");
String title = parts[0];
String text = parts[1];
String date = parts[2];
String image = parts[3];
String firstSentence = text.replaceAll("\\..*", ".");
// Output what you like about the article using the extracted parts
}
Scanner.next() reads in the whole line (the default delimiter is the newline char(s)).
split("\\s*\\|\\s*") splits the line on pipe chars (which have to be escaped because the pipe char has special regex meaning) and the \s* consumes any whitespace that may surround the pipe chars.
What I did was change hasMoreElements() to hasMoreTokens(). I then found the first occurrence of a ".". and created an int value. I then printed out a substring. here is what my code looked like:
while((s = br.readLine()) != null){
out.println("<tr>");
StringTokenizer s2 = new StringTokenizer(s, "|");
while (s2.hasMoreTokens()){
if (index == 0){
String one = s2.nextToken();
int i = one.indexOf(".");
out.println("<td>"+one.substring(0 , i)+"."+"</td>");
}

How to replace all special characters with another character in java?

I want to replace all 'special characters' with a special character in java
For example 'cash&carry' will become 'cash+carry' and also 'cash$carry' will become 'cash+carry'
I have a sample CSV file as
Here the CSV headers are 'What' and 'Where'
What,Where
salon,new+york+metro
pizza,los+angeles+metro
crate&barrel,los+angeles+metro
restaurants,los+angeles+metro
gas+station,los+angeles+metro
persian+restaurant,los+angeles+metro
car+wash,los+angeles+metro
book store,los+angeles+metro
garment,los+angeles+metro
"cash,carry",los+angeles+metro
cash&carry,los+angeles+metro
cash carry,los+angeles+metro
The expected output
What,Where
salon,new+york+metro
pizza,los+angeles+metro
crate+barrel,los+angeles+metro
restaurants,los+angeles+metro
gas+station,los+angeles+metro
persian+restaurant,los+angeles+metro
car+wash,los+angeles+metro
book+store,los+angeles+metro
garment,los+angeles+metro
cash+carry,los+angeles+metro
cash+carry,los+angeles+metro
cash+carry,los+angeles+metro
The sample code is as follows
String csvfile="BidAPI.csv";
try{
// create the 'Array List'
ArrayList<String> What=new ArrayList<String>();
ArrayList<String> Where=new ArrayList<String>();
BufferedReader br=new BufferedReader(new FileReader(csvfile));
StringTokenizer st=null;
String line="";
int linenumber=0;
int columnnumber;
int free=0;
int free1=0;
while((line=br.readLine())!=null){
linenumber++;
columnnumber=0;
st=new StringTokenizer(line,",");
while(st.hasMoreTokens()){
columnnumber++;
String token=st.nextToken();
if("What".equals(token)){
free=columnnumber;
System.out.println("the value of free :"+free);
} else if("Where".equals(token)){
free1=columnnumber;
System.out.println("the value of free1 :"+free1);
}
if(linenumber>1){
if (columnnumber==free){
What.add(token);
} else if(columnnumber==free1){
Where.add(token);
}
}
}
}
// converting the 'What' Array List to array
String[] what=What.toArray(new String[What.size()]);
// converting the 'Where' Array List to array
String[] where = Where.toArray(new String[Where.size()]);
for(int i=0;i<what.length;i++){
String data = what[i].replaceAll("[^A-Za-z0-9\",]| (?!([^\"]*\"){2}[^\"]*$)", "+").replace("\"", "");
System.out.println(data);
System.out.println(where[i]);
String finaldata = data+where[i];
String json = readUrl(desturl);
br.close();
}catch(Exception e){
System.out.println("There is an error :"+e);
}
All the special characters, all the spaces and the double quotes should be removed and replaced as in the desired output.
I am using value.replaceAll("[^A-Za-z0-9 ]", "+") , but it is not working.
Error
cash
carry"
Any help is appreciated. new to regex.
You need to:
replace all commas within quotes with +
replace non-whitelist (and you need to add commas to your whitelist)
+
remove double quotes
Try this:
line = line.replaceAll("[^A-Za-z0-9\",]|,(?!(([^\"]*\"){2})*[^\"]*$)", "+").replace("\"", "");
I think your regex is pretty close. Add an exception for comma's as well and get rid of the space and you are good.
BufferedReader r = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = r.readLine()) != null)
{
String replaced = line.replace("\"", "");
replaced = replaced.replaceAll("[^A-Za-z0-9,]", "+");
System.out.println(replaced);
}
Of course, Strings are immutable in Java. Keep that in mind. replaceAll() returns a new String and does not modify the original instance.
Demo here.
You need to first find quote and replace , inside it with +. Next you can just use replaceAll("[^A-Za-z0-9,]", "+") so you will replace all non alphanumeric characters or , with +. Your code for that can use
Pattern p = Pattern.compile("\"([^\"]*)\"");
pattern to locate quotations and appendReplacement, appendTail from Matcher class to replace founded quotations with its new version.
So in short your code can look something like
Scanner scanner = new Scanner(new File(csvfile));
Pattern p = Pattern.compile("\"([^\"]*)\"");
StringBuffer sb = new StringBuffer();
while(scanner.hasNextLine()){
String line = scanner.nextLine();
Matcher m = p.matcher(line);
while (m.find()){//find quotes
//and replace their content with content with replaced `,` by `+`
//BTW group(1) holds part of quotation without `"` marsk
m.appendReplacement(sb, m.group(1).replace(',', '+'));
}
m.appendTail(sb);//we need to also add rest of unmatched data to buffer
//now we can just normally replace special characters with +
String result = sb.toString().replaceAll("[^A-Za-z0-9,]", "+");
//after job is done we can use result, so lest print it
System.out.println(result);
//lets not forget to reset buffer for next line
sb.delete(0, sb.length());
}
Answer to the question
String csvfile="BidAPI.csv";
try{
// create the 'Array List'
ArrayList<String> What=new ArrayList<String>();
ArrayList<String> Where=new ArrayList<String>();
BufferedReader br=new BufferedReader(new FileReader(csvfile));
StringTokenizer st=null;
String line="";
int linenumber=0;
int columnnumber;
int free=0;
int free1=0;
while((line=br.readLine())!=null){
line =line.replaceAll("[^A-Za-z0-9\",]|,(?!(([^\"]*\"){2})*[^\"]*$)", "+").replace("\"", "");
linenumber++;
columnnumber=0;
st=new StringTokenizer(line,",");
while(st.hasMoreTokens()){
columnnumber++;
String token=st.nextToken();
if("What".equals(token)){
free=columnnumber;
System.out.println("the value of free :"+free);
} else if("Where".equals(token)){
free1=columnnumber;
System.out.println("the value of free1 :"+free1);
}
if(linenumber>1){
if (columnnumber==free){
What.add(token);
} else if(columnnumber==free1){
Where.add(token);
}
}
}
}
// converting the 'What' Array List to array
String[] what=What.toArray(new String[What.size()]);
// converting the 'Where' Array List to array
String[] where = Where.toArray(new String[Where.size()]);
for(int i=0;i<what.length;i++){
String data = what[i].replaceAll("[^A-Za-z0-9\",]| (?!([^\"]*\"){2}[^\"]*$)", "+").replace("\"", "");
System.out.println(data);
System.out.println(where[i]);
String finaldata = data+where[i];
String json = readUrl(desturl);
br.close();
}catch(Exception e){
System.out.println("There is an error :"+e);
}

Tokenize words ignoring hashtags with Open nlp

I'm trying to tokenize some sentences. For example the sentences :
String sentence = "The sky is blue. A cat is #blue.";
I use the following command with Open nlp:
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] result = tokenizer.tokenize(sentence);
But I want opennlp considers '#' as a letter of a word. So '#blue#' will be a token.
How to do this ?
You just have to create a new Tokenizer object (implementing Tokenizer).
Tokenizer t = new Tokenizer() {
#Override
public Span[] tokenizePos(String arg0) {
}
#Override
public String[] tokenize(String arg0) {
}
};
Then, Copy/Paste the SimpleTokenizer code into thoses 2 functions.
And Associate the '#' to others alphanumericals values :
if (StringUtil.isWhitespace(c)) {
charType = CharacterEnum.WHITESPACE;
} else if (Character.isLetter(c) || c=='#') {
charType = CharacterEnum.ALPHABETIC;
} else if (Character.isDigit(c)) {
charType = CharacterEnum.NUMERIC;
} else {
charType = CharacterEnum.OTHER;
}
Maybe you are just being unlucky, try this:
public static void tokenize() throws InvalidFormatException, IOException {
InputStream is = new FileInputStream("models/en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("The sky is blue. A cat is #blue. ");
for (String a : tokens)
System.out.println(a);
is.close();
}
As you can see, "#blue" is tokenized as a single token. And the intelligence of the Tokenizer remains.
For this to work you will need the "en-token.bin" model for this to work.
you could just try String[] tokens = sentence.split(" ");
split() is a method of String in java. Passing it a space (i.e. " ") will just give you all the tokens in the string delimited by a space

I want to search for a string using StringTokenizer but the string I'm looking for has a delimiter in it - Java

I have an external file named quotes.txt and I'll show you some contents of the file:
1 Everybody's always telling me one thing and out the other.
2 I love criticism just so long as it's unqualified praise.
3 The difference between 'involvement' and 'commitment' is like an eggs-and-ham
breakfast: the chicken was 'involved' - the pig was 'committed'.
I used this: StringTokenizer str = new StringTokenizer(line, " .'");
This is the code for the searching:
String line = "";
boolean wordFound = false;
while((line = bufRead.readLine()) != null) {
while(str.hasMoreTokens()) {
String next = str.nextToken();
if(next.equalsIgnoreCase(targetWord) {
wordFound = true;
output = line;
break;
}
}
if(wordFound) break;
else output = "Quote not found";
}
Now, I want to search for strings "Everybody's" and "it's" in line 1 and 2 but it won't work since the apostrophe is one of the delimiters. If I remove that delimiter, then I won't be able to search for "involvement", "commitment", "involved" and "committed" in line 3.
What suitable code can I do with this problem? Please help and thanks.
I would suggest using regular expressions (the Pattern class) rather than StringTokenizer for this. For example:
final Pattern targetWordPattern =
Pattern.compile("\\b" + Pattern.quote(targetWord) + "\\b",
Pattern.CASE_INSENSITIVE);
String line = "";
boolean wordFound = false;
while((line = bufRead.readLine()) != null) {
if(targetWordPattern.matcher(line).find()) {
wordFound = true;
break;
}
else
output = "Quote not found";
}
Tokenize by whitespace, then trim by the ' character.

Regarding Java String Manipulation

I have the string "MO""RET" gets stored in items[1] array after the split command. After it get's stored I do a replaceall on this string and it replaces all the double quotes.
But I want it to be stored as MO"RET. How do i do it. In the csv file from which i process using split command Double quotes within the contents of a Text field are repeated (Example: This account is a ""large"" one"). So i want retain the one of the two quotes in the middle of string if it get's repeated and ignore the end quotes if present . How can i do it?
String items[] = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
items[1] has "MO""RET"
String recordType = items[1].replaceAll("\"","");
After this recordType has MORET I want it to have MO"RET
Don't use regex to split a CSV line. This is asking for trouble ;) Just parse it character-by-character. Here's an example:
public static List<List<String>> parseCsv(InputStream input, char separator) throws IOException {
BufferedReader reader = null;
List<List<String>> csv = new ArrayList<List<String>>();
try {
reader = new BufferedReader(new InputStreamReader(input, "UTF-8"));
for (String record; (record = reader.readLine()) != null;) {
boolean quoted = false;
StringBuilder fieldBuilder = new StringBuilder();
List<String> fields = new ArrayList<String>();
for (int i = 0; i < record.length(); i++) {
char c = record.charAt(i);
fieldBuilder.append(c);
if (c == '"') {
quoted = !quoted;
}
if ((!quoted && c == separator) || i + 1 == record.length()) {
fields.add(fieldBuilder.toString().replaceAll(separator + "$", "")
.replaceAll("^\"|\"$", "").replace("\"\"", "\"").trim());
fieldBuilder = new StringBuilder();
}
if (c == separator && i + 1 == record.length()) {
fields.add("");
}
}
csv.add(fields);
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}
return csv;
}
Yes, there's little regex involved, but it only trims off ending separator and surrounding quotes of a single field.
You can however also grab any 3rd party Java CSV API.
How about:
String recordType = items[1].replaceAll( "\"\"", "\"" );
I prefer you to use replace instead of replaceAll.
replaceAll uses REGEX as the first argument.
The requirement is to replace two continues QUOTES with one QUOTE
String recordType = items[1].replace( "\"\"", "\"" );
To see the difference between replace and replaceAll , execute bellow code
recordType = items[1].replace( "$$", "$" );
recordType = items[1].replaceAll( "$$", "$" );
Here you can use the regular expression.
recordType = items[1].replaceAll( "\\B\"", "" );
recordType = recordType.replaceAll( "\"\\B", "" );
First statement replace the quotes in the beginning of the word with empty character.
Second statement replace the quotes in the end of the word with empty character.

Categories