Finding the nth occurrence of a character in a String using IndexOf() - java

I have a question regarding indexOf(). I am trying to program an EmailExtractor (Yes, this is a homework but I am not looking for code) which extracts the entire email address from a sentence that is input by a user.
For example -
User Input: Mail us at abc#def.ghi.jk with your queries.
The program will then display abc#def.ghi.jk from the above String. I understand indexOf() and substring() are required.
The idea I have now is to use indexOf() to locate the '#', and then search for the empty space just before the email address input by the user (nth).
My code is as follows:
System.out.println("This is an Email Address Extractor.\n");
System.out.print("Enter a line of text with email address: ");
String emailInput = scn.nextLine();
int spaceAt = emailInput.indexOf(" ");
for (int i = 1; i <= emailInput.indexOf("#"); i++){
if (spaceAt < emailInput.indexOf("#")) {
spaceAt = emailInput.indexOf(" ", spaceAt + 1);
}
}
I understand and am aware of the problem in my code.
1) "Mail us at abc#def.ghi.jk with your queries".indexOf(" ") is 4, I am trying to get 10. However, the IF Condition I have input will cause it to skip to the next instance of indexOf() which is 25. (Because 10 < 14).
How do I go about avoiding this from happening?
Once again, I am not looking for purely the answer rather, I am trying to work around a solution. Thanks in advance!

How about finding the space before and after the #
int at = emailInput.indexOf('#');
int start = emailInput.lastIndexOf(' ', at) + 1;
int end = emailInput.indexOf(' ', at);
if (end == -1) end = emailInput.length();
String email = emailInput.substring(start, end);

you can use regular expressions instead of using indexOf() and substring()
^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#"+"[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$"
use the Pattern and Matcher to validate the email id in your string.
This may give you a clear view on it

One thing that you could do is call indexOf and test the result in the same iteration of the loop, instead of saving the check for the next iteration. Something like this in mostly pseudocode:
spaceAt := 0
indexOfAtSign := emailInput.indexOf('#');
while (spaceAt < indexOfAtSign) { // you never really used i anyway
temp := index of next space
if (temp > indexOfAtSign) break;
else spaceAt = temp;
}
(There'd need to be a small bit of corner-case testing, but only for if there are no spaces either before the # sign or after the # sign)

Related

Reading a file -- pairing a String and int value -- with multiple split lines

I am working on an exercise with the following criteria:
"The input consists of pairs of tokens where each pair begins with the type of ticket that the person bought ("coach", "firstclass", or "discount", case-sensitively) and is followed by the number of miles of the flight."
The list can be paired -- coach 1500 firstclass 2000 discount 900 coach 3500 -- and this currently works great. However, when the String and int value are split like so:
firstclass 5000 coach 1500 coach
100 firstclass
2000 discount 300
it breaks entirely. I am almost certain that it has something to do with me using this format (not full)
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ")
while(token.hasMoreTokens())
{
String ticketClass = token.nextToken().toLowerCase();
int count = Integer.parseInt(token.nextToken());
...
}
}
because it will always read the first value as a String and the second value as an integer. I am very lost on how to keep track of one or the other while going to read the next line. Any help is truly appreciated.
Similar (I think) problems:
Efficient reading/writing of key/value pairs to file in Java
Java-Read pairs of large numbers from file and represent them with linked list, get the sum and product of each pair
Reading multiple values in multiple lines from file (Java)
If you can afford to read the text file in all at once as a very long String, simply use the built-in String.split() with the regex \\s+, like so
String[] tokens = fileAsString.split("\\s+");
This will split the input file into tokens, assuming the tokens are separated by one or more whitespace characters (a whitespace character covers newline, space, tab, and carriage return). Even and odd tokens are ticket types and mile counts, respectively.
If you absolutely have to read in line-by-line and use StringTokenizer, a solution is to count number of tokens in the last line. If this number is odd, the first token in the current line would be of a different type of the first token in the last line. Once knowing the starting type of the current line, simply alternating types from there.
int tokenCount = 0;
boolean startingType = true; // true for String, false for integer
boolean currentType;
while(fileScanner.hasNextLine())
{
StringTokenizer token = new StringTokenizer(fileScanner.nextLine(), " ");
startingType = startingType ^ (tokenCount % 2 == 1); // if tokenCount is odd, the XOR ^ operator will flip the starting type of this line
tokenCount = 0;
while(token.hasMoreTokens())
{
tokenCount++;
currentType = startingType ^ (tokenCount % 2 == 0); // alternating between types in current line
if (currentType) {
String ticketClass = token.nextToken().toLowerCase();
// do something with ticketClass here
} else {
int mileCount = Integer.parseInt(token.nextToken());
// do something with mileCount here
}
...
}
}
I found another way to do this problem without using either the StringTokenizer or the regex...admittedly I had trouble with the regular expressions haha.
I declare these outside of the try-catch block because I want to use them in both my finally statement and return the points:
int points = 0;
ArrayList<String> classNames = new ArrayList<>();
ArrayList<Integer> classTickets = new ArrayList<>();
Then inside my try-statement, I declare the index variable because I won't need that outside of this block. That variable increases each time a new element is read. Odd elements are read as ticket classes and even elements are read as ticket prices:
try
{
int index = 0;
// read till the file is empty
while(fileScanner.hasNext())
{
// first entry is the ticket type
if(index % 2 == 0)
classNames.add(fileScanner.next());
// second entry is the number of points
else
classTickets.add(Integer.parseInt(fileScanner.next()));
index++;
}
}
You can either catch it here like this or use throws NoSuchElementException in your method declaration -- As long as you catch it on your method call
catch(NoSuchElementException noElement)
{
System.out.println("<###-NoSuchElementException-###>");
}
Then down here, loop through the number of elements. See which flight class it is and multiply the ticket count respectively and return the points outside of the block:
finally
{
for(int i = 0; i < classNames.size(); i++)
{
switch(classNames.get(i).toLowerCase())
{
case "firstclass": // 2 points for first
points += 2 * classTickets.get(i);
break;
case "coach": // 1 point for coach
points += classTickets.get(i);
break;
default:
// budget gets nothing
}
}
}
return points;
The regex seems like the most convenient way, but this was more intuitive to me for some reason. Either way, I hope the variety will help out.
simply use the built-in String.split() - #bui
I was finally able to wrap my head around regular expressions, but \s+ was not being recognized for some reason. It kept giving me this error message:
Invalid escape sequence (valid ones are \b \t \n \f \r " ' \ )Java(1610612990)
So when I went through with those characters instead, I was able to write this:
int points = 0, multiplier = 0, tracker = 0;
while(fileScanner.hasNext())
{
String read = fileScanner.next().split(
"[\b \t \n \f \r \" \' \\ ]")[0];
if(tracker % 2 == 0)
{
if(read.toLowerCase().equals("firstclass"))
multiplier = 2;
else if(read.toLowerCase().equals("coach"))
multiplier = 1;
else
multiplier = 0;
}else
{
points += multiplier * Integer.parseInt(read);
}
tracker++;
}
This code goes one entry at a time instead of reading a whole array void of whitespace as a work-around for that error message I was getting. If you could show me what the code would look like with String[] tokens = fileAsString.split("\s+"); instead I would really appreciate it :)
you need to add another "\" before "\s" to escape the slash before "s" itself – #bui

Why I cannot get the string without tokens with the program I have written?

Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
Queue q=new LinkedList();
for(int i=0;i<s.length();i++){
int x=(int)s.charAt(i);
if(x<65 || (x>90 && x<97) || x>122) {
q.add(s.charAt(i));
}
}
System.out.println(q.peek());
String redex="";
while(!q.isEmpty()) {
redex+=q.remove();
}
String[] x=s.split(redex,-1);
for(String y:x) {
if(y!=null)
System.out.println(y);
}
scan.close();
I am trying to print the string "my name is NLP and I, so, works:fine;"yes"." without tokens such as {[]}+-_)*&%$ but it just prints out all the String as it is, and I don't understand the problem?
This is 3 answers in one:
For your initial problem
For a solution without regex
For a correct use of Scanner (this is up to you).
First
When you use a regex build from whatever character you got under the hand, you should quote it:
String[] x=s.split(Pattern.quote(redex),-1);
That would be the usual problem, but the second problem is that you are building a regexp range but you are omitting the [] making the range, so it can work as is:
String[] x=s.split("[" + Pattern.quote(redex) + "]",-1);
This one may work, but may fail if Pattern.quote don't quote - and - is found in between two characters making a range such as : $-!.
This would means: character in range starting at $ from !. It may fail if the range is invalid and my example may be invalid ($ may be after !).
Finally, you may use:
String redex = q.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
This regexp should match the unwanted character.
Second:
For the rest, the other answer point out another problem: you are not using the Character.isXXX method to check for valid characters.
Firstly, be wary that some method does not use char but code points. For example, isAlphabetic use code points. A code points is simply a representation of a character in a multibyte encoding. There some unicode character which take two char.
Secondly, I think your problem lies in the fact you are not using the right tool to split your words.
In pseudo code, this should be:
List<String> words = new ArrayList<>();
int offset = 0;
for (int i = 0, n = line.length(); i < n; ++i) {
// if the character fail to match, then we switched from word to non word
if (!Character.isLetterOrDigit(line.charAt(i)) {
if (offset != i) {
words.add(line.substring(offset, i));
}
offset = i + 1; // next char
}
}
if (offset != line.length()) {
words.add(line.substring(offset));
}
This would:
- Find transition from word to non word and change offset (where we started)
- Add word to the list
- Add the last token as ending word.
Last
Alternatively, you may also play with Scanner class since it allows you to input a custom delimiter for its hasNext(): https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
I quote the class javadoc:
The scanner can also use delimiters other than whitespace. This
example reads several items in from a string:
String input = "1 fish 2 fish red fish blue fish";
Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
System.out.println(s.nextInt());
System.out.println(s.nextInt());
System.out.println(s.next());
System.out.println(s.next());
s.close();
As you guessed, you may pass on any delimiter and then use hasNext() and next() to get only valid words.
For example, using [^a-zA-Z0-9] would split on each non alpha/digit transition.
As noted in the comment, the condition x<65 will catch all sorts of special characters you're not interested in. Using Character's built-in methods will help you write this condition in a clearer, bug-free way:
x = s.charAt(i);
if (Character.isLetter(x) || Character.isWhiteSpace(x)) {
q.add(x);
}

Java: Removing duplicate words & substrings of words in java

Recently i have come up against a question which i am not able to tackle in school.
I need to remove duplicate words in an input string which consists of words. The main issue here is that the requirement states that i cannot use arrays or regular expressions.
E.g.
userInput = "this is a test testing is fun really fun"
the first "is" is a duplicate of "this" as it is a substring
the second "is" is a duplicate of the first "is"
"testing" is not a duplicate of "test" as it is not an exact match
therefore the output comes out as - "this a test testing fun really"
How would one actually achieve this without using Arrays or Regular Expressions as it is impossible to split the words up by the white spaces and dynamically create a String in java.
I didn't compile this code, but I think it should works.
Let me know if it can help you to solved your problem.
public String solve(String input) {
String ret = "";
int pos = 0;
while(pos<input.length()) {
// find next position of space
int next = input.indexOf(' ',pos);
// space not exists, skip next to end of string
if(next==-1) next = input.length();
// take 1 word from input
String word = input.substring(pos,next);
// check if word exists in previous result
if(ret.indexOf(word)==-1) {
if(ret.length() > 0) ret += " ";
// append word to ret
ret += word;
}
pos = next + 1;
}
return ret;
}

Regular expression for validating an answer to a question

Hey everyone,
I'm having a minor difficulty setting up a regular expression that evaluates a sentence entered by a user in a textbox to keyword(s). Essentially, the keywords have to be entered consecutive from one to the other and can have any number of characters or spaces before, between, and after (ie. if the keywords are "crow" and "feet", crow must be somewhere in the sentence before feet. So with that in mind, this statement should be valid "blah blah sccui crow dsj feet "). The characters and to some extent, the spaces (i would like the keywords to have at least one space buffer in the beginning and end) are completely optional, the main concern is whether the keywords were entered in their proper order.
So far, I was able to have my regular expression work in a sentence but failed to work if the answer itself was entered only.
I have the regular expression used in the function below:
// Comparing an answer with the right solution
protected boolean checkAnswer(String a, String s) {
boolean result = false;
//Used to determine if the solution is more than one word
String temp[] = s.split(" ");
//If only one word or letter
if(temp.length == 1)
{
if (s.length() == 1) {
// check multiple choice questions
if (a.equalsIgnoreCase(s)) result = true;
else result = false;
}
else {
// check short answer questions
if ((a.toLowerCase()).matches(".*?\\s*?" + s.toLowerCase() + "\\s*?.*?")) result = true;
else result = false;
}
}
else
{
int count = temp.length;
//Regular expression used to
String regex=".*?\\s*?";
for(int i = 0; i<count;i++)
regex+=temp[i].toLowerCase()+"\\s*?.*?";
//regex+=".*?";
System.out.println(regex);
if ((a.toLowerCase()).matches(regex)) result = true;
else result = false;
}
return result;
Any help would greatly be appreciated.
Thanks.
I would go about this in a different way. Instead of trying to use one regular expression, why not use something similar to:
String answer = ... // get the user's answer
if( answer.indexOf("crow") < answer.indexOf("feet") ) {
// "correct" answer
}
You'll still need to tokenize the words in the correct answer, then check in a loop to see if the index of each word is less than the index of the following word.
I don't think you need to split the result on " ".
If I understand correctly, you should be able to do something like
String regex="^.*crow.*\\s+.*feet.*"
The problem with the above is that it will match "feetcrow feetcrow".
Maybe something like
String regex="^.*\\s+crow.*\\s+feet\\s+.*"
That will enforce that the word is there as opposed to just in a random block of characters.
Depending on the complexity Bill's answer might be the fastest solution. If you'd prefer a regular expression, I wouldn't look for any spaces, but word boundaries instead. That way you won't have to handle commas, dots, etc. as well:
String regex = "\\bcrow(?:\\b.*\\b)?feet\\b"
This should match "crow bla feet" as well as "crowfeet" and "crow, feet".
Having to match multiple words in a specific order you could just join them together using '(?:\b.*\b)?' without requiring any additional sorting or checking.
Following Bill answer, I'd try this:
String input = // get user input
String[] tokens = input.split(" ");
String key1 = "crow";
String key2 = "feet";
String[] tokens = input.split(" ");
List<String> list = Arrays.asList(tokens);
return list.indexOf(key1) < list.indexOf(key2)

Algorithm to detect how many words typed, also multi sentence support (Java)

Problem:
I have to design an algorithm, which does the following for me:
Say that I have a line (e.g.)
alert tcp 192.168.1.1 (caret is currently here)
The algorithm should process this line, and return a value of 4.
I coded something for it, I know it's sloppy, but it works, partly.
private int counter = 0;
public void determineRuleActionRegion(String str, int index) {
if (str.length() == 0 || str.indexOf(" ") == -1) {
triggerSuggestionList(1);
return;
}
//remove duplicate space, spaces in front and back before searching
int num = str.trim().replaceAll(" +", " ").indexOf(" ", index);
//Check for occurances of spaces, recursively
if (num == -1) { //if there is no space
//no need to check if it's 0 times it will assign to 1
triggerSuggestionList(counter + 1);
counter = 0;
return; //set to rule action
} else { //there is a space
counter++;
determineRuleActionRegion(str, num + 1);
}
} //end of determineactionRegion()
So basically I find for the space and determine the region (number of words typed). However, I want it to change upon the user pressing space bar <space character>.
How may I go around with the current code?
Or better yet, how would one suggest me to do it the correct way? I'm figuring out on BreakIterator for this case...
To add to that, I believe my algorithm won't work for multi sentences. How should I address this problem as well.
--
The source of String str is acquired from textPane.getText(0, pos + 1);, the JTextPane.
Thanks in advance. Do let me know if my question is still not specific enough.
--
More examples:
alert tcp $EXTERNAL_NET any -> $HOME_NET 22 <caret>
return -1 (maximum of the typed text is 7 words)
alert tcp 192.168.1.1 any<caret>
return 4 (as it is still at 2nd arg)
alert tcp<caret>
return 2 (as it is still at 2nd arg)
alert tcp <caret>
return 3
alert tcp $EXTERNAL_NET any -> <caret>
return 6
It is something like shell commands. As above. Though I think it does not differ much I believe, I just want to know how many arguments are typed. Thanks.
--
Pseudocode
Get whole paragraph from textpane
if more than 1 line -> process the last line
count how many arguments typed and return appropriate number
else
process current line
count how many arguments typed and return appropriate number
End
This uses String.split; I think this is what you want.
String[] texts = {
"alert tcp $EXTERNAL_NET any -> $HOME_NET 22 ",
"alert tcp 192.168.1.1 any",
"alert tcp",
"alert tcp ",
"alert tcp $EXTERNAL_NET any -> ",
"multine\ntest\ntest 1 2 3",
};
for (String text : texts) {
String[] lines = text.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
for (String token : tokens) {
System.out.print("[" + token + "]");
}
int pos = (tokens.length <= 7) ? tokens.length : -1;
System.out.println(" = " + pos);
}
This produces the following output:
[alert][tcp][$EXTERNAL_NET][any][->][$HOME_NET][22][] = -1
[alert][tcp][192.168.1.1][any] = 4
[alert][tcp] = 2
[alert][tcp][] = 3
[alert][tcp][$EXTERNAL_NET][any][->][] = 6
[test][1][2][3] = 4
The codes provided by polygenelubricants and helios work, to a certain extent. It addresses the aforementioned problem I'd stated, but not with multi-lines. helios's code is more straightforward.
However both codes did not address the problem when you press enter in the JTextPane, it will still return back the old count instead of 1 as the split() returns it as one sentence instead of two.
E.g. alert tcp <enter is pressed>
By right it should return 1 since it is a new sentence. It returned 2 for both algorithms.
Also, if I highlight all and delete both algorithms will throw NullPointerException as there is no string to be split.
I added one line, and it solved the problems mentioned above:
public void determineRuleActionRegion(String str) {
//remove repetitive spaces and concat $ for new line indicator
str = str.trim().replaceAll(" +", " ") + "$";
String[] lines = str.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
int pos = (tokens.length <= 7) ? tokens.length : -1;
triggerSuggestionList(pos);
System.out.println("Current pos: " + pos);
return;
} //end of determineactionRegion()
With that, when split() parses the str, the "$" will create another line, which will be the last line regardless, and the count now will return to one. Also, there will not be NullPointerException as the "$" is always there.
However, without the help of polygenelubricants and helios, I don't think I will be able to figure it out so soon. Thanks guys!
EDIT: Okay... apparently split("\r?\n|\r",-1) works the same. Question is should I accept polygenelubricants or my own? Hmm.
2nd EDIT: One thing bad about concatenating '%' to the end of the str, lastLine.endsWith(" ") == true will return false. So have to use split("\r?\n|\r",-1) and lastLine.endsWith(" ") == true for the complete solution.
What about this: get last line, count what's between spaces...
String text = ...
String[] lines = text.split("\n"); // or \r\n depending on how you get the string
String lastLine = lines[lines.length-1];
StringTokenizer tokenizer = new StringTokenizer(lastLine, " ");
// note that strtokenizer will ignore empty tokens, it is, what is between two consecutive spaces
int count = 0;
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
count++;
}
return count;
Edit you could control if you have a final space (lastLine.endsWith(" ")) so you are starting a new word or whatever, it's a basic approach for you to make it up :)
Is the sample line representative? An editor for some rule based language (ACLs)?
How about going for a full Information Extraction/named entity recognition solution, the one that will be able to recognize entities (keywords, ip addresses, etc)? You don't have to write everything from scratch, there're existing tools and libraries.
UPDATE: Here's a piece of Snort code that I believe does the parsing:
Function ParseRule()
if (*args == '(') {
// "Preprocessor Rule detected"
} else {
/* proto ip port dir ip port r*/
toks = mSplit(args, " \t", 7, &num_toks, '\\');
/* A rule might not have rule options */
if (num_toks < 6) {
ParseError("Bad rule in rules file: %s", args);
}
..
}
otn = ParseRuleOptions(sc, rtn, roptions, rule_type, protocol);
..
mSplit is defined in mstring.c, a function to split a string into tokens.
In your case, ParseRuleOptions should return one for the whole string inside brackets I guess.
UPDATE 2: btw, is your first example correct, since in snort, you can add options to rules? For example this is a valid rule being written (options section not completed):
alert tcp any any -> 192.168.1.0/24 111 (content:"|00 01 86 a5|"; <caret>
In some cases you can have either 6 or 7 'words', so your algorithm should have a bit more knowledge, right?

Categories