Java: Checking each space in a String

Java: Checking each space in a String - java

I'm sure this is fairly simple, however I've tried googling the question but can't find an answer that fits my problem.
I'm playing around with string manipulation and one of the things I'm trying to do is get the first letter of each word. (And then place them all into a string)
I'm having trouble with registering each 'space' so that my If statement will be triggered. Here's what I have so far.
while (scanText.hasNext()) {
boolean isSpace = false;
if (scanText.hasNext(" ")) {isSpace = true;}
String s = scanText.next();
if (isSpace) {firstLetters += s + " ";}
}
Also, if there is a much better way to do this then please let me know

You can also split the original text by spaces, and collect the words.
String input = " Hello world aaa ";
String[] split = input.trim().split("\\s+"); // all types of whitespace; " +" to pick spaces only
// operate on "split" array containing words now: [Hello, world, aaa]
However using regexps here might be overkill.

Assuming that scanText is a Scanner object, you could use something like stated on the documentation:
Scanner s = new Scanner(input).useDelimiter("\\s+"); //regex for spaces
https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

Related

How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044 　　15939353　　 C8H14O3S2 　　　beta-lipoic acid　　 C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353　　C924O3S2 　　　saponin　　 CCCC(=O)O "
"9048 　 CTD042032　23241　　C3HO4O3S2　Berberine　 [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044 　　15939353　　 C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}

This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Issue in Combining splitted String

I have extracted text from "web 2.0 wikipedia" article, and splitted it into "sentences". After that, I am going to create "Strings" which each string containing 5 sentences.
When extracted, the text looks like below, in EditText
Below is my code
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i];
if( (i%5==0) )
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
}
}
But, when I use the Toast to display the text, here what is gives
Toast.makeText(Talk.this, textCollection.get(0), Toast.LENGTH_LONG).show();
As you can see, this is only one sentence! But I expected it to have 5 sentences!
And the other thing is, the second sentence is starting from somewhere else. Here how I have extracted it into Toast
Toast.makeText(Talk.this, textCollection.get(1), Toast.LENGTH_LONG).show();
This make no sense to me! How can I properly split the text into sentences and, create Strings containing 5 sentences each?

The problem is that for the first sentence, 0 % 5 = 0, so it is being added to the array list immediately. You should use another counter instead of mod.
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
int sentenceAdded = 0;
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences += textArrayWithFullStop[i] + ". ";
sentenceAdded++;
if(sentenceAdded == 5)
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
sentenceAdded = 0;
}
}

add ". " to textArrayWithFullStop[i]
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i]+". ";

I believe that if you modify the mod line to this:
if(i%5==4)
you will have what you need.
You probably realize this, but there are other reasons why someone might use a ". ", that doesn't actually end a sentence, for instance
I spoke to John and he said... "I went to the store.
Then I went to the Tennis courts.",
and I don't believe he was telling the truth because
1. Why would someone go to play tennis after going to the store and
2. John has no legs!
I had to ask, am I going to let him get away with these lies?
That's two sentences that don't end with a period and would mislead your code into thinking it's 5 sentences broken up at entirely the wrong places, so this approach is really fraught with problems. However, as an exercise in splitting strings, I guess it's as good as any other.

As a side problem(splitting sentences) solution I would suggest to start with this regexp
string.split(".(\\[[0-9\\[\\]]+\\])? ")
And for main problem may be you could use copyOfRange()

closest thing to NSScanner in Java

I'm moving some code from objective-c to java. The project is an XML/HTML Parser. In objective c I pretty much only use the scanUpToString("mystring"); method.
I looked at the Java Scanner class, but it breaks everything into tokens. I don't want that. I just want to be able to scan up to occurrences of substrings and keep track of the scanners current location in the overall string.
Any help would be great thanks!
EDIT
to be more specific. I don't want Scanner to tokenize.
String test = "<title balh> blah <title> blah>";
Scanner feedScanner = new Scanner(test);
String title = "<title";
String a = feedScanner.next(title);
String b = feedScanner.next(title);
In the above code I'd like feedScanner.next(title); to scan up to the end of the next occurrence of "<title"
What actually happens is the first time feeScanner.next is called it works since the default delimiter is whitespace, however, the second time it is called it fails (for my purposes).

You can achieve this with String class (Java.lang.String).
First get the first index of your substring.
int first_occurence= string.indexOf(substring);
Then iterate over entire string and get the next value of substrings
int next_index=indexOf( str,fromIndex);
If you want to save the values, add them to the wrapper class and the add to a arraylist object.

This really is easier by just using String's methodsdirectly:
String test = "<title balh> blah <title> blah>";
String target = "<title";
int index = 0;
index = test.indexOf( target, index ) + target.length();
// Index is now 6 (the space b/w "<title" and "blah"
index = test.indexOf( target, index ) + target.length();
// Index is now at the ">" in "<title> blah"
Depending on what you want to actually do besides walk through the string, different approaches might be better/worse. E.g. if you want to get the blah> blah string between the <title's, a Scanner is convenient:
String test = "<title balh> blah <title> blah>";
Scanner scan = new Scanner(test);
scan.useDelimiter("<title");
String stuff = scan.next(); // gets " blah> blah ";

Maybe String.split is something for you?
s = "The almighty String is mystring is your String is our mystring-object - isn't it?";
parts = s.split ("mystring");
Result:
Array("The almighty String is ", " is your String is our ", -object - isn't it?)
You know that in between your "mystring" must be. I'm not sure for start and end, so maybe you need some s.startsWith ("mystring") / s.endsWith.

Regular expression for validating an answer to a question

Hey everyone,
I'm having a minor difficulty setting up a regular expression that evaluates a sentence entered by a user in a textbox to keyword(s). Essentially, the keywords have to be entered consecutive from one to the other and can have any number of characters or spaces before, between, and after (ie. if the keywords are "crow" and "feet", crow must be somewhere in the sentence before feet. So with that in mind, this statement should be valid "blah blah sccui crow dsj feet "). The characters and to some extent, the spaces (i would like the keywords to have at least one space buffer in the beginning and end) are completely optional, the main concern is whether the keywords were entered in their proper order.
So far, I was able to have my regular expression work in a sentence but failed to work if the answer itself was entered only.
I have the regular expression used in the function below:
// Comparing an answer with the right solution
protected boolean checkAnswer(String a, String s) {
boolean result = false;
//Used to determine if the solution is more than one word
String temp[] = s.split(" ");
//If only one word or letter
if(temp.length == 1)
{
if (s.length() == 1) {
// check multiple choice questions
if (a.equalsIgnoreCase(s)) result = true;
else result = false;
}
else {
// check short answer questions
if ((a.toLowerCase()).matches(".*?\\s*?" + s.toLowerCase() + "\\s*?.*?")) result = true;
else result = false;
}
}
else
{
int count = temp.length;
//Regular expression used to
String regex=".*?\\s*?";
for(int i = 0; i<count;i++)
regex+=temp[i].toLowerCase()+"\\s*?.*?";
//regex+=".*?";
System.out.println(regex);
if ((a.toLowerCase()).matches(regex)) result = true;
else result = false;
}
return result;
Any help would greatly be appreciated.
Thanks.

I would go about this in a different way. Instead of trying to use one regular expression, why not use something similar to:
String answer = ... // get the user's answer
if( answer.indexOf("crow") < answer.indexOf("feet") ) {
// "correct" answer
}
You'll still need to tokenize the words in the correct answer, then check in a loop to see if the index of each word is less than the index of the following word.

I don't think you need to split the result on " ".
If I understand correctly, you should be able to do something like
String regex="^.*crow.*\\s+.*feet.*"
The problem with the above is that it will match "feetcrow feetcrow".
Maybe something like
String regex="^.*\\s+crow.*\\s+feet\\s+.*"
That will enforce that the word is there as opposed to just in a random block of characters.

Depending on the complexity Bill's answer might be the fastest solution. If you'd prefer a regular expression, I wouldn't look for any spaces, but word boundaries instead. That way you won't have to handle commas, dots, etc. as well:
String regex = "\\bcrow(?:\\b.*\\b)?feet\\b"
This should match "crow bla feet" as well as "crowfeet" and "crow, feet".
Having to match multiple words in a specific order you could just join them together using '(?:\b.*\b)?' without requiring any additional sorting or checking.

Following Bill answer, I'd try this:
String input = // get user input
String[] tokens = input.split(" ");
String key1 = "crow";
String key2 = "feet";
String[] tokens = input.split(" ");
List<String> list = Arrays.asList(tokens);
return list.indexOf(key1) < list.indexOf(key2)

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!

I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.

The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568

Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.

will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf

Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.

I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Checking each space in a String - java

Assuming that scanText is a Scanner object, you could use something like stated on the documentation: Scanner s = new Scanner(input).useDelimiter("\\s+"); //regex for spaces https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

Related

How can i extract specific terms from string lines in Java?

Issue in Combining splitted String

closest thing to NSScanner in Java

Regular expression for validating an answer to a question

Need help parsing strings in Java

Categories

Resources