Issue in Combining splitted String - java

I have extracted text from "web 2.0 wikipedia" article, and splitted it into "sentences". After that, I am going to create "Strings" which each string containing 5 sentences.
When extracted, the text looks like below, in EditText
Below is my code
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i];
if( (i%5==0) )
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
}
}
But, when I use the Toast to display the text, here what is gives
Toast.makeText(Talk.this, textCollection.get(0), Toast.LENGTH_LONG).show();
As you can see, this is only one sentence! But I expected it to have 5 sentences!
And the other thing is, the second sentence is starting from somewhere else. Here how I have extracted it into Toast
Toast.makeText(Talk.this, textCollection.get(1), Toast.LENGTH_LONG).show();
This make no sense to me! How can I properly split the text into sentences and, create Strings containing 5 sentences each?

The problem is that for the first sentence, 0 % 5 = 0, so it is being added to the array list immediately. You should use another counter instead of mod.
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
int sentenceAdded = 0;
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences += textArrayWithFullStop[i] + ". ";
sentenceAdded++;
if(sentenceAdded == 5)
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
sentenceAdded = 0;
}
}

add ". " to textArrayWithFullStop[i]
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i]+". ";

I believe that if you modify the mod line to this:
if(i%5==4)
you will have what you need.
You probably realize this, but there are other reasons why someone might use a ". ", that doesn't actually end a sentence, for instance
I spoke to John and he said... "I went to the store.
Then I went to the Tennis courts.",
and I don't believe he was telling the truth because
1. Why would someone go to play tennis after going to the store and
2. John has no legs!
I had to ask, am I going to let him get away with these lies?
That's two sentences that don't end with a period and would mislead your code into thinking it's 5 sentences broken up at entirely the wrong places, so this approach is really fraught with problems. However, as an exercise in splitting strings, I guess it's as good as any other.

As a side problem(splitting sentences) solution I would suggest to start with this regexp
string.split(".(\\[[0-9\\[\\]]+\\])? ")
And for main problem may be you could use copyOfRange()

Related

Java: Checking each space in a String

I'm sure this is fairly simple, however I've tried googling the question but can't find an answer that fits my problem.
I'm playing around with string manipulation and one of the things I'm trying to do is get the first letter of each word. (And then place them all into a string)
I'm having trouble with registering each 'space' so that my If statement will be triggered. Here's what I have so far.
while (scanText.hasNext()) {
boolean isSpace = false;
if (scanText.hasNext(" ")) {isSpace = true;}
String s = scanText.next();
if (isSpace) {firstLetters += s + " ";}
}
Also, if there is a much better way to do this then please let me know
You can also split the original text by spaces, and collect the words.
String input = " Hello world aaa ";
String[] split = input.trim().split("\\s+"); // all types of whitespace; " +" to pick spaces only
// operate on "split" array containing words now: [Hello, world, aaa]
However using regexps here might be overkill.
Assuming that scanText is a Scanner object, you could use something like stated on the documentation:
Scanner s = new Scanner(input).useDelimiter("\\s+"); //regex for spaces
https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

Splitting String according to multiple String in java

I just beginning to learn java, so please don't mind.
I have string
String test="John Software_Engineer Kartika QA Xing Project_Manager Mark CEO Celina Assistant_Developer";
I want to splitting based of position of Company={"Software_Engineer", "QA","Project_Manager","CEO ","Assistant_Developer"};
EDITED:
if above is difficulties then is it possible??? Based or {AND, OR)
String value="NA_USA >= 15 AND NA_USA=< 30 OR NA_USA!=80"
String value1="EUROPE_SPAIN >= 5 OR EUROPE_SPAIN < = 30 "
How to split and put in hashtable in java. finally how to access it from the end. this is not necessary but my main concern is how to split.
Next EDIT:
I got solution from this, it is the best idea or not????
String to="USA AND JAPAN OR SPAIN AND CHINA";
String [] ind= new String[]{"AND", "OR"};
for (int hj = 0; hj < ind.length; hj++){
to=to.replaceAll(ind[hj].toString(), "*");
}
System.out.println(" (=to=) "+to);
String[] partsparts = to.split("\\*");
for (int hj1 = 0; hj1 < partsparts.length; hj1++){
System.out.println(" (=partsparts=) "+partsparts[hj1].toString());
}
and
List<String> test1=split(to, '*', 1);
System.out.println("-str333->"+test1);
New EDIT:
If I have this type of String how can you splitting:
final String PLAYER = "IF John END IF Football(soccer) END IF Abdul-Jabbar tennis player END IF Karim -1996 * 1974 END IF";
How can i get like this: String [] data=[John , Football(soccer) ,Abdul-Jabbar tennis player, Karim -1996 * 1974 ]
Do you have any idea???
This will split your string for you and store it in a string array(Max size 50).
private static String[]split = new String[50];
public static void main(String[] args) {
String test="John -Software_Engineer Kartika -QA Xing -Project_Manager Mark -CEO Celina -Assistant_Developer";
for (String retval: test.split("-")){
int i = 0;
split[i]=retval;
System.out.println(split[i]);
i++;
}
}
You can make a string with Name:post and space. then it will be easy get desire value.
String test="John:Software_Engineer Kartika:QA Xing:Project_Manager"
I am unable to comment as my reputation is less. Hence i am writing over here.
Your first Question of String splitting could be generalized as positional word splitting. If it is guaranteed that you require all even positioned string, you could first split the string based on the space and pull all the even position string.
On your Second Question on AND & OR split, you could replace all " AND " & " OR " with single String " " and you could split the output string by single space string " ".
On your third Question, replace "IF " & " END" with single space string " " and I am not sure whether last IF do occurs in your string. If so you could replace it too with empty string "" and then split the string based on single space string " ".
First classify your input string based on patterns and please devise an algorithm before you work on Java.
I would suggest you to use StringBuffer or StringBuilder instead of using String directly as the cost is high for String Operation when compared to the above to.
try this
String[] a = test.replaceAll("\\w+ (\\w+)", "$1").split(" ");
here we first replace word pairs with the second word, then split by space
You can take a set which have all positions Like
Set<String> positions = new HashSet<String>();
positions.add("Software_Engineer");
positions.add("QA");
String test="John Software_Engineer Kartika QA Xing Project_Manager Mark CEO Celina Assistant_Developer";
List<String> positionsInString = new ArrayList<String>();
Iterator<String> iterator = positions.iterator();
while (iterator.hasNext()) {
String position = (String) iterator.next();
if(test.contains(position)){
positionsInString.add(position);
break;
}
}

How can i extract specific terms from string lines in Java?

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044   15939353   C8H14O3S2    beta-lipoic acid   C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**
Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044   15939353   C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine
Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}
This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

making a String from a loop(Arraylist) and several individual signs to mysql commandline

I might have overlooked some factors influencing the process but that is why i seek help here. It is my first post here and i have read the initial prescriptions for helping me getting the best question as a basis for the best answer. I hop you will understand(otherwise please make a comment with further questions)
The case is that i have been creating an ArrayList
ArrayList<String> liste = new ArrayList<String>();
I gather several names, quantities, and dates:
if(shepherd == 0) {
} else if(shepherd <= 0) {
System.out.println(shepherd);
String s = "('shepherd'," + "'" + shepherd + "'," +"'" + ft.format(date) + "'" + ")";
liste.add(s);
}
I have defined shepherd as follows:
double shepherd = 0;
Next, I wish to add these entries to my MySql database.
I construct a query, and print it out so that I can verify that it is of the correct format:
System.out.println("INSERT INTO kennel VALUES");
for(int i = 0; i < liste.size(); i++) {
System.out.println(liste.get(i));
if(i != liste.size()-1) {
System.out.println(",");
}
}
This shows the correct command, with the proper syntax, but it's only output to the console at this point.
I have to send this through some Jsch or Ganymed. Most likely as a String. So i am wondering how i could take all the different parts, the doubles, the strings, the loop and build up a String, identical to the printed line i get in console.
I sensed it would look like this:
String command = (mysql -e "use kennel;insert into department3 values ('shepherd','1','2013-03-04');";
I believe that I am having some trouble with the " and ( and '.
I hope i made it clear what the trouble is about. Thank you in advance. Sincerely
Your string need to be held within quotation marks. Because this will interfere with the quotation marks within your String, you need to escape them. You can do this by placing a backslash in front of the character. :)
String command = "(mysql -e \"use kennel;insert into department3 values ('shepherd','1','2013-03-04');\"";

Algorithm to detect how many words typed, also multi sentence support (Java)

Problem:
I have to design an algorithm, which does the following for me:
Say that I have a line (e.g.)
alert tcp 192.168.1.1 (caret is currently here)
The algorithm should process this line, and return a value of 4.
I coded something for it, I know it's sloppy, but it works, partly.
private int counter = 0;
public void determineRuleActionRegion(String str, int index) {
if (str.length() == 0 || str.indexOf(" ") == -1) {
triggerSuggestionList(1);
return;
}
//remove duplicate space, spaces in front and back before searching
int num = str.trim().replaceAll(" +", " ").indexOf(" ", index);
//Check for occurances of spaces, recursively
if (num == -1) { //if there is no space
//no need to check if it's 0 times it will assign to 1
triggerSuggestionList(counter + 1);
counter = 0;
return; //set to rule action
} else { //there is a space
counter++;
determineRuleActionRegion(str, num + 1);
}
} //end of determineactionRegion()
So basically I find for the space and determine the region (number of words typed). However, I want it to change upon the user pressing space bar <space character>.
How may I go around with the current code?
Or better yet, how would one suggest me to do it the correct way? I'm figuring out on BreakIterator for this case...
To add to that, I believe my algorithm won't work for multi sentences. How should I address this problem as well.
--
The source of String str is acquired from textPane.getText(0, pos + 1);, the JTextPane.
Thanks in advance. Do let me know if my question is still not specific enough.
--
More examples:
alert tcp $EXTERNAL_NET any -> $HOME_NET 22 <caret>
return -1 (maximum of the typed text is 7 words)
alert tcp 192.168.1.1 any<caret>
return 4 (as it is still at 2nd arg)
alert tcp<caret>
return 2 (as it is still at 2nd arg)
alert tcp <caret>
return 3
alert tcp $EXTERNAL_NET any -> <caret>
return 6
It is something like shell commands. As above. Though I think it does not differ much I believe, I just want to know how many arguments are typed. Thanks.
--
Pseudocode
Get whole paragraph from textpane
if more than 1 line -> process the last line
count how many arguments typed and return appropriate number
else
process current line
count how many arguments typed and return appropriate number
End
This uses String.split; I think this is what you want.
String[] texts = {
"alert tcp $EXTERNAL_NET any -> $HOME_NET 22 ",
"alert tcp 192.168.1.1 any",
"alert tcp",
"alert tcp ",
"alert tcp $EXTERNAL_NET any -> ",
"multine\ntest\ntest 1 2 3",
};
for (String text : texts) {
String[] lines = text.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
for (String token : tokens) {
System.out.print("[" + token + "]");
}
int pos = (tokens.length <= 7) ? tokens.length : -1;
System.out.println(" = " + pos);
}
This produces the following output:
[alert][tcp][$EXTERNAL_NET][any][->][$HOME_NET][22][] = -1
[alert][tcp][192.168.1.1][any] = 4
[alert][tcp] = 2
[alert][tcp][] = 3
[alert][tcp][$EXTERNAL_NET][any][->][] = 6
[test][1][2][3] = 4
The codes provided by polygenelubricants and helios work, to a certain extent. It addresses the aforementioned problem I'd stated, but not with multi-lines. helios's code is more straightforward.
However both codes did not address the problem when you press enter in the JTextPane, it will still return back the old count instead of 1 as the split() returns it as one sentence instead of two.
E.g. alert tcp <enter is pressed>
By right it should return 1 since it is a new sentence. It returned 2 for both algorithms.
Also, if I highlight all and delete both algorithms will throw NullPointerException as there is no string to be split.
I added one line, and it solved the problems mentioned above:
public void determineRuleActionRegion(String str) {
//remove repetitive spaces and concat $ for new line indicator
str = str.trim().replaceAll(" +", " ") + "$";
String[] lines = str.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
int pos = (tokens.length <= 7) ? tokens.length : -1;
triggerSuggestionList(pos);
System.out.println("Current pos: " + pos);
return;
} //end of determineactionRegion()
With that, when split() parses the str, the "$" will create another line, which will be the last line regardless, and the count now will return to one. Also, there will not be NullPointerException as the "$" is always there.
However, without the help of polygenelubricants and helios, I don't think I will be able to figure it out so soon. Thanks guys!
EDIT: Okay... apparently split("\r?\n|\r",-1) works the same. Question is should I accept polygenelubricants or my own? Hmm.
2nd EDIT: One thing bad about concatenating '%' to the end of the str, lastLine.endsWith(" ") == true will return false. So have to use split("\r?\n|\r",-1) and lastLine.endsWith(" ") == true for the complete solution.
What about this: get last line, count what's between spaces...
String text = ...
String[] lines = text.split("\n"); // or \r\n depending on how you get the string
String lastLine = lines[lines.length-1];
StringTokenizer tokenizer = new StringTokenizer(lastLine, " ");
// note that strtokenizer will ignore empty tokens, it is, what is between two consecutive spaces
int count = 0;
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
count++;
}
return count;
Edit you could control if you have a final space (lastLine.endsWith(" ")) so you are starting a new word or whatever, it's a basic approach for you to make it up :)
Is the sample line representative? An editor for some rule based language (ACLs)?
How about going for a full Information Extraction/named entity recognition solution, the one that will be able to recognize entities (keywords, ip addresses, etc)? You don't have to write everything from scratch, there're existing tools and libraries.
UPDATE: Here's a piece of Snort code that I believe does the parsing:
Function ParseRule()
if (*args == '(') {
// "Preprocessor Rule detected"
} else {
/* proto ip port dir ip port r*/
toks = mSplit(args, " \t", 7, &num_toks, '\\');
/* A rule might not have rule options */
if (num_toks < 6) {
ParseError("Bad rule in rules file: %s", args);
}
..
}
otn = ParseRuleOptions(sc, rtn, roptions, rule_type, protocol);
..
mSplit is defined in mstring.c, a function to split a string into tokens.
In your case, ParseRuleOptions should return one for the whole string inside brackets I guess.
UPDATE 2: btw, is your first example correct, since in snort, you can add options to rules? For example this is a valid rule being written (options section not completed):
alert tcp any any -> 192.168.1.0/24 111 (content:"|00 01 86 a5|"; <caret>
In some cases you can have either 6 or 7 'words', so your algorithm should have a bit more knowledge, right?

Categories