Java text splitting algorithm - java

I have a large string (with text).
I need to split it into a few pieces (according to max chat limit), run some operations with them independently, and in the end merge the result.
A pretty simple task.
I'm just looking for an algorithm that will split text naturally. So it doesn't split it on fixed sized substrings, and doesn't cut the words in half.
For example (* is the 100th char, max char limit is set to 100):
....split me aro*und here...
the 1st fragment should contain: ...split me
the 2nd fragment should be: around here...
Working in Java btw.

The wikipedia article on word wrapping discusses this. It also links to an algorithm by Knuth.

You could use lastIndexOf(String find, int index).
public static List<String> splitByText(String text, String sep, int maxLength) {
List<String> ret = new ArrayList<String>();
int start = 0;
while (start + maxLength < text.length()) {
int index = text.lastIndexOf(sep, start + maxLength);
if (index < start)
throw new IllegalArgumentException("Unable to break into strings of " +
"no more than " + maxLength);
ret.add(text.substring(start, index));
start = index + sep.length();
}
ret.add(text.substring(start));
return ret;
}
And
System.out.println(splitByText("....split me around here...", " ", 14));
Prints
[....split me, around here...]

Jakarta commons-lang WordUtils.wrap() is close:
It only breaks on spaces
It doesn't return a list, but you can choose a "line separator" that's unlikely to occur in the text & then split on that

If you're using Swing for your chat, then you can handle it like this:
//textarea is JTextArea instance
textarea.setLineWrap(true);
textarea.setWrapStyleWord(true);

Related

How many times the word is used on the html page

I have a method that should return an integer which is the number of uses of the searchWord in the text of an HTML document:
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
if (bodyText.toLowerCase().contains(searchWord.toLowerCase())){
count++;
}
return count;
}
But my method always returns count=1, even if the word is used several times. I understand that the error should be obvious, but I’m stuck and I don’t see it.
You are currently only checking once that the text contains the search word, so the count will always be either 0 or 1. To find the total count, keep looping using String#indexOf(str, fromIndex) while the String can be found using the second argument that indicates the index to start searching from.
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
for(int idx = -1; (idx = bodyText.indexOf(searchWord, idx + 1)) != -1; count++);
return count;
}
According to the Java docs String#contains:
Returns true if and only if this string contains the specified sequence of char values.
You're asking if the word you're looking for is contained in the document, which it is.
You could:
Split the text on words (splitting it by spaces) and then count how many times it appears
Iterate the String using String#indexOf starting on index 0 and then from last index you found until the end of the String.
Iterate the String using contains but starting from a certain index (doing this logic yourself).
I'd go for the 2nd approach as it seems like the easiest one.
These are only conditional statements, you aren't looping through the HTML text, therefor, if it finds the instance of searchWord in bodyText, it'll increment it, and then exit the method with a value of 1. I suggest looping through every word in the html, adding it to an array, and counting it that way using something like this:
char[] bodyTextA = bodyText.toCharArray();
Or keep it in a string array and split it by a space, or new line, or whatever criteria you have. Example of space:
//puts hello, i'm, your, and string into their own array slots in the array
/split
str = "Hello I'm your String";
String[] split = str.split("\\s+");
Your issue here is that the if statement is checking if the text contains the word and the increments your count variable. So even if it contains the word multiple time, your logic goes basically, if it contains it at all, increase count by one. You will have to rewrite your code to check for multiple occurrences of the word. There are many ways you can go about this, you could loop through the entire body text, you could split the body text into an array of words and check that, or you could remove the search word from the text each time you find it and keep checking until it no longer contains the search word.
You can use indexOf(,) with an index for the last found word
public int searchForWord(String searchWord) {
int count = 0;
if(this.htmlDocument == null){
System.out.println("ERROR! Call crawl() before performing analysis on the document");
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
int index = 0;
while ((index = bodyText.indexOf(searchWord, index + 1)) != -1) {
count++;
}
return count;
}

Check if string contains word (not substring!)

Is there a way to check if a substring contains an entire WORD, and not a substring.
Envision the following scenario:
public class Test {
public static void main(String[] args) {
String[] text = {"this is a", "banana"};
String search = "a";
int counter = 0;
for(int i = 0; i < text.length; i++) {
if(text[i].toLowerCase().contains(search)) {
counter++;
}
}
System.out.println("Counter was " + counter);
}
}
This evaluates to
Counter was 2
Which is not what I'm looking for, as there is only one instance of the word 'a' in the array.
The way I read it is as follows:
The if-test finds an 'a' in text[0], the 'a' corresponding to "this is [a]". However, it also finds occurrences of 'a' in "banana", and thus increments the counter.
How can I solve this to only include the WORD 'a', and not substrings containing a?
Thanks!
You could use a regex, using Pattern.quote to escape out any special characters.
String regex = ".*\\b" + Pattern.quote(search) + "\\b.*"; // \b is a word boundary
int counter = 0;
for(int i = 0; i < text.length; i++) {
if(text[i].toLowerCase().matches(regex)) {
counter++;
}
}
Note this will also find "a" in "this is a; pause" or "Looking for an a?" where a doesn't have a space after it.
Could try this way:
for(int i = 0; i < text.length; i++) {
String[] words = text[i].split("\\s+");
for (String word : words)
if(word.equalsIgnoreCase(search)) {
counter++;
break;
}
}
If the words are separated by a space, then you can do:
if((" "+text[i].toLowerCase()+" ").contains(" "+search+" "))
{
...
}
This adds two spaces to the original String.
eg: "this is a" becomes " this is a ".
Then it searches for the word, with the flanking spaces.
eg: It searches for " a " when search is "a"
Arrays.asList("this is a banana".split(" ")).stream().filter((s) -> s.equals("a")).count();
Of course, as others have written, you can start playing around with all kinds of pattern to match "words" out of "text".
But the thing is: depending on the underlying problem you have to solve, this might (by far) not good enough. Meaning: are you facing the problem of finding some pattern in some string ... or is it really, that you want to interpret that text in the "human language" sense? You know, when somebody writes down text, there might be subtle typos, strange characters; all kind of stuff that make it hard to really "find" a certain word in that text. Unless you dive into the "language processing" aspect of things.
Long story short: if your job is "locate certain patterns in strings"; then all the other answers will do. But if your requirement goes beyond that, like "some human will be using your application to 'search' huge data sets"; then you better stop now; and consider turning to full-text enabled search engines like ElasticSearch or Solr.

Java characters count in an array

Another problem I try to solve (NOTE this is not a homework but what popped into my head), I'm trying to improve my problem-solving skills in Java. I want to display this:
Students ID #
Carol McKane 920 11
James Eriol 154 10
Elainee Black 462 12
What I want to do is on the 3rd column, display the number of characters without counting the spaces. Give me some tips to do this. Or point me to Java's robust APIs, cause I'm not yet that familiar with Java's string APIs. Thanks.
It sounds like you just want something like:
public static int countNonSpaces(String text) {
int count = 0;
for (int i = 0; i < text.length(); i++) {
if (text.charAt(i) != ' ') {
count++;
}
}
return count;
}
You may want to modify this to use Character.isWhitespace instead of only checking for ' '. Also note that this will count pairs outside the Basic Multilingual Plane as two characters. Whether that will be a problem for you or not depends on your use case...
Think of solving a problem and presenting the answer as two very different steps. I won't help you with the presentation in a table, but to count the number of characters in a String (without spaces) you can use this:
String name = "Carol McKane";
int numberOfCharacters = name.replaceAll("\\s", "").length();
The regular expression \\s matches all whitespace characters in the name string, and replaces them with "", or nothing.
Probably the shortest and easiest way:
String[][] students = { { "Carol McKane", "James Eriol", "Elainee Black" }, { "920", "154", "462" } };
for (int i = 0 ; i < students[0].length; i++) {
System.out.println(students[0][i] + "\t" + students[1][i] + "\t" + students[0][i].replace( " ", "" ).length() );
}
replace(), replaces each substring (" ") of your string and removes it from the result returned, from this temporal string, without spaces, you can get the length by calling length() on it...
The String name will remain unchanged.
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
cheers
To learn more about it you should watch the API documentation for String and Character
Here some examples how to do:
// variation 1
int count1 = 0;
for (char character : text.toCharArray()) {
if (Character.isLetter(character)) {
count1++;
}
}
This uses a special short from of "for" instruction. Here's the long form for better understanding:
// variation 2
int count2 = 0;
for (int i = 0; i < text.length(); i++) {
char character = text.charAt(i);
if (Character.isLetter(character)) {
count2++;
}
}
BTW, removing whitespaces via replace method is not a good coding style to me and not quite helpful for understanding how string class works.

Java: How To Grab Each nth Lines From a String

I'm wondering how I could grab each nth lines from a String, say each 100, with the lines in the String being seperated with a '\n'.
This is probably a simple thing to do but I really can't think of how to do it, so does anybody have a solution?
Thanks much,
Alex.
UPDATE:
Sorry I didn't explain my question very well.
Basically, imagine there's a 350 line file. I want to grab the start and end of each 100 line chunk. Pretending each line is 10 characters long, I'd finish with a 2 seperate arrays (containing start and end indexes) like this:
(Lines 0-100) 0-1000
(Lines 100-200) 1000-2000
(Lines 200-300) 2000-3000
(Lines 300-350) 3000-3500
So then if I wanted to mess around with say the second set of 100 lines (100-200) I have the regions for them.
You can split the string into an array using split() and then just get the indexes you want, like so:
String[] strings = myString.split("\n");
int nth = 100;
for(int i = nth; i < strings.length; i + nth) {
System.out.println(strings[i]);
}
String newLine = System.getProperty("line.separator");
String lines[] = text.split(newLine);
Where text is string with your whole text.
Now to get nth line, do e.g.:
System.out.println(lines[nth - 1]); // Minus one, because arrays in Java are zero-indexed
One approach is to create a StringReader from the string, wrap it in a BufferedReader and use that to read lines. Alternatively, you could just split on \n to get the lines, of course...
String[] allLines = text.split("\n");
List<String> selectedLines = new ArrayList<String>();
for (int i = 0; i < allLines.length; i += 100)
{
selectedLines.add(allLines[i]);
}
This is simpler code than using a BufferedReader, but it does mean having the complete split string in memory (as well as the original, at least temporarily, of course). It's also less flexible in terms of being adapted to reading lines from other sources such as a file. But if it's all you need, it's pretty straightforward :)
EDIT: If the start indexes are needed too, it becomes slightly more complicated... but not too bad. You probably want to encapsulate the "start and line" in a single class, but for the sake of brevity:
String[] allLines = text.split("\n");
List<String> selectedLines = new ArrayList<String>();
List<Integer> selectedIndexes = new ArrayList<Integer>();
int index = 0;
for (int i = 0; i < allLines.length; i++)
{
if (i % 100 == 0)
{
selectedLines.add(allLines[i]);
selectedIndexes.add(index);
}
index += allLines[i].length + 1; // Add 1 for the trailing "\n"
}
Of course given the start index and the line, you can get the end index just by adding the line length :)

Algorithm to detect how many words typed, also multi sentence support (Java)

Problem:
I have to design an algorithm, which does the following for me:
Say that I have a line (e.g.)
alert tcp 192.168.1.1 (caret is currently here)
The algorithm should process this line, and return a value of 4.
I coded something for it, I know it's sloppy, but it works, partly.
private int counter = 0;
public void determineRuleActionRegion(String str, int index) {
if (str.length() == 0 || str.indexOf(" ") == -1) {
triggerSuggestionList(1);
return;
}
//remove duplicate space, spaces in front and back before searching
int num = str.trim().replaceAll(" +", " ").indexOf(" ", index);
//Check for occurances of spaces, recursively
if (num == -1) { //if there is no space
//no need to check if it's 0 times it will assign to 1
triggerSuggestionList(counter + 1);
counter = 0;
return; //set to rule action
} else { //there is a space
counter++;
determineRuleActionRegion(str, num + 1);
}
} //end of determineactionRegion()
So basically I find for the space and determine the region (number of words typed). However, I want it to change upon the user pressing space bar <space character>.
How may I go around with the current code?
Or better yet, how would one suggest me to do it the correct way? I'm figuring out on BreakIterator for this case...
To add to that, I believe my algorithm won't work for multi sentences. How should I address this problem as well.
--
The source of String str is acquired from textPane.getText(0, pos + 1);, the JTextPane.
Thanks in advance. Do let me know if my question is still not specific enough.
--
More examples:
alert tcp $EXTERNAL_NET any -> $HOME_NET 22 <caret>
return -1 (maximum of the typed text is 7 words)
alert tcp 192.168.1.1 any<caret>
return 4 (as it is still at 2nd arg)
alert tcp<caret>
return 2 (as it is still at 2nd arg)
alert tcp <caret>
return 3
alert tcp $EXTERNAL_NET any -> <caret>
return 6
It is something like shell commands. As above. Though I think it does not differ much I believe, I just want to know how many arguments are typed. Thanks.
--
Pseudocode
Get whole paragraph from textpane
if more than 1 line -> process the last line
count how many arguments typed and return appropriate number
else
process current line
count how many arguments typed and return appropriate number
End
This uses String.split; I think this is what you want.
String[] texts = {
"alert tcp $EXTERNAL_NET any -> $HOME_NET 22 ",
"alert tcp 192.168.1.1 any",
"alert tcp",
"alert tcp ",
"alert tcp $EXTERNAL_NET any -> ",
"multine\ntest\ntest 1 2 3",
};
for (String text : texts) {
String[] lines = text.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
for (String token : tokens) {
System.out.print("[" + token + "]");
}
int pos = (tokens.length <= 7) ? tokens.length : -1;
System.out.println(" = " + pos);
}
This produces the following output:
[alert][tcp][$EXTERNAL_NET][any][->][$HOME_NET][22][] = -1
[alert][tcp][192.168.1.1][any] = 4
[alert][tcp] = 2
[alert][tcp][] = 3
[alert][tcp][$EXTERNAL_NET][any][->][] = 6
[test][1][2][3] = 4
The codes provided by polygenelubricants and helios work, to a certain extent. It addresses the aforementioned problem I'd stated, but not with multi-lines. helios's code is more straightforward.
However both codes did not address the problem when you press enter in the JTextPane, it will still return back the old count instead of 1 as the split() returns it as one sentence instead of two.
E.g. alert tcp <enter is pressed>
By right it should return 1 since it is a new sentence. It returned 2 for both algorithms.
Also, if I highlight all and delete both algorithms will throw NullPointerException as there is no string to be split.
I added one line, and it solved the problems mentioned above:
public void determineRuleActionRegion(String str) {
//remove repetitive spaces and concat $ for new line indicator
str = str.trim().replaceAll(" +", " ") + "$";
String[] lines = str.split("\r?\n|\r");
String lastLine = lines[lines.length - 1];
String[] tokens = lastLine.split("\\s+", -1);
int pos = (tokens.length <= 7) ? tokens.length : -1;
triggerSuggestionList(pos);
System.out.println("Current pos: " + pos);
return;
} //end of determineactionRegion()
With that, when split() parses the str, the "$" will create another line, which will be the last line regardless, and the count now will return to one. Also, there will not be NullPointerException as the "$" is always there.
However, without the help of polygenelubricants and helios, I don't think I will be able to figure it out so soon. Thanks guys!
EDIT: Okay... apparently split("\r?\n|\r",-1) works the same. Question is should I accept polygenelubricants or my own? Hmm.
2nd EDIT: One thing bad about concatenating '%' to the end of the str, lastLine.endsWith(" ") == true will return false. So have to use split("\r?\n|\r",-1) and lastLine.endsWith(" ") == true for the complete solution.
What about this: get last line, count what's between spaces...
String text = ...
String[] lines = text.split("\n"); // or \r\n depending on how you get the string
String lastLine = lines[lines.length-1];
StringTokenizer tokenizer = new StringTokenizer(lastLine, " ");
// note that strtokenizer will ignore empty tokens, it is, what is between two consecutive spaces
int count = 0;
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
count++;
}
return count;
Edit you could control if you have a final space (lastLine.endsWith(" ")) so you are starting a new word or whatever, it's a basic approach for you to make it up :)
Is the sample line representative? An editor for some rule based language (ACLs)?
How about going for a full Information Extraction/named entity recognition solution, the one that will be able to recognize entities (keywords, ip addresses, etc)? You don't have to write everything from scratch, there're existing tools and libraries.
UPDATE: Here's a piece of Snort code that I believe does the parsing:
Function ParseRule()
if (*args == '(') {
// "Preprocessor Rule detected"
} else {
/* proto ip port dir ip port r*/
toks = mSplit(args, " \t", 7, &num_toks, '\\');
/* A rule might not have rule options */
if (num_toks < 6) {
ParseError("Bad rule in rules file: %s", args);
}
..
}
otn = ParseRuleOptions(sc, rtn, roptions, rule_type, protocol);
..
mSplit is defined in mstring.c, a function to split a string into tokens.
In your case, ParseRuleOptions should return one for the whole string inside brackets I guess.
UPDATE 2: btw, is your first example correct, since in snort, you can add options to rules? For example this is a valid rule being written (options section not completed):
alert tcp any any -> 192.168.1.0/24 111 (content:"|00 01 86 a5|"; <caret>
In some cases you can have either 6 or 7 'words', so your algorithm should have a bit more knowledge, right?

Categories