Extract data inside nested braces - java

I want to extract content between the first nested braces and second nested braces separately. Now I am totally stuck with this can anyone help me. My file read.txt contains the below data . I just read that to a string "s".
BufferedReader br=new BufferedReader(new FileReader("read.txt"));
while(br.ready())
{
String s=br.readLine();
System.out.println(s);
}
Output
{ { "John", "ran" }, { "NOUN", "VERB" } },
{ { "The", "dog", "jumped"}, { "DET", "NOUN", "VERB" } },
{ { "Mike","lives","in","Poland"}, {"NOUN","VERB","DET","NOUN"} },
ie my output should look like
"John", "ran"
"NOUN", "VERB"
"The", "dog", "jumped"
"DET", "NOUN", "VERB"
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"

Use this regex:
(?<=\{)(?!\s*\{)[^{}]+
See the matches in the Regex Demo.
In Java:
Pattern regex = Pattern.compile("(?<=\\{)(?!\\s*\\{)[^{}]+");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
}
Explanation
The lookbehind (?<=\{) asserts that what precedes the current position is a {
The negative lookahead (?!\s*\{) asserts that what follows is not optional whitespace then {
[^{}]+ matches any chars that are not curlies

If you split on "}," then you get your sets of words in a single string, then just a matter of replacing curly braces
As per your code
BufferedReader br=new BufferedReader(new FileReader("read.txt"));
while(br.ready())
{
String s=br.readLine();
String [] words = s.split ("},");
for (int x = 0; x < words.length; x++) {
String printme = words[x].replace("{", "").replace("}", "");
}
}

You could always remove the opening brackets, then split by '},' which would leave you with the list of strings you've asked for. (If that is all one string, of course)
String s = input.replace("{","");
String[] splitString = s.split("},");
Would first remove open brackets:
"John", "ran" }, "NOUN", "VERB" } },
"The", "dog", "jumped"}, "DET", "NOUN", "VERB" } },
"Mike","lives","in","Poland"},"NOUN","VERB","DET","NOUN"} },
Then would split by },
"John", "ran"
"NOUN", "VERB" }
"The", "dog", "jumped"
"DET", "NOUN", "VERB" }
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"}
Then you just need to tidy them up with another replace!

Another approach could be searching for {...} substring with no inner { or } characters and take only its inner part without { and }.
Regex describing such substring can look like
\\{(?<content>[^{}]+)\\}
Explanation:
\\{ is escaped { so now it represents { literal (normally it represents start of quantifier {x,y} so it needed to be escaped)
(?<content>...) is named-capturing group, it will store only part between { and } and later we would be able to use this part (instead of entire match which would also include { })
[^{}]+ represents one or more non { } characters
\\} escaped } which means it represents }
DEMO:
String input = "{ { \"John\", \"ran\" }, { \"NOUN\", \"VERB\" } },\r\n" +
"{ { \"The\", \"dog\", \"jumped\"}, { \"DET\", \"NOUN\", \"VERB\" } },\r\n" +
"{ { \"Mike\",\"lives\",\"in\",\"Poland\"}, {\"NOUN\",\"VERB\",\"DET\",\"NOUN\"} },";
Pattern p = Pattern.compile("\\{(?<content>[^{}]+)\\}");
Matcher m = p.matcher(input);
while(m.find()){
System.out.println(m.group("content").trim());
}
Output:
"John", "ran"
"NOUN", "VERB"
"The", "dog", "jumped"
"DET", "NOUN", "VERB"
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"

Related

finding the most popular word in a person's tweets

In a project, I'm trying to query the tweets of a particular user's handle and find the most common word in the user's tweets and also return the frequency of that most common word.
Below is my code:
public String mostPopularWord()
{
this.removeCommonEnglishWords();
this.sortAndRemoveEmpties();
Map<String, Integer> termsCount = new HashMap<>();
for(String term : terms)
{
Integer c = termsCount.get(term);
if(c==null)
c = new Integer(0);
c++;
termsCount.put(term, c);
}
Map.Entry<String,Integer> mostRepeated = null;
for(Map.Entry<String, Integer> curr: termsCount.entrySet())
{
if(mostRepeated == null || mostRepeated.getValue()<curr.getValue())
mostRepeated = curr;
}
//frequencyMax = termsCount.get(mostRepeated.getKey());
try
{
frequencyMax = termsCount.get(mostRepeated.getKey());
return mostRepeated.getKey();
}
catch (NullPointerException e)
{
System.out.println("Cannot find most popular word from the tweets.");
}
return "";
}
I also think it would help to show the codes for the first two methods I call in the method above, as shown below. They are all in the same class, with the following defined:
private Twitter twitter;
private PrintStream consolePrint;
private List<Status> statuses;
private List<String> terms;
private String popularWord;
private int frequencyMax;
#SuppressWarnings("unchecked")
public void sortAndRemoveEmpties()
{
Collections.sort(terms);
terms.removeAll(Arrays.asList("", null));
}
private void removeCommonEnglishWords()
{
Scanner sc = null;
try
{
sc = new Scanner(new File("commonWords.txt"));
}
catch(Exception e)
{
System.out.println("The file is not found");
}
List<String> commonWords = new ArrayList<String>();
int count = 0;
while(sc.hasNextLine())
{
count++;
commonWords.add(sc.nextLine());
}
Iterator<String> termIt = terms.iterator();
while(termIt.hasNext())
{
String term = termIt.next();
for(String word : commonWords)
if(term.equalsIgnoreCase(word))
termIt.remove();
}
}
I apologise for the rather long code snippets. But one frustrating thing is that even though my removeCommonEnglish() method is apparently right (discussed in another post), when I run the mostPopularWord(), it returns "the", which is clearly a part of the common English Words list that I have and meant to eliminate from the List terms. What might I be doing wrong?
UPDATE 1:
Here is the link ot the commonWords file:
https://drive.google.com/file/d/1VKNI-b883uQhfKLVg-L8QHgPTLNb22uS/view?usp=sharing
UPDATE 2: One thing I've noticed while debugging is that the
while(sc.hasNext())
in removeCommonEnglishWords() is entirely skipped. I don't understand why, though.
It can be more simple if you use stream like so :
String mostPopularWord() {
return terms.stream()
.collect(Collectors.groupingBy(s -> s, Collectors.counting()))
.entrySet().stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.findFirst()
.map(Map.Entry::getKey)
.orElse("");
}
I tried your code. Here is what you will have to do. Replace the following part in removeCommonEnglishWords()
Iterator<String> termIt = terms.iterator();
while(termIt.hasNext())
{
String term = termIt.next();
for(String word : commonWords)
if(!term.equalsIgnoreCase(word))
reducedTerms.add( term );
}
with this:
List<String> reducedTerms = new ArrayList<>();
for( String term : this.terms ) {
if( !commonWords.contains( term ) ) reducedTerms.add( term );
}
this.terms = reducedTerms;
Since you hadn't provided the class, I created one with some assumptions, but I think this code will go through.
A slightly different approach using streams.
This uses the relatively common frequency count idiom using streams and stores them in a map.
It then does a simple scan to find the largest count obtained and either returns
that word or the string "No words found".
It also filters out the words in a Set<String> called ignore so you need to create that too.
import java.util.Arrays;
import java.util.Comparator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
Set<String> ignore = Set.of("the", "of", "and", "a",
"to", "in", "is", "that", "it", "he", "was",
"you", "for", "on", "are", "as", "with",
"his", "they", "at", "be", "this", "have",
"via", "from", "or", "one", "had", "by",
"but", "not", "what", "all", "were", "we",
"RT", "I", "&", "when", "your", "can",
"said", "there", "use", "an", "each",
"which", "she", "do", "how", "their", "if",
"will", "up", "about", "out", "many",
"then", "them", "these", "so", "some",
"her", "would", "make", "him", "into",
"has", "two", "go", "see", "no", "way",
"could", "my", "than", "been", "who", "its",
"did", "get", "may", "…", "#", "??", "I'm",
"me", "u", "just", "our", "like");
Map.Entry<String, Long> entry = terms.stream()
.filter(wd->!ignore.contains(wd)).map(String::trim)
.collect(Collectors.groupingBy(a -> a,
Collectors.counting()))
.entrySet().stream()
.collect(Collectors.maxBy(Comparator
.comparing(Entry::getValue)))
.orElse(Map.entry("No words found", 0L));
System.out.println(entry.getKey() + " " + entry.getValue());

How to read multi-line content between two words from a PDF file using java?

I have a requirement where I have to get data from a PDF file which is coming after word "IN:" and before word "OUT:" and there are many such occurrences across the file.
The problem statement is that it can be in multiple lines as well, and it's format is not defined.
I even tried it by putting some conditions like starting or ending with specific characters, but in that way I would have to write too many conditions and also such format does exist after the "OUT:" word which was getting fetched.
Kindly let me know how can I solve the problem.
Below is sample data formats:
Format 1:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
{"jkl": valuejkl, "mno": valuemno, "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "id": "1", "def": {}}
Format 2 :
IN: {"abc": "valueabc", "def": "valuedef", "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": "valueghi"}
Format 3 :
IN: {"abc": "valueabc", "def": "valuedef", "jkl":
["valuejkl"], "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": {}}
Below is the core logic of the solution code I have tried, in if statement there is separate data which needs to be fetched as well, afterwards it's the logic for fetching the data after "IN:" and before "OUT:"
for(String line:lines)
{
String pattern = "^[0-9]+[\\.][0-9]+[\\.][0-9]+[\\.].*";
boolean matches = Pattern.matches(pattern, line);
if(matches)
{
String subString1 = line.split("\\.")[3].trim();
String subString2 = line.split("\\.")[4].trim();
String finalString = subString1+"."+subString2+",";
System.out.println();
System.out.print(finalString);
}
else if(line.startsWith("IN:"))
{
String finalString = line.substring(3).trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.endsWith("}"))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.startsWith("\""))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else
{
continue;
}
}
How about this? If you want a value between IN: and OUT:,
Could you try this code?
StringBuilder sb = new StringBuilder();
boolean targetFound = false;
for (String line : lines) {
if (line.startsWith("IN:")) {
line = line.replace("IN:", "");
targetFound = false;
} else if (line.startsWith("OUT:")) {
targetFound = true;
}
if (targetFound && !line.equals("OUT:")) {
// Print
System.out.println(sb.toString());
sb.setLength(0);
} else {
sb.append(line.trim());
}
}
INPUT TEXT:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "~"}
RESULT:
{"abc": "valueabc","def": "valuedef","ghi":["valuepqr"},{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":"valuepqr"}],"id": "1"}

Pattern matching code in java [duplicate]

This question already has answers here:
Regex to pick characters outside of pair of quotes
(6 answers)
Closed 10 years ago.
I need to parse several rows of the type of JSON code given below. I need to remove all the commas(,) which are inside square brackets. That is ["Cheesesteaks","Sandwiches", "Restaurants"] becomes ["Cheesestakes""Sandwiches""Restaurants"]. I need to preserve all the other commas as they are.
Another example - ["Massachusetts Institute of Technology", "Harvard University"] would become ["Massachusetts Institute of Technology""Harvard University"] keeping all other commas intact.
{"business_id": "EjgQxDOUS-GFLsNxoEFJJg", "full_address": "Liberty Place\n1625 Chestnut St\nMantua\nPhiladelphia, PA 19103", "schools": ["Massachusetts Institute of Technology", "Harvard University"], "open": true, "categories": ["Cheesesteaks", "Sandwiches", "Restaurants"], "photo_url": "http://s3-media4.ak.yelpcdn.com/bphoto/SxGxfJGy9pXRgCNHTRDeBA/ms.jpg", "city": "Philadelphia", "review_count": 43, "name": "Rick's Steaks", "neighborhoods": ["Mantua"], "url": "http://www.yelp.com/biz/ricks-steaks-philadelphia", "longitude": -75.199929999999995, "state": "PA", "stars": 3.5, "latitude": 39.962440000000001, "type": "business"}
Can someone please help me find the regular expression to match this pattern?
Try this:
Pattern outer = Pattern.compile("\\[.*?\\]");
Pattern inner = Pattern.compile("\"\\s*,\\s*\"");
Matcher mOuter = null;
Matcher mInner = null;
mOuter = outer.matcher(jsonString);
StringBuffer sb = new StringBuffer();
while (mOuter.find()) {
mOuter.appendReplacement(sb, "");
mInner = inner.matcher(mOuter.group());
while (mInner.find()) {
mInner.appendReplacement(sb, "\"\"");
}
mInner.appendTail(sb);
}
mOuter.appendTail(sb);
System.out.println(sb.toString());
And replace jsonString with your input.
That should be a very simple replace.
String in = "[\"Cheesesteaks\",\"Sandwiches\", \"Restaurants\"]";
String out = in.replaceAll(", ?", "");
System.out.println(out);
Gives
["Cheesesteaks""Sandwiches""Restaurants"]

Picking apart a string and replacing it

I have been picking my brain lately and can't seem to figure out how to pull the "text" from this string and replace the found pattern with those word(s).
Pattern searchPattern = Pattern.compile("\\[\\{(.+?)\\}\\]");
Matcher matcher = searchPattern.matcher(sb);
sb is the string that contains a few occurrences of these patterns that start with [{ and end with ]}.
[{ md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.searchtype, l : _LU[_LU.el.searchtype].nfts.l, v : _LU[_LU.el.searchtype].nfts.v}}, { md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.topicgroup, l : "Books", v : "ETBO"}}]
gets returned as
md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.searchtype, l : _LU[_LU.el.searchtype].nfts.l, v : _LU[_LU.el.searchtype].nfts.v}}, { md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.topicgroup, l : "Books", v : "ETBO"}
Notice the lack of [{ and }]. I manage to find the above pattern but how would I find the words set and Book and then replace the original found pattern with only those words. I can search the string if it contains a " via
while (matcher.find()) {
matcher.group(1).contains("\"");
but I really just need some ideas about how to go about doing this.
Is this what you are looking for (answer based on your first comment)?
its actually fairly large.. but goes along the lines of "hello my name is, etc, etc, etc, [{ md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.searchtype, l : _LU[_LU.el.searchtype].nfts.l, v : _LU[_LU.el.searchtype].nfts.v}}, { md : {o : "set", et : _LU.et.v.v }, d : {t : _LU.el.topicgroup, l : "Books", v : "ETBO"}}] , some more text here, and some more" -> the [{ }] parts should be replaced with the text inside of them in this case set, books, etbo... resulting in a final string of "hello my name is, etc, etc, etc, set set Books ETBO , some more text here, and some more"
// text from your comment
String sb = "hello my name is, etc, etc, etc, [{ md : "
+ "{o : \"set\", et : _LU.et.v.v }, d : {t : "
+ "_LU.el.searchtype, l : _LU[_LU.el.searchtype].nfts.l, "
+ "v : _LU[_LU.el.searchtype].nfts.v}}, { md : {o : "
+ "\"set\", et : _LU.et.v.v }, d : {t : _LU.el.topicgroup, "
+ "l : \"Books\", v : \"ETBO\"}}] , "
+ "some more text here, and some more";
Pattern searchPattern = Pattern.compile("\\[\\{(.+?)\\}\\]");
Matcher matcher = searchPattern.matcher(sb);
// pattern that finds words between quotes
Pattern serchWordsInQuores = Pattern.compile("\"(.+?)\"");
// here I will collect words in quotes placed in [{ and }] and separate
// them with one space
StringBuilder words = new StringBuilder();
// buffer used while replacing [{ xxx }] part with words found in xxx
StringBuffer output = new StringBuffer();
while (matcher.find()) {// looking for [{ xxx }]
words.delete(0, words.length());
//now I search for words in quotes from [{ xxx }]
Matcher m = serchWordsInQuores.matcher(matcher.group());
while (m.find())
words.append(m.group(1)).append(" ");
matcher.appendReplacement(output, words.toString().trim());
//trim was used to remove last space
}
//we also need to append last part of String that wasn't used in matcher
matcher.appendTail(output);
System.out.println(output);
Output:
hello my name is, etc, etc, etc, set set Books ETBO , some more text here, and some more
OK, I think you need to do this in three passes, first time matching the section between the [{ }], and the second time going through the match doing the replace, and the third time replacing that match with the string you got from the second pass.
You already have a pattern for the first match, and you'd just use it again for the third match, when you replace it with the result of the second pass.
For the second pass, you're going to need to replaceAll on the first match. Something like this:
Pattern searchPattern = Pattern.compile("\\[\\{(.+?)\\}\\]");
Matcher matcher = searchPattern.matcher(sb);
while ( matcher.find() )
{
matcher.replaceFirst(matcher.group(1).replaceAll("[^\"]*\"([^\"]*)\"", "$1"));
}
The first pass is done by matcher.find(). The next one is done by matcher.group().replaceAll(), which is then passed into matcher.replaceFirst() for the third pass. The third pass is a little weird: it replaces the first example of the [{ }]. However, since we're starting from the beginning and moving forward, that will be the one we just found, and we won't match it again because it will get replaced by a non-matching string. The docs recommend resetting the matcher after replaceFirst(), but I think it will be safe here because it will continue from after that replacement, which is exactly what we want.
I would point out that this is not particularly efficient. I think that you would be better off doing more of this manually rather than with regular expressions.
LATEST REVISION
An Example on how to loop over a string with multiple boundaries and replacing at each level
public static String replace(CharSequence rawText, String oldWord, String newWord, String regex) {
Pattern patt = Pattern.compile(regex);
Matcher m = patt.matcher(rawText);
StringBuffer sb = new StringBuffer(rawText.length());
while (m.find()) {
String text = m.group(1);
if(oldWord == null || oldWord.isEmpty()) {
m.appendReplacement(sb, Matcher.quoteReplacement(newWord));
} else {
if(text.matches(oldWord)) {
m.appendReplacement(sb, Matcher.quoteReplacement(newWord));
}
}
}
m.appendTail(sb);
return sb.toString();
}
public static void main(String[] args) throws Exception {
String rawText = "[{MY NAME IS \"NAME\"}]";
rawText += " bla bla bla [{I LIVE IN \"SOME RANDOM CITY\" WHERE THE PIZZA IS GREAT!}]";
rawText += " bla bla etc etc [{I LOVE \"A HOBBY\"}]";
System.out.println(rawText);
Pattern searchPattern = Pattern.compile("\\[\\{(.+?)\\}\\]");
Matcher matcherBoundary = searchPattern.matcher(rawText);
List<String> replacement = new ArrayList<String>();
replacement.add("BOB");
replacement.add("LOS ANGELES");
replacement.add("PUPPIES");
int counter = 0;
while (matcherBoundary.find()) {
String result = Test.replace(matcherBoundary.group(1), null, replacement.get(counter), "\"([^\"]*)\"");
System.out.println(result);
counter++;
}
}
The output I get is:
**Raw Text**
[{MY NAME IS "NAME"}] bla bla bla [{I LIVE IN "SOME RANDOM CITY" WHERE THE PIZZA IS GREAT!}] bla bla etc etc [{I LOVE "A HOBBY"}]
**In Every Loop**
MY NAME IS BOB
I LIVE IN LOS ANGELES WHERE THE PIZZA IS GREAT!
I LOVE PUPPIES

Removing strings from another string in java

Lets say I have this list of words:
String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};
Than I have text
String text = "I would like to do a nice novel about nature AND people"
Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:
String noStopWordsText = remove(text, stopWords);
Result:
" would like do nice novel nature people"
If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.
BTW, right now I'm using this commons method which is lacking proper insensitive case handling:
private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};
noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);
Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAll method to replace all matches with an empty string
import java.util.regex.*;
Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");
the ... in the pattern is just me being lazy, continue the list of stop words.
Another method is to loop over all the stop words and use String's replaceAll method. The problem with that approach is that replaceAll will compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String's replaceAll.
Edit: I added \b around the pattern to make it match whole words only. I also added \s* to make it glob up any spaces after, that's maybe not necessary.
You can make a reg expression to match all the stop words [for example a , note space here]and end up with
str.replaceAll(regexpression,"");
OR
String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
String text = " I would like to do a nice novel about nature AND people ";
for (String stopword : stopWords) {
text = text.replaceAll("(?i)"+stopword, " ");
}
System.out.println(text);
output:
would like do nice novel nature people
IdeOneDemo
There might be better way.
This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n) where n is the length of the text.
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...
String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;
while (index < sampleText.length) {
// the only word delimiter supported is space, if you want other
// delimiters you have to do a series of indexOf calls and see which
// one gives the smallest index, or use regex
int nextIndex = sampleText.indexOf(" ", index);
if (nextIndex == -1) {
nextIndex = sampleText.length - 1;
}
String word = sampleText.substring(index, nextIndex);
if (!stopWords.contains(word.toLowerCase())) {
clean.append(word);
if (nextIndex < sampleText.length) {
// this adds the word delimiter, e.g. the following space
clean.append(sampleText.substring(nextIndex, nextIndex + 1));
}
}
index = nextIndex + 1;
}
System.out.println("Stop words removed: " + clean.toString());
Split text on whilespace. Then loop through the array and keep appending to a StringBuilder only if it is not one of the stop words.

Categories