Lets say I have this list of words:
String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};
Than I have text
String text = "I would like to do a nice novel about nature AND people"
Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:
String noStopWordsText = remove(text, stopWords);
Result:
" would like do nice novel nature people"
If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.
BTW, right now I'm using this commons method which is lacking proper insensitive case handling:
private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};
noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);
Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAll method to replace all matches with an empty string
import java.util.regex.*;
Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");
the ... in the pattern is just me being lazy, continue the list of stop words.
Another method is to loop over all the stop words and use String's replaceAll method. The problem with that approach is that replaceAll will compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String's replaceAll.
Edit: I added \b around the pattern to make it match whole words only. I also added \s* to make it glob up any spaces after, that's maybe not necessary.
You can make a reg expression to match all the stop words [for example a , note space here]and end up with
str.replaceAll(regexpression,"");
OR
String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
String text = " I would like to do a nice novel about nature AND people ";
for (String stopword : stopWords) {
text = text.replaceAll("(?i)"+stopword, " ");
}
System.out.println(text);
output:
would like do nice novel nature people
IdeOneDemo
There might be better way.
This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n) where n is the length of the text.
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...
String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;
while (index < sampleText.length) {
// the only word delimiter supported is space, if you want other
// delimiters you have to do a series of indexOf calls and see which
// one gives the smallest index, or use regex
int nextIndex = sampleText.indexOf(" ", index);
if (nextIndex == -1) {
nextIndex = sampleText.length - 1;
}
String word = sampleText.substring(index, nextIndex);
if (!stopWords.contains(word.toLowerCase())) {
clean.append(word);
if (nextIndex < sampleText.length) {
// this adds the word delimiter, e.g. the following space
clean.append(sampleText.substring(nextIndex, nextIndex + 1));
}
}
index = nextIndex + 1;
}
System.out.println("Stop words removed: " + clean.toString());
Split text on whilespace. Then loop through the array and keep appending to a StringBuilder only if it is not one of the stop words.
Related
I am trying to split a sentence into a group of strings. I want to keep all words, punctuation and whitespace in an array.
For example:
"Hello! My name is John Doe."
Would be split into:
["Hello", "!", " ", "My", " ", "name", " ", "is", " ", "John", " ", "Doe"]
I currently have the following line of code breaking my sentence:
String[] fragments = sentence.split("(?<!^)\\b");
However, this is running into an error where it counts a punctuation mark followed by a whitespace as a single string. How do I modify my regex to account for this?
You can try the following regular expression:
(?<=\b|[^\p{L}])
"Hello! My name is John Doe.".split("(?<=\\b|[^\\p{L}])", 0)
// ⇒ ["Hello", "!", " ", "My", " ", "name", " ", "is", " ", "John", " ", "Doe", "."]
How can i remove the middle border of jtable.
From this ..
To This ..
here is my code
Object rowData[][] = { { "", "", "", "", "", "", "" }, //4 empty row
{ "", "", "", "", "", "", "" },
{ "", "", "", "", "", "", "" },
{ "", "", "", "", "", "", "" }
};
Object columnNames[] = { "File Type", "Total File", "Size(GB)", " ",
"File Type", "Total File", "Size(GB)" };
JTable table = new JTable(rowData, columnNames);
JScrollPane scrollPane = new JScrollPane(table);
one way t do the smilier thing is to draw 2 tables with spacing between them.
Note there will be a some of work to keep 2 tables in sync as one in case of sorting and deleting.
I want to extract content between the first nested braces and second nested braces separately. Now I am totally stuck with this can anyone help me. My file read.txt contains the below data . I just read that to a string "s".
BufferedReader br=new BufferedReader(new FileReader("read.txt"));
while(br.ready())
{
String s=br.readLine();
System.out.println(s);
}
Output
{ { "John", "ran" }, { "NOUN", "VERB" } },
{ { "The", "dog", "jumped"}, { "DET", "NOUN", "VERB" } },
{ { "Mike","lives","in","Poland"}, {"NOUN","VERB","DET","NOUN"} },
ie my output should look like
"John", "ran"
"NOUN", "VERB"
"The", "dog", "jumped"
"DET", "NOUN", "VERB"
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"
Use this regex:
(?<=\{)(?!\s*\{)[^{}]+
See the matches in the Regex Demo.
In Java:
Pattern regex = Pattern.compile("(?<=\\{)(?!\\s*\\{)[^{}]+");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
}
Explanation
The lookbehind (?<=\{) asserts that what precedes the current position is a {
The negative lookahead (?!\s*\{) asserts that what follows is not optional whitespace then {
[^{}]+ matches any chars that are not curlies
If you split on "}," then you get your sets of words in a single string, then just a matter of replacing curly braces
As per your code
BufferedReader br=new BufferedReader(new FileReader("read.txt"));
while(br.ready())
{
String s=br.readLine();
String [] words = s.split ("},");
for (int x = 0; x < words.length; x++) {
String printme = words[x].replace("{", "").replace("}", "");
}
}
You could always remove the opening brackets, then split by '},' which would leave you with the list of strings you've asked for. (If that is all one string, of course)
String s = input.replace("{","");
String[] splitString = s.split("},");
Would first remove open brackets:
"John", "ran" }, "NOUN", "VERB" } },
"The", "dog", "jumped"}, "DET", "NOUN", "VERB" } },
"Mike","lives","in","Poland"},"NOUN","VERB","DET","NOUN"} },
Then would split by },
"John", "ran"
"NOUN", "VERB" }
"The", "dog", "jumped"
"DET", "NOUN", "VERB" }
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"}
Then you just need to tidy them up with another replace!
Another approach could be searching for {...} substring with no inner { or } characters and take only its inner part without { and }.
Regex describing such substring can look like
\\{(?<content>[^{}]+)\\}
Explanation:
\\{ is escaped { so now it represents { literal (normally it represents start of quantifier {x,y} so it needed to be escaped)
(?<content>...) is named-capturing group, it will store only part between { and } and later we would be able to use this part (instead of entire match which would also include { })
[^{}]+ represents one or more non { } characters
\\} escaped } which means it represents }
DEMO:
String input = "{ { \"John\", \"ran\" }, { \"NOUN\", \"VERB\" } },\r\n" +
"{ { \"The\", \"dog\", \"jumped\"}, { \"DET\", \"NOUN\", \"VERB\" } },\r\n" +
"{ { \"Mike\",\"lives\",\"in\",\"Poland\"}, {\"NOUN\",\"VERB\",\"DET\",\"NOUN\"} },";
Pattern p = Pattern.compile("\\{(?<content>[^{}]+)\\}");
Matcher m = p.matcher(input);
while(m.find()){
System.out.println(m.group("content").trim());
}
Output:
"John", "ran"
"NOUN", "VERB"
"The", "dog", "jumped"
"DET", "NOUN", "VERB"
"Mike","lives","in","Poland"
"NOUN","VERB","DET","NOUN"
For some reasons I have to use a specific string in my project. This is the text file (it's a JSON File):
{"algorithm":
[
{ "key": "onGapLeft", "value" : "moveLeft" },
{ "key": "onGapFront", "value" : "moveForward" },
{ "key": "onGapRight", "value" : "moveRight" },
{ "key": "default", "value" : "moveBackward" }
]
}
I've defined it in JAVA like this:
static String input = "{\"algorithm\": \n"+
"[ \n" +
"{ \"key\": \"onGapLeft\", \"value\" : \"moveLeft\" }, \n" +
"{ \"key\": \"onGapFront\", \"value\" : \"moveForward\" }, \n" +
"{ \"key\": \"onGapRight\", \"value\" : \"moveRight\" }, \n" +
"{ \"key\": \"default\", \"value\" : \"moveBackward\" } \n" +
"] \n" +
"}";
Now I have to isolate the keys and values in an array:
key[0] = onGapLeft; value[0] = moveLeft;
key[1] = onGapFront; value[1] = moveForward;
key[2] = onGapRight; value[2] = moveRight;
key[3] = default; value[3] = moveBackward;
I'm new to JAVA and don't understand the string class very well. Is there an easy way to get to that result? You would help me really!
Thanks!
UPDATE:
I didn't explained it well enough, sorry. This program will run on a LEGO NXT Robot. JSON won't work there as I want it to so I have to interpret this JSON File as a normal STRING! Hope that explains what I want :)
I propose a solution in several step.
1) Let's get the different parts of your ~JSON String. We will use a pattern to get the different {.*} parts :
public static void main(String[] args) throws Exception{
List<String> lines = new ArrayList<String>();
Pattern p = Pattern.compile("\\{.*\\}");
Matcher matcher = p.matcher(input);
while (matcher.find()) {
lines.add(matcher.group());
}
}
(you should take a look at Pattern and Matcher)
Now, lines contains 4 String :
{ "key": "onGapLeft", "value" : "moveLeft" }
{ "key": "onGapFront", "value" : "moveForward" }
{ "key": "onGapRight", "value" : "moveRight" }
{ "key": "default", "value" : "moveBackward" }
Given a String like one of those, you can remove curly brackets with a call to String#replaceAll();
List<String> cleanLines = new ArrayList<String>();
for(String line : lines) {
//replace curly brackets with... nothing.
//added a call to trim() in order to remove whitespace characters.
cleanLines.add(line.replaceAll("[{}]","").trim());
}
(You should take a look at String String#replaceAll(String regex))
Now, cleanLines contains :
"key": "onGapLeft", "value" : "moveLeft"
"key": "onGapFront", "value" : "moveForward"
"key": "onGapRight", "value" : "moveRight"
"key": "default", "value" : "moveBackward"
2) Let's parse one of those lines :
Given a line like :
"key": "onGapLeft", "value" : "moveLeft"
You can split it on , character using String#split(). It will give you a String[] containing 2 elements :
//parts[0] = "key": "onGapLeft"
//parts[1] = "value" : "moveLeft"
String[] parts = line.split(",");
(You should take a look at String[] String#split(String regex))
Let's clean those parts (remove "") and assign them to some variables:
String keyStr = parts[0].replaceAll("\"","").trim(); //Now, key = key: onGapLeft
String valueStr = parts[1].replaceAll("\"","").trim();//Now, value = value : moveLeft
//Then, you split `key: onGapLeft` with character `:`
String key = keyStr.split(":")[1].trim();
//And the same for `value : moveLeft` :
String value = valueStr.split(":")[1].trim();
That's it !
You should also take a look at Oracle's tutorial on regular expressions (This one is really important and you should invest time on it).
You need to use a JSON parser library here. For example, with org.json you could parse it as
String input = "{\"algorithm\": \n"+
"[ \n" +
"{ \"key\": \"onGapLeft\", \"value\" : \"moveLeft\" }, \n" +
"{ \"key\": \"onGapFront\", \"value\" : \"moveForward\" }, \n" +
"{ \"key\": \"onGapRight\", \"value\" : \"moveRight\" }, \n" +
"{ \"key\": \"default\", \"value\" : \"moveBackward\" } \n" +
"] \n" +
"}";
JSONObject root = new JSONObject(input);
JSONArray map = root.getJSONArray("algorithm");
for (int i = 0; i < map.length(); i++) {
JSONObject entry = map.getJSONObject(i);
System.out.println(entry.getString("key") + ": "
+ entry.getString("value"));
}
Output :
onGapLeft: moveLeft
onGapFront: moveForward
onGapRight: moveRight
default: moveBackward
I have the following result obtained from a custom result.
{
"kind": "customsearch#search",
"url": {
"type": "application/json",
"template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}& ={count?}& start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&nsc={nsc?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
},
"queries": {
"nextPage": [
{
"title": "Google Custom Search - flowers",
"totalResults": 10300000,
"searchTerms": "flowers",
"count": 10,
"startIndex": 11,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"cx": "013036536707430787589:_pqjad5hr1a"
}
],
"request": [
{
"title": "Google Custom Search - flowers",
"totalResults": 10300000,
"searchTerms": "flowers",
"count": 10,
"startIndex": 1,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"cx": "013036536707430787589:_pqjad5hr1a"
}
]
},
"context": {
"title": "Custom Search"
},
"items": [
{
"kind": "customsearch#result",
"title": "Flower - Wikipedia, the free encyclopedia",
"htmlTitle": "<b>Flower</b> - Wikipedia, the free encyclopedia",
"link": "http://en.wikipedia.org/wiki/Flower",
"displayLink": "en.wikipedia.org",
"snippet": "A flower, sometimes known as a bloom or blossom, is the reproductive structure found in flowering plants (plants of the division Magnoliophyta, ...",
"htmlSnippet": "A <b>flower</b>, sometimes known as a bloom or blossom, is the reproductive structure <br> found in flowering plants (plants of the division Magnoliophyta, <b>... </b>",
"pagemap": {
"RTO": [
{
"format": "image",
"group_impression_tag": "prbx_kr_rto_term_enc",
"Opt::max_rank_top": "0",
"Opt::threshold_override": "3",
"Opt::disallow_same_domain": "1",
"Output::title": "<b>Flower</b>",
"Output::want_title_on_right": "true",
"Output::num_lines1": "3",
"Output::text1": "꽃은 식물 에서 씨 를 만들어 번식 기능을 수행하는 생식 기관 을 말한다. 꽃을 형태학적으로 관찰하여 최초로 총괄한 사람은 식물계를 24강으로 분류한 린네 였다. 그 후 꽃은 식물분류학상중요한 기준이 되었다.",
"Output::gray1b": "- 위키백과",
"Output::no_clip1b": "true",
"UrlOutput::url2": "http://en.wikipedia.org/wiki/Flower",
"Output::link2": "위키백과 (영문)",
"Output::text2b": " ",
"UrlOutput::url2c": "http://ko.wikipedia.org/wiki/꽃",
"Output::link2c": "위키백과",
"result_group_header": "백과사전",
"Output::image_url": "http://www.gstatic.com/richsnippets/b/fcb6ee50e488743f.jpg",
"image_size": "80x80",
"Output::inline_image_width": "80",
"Output::inline_image_height": "80",
"Output::image_border": "1"
}
]
}
}
]
}
How can I extract all the https links from the above code using java?
You could be lazy and ignore parsing JSON, treat the entire result as a String and just use a regular expression to match URLs.
String httpLinkPattern = "https?://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern p = Pattern.compile(httpLinkPattern);
Matcher m = p.matcher(jsonResult);
while (m.find())
System.out.println("Found http link: "+m.group());
If you are looking to convert your response to string for manipulation, and hence extract the URL's as oppose to using JSON library, then below should do.
public List<String> extractUrls(String input)
{
List<String> result = new ArrayList<String>();
Pattern pattern =
Pattern.compile("\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + "(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov"
+ "|mil|biz|info|mobi|name|aero|jobs|museum" + "|travel|[a-z]{2}))(:[\\d]{1,5})?"
+ "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?"
+ "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*"
+ "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find())
{
result.add(matcher.group());
}
return result;
}
Usage:
List<String> links = extractUrls(jsonResponseString);
for (String link : links)
{
System.out.println(link);
}
Please use JSON Parser to do this. I think that would be the best. Please refer the below link for nice example
Java code to parse JSON
https?://.*?\.(org|com|net|gov)/.*?(?=")
This regex will work for your purposes. http://regexr.com?30nm2