I'm trying to figure out a way to use regular expressions to find duplicate words on a webpage, I'm completely clueless and apologise in advance if I'm using the incorrect terminology.
So far I've found the following regular expressions which work well but only on words that are consecutively (e.g. hello hello) but not words that are placed in different parts of the webpage or separated by another word (e.g. hello food hello)
\b(\w+)(\s+\1\b)*
\b(\w+(?:\s*\w*))\s+\1\b
I would be super grateful to anyone that can help, I realise I might not be in the right place since I'm basically a noob.
Capture the first word (surrounded by word boundaries) in a group, and then backreference it later in a lookahead, after repeating optional characters in between:
\b(\w+)\b(?=.*\b\1\b)
https://regex101.com/r/TcS1UW/3
I would use Jsoup to get the text from the webpage. Then you could keep track of the counts using a HashMap, and then search the map for any number of occurrences you want:
String url = "https://en.wikipedia.org/wiki/Jsoup";
String body = Jsoup.connect(url).get().body().text();
Map<String,Integer> counts = new HashMap<>();
for ( String word : body.split(" ") )
{
counts.merge(word, 1, Integer::sum);
}
for ( String key : counts.keySet() )
{
if ( counts.get(key) >= 2 )
{
System.out.println(key + " occurs " + counts.get(key) + " times.");
}
}
You may need to clean up the map to get rid of some entries that aren't words, but this will get you most of the way.
Related
I'm new in using regex in java and now having problems getting my regular expression working.
I want to keep minimum 3 characters in a string, if it only 2 characters, i want to delete it.
here's my string :
It might be more sensible for real users if I also included a lower limit on the number of letters.
The output i want :
might more sensible for real users also includedlower limit the number letters.
So, i did some googling but still doesnt work.
so basically here's the complete code (1-5 is the regex i've tried):
String input = "It might be more sensible for real users if I also included a lower limit on the number of letters.";
//1. /^[a-zA-Z]{3,}$/
//2. /^[a-zA-Z]{3,30}$/
//3. \\b[a-zA-Z]{4,30}\\b
//4. ^\\W*(?:\\w+\\b\\W*){3,30}$
//5. [+]?(?:[a-zA-Z]\\s*){3,30}
String output = input.replaceAll("/^[a-zA-Z]{3,}$/", "");
System.out.println(output);
You can try this:
package com.stackoverflow.answer;
public class RegexTest {
public static void main(String[] args) {
String input = "It might be more sensible for real users if I also included a lower limit on the number of letters.";
System.out.println("BEFORE: " + input);
input = input.replaceAll("\\b[\\w']{1,2}\\b", "").replaceAll("\\s{2,}", " ");
System.out.println("AFTER: " + input);
}
}
You can use \\w{1,3} to get any 1-2 word characters. You then need to make sure they are not adjacent to other word characters before removing them, so you check for non-word characters (\\W) and beginning or ending of the line (^ and $) like so:
String output = input.replaceAll("(^|\\W)\\w{1,3}($|\\W)", " ");
Note the extra space cleans up for the potentially 2 spaces we are removing.
I was trying to answer a question recently and while attempting to solve it, I ran into a question of my own.
Given the following code
private void regexample(){
String x = "a3ab4b5";
Pattern p = Pattern.compile("(\\D+(\\d+)\\D+){2}");
Matcher m = p.matcher(x);
while(m.find()){
for(int i=0;i<=m.groupCount();i++){
System.out.println("Group " + i + " = " + m.group(i));
}
}
}
And the output
Group 0 = a3ab4b
Group 1 = b4b
Group 2 = 4
Is there any straight-forward way I'm missing to get the value 3? The pattern should look for two occurrences of (\\D+(\\d+)\\D+) back-to-back, and a3a is part of the match. I realize I can change expression to (\\D+(\\d+)\\D+) and then look for all matches, but that isn't technically the same thing. Is the only way to do a double search? ie: Use the given pattern to match the string and then search again for each count of the outer group?
I guessed that the first values were overwritten with the second, but as I'm not that great with regex, I was hoping there was something I was missing.
It is impossible to capture multiple occurrences of the same group (with standard regex engines). You could use something like this:
Pattern.compile("(\\D+(\\d+)\\D+)(\\D+(\\d+)\\D+)");
Now, there are four groups instead of two, so you will get the values you expected.
This question deals with a similar problem.
I am processing tweets using a map reduce job. One of the things I want to do is censor the abusing words. When I test My code locally it works as desired. But when I run it on the whole data set for some text it censors the abusing words but misses some. Now as the data is 1TB in size total(800 files) I am not able to find that particular tweet data in the raw form(JSON) so that I can test it locally to find the problem. However I have the tweet text(not the whole json) which got uncensored from my map reduce program. To test I tried to put that text in the tweet text field of some other tweet json and the program correctly censored the abusing word. Can you guys suggest any strategy by which I can find the bug. Or if you find a bug in my code just by looking at it that would be great
Function which loops through all the words of tweet (tweet split by non alphanumeric character)
public static String censorText(String text, String textWords[], Set banned) {
StringBuilder builder = new StringBuilder(text);
textWords = getTextArray(text);
for (int i = 0; i < textWords.length; i++) {
if (banned.contains(textWords[i].toLowerCase())) {
String cleanedWord = cencor(textWords[i]);
// compile a pattern with banned word
List<Integer> indexList = getIndexes(builder, textWords[i]);
replaceWithCleanWord(builder, indexList, cleanedWord);
}
}
return builder.toString();
}
//function to find the position of abuse word in the tweet text so that //can be replaced by censored word
private static List<Integer> getIndexes(StringBuilder builder, String string) {
List<Integer> indexes = new ArrayList<Integer>();
String word = "(" + string.charAt(0) + ")" + string.substring(1);
System.out.println("word to match" +word);
Pattern p = Pattern.compile("(?<=^|[^a-zA-Z\\d])" + word + "(?=$|[^a-zA-Z\\d])");
Matcher m = p.matcher(builder.toString());
while (m.find()) {
indexes.add(m.start());
}
return indexes;
}
Sample text I want to censor:
"text":"Gracias a todos los seguidores de cuantoporno y http://t.co/, #sex #sexo #porn #porno #pussy #xxx;"
only if the word is surrounded by special characters or space then censor it
"text":"Gracias a todos los seguidores de cuantoporno y http://t.co/ , #s*x #sexo #porn #porno #p***y #xxx;"
The first text is the output of my map reduce but expected output is second text. When I input the same text on my local machine for the same java file I get the expected result. What could be the problem?
You do not use any regex feature other than lookahed/lookbehind. Lookahead and lookbehind are not optimized in Java regexp search. You could as well search for the string and then verify if the character before/behind is ok.
This would save a lot of performance:
compilation of regular expressions is expensive (compared to string search compilation)
search with regular expressions is even more expensive (compared to string search)
So if you want to solve the problem: Use a string search algorithm (as boyer-moore-horspool).
And it gets even more efficient if you use a multistring search algorithm, like set-horspool or wu-manber. Such an algorithm will deliver all indexes of all words with a performance of nearly O(n) (n is the length of the text).
From a server, I get strings of the following form:
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]]] fixedWord8:fixedWord8";
(only spaces divide groups of word-var pairs)
Later, I want to store them in a Hashmap, like myHashMap.put(fixedWord1, var1); and so on.
Problem:
Inside the first "data[......]"-tag, the number of other "data[..........]"-tags is variable, and I don't know the length of the string in advance.
I don't know how to process such Strings without resorting to String.split(), which is discouraged by our assignment task givers (university).
I have searched the internet and couldn't find appropriate websites explaining such things.
It would be of great help, if experienced people could give me some links to websites or something like a "diagrammatic plan" so that I could code something.
EDIT:
got mistake in String (off-topic-begin "please don't lynch" off-topic-end), the right string is (changed fixedWord7=var7 ---to---> fixedWord7=[var7]):
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]]] fixedWord8:fixedWord8";
I assume your string follows a same pattern, which has "data" and "[", "]" in it. And the variable name/value will not include these strings
remove string "data[", "[", "]", and "," from the original string
replaceAll("data[", "")
replaceAll("[", "")
etc
separate the string by space: " " by using StringTokenizer or loop through the String char by char.
then you will get array of strings like
fixedWorld1:var1
fixedWorld2:var2
......
fixedWorld4
fixedWorld5=var5
......
then again separate the sub strings by ":" or "=". and put the name/value into the Map
Problem is not absolutely clear but may be something like this will work for you:
Pattern p = Pattern.compile("\\b(\\w+)[:=]\\[?(\\w+)");
Matcher m = p.matcher( x );
while( m.find() ) {
System.out.println( "matched: " + m.group(1) + " - " + m.group(2) );
hashMap.put ( m.group(1), m.group(2) );
}
At my job today, I was made aware of a little error in our pages' titles. Our site is built using .jsp pages and for the titles of our product pages we use
In our admin (where we can set up the titles for each of the products), we would normally add in * anyone ever run into this issue before, and if so, does anyone know of a way to fix the double pipes issue I have encountered?
Problem is that the method replaceAll has as the first argument regular expression. The "|" is reserved symbol in regular expressions and you must escape it if you want use it as a string literal. You can create workaround, for example this way.
String[] words = str.split(" ");
for (int i = 0; i < words.length; i++) {
if (words[i].length() > 0) {
if (!(words[i].substring(0, 1).equals("|"))) {
sb.append(words[i].replaceFirst(words[i].substring(0, 1), words[i].substring(0, 1).toUpperCase()) + " ");
} else {
sb.append(words[i] + " ");
}
}
}
Try using the html escape code for the pipe character ¦.
Your title would be:
"Monkey Thank You ¦ Monkey Thank You Cards"
I think the issue is in the fact that replaceFirst() takes a regex as parameter and a replacement string. Because you push in the first character as is for the regex parameter, what happens with the vertical bar is (omitting adding to the StringBuffer) equivalent to:
String addedToBuffer = "|".replaceFirst("|", "|".toUpperCase());
What happens then, is that we have a regex which matches the empty string or the empty string. Well, any string matches the empty string regex. So the match gets replaced by "|" (to upper case). So "|".replaceFirst("|", "|".toUpperCase()) expands to "||". So the append() call is given the parameter of "|| ".
You can fix your algorithm in two ways:
Fix the regex automatically, use literal notation in between \Q and \E. So your regex to pass to replaceFirst() becomes something like "\\Q"+ literal + "\\E".
Realise that you do not need regexes in the first place. Instead use two append() operations. One to append() the case converted first character of the item to add, the other to append the rest. This looks like this:
for(String s: items) {
if(s.equals("")) {
sb.append(" ");
}
else {
sb.append(Character.toUpperCase(s.charAt(0)));
if(s.length() > 1) {
sb.append(s.substring(1));
}
sb.append(" ");
}
}
The second approach is probably much easier to follow as well.
PS: For some reason the StackOverflow editor is vehemently disagreeing with code blocks in lists. If someone happens to know how to fix the munged formatting... ?