How to find the missing edge case in my code - java

I am processing tweets using a map reduce job. One of the things I want to do is censor the abusing words. When I test My code locally it works as desired. But when I run it on the whole data set for some text it censors the abusing words but misses some. Now as the data is 1TB in size total(800 files) I am not able to find that particular tweet data in the raw form(JSON) so that I can test it locally to find the problem. However I have the tweet text(not the whole json) which got uncensored from my map reduce program. To test I tried to put that text in the tweet text field of some other tweet json and the program correctly censored the abusing word. Can you guys suggest any strategy by which I can find the bug. Or if you find a bug in my code just by looking at it that would be great
Function which loops through all the words of tweet (tweet split by non alphanumeric character)
public static String censorText(String text, String textWords[], Set banned) {
StringBuilder builder = new StringBuilder(text);
textWords = getTextArray(text);
for (int i = 0; i < textWords.length; i++) {
if (banned.contains(textWords[i].toLowerCase())) {
String cleanedWord = cencor(textWords[i]);
// compile a pattern with banned word
List<Integer> indexList = getIndexes(builder, textWords[i]);
replaceWithCleanWord(builder, indexList, cleanedWord);
}
}
return builder.toString();
}
//function to find the position of abuse word in the tweet text so that //can be replaced by censored word
private static List<Integer> getIndexes(StringBuilder builder, String string) {
List<Integer> indexes = new ArrayList<Integer>();
String word = "(" + string.charAt(0) + ")" + string.substring(1);
System.out.println("word to match" +word);
Pattern p = Pattern.compile("(?<=^|[^a-zA-Z\\d])" + word + "(?=$|[^a-zA-Z\\d])");
Matcher m = p.matcher(builder.toString());
while (m.find()) {
indexes.add(m.start());
}
return indexes;
}
Sample text I want to censor:
"text":"Gracias a todos los seguidores de cuantoporno y http://t.co/, #sex #sexo #porn #porno #pussy #xxx;"
only if the word is surrounded by special characters or space then censor it
"text":"Gracias a todos los seguidores de cuantoporno y http://t.co/ , #s*x #sexo #porn #porno #p***y #xxx;"
The first text is the output of my map reduce but expected output is second text. When I input the same text on my local machine for the same java file I get the expected result. What could be the problem?

You do not use any regex feature other than lookahed/lookbehind. Lookahead and lookbehind are not optimized in Java regexp search. You could as well search for the string and then verify if the character before/behind is ok.
This would save a lot of performance:
compilation of regular expressions is expensive (compared to string search compilation)
search with regular expressions is even more expensive (compared to string search)
So if you want to solve the problem: Use a string search algorithm (as boyer-moore-horspool).
And it gets even more efficient if you use a multistring search algorithm, like set-horspool or wu-manber. Such an algorithm will deliver all indexes of all words with a performance of nearly O(n) (n is the length of the text).

Related

How to add a space after certain characters using regex Java

I have a string consisting of 18 digits Eg. 'abcdefghijklmnopqr'. I need to add a blank space after 5th character and then after 9th character and after 15th character making it look like 'abcde fghi jklmno pqr'. Can I achieve this using regular expression?
As regular expressions are not my cup of tea hence need help from regex gurus out here. Any help is appreciated.
Thanks in advance
Regex finds a match in a string and can't preform a replacement. You could however use regex to find a certain matching substring and replace that, but you would still need a separate method for replacement (making it a two step algorithm).
Since you're not looking for a pattern in your string, but rather just the n-th char, regex wouldn't be of much use, it would make it unnecessary complex.
Here are some ideas on how you could implement a solution:
Use an array of characters to avoid creating redundant strings: create a character array and copy characters from the string before
the given position, put the character at the position, copy the rest
of the characters from the String,... continue until you reach the end
of the string. After that construct the final string from that
array.
Use Substring() method: concatenate substring of the string before
the position, new character, substring of the string after the
position and before the next position,... and so on, until reaching the end of the original string.
Use a StringBuilder and its insert() method.
Note that:
First idea listed might not be a suitable solution for very large strings. It needs an auxiliary array, using additional space.
Second idea creates redundant strings. Strings are immutable and final in Java, and are stored in a pool. Creating
temporary strings should be avoided.
Yes you can use regex groups to achieve that. Something like that:
final Pattern pattern = Pattern.compile("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})");
final Matcher matcher = pattern.matcher("abcdefghijklmnopqr");
if (matcher.matches()) {
String first = matcher.group(0);
String second = matcher.group(1);
String third = matcher.group(2);
String fourth = matcher.group(3);
return first + " " + second + " " + third + " " + fourth;
} else {
throw new SomeException();
}
Note that pattern should be a constant, I used a local variable here to make it easier to read.
Compared to substrings, which would also work to achieve the desired result, regex also allow you to validate the format of your input data. In the provided example you check that it's a 18 characters long string composed of only lowercase letters.
If you had a more interesting examples, with for example a mix of letters and digits, you could check that each group contains the correct type of data with the regex.
You can also do a simpler version where you just replace with:
"abcdefghijklmnopqr".replaceAll("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})", "$1 $2 $3 $4")
But you don't have the benefit of checking because if the string doesn't match the format it will just not replaced and this is less efficient than substrings.
Here is an example solution using substrings which would be more efficient if you don't care about checking:
final Set<Integer> breaks = Set.of(5, 9, 15);
final String str = "abcdefghijklmnopqr";
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (breaks.contains(i)) {
stringBuilder.append(' ');
}
stringBuilder.append(str.charAt(i));
}
return stringBuilder.toString();

Check the number of occurrences of word(s) stored in an ArrayList

I have big text such as :
really!!! Oh Oh! You read about them in a book and they told you to wear clothes? buahahahaham Did they also tell you how they were able to sew the leaves that they used to cover up? You amu
Also I have an arraylist of some words and expression such as really or oh oh!
Now I want to count the number of occurrence of the phrases (which is in the arraylist ) in the given text above or any similar text.
So for that I first split the text to words and start looping as follow:
String[] word=content.split("\\s+");
for(int j=0;j<word.length;j++){
if(sexuality.contains(word[j])){
swCount=sw+1;
}
But this does not work since the oh oh! or really cannot be picked by the above method. Can anyone help?
This counts the occurences of any searchString in your input.
String input = "....";
List<String> searchStrings = Arrays.asList("oh oh!", "really");
int count = 0;
for (String searchString : searchStrings) {
int indexOf = input.indexOf(searchString);
while (indexOf > -1) {
count++;
indexOf = input.indexOf(searchString, indexOf+1);
}
}
If you want case insensitive search, convert both the input and the search words to lowercase. If you don't want to count words twice, replace the indexOf and the while loop with a simple contains:
int count = 0;
for (String searchString : searchStrings) {
if (input.contains(searchString)) {
count++;
}
}
If you have something like god in your blacklist and don't want to match goddamn in input (for whatever reason) you need to make sure there are string boundaries around your search word. Have a look at this code:
int count = 0;
for (String searchString : searchStrings) {
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchString) + "\\b");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
count++;
}
}
I also don't understand exactly: is the problem that "oh oh!" should be one word? or is "!" the problem? Anyway, consider overriding "Equals" in ArrayList (I assume "sexuality" is your arraylist) to fit your needs. Check out this post:
ArrayList's custom Contains method
The brute force approach is to insert all strings of sexuality list to an HashMap and then for each substring of content search for it in the map. You can limit the length of the substring to the maximum length of the words in sexuality list. However this could be really expensive, it depends on the length of content and the length of the longest word contained in sexuality
For a smarter approach you should have a look at another data structure, the trie.
An implementation is available in the Apache Commons Collection 4 lib. This approach is much faster because let you stop scanning the substring as soon as you find a prefix the doesn't exist in your dictionary (in your case the sexuality list)
If your "sentence" is not too big and your List doesn´t contain too many items I would go the easy way and do it like this:
String sentence = "Here is my my sentence";
List<String> searchList = new ArrayList<>();
searchList.add("is");
searchList.add("my");
int occurences[] = new int[searchList.size()];
for (int i = 0; i < searchList.size(); i++) {
int searchFromPos = 0;
String wordToSearch = searchList.get(i);
while ((searchFromPos = sentence.indexOf(wordToSearch, searchFromPos)) != -1) {
occurences[i]++;
searchFromPos += wordToSearch.length();
}
}
NOTE, however, that is will also detect word parts.
e.g. when your sentence is "This is sneaky" and you search for "is", there wille be two results, because This also has and "is".

How do I check if a String contains restricted words using Regex?

These are the strings that I should not allow in my address:
"PO BOX","P0 DRAWER","POSTOFFICE", " PO ", " BOX ",
"C/O","C.O."," ICO "," C/O "," C\0 ","C/0","P O BOX",
"P 0 BOX","P 0 B0X","P0 B0X","P0 BOX","P0BOX","P0B0X",
"POBX","P0BX","POBOX","P.0.","P.O","P O "," P 0 ",
"P.O.BOX","P.O.B","POB ","P0B","P 0 B","P O B",
" CARE ","IN CARE"," APO "," CPO "," UPO ", "GENDEL",
"GEN DEL", "GENDELIVERY","GEN DELIVERY","GENERALDEL",
"GENERAL DEL","GENERALDELIVERY","GENERAL DELIVERY"
I created regular expression: This expression validates only POBOx part – please correct to not allow all the above strings in my address field
"([\\w\\s*\\W]*((P(O|OST)?.?\\s*((O(FF(ICE)?)?)?.?\\s*(B(IN|OX|.?))|B(IN|OX))+))[\\w\\s*\\W]*)+
|([\\w\\s*\\W]* (IN \s*(CARE)?\\s*)|\s*[\\w\\s*\\W]*((.?(APO)?|.?(cPO)?|.?(uPO))?.?\s*) [\\w\\s*\\W]*|([\\w\\s*\\W]*(GEN(ERAL)?)?.?\s*(DEL(IVERY)?)?.?\s* [\\w\\s*\\W]*))";
I'm guessing you're trying to see if an address string contains any restricted phrases.
Please do not do this in one single regex.
Doing one single massive regex matching query means it's hard to understand what you did to create the regex, hard to extend if more restrictions pop up, and generally not good code practice.
Here's a (hopefully) more sane approach:
public static final String RESTRICTIONS[] = { " P[0O] ", " B[0O]X ", "etc, etc" };
public static boolean containsRestrictions(String testString) {
for (String expression : RESTRICTIONS) {
Matcher restriction = Pattern.compile(expression).matcher(testString);
if (restriction.find())
return true;
}
return false;
}
You're still doing regex matching, so you can put your fancy schmancy regex into your restrictions list, but it works on just plain old strings too. Now you only need to verify that each of the individual regexes work instead of verifying a giant regex against all possible cases. If you wanna add a new restriction, just add it to the list. If you're real fancy you can load the restrictions from a configuration file or inject it using spring so your pesky product people can add address restrictions without touching une ligne de code.
Edit: To make this even easier to read, and to do what you really want (restricting strings separated from other strings using whitespace), you can remove regexes altogether from the restrictions and do some basic matching work in your method.
// No regexes here, just words you wanna restrict
public static final String RESTRICTIONS[] = { "PO", "PO BOX", "etc, etc" };
public static boolean containsRestrictions(String testString) {
for (String word : RESTRICTIONS) {
String expression = "(^|\\s)" + word + "(\\s|$)";
Matcher restriction = Pattern.compile(expression).matcher(testString);
if (restriction.find())
return true;
}
return false;
}
So, you want to search substrings like a pro? I'd suggest using the Aho Corasick algorithm which solves the kind of problems you have.
Selling point:
It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously.
Luckily, a Java implementation exists. You can get it here.
Here's how to use it:
// this is the part you have to do only once
AhoCorasick tree = new AhoCorasick();
String[] terms = {"PO BOX","P0 DRAWER",...};
for (int i = 0; i < terms.length; i++) {
tree.add(terms[i].getBytes(), terms[i]);
}
tree.prepare();
// here comes the part you use for every address you want to check
String text = "The ga3 mutant of Arabidopsis is a gibberellin-responsive. In UPO, that is...";
boolean restrictedWordFound = false;
#SuppressWarnings("unchecked")
Iterator<SearchResult> search = (Iterator<SearchResult>)tree.search(text.getBytes());
if(search.hasNext()) {
restrictedWordFound = true;
}
If a match has been found, restrictedWordFound will be true.
Note: this search is case sensitive. Since your strings are all in upper case, I'd suggest you first convert address in a temporary upper case variant and use matching on it. That way, you will cover all possible combinations.
From my tests, Aho Corasick is faster than regex based search and in most cases faster than naive string searching using contains and other String based methods. You can add even more filter words; Aho Corasick is the way to go.
Instead of using such complicated regular expressions, you can state: the regex:
"PO BOX|P0 DRAWER|POSTOFFICE| PO | BOX |C/O|C.O.| ICO | C/O | C\0 |C/0|P O BOX|P 0 BOX|P 0 B0X|P0 B0X|P0 BOX|P0BOX|P0B0X|POBX|P0BX|POBOX|P.0.|P.O|P O | P 0 |P.O.BOX|P.O.B|POB |P0B|P 0 B|P O B| CARE |IN CARE| APO | CPO | UPO |GENDEL|GEN DEL|GENDELIVERY|GEN DELIVERY|GENERALDEL|GENERAL DEL|GENERALDELIVERY|GENERAL DELIVERY"
And negate the answer.
When you compile the regex (in Java) the resulting mechanism will become more efficiënt. (Java uses DFA minimalisation).

Java simple sentence parser

is there any simple way to create sentence parser in plain Java
without adding any libs and jars.
Parser should not just take care about blanks between words,
but be more smart and parse: . ! ?,
recognize when sentence is ended etc.
After parsing, only real words could be all stored in db or file, not any special chars.
thank you very much all in advance :)
You might want to start by looking at the BreakIterator class.
From the JavaDoc.
The BreakIterator class implements
methods for finding the location of
boundaries in text. Instances of
BreakIterator maintain a current
position and scan over text returning
the index of characters where
boundaries occur. Internally,
BreakIterator scans text using a
CharacterIterator, and is thus able to
scan text held by any object
implementing that protocol. A
StringCharacterIterator is used to
scan String objects passed to setText.
You use the factory methods provided
by this class to create instances of
various types of break iterators. In
particular, use getWordIterator,
getLineIterator, getSentenceIterator,
and getCharacterIterator to create
BreakIterators that perform word,
line, sentence, and character boundary
analysis respectively. A single
BreakIterator can work only on one
unit (word, line, sentence, and so
on). You must use a different iterator
for each unit boundary analysis you
wish to perform.
Line boundary analysis determines
where a text string can be broken when
line-wrapping. The mechanism correctly
handles punctuation and hyphenated
words.
Sentence boundary analysis allows
selection with correct interpretation
of periods within numbers and
abbreviations, and trailing
punctuation marks such as quotation
marks and parentheses.
Word boundary analysis is used by
search and replace functions, as well
as within text editing applications
that allow the user to select words
with a double click. Word selection
provides correct interpretation of
punctuation marks within and following
words. Characters that are not part of
a word, such as symbols or punctuation
marks, have word-breaks on both sides.
Character boundary analysis allows
users to interact with characters as
they expect to, for example, when
moving the cursor through a text
string. Character boundary analysis
provides correct navigation of through
character strings, regardless of how
the character is stored. For example,
an accented character might be stored
as a base character and a diacritical
mark. What users consider to be a
character can differ between
languages.
BreakIterator is intended for use with
natural languages only. Do not use
this class to tokenize a programming
language.
See demo: BreakIteratorDemo.java
Based on #Jarrod Roberson's answer, I have created a util method that uses BreakIterator and returns the list of sentences.
public static List<String> tokenize(String text, String language, String country){
List<String> sentences = new ArrayList<String>();
Locale currentLocale = new Locale(language, country);
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);
sentenceIterator.setText(text);
int boundary = sentenceIterator.first();
int lastBoundary = 0;
while (boundary != BreakIterator.DONE) {
boundary = sentenceIterator.next();
if(boundary != BreakIterator.DONE){
sentences.add(text.substring(lastBoundary, boundary));
}
lastBoundary = boundary;
}
return sentences;
}
Just use regular expression (\s+ - it will apply to one or more whitespaces (spaces, tabs, etc.)) to split String into array.
Then you may iterate over that array and check whether word ends with .?! (String.endsWith() to find end of sentences.
And before saving any word use once again regular expression to remove every non-alphanumeric character.
Of course, use StringTokenizer
import java.util.StringTokenizer;
public class Token {
public static void main(String[] args) {
String sentence = "Java! simple ?sentence parser.";
String separator = "!?.";
StringTokenizer st = new StringTokenizer( sentence, separator, true );
while ( st.hasMoreTokens() ) {
String token = st.nextToken();
if ( token.length() == 1 && separator.indexOf( token.charAt( 0 ) ) >= 0 ) {
System.out.println( "special char:" + token );
}
else {
System.out.println( "word :" + token );
}
}
}
}
String Tokenizer
Scanner
Ex.
StringTokenizer tokenizer = new StringTokenizer(input, " !?.");

Extract words out of a text file

Let's say you have a text file like this one:
http://www.gutenberg.org/files/17921/17921-8.txt
Does anyone has a good algorithm, or open-source code, to extract words from a text file?
How to get all the words, while avoiding special characters, and keeping things like "it's", etc...
I'm working in Java.
Thanks
This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:
String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);
while ( m.find() ) {
System.out.println(input.substring(m.start(), m.end()));
}
The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.
Here's a good approach to your problem:
This function receives your text as an input and returns an array of all the words inside the given text
private ArrayList<String> get_Words(String SInput){
StringBuilder stringBuffer = new StringBuilder(SInput);
ArrayList<String> all_Words_List = new ArrayList<String>();
String SWord = "";
for(int i=0; i<stringBuffer.length(); i++){
Character charAt = stringBuffer.charAt(i);
if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
SWord = SWord + charAt;
}
else{
if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
SWord = "";
}
}
return all_Words_List;
}
Pseudocode would look like this:
create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right
The python code would be something like this:
words = input.split()
words = [word.strip(PUNCTUATION) for word in words]
where
PUNCTUATION = ",. \n\t\\\"'][#*:"
or any other characters you want to remove.
I believe Java has equivalent functions in the String class: String.split() .
Output of running this code on the text you provided in your link:
>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis',
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for',
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and',
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may',
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under',
... etc etc.
Basically, you want to match
([A-Za-z])+('([A-Za-z])*)?
right?
You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.

Categories