Splitting into sentences Java - java

I want to split a text into sentences. My text contains \n character in between. I want the splitting to be done at \n and .(dot). I cannot use BreakIterator as splitting condition for it is a space followed by a period (In the text I want to split, that isn't necessary).
Example:
i am a java programmer.i like coding in java. pi is 3.14\n regex not working
Should output:
['i am a java programmer', 'i like coding in java', 'pi is 3.14', 'regex not working']
I tried a simple regex which splits on either \n or .:
[\\\\n\\.]
This isn't working although, specifying separately works.
\\\\n
\\.
So can anyone give a regex that will split on either \n or . ?
Another problem is I don't want splitting to be done in case of decimals like 5.6.

This java regex should go it:
"\n|((?<!\\d)\\.(?!\\d))"
Points here:
you don't need to escape \n, ever
those weird looking things around the dot are negative look arounds, and means "the previous/next character must not be a digit
This regex says: "either a newline, or a literal dot that is not preceded or followed by a digit
FYI, you don't need to escape characters in a character class (between []) except for the brackets themselves.

Use string.split("[\n.]") to split at \n or .
Inside character class, . has no special meaning. So there is no need for escaping .
Edit: string.split("\n|[.](?<!\\d)(?!\\d)") avoids splitting of decimal numbers.
Here, for each . a lookbehind and a lookahead is there to check whether there is a digit on both sides. If both are not numbers, split is applied.
\n|\\.(?!\\d)|(?<!\\d)\\. avoids split for . with digits on both sides.
\n|(?<!\\d)[.](?!\\d) avoids split if any side has a digit
So what you require might be
string.split("\n|\\.(?!\\d)|(?<!\\d)\\.")
which splits something.4 but not 3.14

You need not double-escape stuff in a Java regex in the [] block:
[.\n]
should work.

Related

How to use split for CSV while escaping \

I am trying to split a csv list.
csvList=hello there, how are you, what is your name\, again
I have to use the Java split function to get the three components:
hello there
how are you
what is your name, again
I want to escape the comma that is preceded by the '\'.
Can anyone please help?
Thanks.
You can use lookbehind egex:
String[] tok="hello there, how are you, what is your name\\, again".split(" *(?<!\\\\), *");
You can use a negative look behind like this:
input.split("\\W*(?<!\\\\),\\W*")
Really the key here is the (?<!\\\\),. This says, "Find me comma's that don't have a slash behind them."
You need 4 slashes because in Java, the first slash will be considered an escape (eg: like the slash for \t). Two slashes will be considered as slash, but in a regex, a slash is a special character. So you need to escape the escape.
The \\W* says, "match 0 or more whitespace characters". The point of that is simply to trim your results so they don't have spaces before or after them.

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Need a regular expression for field which should allow special characters, alphanumeric characters, and spaces

I am using the following regex:
[a-zA-Z0-9-#.()/%&\\s]{0,19}.
The requirement for the field is it should allow any thing and the field size should be 19.
Let me know if any corrections.Any help is appreciated.
You simply need to escape the special characters. Try:
[a-zA-Z0-9\-#\.\(\)\/%&\s]{0,19}
You can test your regular expressions on http://rubular.com/
Your regex is incorrect in at least one way - if you're considering a hyphen to be a "special character", then you should put it at the beginning or end of the range. So: [a-zA-Z0-9#.()/%&\s-]{0,19}.
Characters that are "special" within the context of the regex itself are often not parsed if they're inside a range. So you're fine with ., ( and ). But check your parser to make sure that it understands what \s means. It might be simpler just to put a space.
Also, if your regex parser tends to delimit the regex with slashes, then you may have to escape the slash in the middle of the range: [a-zA-Z0-9#.()\/%&\s-]{0,19}.
Just escape the dash - or put it at the begining or at the end of the character class:
[a-zA-Z0-9\\-#.()/%&\\s]{0,19}
or
[-a-zA-Z0-9#.()/%&\\s]{0,19}
or
[a-zA-Z0-9#.()/%&\\s-]{0,19}

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?
Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Categories