Regex is eating too much stuff

Regex is eating too much stuff - java

So I recently opened a question and ended up solving it by using a regex. The regex I used essentially ate ALL my non-english characters.
Let me retry this:
I want to eat all non-keyboard characters that may exist in a string
the regex that I'm using is:
[^\\p{L}\\p{N}]
However this turns stuff like
10/10/2012 10:51:25 AM
into
10102012105125AM
Is there some way to easily exclude all alt-code characters from a string with replaceALL and leave keyboard characters like % / \ : and others intact?
Thanks!

You probably want to save only the ASCII characters. The character range [ -~] will achieve that. If you also want whitespace chars, you can add them in: [ -~\s].
System.out.println(input.replaceAll("[^ -~\\s]+", ""));

To remove all non-ASCII characters:
String mystring = <your_input_string>;
mystring.replaceAll("[^ -~\\s]+", "");

What about \p{Print}? It matches all printable characters, that sounds like exactly what you need.

Related

Remove escaped unicode string in java with regex

I have string like below
"them coming \nLove it \ud83d\ude00"
I want to remove this character "\ud83d\ude00". so it will be
"them coming \nLove it "
How can I achieve this in java? I have tried with code like below but it won't works
payload.toString().replaceAll("\\\\u\\b{4}.", "")
Thanks :)

I think \\\\u\\b{4}. will not work, because regex treat \ud83d as a symbol �, not a literal string. So to match this kind unwanted (for any reason) unicode characters it will be better to exclude character you accept(don't want to replace), so for ecample all ASCII character, and match everything else (what you want to replace). Try with:
[^\x00-\x7F]+
The \x00-\x7F includes Unicode Basic Latin block.
String str = "them coming \nLove it \ud83d\ude00";
System.out.println(str.replaceAll("[^\\x00-\\x7F]+", ""));
will result with:
them coming
Love it
However, you willl hava a problem, if you use national character, any other non-ASCII symbols (ś,ą,♉,☹,etc.).

Java regex : What is the best way to filter out the keyboard characters?

Here is my regex that I am using to remove all non keyboard characters from a string, i.e. leave all regular characters that could be typed in using a regular keboard :
String test = "\u2665\n\t\r whatever";
String myregex = "[^\\p{L}\\p{Nd}\\,\\[\\]\\{\\}\\\\|\"\' `~!##$%^&*()_+-=,./<>?\n\r\t]+";
System.out.println(test.replaceAll(myregex, ""));
Is there a better way to do that ? Is there any more compact regex, more efficient regex ?
I am asking because initially I did not have this part of the regex \n\r\t and then realized that a user may hit Enter so that part was missing it . Maybe there is something else missing there ?
Basically what I am asking is : instead of listing all numbers and letters , we can use this \\p{L}\\p{Nd} . Is there any other shortcut for the keyboard characters like !##$% . . ?

It seems you can modify your regular expression as follows. This will remove any character(s) that are not from SPACE to TILDE in the ASCII table and the exception of (CR, LF and TAB).
String myregex = "[^ -~\r\n\t]+";

Using Regexp in Java to remove some text

It is maybe a simple question. But I tried a lot of Regexp combinations and still not worinkg. My problem is: I have words like: Test=move or Testing=move
I would like to remove the text 'Test=' or 'Testing='. In other words i need only the 'move' text after the '='. What is the best way to do that in Java? Thanks.

I think that for this problem, the split(string regex) is better suited:
String str = "Test=move";
System.out.println(str.split("=")[1]);

I would replace \w+= with "" - this will get rid of any work preceding an equals sign.
myString.replaceAll("\w+=", "");
If the string before the equals sign has more than just letters you can add them to an optional selection:
myString.replaceAll("[\w-\.\d]+=", "");
This will remove any word with letters, numbers, hyphens and periods.

Splitting into sentences Java

I want to split a text into sentences. My text contains \n character in between. I want the splitting to be done at \n and .(dot). I cannot use BreakIterator as splitting condition for it is a space followed by a period (In the text I want to split, that isn't necessary).
Example:
i am a java programmer.i like coding in java. pi is 3.14\n regex not working
Should output:
['i am a java programmer', 'i like coding in java', 'pi is 3.14', 'regex not working']
I tried a simple regex which splits on either \n or .:
[\\\\n\\.]
This isn't working although, specifying separately works.
\\\\n
\\.
So can anyone give a regex that will split on either \n or . ?
Another problem is I don't want splitting to be done in case of decimals like 5.6.

This java regex should go it:
"\n|((?<!\\d)\\.(?!\\d))"
Points here:
you don't need to escape \n, ever
those weird looking things around the dot are negative look arounds, and means "the previous/next character must not be a digit
This regex says: "either a newline, or a literal dot that is not preceded or followed by a digit
FYI, you don't need to escape characters in a character class (between []) except for the brackets themselves.

Use string.split("[\n.]") to split at \n or .
Inside character class, . has no special meaning. So there is no need for escaping .
Edit: string.split("\n|[.](?<!\\d)(?!\\d)") avoids splitting of decimal numbers.
Here, for each . a lookbehind and a lookahead is there to check whether there is a digit on both sides. If both are not numbers, split is applied.
\n|\\.(?!\\d)|(?<!\\d)\\. avoids split for . with digits on both sides.
\n|(?<!\\d)[.](?!\\d) avoids split if any side has a digit
So what you require might be
string.split("\n|\\.(?!\\d)|(?<!\\d)\\.")
which splits something.4 but not 3.14

You need not double-escape stuff in a Java regex in the [] block:
[.\n]
should work.

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)

The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.

You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."

The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/

You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex is eating too much stuff - java

You probably want to save only the ASCII characters. The character range [ -~] will achieve that. If you also want whitespace chars, you can add them in: [ -~\s]. System.out.println(input.replaceAll("[^ -~\\s]+", ""));

To remove all non-ASCII characters: String mystring = <your_input_string>; mystring.replaceAll("[^ -~\\s]+", "");

What about \p{Print}? It matches all printable characters, that sounds like exactly what you need.

Related

Remove escaped unicode string in java with regex

Java regex : What is the best way to filter out the keyboard characters?

Using Regexp in Java to remove some text

Splitting into sentences Java

Java - Unknown characters passing as [a-zA-z0-9]*?

Categories

Resources