My application is reading a file which contains following data:
MENS HEALTH^\^# P
while actual text should be
MENS HEALTH P
I have already replaced the '\u0000' but still "^\" is still remaining in the string. I am not sure what is code for this characters, so I can replace it.
When I open the file in intelliJ editor it's displayed as FS symbol.
Please suggest how I can eliminate this.
Thanks,
Rather than worry about what characters the junk consists of, remove everything that isn't what you want to keep:
str = str.replaceAll("[^\\w ]+", "");
This deletes any characters that are not word characters or spaces.
You can use a regular expression with String.replaceAll() to replace these characters.
Note that backslash has a special meaning and need to be escaped (with a backslash).
"my\\^#String".replaceAll("[\\\\^#]", "");
Online Demo
Related
To remove quotation marks in Java,I understand I can use
replaceAll("\"", "");
Ex: "Hello World" becomes Hello World.
However, it only removes this type of quotation marks "". Is there a way to remove quotes like this “Hello World” ?
If you simply want to remove those 3 kinds of double-quotes, irrespective of the context:
replaceAll("[\"“”]", "");
If there are other kinds of quote characters that you want to remove, just add them before the ].
These pages list some of the other quote characters that you might encounter:
https://unicode-table.com/en/sets/quotation-marks/
https://en.wikipedia.org/wiki/Quotation_mark
And also see:
Is there a regex to grab all quotation marks?
which talks about the difficulty in creating a regex to match all of them in a future-proof fashion.
Note that since we are including some "funky" characters (non-ASCII) in the source code (above), it is important that the Java compiler is aware of the character encoding that the source code uses. We could avoid that by using Unicode escapes instead. For example:
replaceAll("[\"\u201c\u201d]", "");
You may try a regex replacement here, e.g.
String input = "“Hello World”";
System.out.println(input.replaceAll("“(.*?)”", "$1")); // prints Hello World
I want to edit on REGEX_PATTERN2 in this code to work with matches()method of The Arabic punctuation marks
String REGEX_PATTERN = "[\\.|,|:|;|!|_|\\?]+";
String s1 = "My life :is happy, stable";
String[] result = s1.split(REGEX_PATTERN);
for (String myString : result) {
System.out.println(myString);
}
String REGEX_PATTERN2 = "[\\.|,|:|;|!|_|،|؛|؟\\?]+";
String s2 = " حياتي ؛ سعيدة، مستقر";
String[] result2 = s2.split(REGEX_PATTERN2);
for (String myString : result2) {
System.out.println(myString);
}
The output I wanted
My life
is happy
stable
حياتي
سعيدة
مستقر
How I can edit to this code and use the matches() instead of split() method to get the same output with Arabic punctuation marks
There are a few problems here. First this example:
if (word.matches("[\\.|,|:|;|!|\\?]+"))
That is mildly1 incorrect for the following reason:
A . does not need to be escaped in a character class.
A | does not mean alternation in a character class.
A ? does not need to be escaped in a character class.
(For more details, read the javadoc or a tutorial on Java regexes.)
So you can rewrite the above as:
if (word.matches("[.,:;!?]+"))
... assuming that you don't want to classify the pipe character as punctuation.
Now this:
if (word.matches("[\.|,|:|;|!|،|؛|..|...|؟|\?]+"))
You have same problems as above. In addition, you seem to have used the two and three full-stop / period characters instead of (presumably) some Unicode character. I suspect they might be a \ufbb7 or u061e or \u06db, but I'm no linguist. (Certainly 2 or 3 full-stops is incorrect.)
So what are the punctuation characters in Arabic?
To be honest, I think that the answer depends on what source you look at, but Wikipedia states:
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma (,).
1 - By mildly incorrect, I mean that the mistakes in this example are mostly harmless. However, your inclusion of (multiple instances of) the | character n the class does mean that you will incorrectly classify a "pipe" as punctuation.
[] denotes a regex character class, which means it only matches single characters. ... is 3 characters, so it cannot be used in a character class.
In a character class, you don't separate characters with |, and you don't need to escape . and ?.
You probably meant this, which is a list of alternate character sequences:
"(?:\\.|,|:|;|!|\\?|،|؛|؟|\\.\\.|\\.\\.\\.)+"
You might get better performance if you do use a character class where you can:
"(?:\\.{1,3}|[,:;!?،؛؟])+"
Of course, with the + at the end, matching 1-3 periods in each iteration is rather redundant, so this will do:
"[.,:;!?،؛؟]+"
Here's a different approach, that uses Unicode properties instead of specific characters (In case you care about more Arabic marks than just the question mark and comma mentioned in another answer):
"(?=^[\\p{InArabic}.,:;!?]+$)^\\p{IsPunctuation}+$"
It matches an entire string of characters that have a punctuation category, that also are either in the Arabic block or are one of the other punctuation characters you listed in your efforts.
It'll match strings like "؟،" or "؟،:", but not "؟،ؠ" or "؟،a".
Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.
I have string like below
"them coming \nLove it \ud83d\ude00"
I want to remove this character "\ud83d\ude00". so it will be
"them coming \nLove it "
How can I achieve this in java? I have tried with code like below but it won't works
payload.toString().replaceAll("\\\\u\\b{4}.", "")
Thanks :)
I think \\\\u\\b{4}. will not work, because regex treat \ud83d as a symbol �, not a literal string. So to match this kind unwanted (for any reason) unicode characters it will be better to exclude character you accept(don't want to replace), so for ecample all ASCII character, and match everything else (what you want to replace). Try with:
[^\x00-\x7F]+
The \x00-\x7F includes Unicode Basic Latin block.
String str = "them coming \nLove it \ud83d\ude00";
System.out.println(str.replaceAll("[^\\x00-\\x7F]+", ""));
will result with:
them coming
Love it
However, you willl hava a problem, if you use national character, any other non-ASCII symbols (ś,ą,♉,☹,etc.).
I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example