Take string "asdxyz\n" and replace \n with the ascii value - java

I dont want to replace it with /u000A, do NOT want it to look like "asdxyz/u000A"
I want to replace it with the actual newline CHARACTER.

Based on your response to Natso's answer, it seems that you have a fundamental misunderstanding of what's going on. The two-character sequence \n isn't the new-line character. But it's the way we represent that character in code because the actual character is hard to see. The compiler knows that when it encounters that two-character sequence in the context of a string literal, it should interpret them as the real new-line character, which has the ASCII value 10.
If you print that two-character sequence to the console, you won't see them. Instead, you'll see the cursor advance to the next line. That's because the compiler has already replaced those two characters with the new-line character, so it's really the new-line character that got sent to the console.
If you have input to your program that contains backslashes and lowercase N's, and you want to convert them to new-line characters, then Zach's answer might be sufficient. But if you want your program to allow real backslashes in the input, then you'll need some way for the input to indicate that a backslash followed by a lowercase N is really supposed to be those two characters. The usual way to do that is to prefix the backslash with another backslash, escaping it. If you use Zach's code in that situation, you may end up turning the three-character sequence \\n into the two-character sequence consisting of a backslash followed by a new-line character.
The sure-fire way to read strings that use backslash escaping is to parse them one character at a time, starting from the beginning of the input. Copy characters from the input to the output, except when you encounter a backslash. In that case, check what the next character is, too. If it's another backslash, copy a single backslash to the output. If it's a lowercase N, then write a new-line character to the output. If it's any other character, so whatever you define to be the right thing. (Examples include rejecting the whole input as erroneous, pretending the backslash wasn't there, and omitting both the backslash and the following character.)
If you're trying to observe the contents of a variable in the debugger, it's possible that the debugger may detect the new-line character convert it back to the two-character sequence \n. So, if you're stepping through your code trying to figure out what's in it, you may fall victim to the debugger's helpfulness. And in most situations, it really is being helpful. Programmers usually want to know exactly what characters are in a string; they're less concerned about how those characters will appear on the screen.

Just use String.replace():
newtext = text.replace("\\n", "\n");

Use the "(char)(10)" code to generate the true ascii value.
newstr = oldstr.replaceAll("\\n",(char)(10));
// -or-
newstr = oldstr.replaceAll("\\n","" + ((char)(10)));
//(been a while)

Related

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

Removing all html markup

I have a string that holds a complete XML get request.
In the request, there is a lot of HTML and some custom commands which I would like to remove.
The only way of doing so I know is by using jSoup.
For example like so.
Now, because the website the request came from also features custom commands, I was not able to completely remove all code.
For example here is a string I would like to 'clean':
\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here
As you can see, the custom commands all have backslashes in front of them.
How would I go about removing these commands with Java?
If I use regex, how can I program it such that it only removes the command, not anything after the command?
(because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).
See http://regex101.com/r/gJ2yN2
The regex (\\.\d{3,}.*?\s|(\\r|\\n)+) works to remove the things you were pointing out.
Result (replacing the match with a single space):
normal text here http://a_random_link_here.com Some more text here
If this was not the result you were looking for, please edit your question with the expected result.
EDIT regex explained:
() - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\ - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
so it is not interpreted as a special character
. - any character (I saw 'u', but there might be others
\d - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s - until you hit a white space
| - or
() - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n
The "custom commands" you're showing us appear to be standard character escapes. \r is carriage return, ASCII 13 (decimal). \n is new line, ASCII 10 (decimal). \uxxxx is generally an escape for the Unicode character with that hex value -- for example, \u0027 is ASCII character 39, the apostrophe character ('). You don't want to discard these; they're part of the text content you're trying to retrieve.
So the best answer is to make sure you know which escapes to accept in this dataset and then either find or write code which does a quick linear scan through the code looking for \ and, when found, using the next character to determine which kind of escape it is (and how many subsequent characters belong to that kind of escape), replace the escape sequence with the single character it represents, and continue until you reach the end of the string/buffer/file/whatever.

Can newlines be replaces with spaces? (lexer)

I'm currently in the progress of developing a parser for a subset of Java, and I was wondering;
Is there any cases, in which newlines are more than token separators?
That is, where they couldn't just be replaced by a space.
Should I ignore newlines, in the same way that I ignore white-space?
That is, just use them to detect token separation.
Yes all newline characters in Java source code can be replaced by a space or be removed. However, do not remove \n (backslash n), because that are the newline characters inside a String literal.
And, yes newlines are for the parser the same as spaces, as long as you are outside String literals. If you are in a String literal, and you would remove a newline, then you would surpress a syntax error. Because it is not allowed in Java to have newline characters in a String literal. So, this is wrong:
String str = "first line
same line";
So, it depends on the fact if you want to detect syntax errors with your parser or not. Do you only parse valid code or not? That is the question you should ask yourself.
The only situation I can think of where it makes a difference is within String-literals.
If there is a linebreak between two "s it would cause a syntax error while a space would not.
you have to notice that it could come in string \n, and of course if you want to make this replace you have to increase the lines number +1 because you will need it in the next phases of your project.

Splitting into sentences Java

I want to split a text into sentences. My text contains \n character in between. I want the splitting to be done at \n and .(dot). I cannot use BreakIterator as splitting condition for it is a space followed by a period (In the text I want to split, that isn't necessary).
Example:
i am a java programmer.i like coding in java. pi is 3.14\n regex not working
Should output:
['i am a java programmer', 'i like coding in java', 'pi is 3.14', 'regex not working']
I tried a simple regex which splits on either \n or .:
[\\\\n\\.]
This isn't working although, specifying separately works.
\\\\n
\\.
So can anyone give a regex that will split on either \n or . ?
Another problem is I don't want splitting to be done in case of decimals like 5.6.
This java regex should go it:
"\n|((?<!\\d)\\.(?!\\d))"
Points here:
you don't need to escape \n, ever
those weird looking things around the dot are negative look arounds, and means "the previous/next character must not be a digit
This regex says: "either a newline, or a literal dot that is not preceded or followed by a digit
FYI, you don't need to escape characters in a character class (between []) except for the brackets themselves.
Use string.split("[\n.]") to split at \n or .
Inside character class, . has no special meaning. So there is no need for escaping .
Edit: string.split("\n|[.](?<!\\d)(?!\\d)") avoids splitting of decimal numbers.
Here, for each . a lookbehind and a lookahead is there to check whether there is a digit on both sides. If both are not numbers, split is applied.
\n|\\.(?!\\d)|(?<!\\d)\\. avoids split for . with digits on both sides.
\n|(?<!\\d)[.](?!\\d) avoids split if any side has a digit
So what you require might be
string.split("\n|\\.(?!\\d)|(?<!\\d)\\.")
which splits something.4 but not 3.14
You need not double-escape stuff in a Java regex in the [] block:
[.\n]
should work.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Categories