Base64 encoding Allowed Characters - java

I'm using Base64 encoding for encoding user id field in Java.
String abc = new String(Base64.encodeBase64("Actualuseridfield"));
I want to know whether the string abc above will contain the character " , : or not?
When the Base64 encoded string is returned in abc, will it ever contain below characters?
" <double quote>
, <comma>
: <colon>

You will not see any commas, colons, or double quotes in a Base64 encoded string. You will see equals signs since they're used to pad the ending content.

If you have a proper encoder for Base64, you will not see special characters except:
[A-Z][a-z][0-9][+/] and the padding char '=' at the end to indicate the number of zero fill bytes
There is another Base64 character set available which replaces [+/] by [_-] making the encoding URL-safe.
Nevertheless the specification allows to include any other character. Often the Base64 encoded data contains a line feed '\n' every 76 characters. Any character except the ones mentioned above has to be removed during decoding. The padding characters indicate the number of zero bytes appended to apply to n*4 output characters.

transforming "weird" and non printable characters is kind of the whole point of base64, so no, you wont see those. more info here http://email.about.com/cs/standards/a/base64_encoding.htm

Related

Escaping non-latin characters in Java

I have a Java program that takes in a string and escapes it so that it can be safely passed to a program in bash. The strategy is basically to escape any of the special characters mentioned here and wrap the result in double quotes.
The algorithm is pretty simple -- just loop over the input string and use input.charAt(i) to check whether the current character needs to be escaped.
This strategy works quite well for characters that aren't represented by surrogate pairs, but I have some concerns if non-latin characters or something like an emoji is embedded in the string. In that case, if we assumed that an emoji was the first character in my input string, input.charAt(0) would give me the first code unit while input.charAt(1) would return the second code unit. My concern is that some of these code units might be interpreted as one of the special characters that need to be escaped. If that happened, I'd try to escape one of the code units which would irrevocably garble the input.
Is such a thing possible? Or is it safe to use input.charAt(i) for something like this?
From the Java docs:
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF).
From the UTF-16 Wikipedia page:
U+D800 to U+DFFF: The Unicode standard permanently reserves these code point values for
UTF-16 encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including UTF-16,
can encode these code points.
From the charAt javadoc:
Returns the char value at the specified index. An index ranges from 0
to length() - 1. The first char value of the sequence is at index 0,
the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate
value is returned.
There is no overlap between the surrogate pair code point range and the range where my special characters ($,`,\ etc) exist as they're all using the ASCII character mappings (i.e. they're all mapped between 0 and 255).
Therefore, if I scan through a string that contains, say, an emoji (which definitely is outside of the supplementary character range) I won't mistake either of the items in the surrogate pair for a special character. Here's a simple test program:

Remove special Unicode characters from a Java String

I am converting HTML data (data with bullet styling) to Java String, but we are getting junk values (�� - default Unicode value replaced) in the String, I tried to remove these values using replaceAll() but it's not working.
Any suggestions about how to remove these Unicode characters from the String?
You can remove all non-ASCII characters with:
s.replaceAll("[^\\p{ASCII}]", "")

Remove non printable utf8 characters except controlchars from String

I've got a String containing text, control characters, digits, umlauts (german) and other utf8 characters.
I want to strip all utf8 characters which are not "part of the language". Special characters like (non complete list) ":/\ßä,;\n \t" should all be preserved.
Sadly stackoverflow removes all those characters so I have to append a picture (link).
Any ideas? Help is very appreciated!
PS: If anybody does know a pasting service which does not kill those special characters I would happily upload the strings.. I just wasn't able to find one..
[Edit]: I THINK the regex "\P{Cc}" are all characters I want to PRESERVE. Could this regex be inverted so all characters not matching this regex be returned?
You have already found Unicode character properties.
You can invert the character property, by changing the case of the leading "p"
e.g.
\p{L} matches all letters
\P{L} matches all characters that does not have the property letter.
So if you think \P{Cc} is what you need, then \p{Cc} would match the opposite.
More details on regular-expressions.info
I am quite sure \p{Cc} is close to what you want, but be careful, it does include, e.g. the tab (0x09), the Linefeed (0x0A) and the Carriage return (0x0D).
But you can create you own character class, like this:
[^\P{Cc}\t\r\n]
This class [^...] is a negated character class, so this would match everything that is not "Not control character" (double negation, so it matches control chars), and not tab, CR and LF.
You can use,
your_string.replaceAll("\\p{C}", "");

Removing all non-word characters in a Cyrillic UTF-8 encoded String

Normally, in order to remove non-word characters from a String the replaceAll method can be used:
String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");
The above returns a cleaned string "somestringwithnonwordssuchas".
However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.
What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?
Try [^\\p{L}]. That should match every Unicode codepoint except for letters.
The Pattern class has a pretty thorough description of the possible character classes. Note that the POSIX character classes are ASCII-only by default and won't help you a lot, you'll need to use the Unicode-specific classes.
Note that there's the UNICODE_CHARACTER_CLASS flag that changes the behavior of the POSIX classes to conform to this section of the Unicode Standard (basically making them equivalent to their closest Unicode-aware equivalents).

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?
The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)
There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...
Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])
The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

Categories