How to tokenize Chinese into individual characters in Java? [duplicate]

How to tokenize Chinese into individual characters in Java? [duplicate] - java

I need to split a String into an array of single character Strings.
Eg, splitting "cat" would give the array "c", "a", "t"

"cat".split("(?!^)")
This will produce
array ["c", "a", "t"]

"cat".toCharArray()
But if you need strings
"cat".split("")
Edit: which will return an empty first value.

String str = "cat";
char[] cArray = str.toCharArray();

If characters beyond Basic Multilingual Plane are expected on input (some CJK characters, new emoji...), approaches such as "a💫b".split("(?!^)") cannot be used, because they break such characters (results into array ["a", "?", "?", "b"]) and something safer has to be used:
"a💫b".codePoints()
.mapToObj(cp -> new String(Character.toChars(cp)))
.toArray(size -> new String[size]);

split("(?!^)") does not work correctly if the string contains surrogate pairs. You should use split("(?<=.)").
String[] splitted = "花ab🌹🌺🌷".split("(?<=.)");
System.out.println(Arrays.toString(splitted));
output:
[花, a, b, 🌹, 🌺, 🌷]

To sum up the other answers...
This works on all Java versions:
"cat".split("(?!^)")
This only works on Java 8 and up:
"cat".split("")

An efficient way of turning a String into an array of one-character Strings would be to do this:
String[] res = new String[str.length()];
for (int i = 0; i < str.length(); i++) {
res[i] = Character.toString(str.charAt(i));
}
However, this does not take account of the fact that a char in a String could actually represent half of a Unicode code-point. (If the code-point is not in the BMP.) To deal with that you need to iterate through the code points ... which is more complicated.
This approach will be faster than using String.split(/* clever regex*/), and it will probably be faster than using Java 8+ streams. It is probable faster than this:
String[] res = new String[str.length()];
int 0 = 0;
for (char ch: str.toCharArray[]) {
res[i++] = Character.toString(ch);
}
because toCharArray has to copy the characters to a new array.

for(int i=0;i<str.length();i++)
{
System.out.println(str.charAt(i));
}

Maybe you can use a for loop that goes through the String content and extract characters by characters using the charAt method.
Combined with an ArrayList<String> for example you can get your array of individual characters.

If the original string contains supplementary Unicode characters, then split() would not work, as it splits these characters into surrogate pairs. To correctly handle these special characters, a code like this works:
String[] chars = new String[stringToSplit.codePointCount(0, stringToSplit.length())];
for (int i = 0, j = 0; i < stringToSplit.length(); j++) {
int cp = stringToSplit.codePointAt(i);
char c[] = Character.toChars(cp);
chars[j] = new String(c);
i += Character.charCount(cp);
}

In my previous answer I mixed up with JavaScript. Here goes an analysis of performance in Java.
I agree with the need for attention on the Unicode Surrogate Pairs in Java String. This breaks the meaning of methods like String.length() or even the functional meaning of Character because it's ultimately a technical object which may not represent one character in human language.
I implemented 4 methods that split a string into list of character-representing strings (Strings corresponding to human meaning of characters). And here's the result of comparison:
A line is a String consisting of 1000 arbitrary chosen emojis and 1000 ASCII characters (1000 times <emoji><ascii>, total 2000 "characters" in human meaning).
(discarding 256 and 512 measures)
Implementations:
codePoints (java 11 and above)
public static List<String> toCharacterStringListWithCodePoints(String str) {
if (str == null) {
return Collections.emptyList();
}
return str.codePoints()
.mapToObj(Character::toString)
.collect(Collectors.toList());
}
classic
public static List<String> toCharacterStringListWithIfBlock(String str) {
if (str == null) {
return Collections.emptyList();
}
List<String> strings = new ArrayList<>();
char[] charArray = str.toCharArray();
int delta = 1;
for (int i = 0; i < charArray.length; i += delta) {
delta = 1;
if (i < charArray.length - 1 && Character.isSurrogatePair(charArray[i], charArray[i + 1])) {
delta = 2;
strings.add(String.valueOf(new char[]{ charArray[i], charArray[i + 1] }));
} else {
strings.add(Character.toString(charArray[i]));
}
}
return strings;
}
regex
static final Pattern p = Pattern.compile("(?<=.)");
public static List<String> toCharacterStringListWithRegex(String str) {
if (str == null) {
return Collections.emptyList();
}
return Arrays.asList(p.split(str));
}
Annex (RAW DATA):
codePoints;classic;regex;lines
45;44;84;256
14;20;98;512
29;42;91;1024
52;56;99;2048
87;121;174;4096
175;221;375;8192
345;411;839;16384
667;826;1285;32768
1277;1536;2440;65536
2426;2938;4238;131072

We can do this simply by
const string = 'hello';
console.log([...string]); // -> ['h','e','l','l','o']
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax says
Spread syntax (...) allows an iterable such as an array expression or string to be expanded...
So, strings can be quite simply spread into arrays of characters.

Related

how to compare two strings to find common substring

i get termination due to timeout error when i compile. Please help me
Given two strings, determine if they share a common substring. A substring may be as small as one character.
For example, the words "a", "and", "art" share the common substring "a" . The words "be" and "cat" do not share a substring.
Input Format
The first line contains a single integer , the number of test cases.
The following pairs of lines are as follows:
The first line contains string s1 .
The second line contains string s2 .
Output Format
For each pair of strings, return YES or NO.
my code in java
public static void main(String args[])
{
String s1,s2;
int n;
Scanner s= new Scanner(System.in);
n=s.nextInt();
while(n>0)
{
int flag = 0;
s1=s.next();
s2=s.next();
for(int i=0;i<s1.length();i++)
{
for(int j=i;j<s2.length();j++)
{
if(s1.charAt(i)==s2.charAt(j))
{
flag=1;
}
}
}
if(flag==1)
{
System.out.println("YES");
}
else
{
System.out.println("NO");
}
n--;
}
}
}
any tips?

Below is my approach to get through the same HackerRank challenge described above
static String twoStrings(String s1, String s2) {
String result="NO";
Set<Character> set1 = new HashSet<Character>();
for (char s : s1.toCharArray()){
set1.add(s);
}
for(int i=0;i<s2.length();i++){
if(set1.contains(s2.charAt(i))){
result = "YES";
break;
}
}
return result;
}
It passed all the Test cases without a time out issue.

The reason for the timeout is probably: to compare two strings that each are 1.000.000 characters long, your code needs 1.000.000 * 1.000.000 comparisons, always.
There is a faster algorithm that only needs 2 * 1.000.000 comparisons. You should use the faster algorithm instead. Its basic idea is:
for each character in s1: add the character to a set (this is the first million)
for each character in s2: test whether the set from step 1 contains the character, and if so, return "yes" immediately (this is the second million)
Java already provides a BitSet data type that does all you need. It is used like this:
BitSet seenInS1 = new BitSet();
seenInS1.set('x');
seenInS1.get('x');

Since you're worried about execution time, if they give you an expected range of characters (for example 'a' to 'z'), you can solve it very efficiently like this:
import java.util.Arrays;
import java.util.Scanner;
public class Whatever {
final static char HIGHEST_CHAR = 'z'; // Use Character.MAX_VALUE if unsure.
public static void main(final String[] args) {
final Scanner scanner = new Scanner(System.in);
final boolean[] characterSeen = new boolean[HIGHEST_CHAR + 1];
mainloop:
for (int word = Integer.parseInt(scanner.nextLine()); word > 0; word--) {
Arrays.fill(characterSeen, false);
final String word1 = scanner.nextLine();
for (int i = 0; i < word1.length(); i++) {
characterSeen[word1.charAt(i)] = true;
}
final String word2 = scanner.nextLine();
for (int i = 0; i < word2.length(); i++) {
if (characterSeen[word2.charAt(i)]) {
System.out.println("YES");
continue mainloop;
}
}
System.out.println("NO");
}
}
}
The code was tested to work with a few inputs.
This uses a fast array rather than slower sets, and it only creates one non-String object (other than the Scanner) for the entire run of the program. It also runs in O(n) time rather than O(n²) time.
The only thing faster than an array might be the BitSet Roland Illig mentioned.
If you wanted to go completely overboard, you could also potentially speed it up by:
skipping the creation of a Scanner and all those String objects by using System.in.read(buffer) directly with a reusable byte[] buffer
skipping the standard process of having to spend time checking for and properly handling negative numbers and invalid inputs on the first line by making your own very fast int parser that just assumes it's getting the digits of a valid nonnegative int followed by a newline

There are different approaches to solve this problem but solving this problem in linear time is a bit tricky.
Still, this problem can be solved in linear time. Just apply KMP algorithm in a trickier way.
Let's say you have 2 strings. Find the length of both strings first. Say length of string 1 is bigger than string 2. Make string 1 as your text and string 2 as your pattern. If the length of the string is n and length of the pattern is m then time complexity of the above problem would be O(m+n) which is way faster than O(n^2).
In this problem, you need to modify the KMP algorithm to get the desired result.
Just need to modify the KMP
public static void KMPsearch(char[] text,char[] pattern)
{
int[] cache = buildPrefix(pattern);
int i=0,j=0;
while(i<text.length && j<pattern.length)
{
if(text[i]==pattern[j])
{System.out.println("Yes");
return;}
else{
if(j>0)
j = cache[j-1];
else
i++;
}
}
System.out.println("No");
return;
}
Understanding Knuth-Morris-Pratt Algorithm

There are two concepts involved in solving this question.
-Understanding that a single character is a valid substring.
-Deducing that we only need to know that the two strings have a common substring — we don’t need to know what that substring is.
Thus, the key to solving this question is determining whether or not the two strings share a common character.
To do this, we create two sets, a and b, where each set contains the unique characters that appear in the string it’s named after.
Because sets 26 don’t store duplicate values, we know that the size of our sets will never exceed the letters of the English alphabet.
In addition, the small size of these sets makes finding the intersection very quick.
If the intersection of the two sets is empty, we print NO on a new line; if the intersection of the two sets is not empty, then we know that strings and share one or more common characters and we print YES on a new line.
In code, it may look something like this
import java.util.*;
public class Solution {
static Set<Character> a;
static Set<Character> b;
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
int n = scan.nextInt();
for(int i = 0; i < n; i++) {
a = new HashSet<Character>();
b = new HashSet<Character>();
for(char c : scan.next().toCharArray()) {
a.add(c);
}
for(char c : scan.next().toCharArray()) {
b.add(c);
}
// store the set intersection in set 'a'
a.retainAll(b);
System.out.println( (a.isEmpty()) ? "NO" : "YES" );
}
scan.close();
}
}

public String twoStrings(String sOne, String sTwo) {
if (sOne.equals(sTwo)) {
return "YES";
}
Set<Character> charSetOne = new HashSet<Character>();
for (Character c : sOne.toCharArray())
charSetOne.add(c);
Set<Character> charSetTwo = new HashSet<Character>();
for (Character c : sTwo.toCharArray())
charSetTwo.add(c);
charSetOne.retainAll(charSetTwo);
if (charSetOne.size() > 0) {
return "YES";
}
return "NO";
}
This must work. Tested with some large inputs.

Python3
def twoStrings(s1, s2):
flag = False
for x in s1:
if x in s2:
flag = True
if flag == True:
return "YES"
else:
return "NO"
if __name__ == '__main__':
q = 2
text = [("hello","world"), ("hi","world")]
for q_itr in range(q):
s1 = text[q_itr][0]
s2 = text[q_itr][1]
result = twoStrings(s1, s2)
print(result)

static String twoStrings(String s1, String s2) {
for (Character ch : s1.toCharArray()) {
if (s2.indexOf(ch) > -1)
return "YES";
}
return "NO";
}

Convert String to a sequence of ints separated with a delimiter using streams

So, I have a List of words that consist of autogenerated symbols. For example: hqst. I convert each symbol of this word to unicode and concatenate it dividing by dot . like this: 104.113.115.116.
I write the next lambda:
.map(word -> {
char[] symbols = word.toCharArray();
StringBuilder newWord = new StringBuilder();
for (int i = 0; i < symbols.length; i++) {
newWord.append((int) symbols[i]).append(".");
if (i == symbols.length - 1) {
newWord = new StringBuilder(newWord.substring(0, i));
}
}
return newWord.toString();
})
Is it possible rewrite this anonymous method using stream API?

Yep. You can use String::chars to get an IntStream from the word, then map each int to a String and collect with a joining collector:
.map(word -> word.chars()
.mapToObj(Integer::toString)
.collect(Collectors.joining("."))
)

With assumption the words are stored in list as integer value based on the sample value provided(since you didn't give full code code snippets). I believe it helps.
public static void main(String[] args) {
List<Integer> list = new ArrayList<>();
list.add(104);
list.add(113);
list.add(115);
list.add(116);
String str = list.stream().map(word->word.toString()).collect(Collectors.joining( "." ));
System.out.println(str);
}

Java match string with x allowed mismatches.

What is the fastest / clearest way to see if a string matches to another string of the same length with X allowed mismatches? Is there a library that can do this, its not in Apache stringUtils (there is only one that also uses insertions / deletions).
So lets say I have 2 string of length for and I want to know if they match with 1 mismatch allowed. Insertions and deletions are not allowed.
So:
ABCD <-> ABCD = Match
ABCC <-> ABCD = Match with 1 mismatch
ACCC <-> ABCD = no match, 2 mismatches is too much.

String str1, str2;
Assuming the lengths of the strings are equal:
int i = 0;
for(char c : str1.toCharArray())
{
if(c != str2.charAt(i++))
counter++;
}
if(counter > 1)
// mismatch

Compare the strings one character at a time.Keep a counter to count the mismatch.When the counter exceeds the limit, return false.If you reach the end of string, return true

Try this to store the strings in a char array (char[] charArray = String.toCharArray()).
char[] stringA = firsString.toCharArray();
char[] stringB = secondString.toCharArray();
int ctr = 0;
if(stringA.length == stringB.length){
for(int i = 0; i<stringA.length; i++){
if(stringA[i] == stringB[i]){
ctr++;
}
}
}
//do the if-else here using the ctr

If you want the FASTEST way, you should code it from an existing algorithm like "Approximate Boyer-Moore String Matching" or Suffix Tree method...
Look at here: https://codereview.stackexchange.com/questions/13383/approximate-string-matching-interview-question
They did the math, you do the code...
Other interesting SO posts are:
Getting the closest string match
Can java.util.regex.Pattern do partial matches?
Generating all permutations of a given string
Similarity Score - Levenshtein

Remove all non alphabetic characters from a String array in java

I'm trying to write a method that removes all non alphabetic characters from a Java String[] and then convert the String to an lower case string. I've tried using regular expression to replace the occurence of all non alphabetic characters by "" .However, the output that I am getting is not able to do so. Here is the code
static String[] inputValidator(String[] line) {
for(int i = 0; i < line.length; i++) {
line[i].replaceAll("[^a-zA-Z]", "");
line[i].toLowerCase();
}
return line;
}
However if I try to supply an input that has non alphabets (say - or .) the output also consists of them, as they are not removed.
Example Input
A dog is an animal. Animals are not people.
Output that I'm getting
A
dog
is
an
animal.
Animals
are
not
people.
Output that is expected
a
dog
is
an
animal
animals
are
not
people

The problem is your changes are not being stored because Strings are immutable. Each of the method calls is returning a new String representing the change, with the current String staying the same. You just need to store the returned String back into the array.
line[i] = line[i].replaceAll("[^a-zA-Z]", "");
line[i] = line[i].toLowerCase();
Because the each method is returning a String you can chain your method calls together. This will perform the second method call on the result of the first, allowing you to do both actions in one line.
line[i] = line[i].replaceAll("[^a-zA-Z]", "").toLowerCase();

You need to assign the result of your regex back to lines[i].
for ( int i = 0; i < line.length; i++) {
line[i] = line[i].replaceAll("[^a-zA-Z]", "").toLowerCase();
}

It doesn't work because strings are immutable, you need to set a value
e.g.
line[i] = line[i].toLowerCase();

You must reassign the result of toLowerCase() and replaceAll() back to line[i], since Java String is immutable (its internal value never changes, and the methods in String class will return a new String object instead of modifying the String object).

As it already answered , just thought of sharing one more way that was not mentioned here >
str = str.replaceAll("\\P{Alnum}", "").toLowerCase();

A cool (but slightly cumbersome, if you don't like casting) way of doing what you want to do is go through the entire string, index by index, casting each result from String.charAt(index) to (byte), and then checking to see if that byte is either a) in the numeric range of lower-case alphabetic characters (a = 97 to z = 122), in which case cast it back to char and add it to a String, array, or what-have-you, or b) in the numeric range of upper-case alphabetic characters (A = 65 to Z = 90), in which case add 32 (A + 22 = 65 + 32 = 97 = a) and cast that to char and add it in. If it is in neither of those ranges, simply discard it.

You can also use Arrays.setAll for this:
Arrays.setAll(array, i -> array[i].replaceAll("[^a-zA-Z]", "").toLowerCase());

Here is working method
String name = "Joy.78#,+~'{/>";
String[] stringArray = name.split("\\W+");
StringBuilder result = new StringBuilder();
for (int i = 0; i < stringArray.length; i++) {
result.append(stringArray[i]);
}
String nameNew = result.toString();
nameNew.toLowerCase();

public static void solve(String line){
// trim to remove unwanted spaces
line= line.trim();
String[] split = line.split("\\W+");
// print using for-each
for (String s : split) {
System.out.println(s);
}

Replace substring with a regex combination

Since I'm not that familiar with java, I don't know if there's a library somewhere that can do this thing. If not, does anybody have any ideas how can this be accomplished?
For instance I have a string "foo" and I want to change the letter f with "f" and "a" so that the function returns a list of strings with values "foo" and "aoo".
How to deal with it when there's more of the same letters? "ffoo" into "ffoo", "afoo", "faoo", "aaoo".
A better explanation:
(("a",("a","b)),("c",("c","d")))
Above is a group of characters that need to be replaced with a character from the other element. "a" is to be replaced with "a" and with "b". "c" is to be replaced with "c" and "d".
If I have a string "ac", the resulting combinations I need are:
"ac"
"bc"
"ad"
"bd"
If the string is "IaJaKc", the resulting combinations are:
"IaJaKc"
"IbJaKc"
"IaJbKc"
"IbJbKc"
"IaJaKd"
"IbJaKd"
"IaJbKd"
"IbJbKd"
The number of combinations can be calculated like this:
(replacements_of_a^letter_amount_a)*(replacements_of_c^letter_amount_c)
first case: 2^1*2^1 = 4
second case: 2^2*2^1 = 8
If, say, the group is (("a",("a","b)),("c",("c","d","e"))), and the string is "aac", the number of combinations is:
2^2*3^1 = 12

Here is the code for your example with foo and aoo
public List<String> doSmthTricky (String str) {
return Arrays.asList("foo".replaceAll("(^.)(.*)", "$1$2 a$2").split(" "));
}
For the input "foo" this method returns a list with 2 strings "foo" and "aoo".
It works only if there is no whitespaces in your input string ("foo" in your example). Otherwise it's a bit more complicated.
How to deal with it when there's more of the same letters? "ffoo" into "ffoo", "afoo", "faoo", "aaoo".
I doubt that regular expressions could help here, you want to generate strings based on initial string, it's not a task for regexp.
UPD: I've created a recursive function (actually it's half-recursive half-iterative) which generates strings based on the template string by replacing its first characters with characters from a specified set:
public static List<String> generatePermutations (String template, String chars, int depth, List<String> result) {
if (depth <= 0) {
result.add (template);
return result;
}
for (int i = 0; i < chars.length(); i++) {
String newTemplate = template.substring(0, depth - 1) + chars.charAt(i) + template.substring(depth);
generatePermutations(newTemplate, chars, depth - 1, result);
}
generatePermutations(template, chars, depth - 1, result);
return result;
}
Parameter #depth means how many characters from the beginning of string should be replaced. Number of permutations (chars.size() + 1) ^ depth.
Tests:
System.out.println(generatePermutations("ffoo", "a", 2, new LinkedList<String>()));
Output: [aaoo, faoo, afoo, ffoo]
--
System.out.println(generatePermutations("ffoo", "ab", 3, new LinkedList<String>()));
Output: [aaao, baao, faao, abao, bbao, fbao, afao, bfao, ffao, aabo, babo, fabo, abbo, bbbo, fbbo, afbo, bfbo, ffbo, aaoo, baoo, faoo, aboo, bboo, fboo, afoo, bfoo, ffoo]

I'm not sure what you need. Please specify source and the result you expect. Anyway, you should use standard java classes for that purpose: java.util.regex.Pattern, java.util.regex.Matcher. If you need to deal with the repeating letters in the beginning, then there is two ways, use symbol "^" - which means beginning of the line, or for the same purpose you can use "\w" shortcut, which means beginning of the word. In more sophisticated cases, please take a look at "lookbehind" expressions. There are more than complete descriptions of these techniques you can find in java doc for java.util.regex and if it's not enough look at www.regular-expressions.info good luck.

Here it is:
public static void returnVariants(String input){
List<String> output = new ArrayList<String>();
StringBuffer word = new StringBuffer(input);
output.add(input);
String letters = "ac";
int lettersLength = letters.length();
int wordLength = word.length();
String replacement = "";
for (int i = 0; i < lettersLength; i++) {
for (int j = 0; j < wordLength; j++) {
if(word.charAt(j)==letters.charAt(i)){
if (word.charAt(j)=='a'){
replacement = "ab";
}else if (word.charAt(j)=='c'){
replacement = "cd";
}
List<String> tempList = new ArrayList<String>();
for (int k = 0; k < replacement.length(); k++) {
for (String variant : output){
StringBuffer tempBuffer = new StringBuffer(variant);
String combination = tempBuffer.replace(j, j+1, replacement.substring(k, k+1)).toString();
tempList.add(combination);
}
}
output.addAll(tempList);
if (j==0){
output.remove(0);
}
}
}
}
Set<String> uniqueCombinations = new HashSet(output);
System.out.println(uniqueCombinations);
}
If input is "ac", the combinations returned are "ac", "bc", "ad", "bd". If it can be optimized further, any additional help is welcome and appreciated.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to tokenize Chinese into individual characters in Java? [duplicate] - java

I need to split a String into an array of single character Strings. Eg, splitting "cat" would give the array "c", "a", "t"

"cat".split("(?!^)") This will produce array ["c", "a", "t"]

"cat".toCharArray() But if you need strings "cat".split("") Edit: which will return an empty first value.

String str = "cat"; char[] cArray = str.toCharArray();

split("(?!^)") does not work correctly if the string contains surrogate pairs. You should use split("(?<=.)"). String[] splitted = "花ab🌹🌺🌷".split("(?<=.)"); System.out.println(Arrays.toString(splitted)); output: [花, a, b, 🌹, 🌺, 🌷]

To sum up the other answers... This works on all Java versions: "cat".split("(?!^)") This only works on Java 8 and up: "cat".split("")

for(int i=0;i<str.length();i++) { System.out.println(str.charAt(i)); }

Maybe you can use a for loop that goes through the String content and extract characters by characters using the charAt method. Combined with an ArrayList<String> for example you can get your array of individual characters.

Related

how to compare two strings to find common substring

Convert String to a sequence of ints separated with a delimiter using streams

Java match string with x allowed mismatches.

Remove all non alphabetic characters from a String array in java

Replace substring with a regex combination

Categories

Resources