Regex: Find first occurence and map to canonical value

Regex: Find first occurence and map to canonical value - java

I have some input data like this:
1996 caterpiller d6 dozer for sale (john deere and komatsu too!)
I want to match the first brand name found and map it to its canonical value.
Here's the map:
canonical regex
KOMATSU \bkomatsu\b
CAT \bcat(erpill[ae]r)?\b
DEERE \b(john )?deere?\b
I can easily test that a brand is in the string:
/\b(cat(erpill[ae]r)?|(john )?deere?|komatsu)\b/i.exec(...) != null
or what the first match was:
/\b(cat(erpill[ae]r)?|(john )?deere?|komatsu)\b/i.exec(...)[0]; //caterpiller
But is there a fast or convenient way to map the first match to the real value that I want?
caterpiller => CAT
Do I need to find the first match, then test against all patterns in the map?
I need to do 10,000+ inputs against 10,000+ brands :D
I could loop the the map, testing against the input value, but that would find the first value that appears in the map, not the input.

An idea consists to associate the number of a capture group with an index in the canonical name array. So each different brand must have its own number:
var can = ['', 'KOMATSU', 'CAT', 'DEERE'];
// ^idx1 ^idx 2 ^idx 3
var re =/\b(?:(komatsu)|(cat(?:erpill[ae]r)?)|((?:john )?deere))\b/ig;
// ^ 1st grp ^ 2nd grp ^ 3rd grp
var text = '1996 caterpiller d6 dozer for sale (john deere and komatsu too!)';
while ((res = re.exec(text)) !== null) {
for (var i=1; i<4; i++) { // test each group until one is defined
if (res[i]!= undefined) {
console.log(can[i] + "\t" + res[0]);
break;
}
}
}
// result:
// CAT caterpiller
// DEERE john deere
// KOMATSU komatsu

Related

Getting integers out of a string containing words in java

Right now I have a string input along the lines of "Stern Brenda 90 86 45". I'm trying to find a way to get 90 86 and 45 out of that and assign them as ints to tests 3, 2, and 1 respectively to compute an average of them.
while ((line = reader.readLine()) != null) {
test3 = line.indexOf(-2, -1);
test2 = line.indexOf(-5, -4);
test1 = line.indexOf(-8, -7);
This is returning a value of -1 for each test (I tried using a regular expression to start from index -2 and go until another integer is found. Trying to get a two digit integer (as opposed to if I was just trying to get something like 5 or 6) is really whats throwing me off. Is using the .indexOf method the best way to go about getting these numbers out of the string? If so how am I using it incorrectly?
edit: I found a solution that was relatively simple.
while ((line = reader.readLine()) != null) {
String nums = line.replaceAll("[\\D]", "");
test1 = Integer.parseInt(nums.substring(0,2));
test2 = Integer.parseInt(nums.substring(2,4));
test3 = Integer.parseInt(nums.substring(4,6));
For the input "Stern Brenda 90 86 45", this returns 90 for test1, 86 for test2, and 45 for test3 (all as integers).

CheshireMoe almost has it right, but he's accessing a List like an array, which probably won't work. In his example:
Instead of:
test3 = Integer.parseInt(tokens[tokens.length-1]);
test2 = Integer.parseInt(tokens[tokens.length-2]);
test1 = Integer.parseInt(tokens[tokens.length-3]);
Should be:
test3 = Integer.parseInt(tokens.get(tokens.size()-1));
test2 = Integer.parseInt(tokens.get(tokens.size()-2));
test1 = Integer.parseInt(tokens.get(tokens.size()-3));
An easier solution might be just to split the array using the space:
while ((line = reader.readLine()) != null) {
String [] tokens = line.split(" ");
if (tokens.length != 5) { // catch errors in your data!
throw new Exception(); // <-- use this if you want to stop on bad data
// continue; <-- use this if you just want to skip the record, instead
}
test3 = Integer.parseInt(tokens[4]);
test2 = Integer.parseInt(tokens[3]);
test1 = Integer.parseInt(tokens[2]);
}
Based on your data, you might also consider putting in some validation like I've shown, to catch things like:
a value is missing (student didn't take one of the tests)
not all the grades were entered as numbers (i.e. bad characters)
first and last name both exist

You could use a regular expression to parse the string. This just computes the average. You can also assign the individual values as you deem appropriate.
String s = "Stern Brenda 90 86 45";
double sum = 0;
\\b - a word boundary
\\d+ - one or more digits
() - a capture group
matching on the string s
Matcher m = Pattern.compile("\\b(\\d+)\\b").matcher(s);
int count = 0;
a long as find() returns true, you have a match
so convert group(1) to a double, add to sum and increment the count.
while (m.find()) {
sum+= Double.parseDouble(m.group(1));
count++;
}
When done, compute the average.
System.out.println(sum + " " + count); // just for demo purposes.
if (count > 0) { //just in case
double avg = sum/count;
System.out.println("Avg = " + avg);
}
prints
221.0 3
Avg = 73.66666666666667
Check out the Pattern class for more details.
Formatting the final answer may be desirable. See System.out.printf

StringTokenizer is a very useful way to work with data strings that come files or streams. If your data is separated by a specific character (space in this case), StringTokenizer is a easy way to brake a large string into parts & iterate through the data.
Since you don't seem to care about the other 'words' in the line & have not specified if it will be a constant number (middle name?) my example puts all the tokens in an array to get the last three like in the question. I have also added the parseInt() method to convert from strings to int. Here is how you would tokenize your lines.
while ((line = reader.readLine()) != null) {
List<String> tokens = new ArrayList<>();
StringTokenizer tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreElements()) {
tokens.add(tokenizer.nextToken());
}
test3 = Integer.parseInt(tokens.get(tokens.size()-1));
test2 = Integer.parseInt(tokens.get(tokens.size()-2));
test1 = Integer.parseInt(tokens.get(tokens.size()-3));
}

How to join string in kotlin and add postfix only if more than 1 item

I have a list of some data class which I want to join into a string in kotlin EFFICIENTLY (least amount of code).
the data class is:
data class Animal(val name: String, val description: String)
and I am getting a List<Animal> in some other class where I want to turn the list into string to display as per follows:
if list only one item (eg [Animal(name: "Dog", description: "Good dog, age 2 years"]) then display name only on one line:
Dog
if more than 1 item in list (eg [Dog, Cat, Mouse]) then display with one line after each animal name like:
Dog
Cat
Mouse
I have done this in code as per following statement/s but it is very ugly + hard to read... so want to ask how can I do the same thing in more efficient neater way..
solution A:
animals.joinToString("\n\n") { it.name } + if (animals.size > 1) "\n" else ""
solution B:
animals.joinToString(separator = "\n\n", postfix = if (animals.size > 1) "\n" else "") { it.name }
please suggest how to improve this..

Since you have 2 cases it's difficult to compact this logic much further but I found this to be as "neat" as I can produce:
animals.takeIf { size == 1 }?.get(0)?.name?:animals.joinToString(separator = "\n\n", postfix = "\n"){it.name}

java parsing array input control

Thanks for checking out my question.
Starting off, the program has the following goal; the user inputs currency formatted as "xD xC xP xH"; the program checks the input is correct and then prints back the 'long' version: "x Dollars, x Cents, x Penny's, x half penny's"
Here I have some code that takes input from user as String currencyIn, splits the string into array tokens, then replaces the D's with Dollars etc and prints the output.
public class parseArray
{
public parseArray()
{
System.out.print('\u000c');
String CurrencyFormat = "xD xS xP xH";
System.out.println("Please enter currency in the following format: \""+CurrencyFormat+"\" where x is any integer");
System.out.println("\nPlease take care to use the correct spacing enter the exact integer plus type of coin\n\n");
Scanner input = new Scanner(System.in);
String currencyIn = input.nextLine();
currencyIn.toUpperCase();
System.out.println("This is the currency you entered: "+currencyIn);
String[] tokens = currencyIn.split(" ");
for (String t : tokens)
{
System.out.println(t);
}
String dollars = tokens[0].replaceAll("D", " Dollars ");
String cents = tokens[1].replaceAll("C", " cents");
String penny = tokens[2].replaceAll("P", " Penny's");
String hPenny = tokens[3].replaceAll("H", " Half penny's");
System.out.println(" "+dollars+ " " +cents+ " " +penny+ " " +hPenny);
input.close();
}
}
Question 1: At the moment the program prints out pretty anything you put in. how do I establish some input control? I've seen this done in textbooks with switch statement and a series of if statements, but were too complicated for me. Would it parse characters using charAt() for each element of the array?
Question 2: Is there a 'better' way to print the output? My friend said converting my 4 strings (dollars, cents, penny's, hpenny's) into elements 0, 1, 2, 3 of a new array (called newArray) and print like this:
System.out.println(Arrays.toString(newArray));
Many thanks in advance.

There is a neat solution, involving Regular Expressions, Streams and some lambdas. Core concept is that we define the input format through a regular expression. We need some sequence of digits, followed by a 'D' or a 'd', followed by a " ", followed by a sequence of digits, followed by a C or c,... I will skip derivation of this pattern, it is explained in the regular expression tutorial I linked above. We will find that
final String regex = "([0-9]+)[D|d]\\ ([0-9]+)[C|c]\\ ([0-9]+)[P|p]\\ ([0-9]+)[H|h]";
satisfies our needs. With this regular expression we can now determine whether our input String has the right format (input.matches(regex)), as well as extract the bits of information we are actually interested in (input.replaceAll(regex, "$1 $2 $3 $4"). Sadly, replaceAll yields another String, but it will contain the four digit sequences we are interested in, divided by a " ". We will use some stream-magic to transform this String into a long[] (where the first cell holds the D-value, the second holds the C-value,...). The final program looks like this:
import java.util.Arrays;
public class Test {
public static void main(String... args) {
final String input = args[0];
final String regex =
"([0-9]+)[D|d]\\ ([0-9]+)[C|c]\\ ([0-9]+)[P|p]\\ ([0-9]+)[H|h]";
if (input.matches(regex) == false) {
throw new IllegalArgumentException("Input is malformed.");
}
long[] values = Arrays.stream(input.replaceAll(regex, "$1 $2 $3 $4").split(" "))
.mapToLong(Long::parseLong)
.toArray();
System.out.println(Arrays.toString(values));
}
}
If you want to have a List<Long> instead a long[] (or a List<Integer> instead of an int[]), you would use
List<Long> values = Arrays.stream(input.replaceAll(regex, "$1 $2 $3 $4").split(" "))
.map(Long::parseLong)
.collect(Collectors.toList());
It is necessary to change mapToLong to map to receive a Stream<Long> instead of a LongStream. I am sure that one could somehow write a custom Collector for LongStream to transform it into a List<Long>, but I found this solution more readable and reliable (after all, the Collector used comes from Oracle, I trust they test their code extensively).
Here is some example call:
$> java Test "10D 9c 8p 7H"
[10, 9, 8, 7]
$> java Test "10E 9C 8P 7H"
Exception in thread "main" java.lang.IllegalArgumentException: Input is malformed.
at Test.main(Test.java:10)
$> java Test "10D 9C 8P 7H 10D 9C 8P 7H"
Exception in thread "main" java.lang.IllegalArgumentException: Input is malformed.
at Test.main(Test.java:10)

Question1
You can actually check if the input is what it's supposed to be with simple checks. For example, you can check the first element like this:
if(tokens[0].charAt(1).equals("D"))
return true;
else
return false;
Another way to check if the input is correct is by using Regular Expressions, but I assume you are a beginner and this is too much trouble for you, although it is the better way. So I leave it to you to look through it later.
Question2
You can actually listen to your friend and do as they said. You can write it as follows:
for(int i = 0; i < 4; i++)
System.out.print(" " + tokens[i])
System.out.println();
Or you may use
System.out.println(Arrays.toString(newArray));
And you have saved newArray like this:
newArray[0] = " " + tokens[0];

you could use the .equals() method to see if what a user has typed in matches what you have
if (currencyIn.equals("CurrencyFormat"))
{
...
}
this is probably the simplest way i can think of!

Regex - Find javascript methods and its variables in text

Best Solution i come up with so far, given a textblock it finds those methods that have paramters, but also the function with parameter key like this: "get: function(key)".
public class JavaScriptMethodFinder
{
static readonly string pattern = #"(?<=\s(?<Begin>[a-zA-Z_][a-zA-Z0-9_]*?)\(|\G)\s*((['""]).+?(?<!\\)\2|\{[^}]+\}|[^,;'""(){}\)]+)\s*(?:,|(?<IsEnd>\)))";
private static readonly Regex RegEx = new Regex(pattern, RegexOptions.Compiled);
public IEnumerable<dynamic> Find(string text)
{
var t = RegEx.Matches(text);
dynamic current = null;
bool isBegin;
foreach (Match item in t)
{
if (isBegin = (item.Groups["Begin"].Value != string.Empty))
{
current = new ExpandoObject();
current.MethodName = item.Groups["Begin"].Value;
current.Parameters = new List<string>();
current.Parameters.Add(item.Groups[1].Value);
}else
current.Parameters.Add(item.Groups[1].Value);
if (item.Groups["IsEnd"].Value != string.Empty)
{
isBegin = false;
if(!(item.Groups["Begin"].Value != string.Empty))
current.Parameters.Add(item.Groups[1].Value);
yield return current;
}
}
}
}
I wanna find Methods and its Variables. Given two examples.
First Example
function loadMarkers(markers)
{
markers.push(
new Marker(
"Hdsf",
40.261330438503,
10.4877055287361,
"some text"
)
);
}
Second Example
var block = new AnotherMethod('literal', 'literal', {"key":0,"key":14962,"key":false,"key":2});
So far i have, tested here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
(?<=Marker\(|\G)\s*((?<name>['""]).+?(?<!\\)\2|\{[^}]+\}|[^,;'""(){}\)]+)\s*(?:,|\))
Found 5 matches:
"Hdsf", has 2 groups:
"Hdsf"
"
40.261330438503, has 2 groups:
40.261330438503
10.4877055287361, has 2 groups:
10.4877055287361
"some text" ) has 2 groups:
"some text"
"
) has 2 groups:
(?<=AnotherMethod\(|\G)\s*((?<name>['""]).+?(?<!\\)\2|\{[^}]+\}|[^,;'""(){}\)]+)\s*(?:,|\))
Found 3 matches:
'literal', has 2 groups:
'literal'
' (name)
'literal', has 2 groups:
'literal'
' (name)
{"key":0,"key":14962,"key":false,"key":2}) has 2 groups:
{"key":0,"key":14962,"key":false,"key":2}
(name)
I would like to combine it such that i have one expression
Match<(methodname)>
Group : parameter
Group : parameter
Group : parameter
Match<(methodname)>
Group : parameter
Group : parameter
Group : parameter
so when i scan a page which contains both cases, i will get two matches witch
ect the first capture being the method name and then the following is the paramters.
I been trying to modify what i already have, but its to complex with the LookBehind stuff for I to understand it.

Regex's are a very problematic approach for this type of project. Have you looked at using a genuine JavaScript parser/compiler like Rhino? That will give you full awareness of JavaScript syntax "for free" and the ability to walk your source code meaningfully.

program to determine number of duplicates in a sentence

Code:public class duplicate
{
public static void main(String[] args)throws IOException
{
System.out.println("Enter words separated by spaces ('.' to quit):");
Set<String> s = new HashSet<String>();
Scanner input = new Scanner(System.in);
while (true)
{
String token = input.next();
if (".".equals(token))
break;
if (!s.add(token))
System.out.println("Duplicate detected: " + token);
}
System.out.println(s.size() + " distinct words:\n" + s);
Set<String> duplicatesnum = new HashSet<String>();
String token = input.next();
if (!s.add(token))
{
duplicatesnum.add(token);
System.out.println("Duplicate detected: " + token);
}
System.out.println(duplicatesnum.size());
}
}
the output is:
Enter words separated by spaces ('.' to quit):
one two one two .
Duplicate detected: one
Duplicate detected: two
2 distinct words:
[two, one]

I assume you want to know the number of different duplicate words. You can use another HashSet<String> for the duplicates.
//Outside the loop
Set<String> duplicates = new HashSet<String>();
//Inside the loop
if (!s.add(token))
{
duplicates.add(token);
System.out.println("Duplicate detected: " + token);
}
//Outside the loop
System.out.println(duplicates.size());
Also if you care for the occurences of each word declare a HashMap<String, Integer> as in others posts is mentioned.
But if you want the number of all duplicate words(not different) just declare a counter:
//Outside the loop
int duplicates = 0;
//Inside the loop
if (!s.add(token))
{
duplicates++;
System.out.println("Duplicate detected: " + token);
}
//Outside the loop
System.out.println(duplicates);

Instead of a HashSet, use a HashMap. A HashSet only stores the values. A HashMap maps a value to another value (see http://www.geekinterview.com/question_details/47545 for an explanation)
In your case, the key of the HashMap is your string (just as the key of the HashSet is the string). The value in the HashMap is the number of times you encountered this string.
When you find a new string, add it to the HashMap, and set the value of the entry to zero.
When you encounter the same string later, increment the value in the HashMap.

Because you are using a HashSet, you will not know how many duplicates you have. If you went with a HashMap<String, Integer>, you could increment whenever you found that your key was != null.

In the if (!s.add(token)), you can increment a counter and then display it's value at the end.

Your question is a bit misleading. Some people understand that you want:
Input: hello man, hello woman, say good by to your man.
Output:
Found duplicate: Hello
Found duplicate: Man
Duplicate count: 2
Others understood you wanted:
Input: hello man, hello woman, say hello to your man.
Output:
Found duplicate: Hello - 3 appearances
Found duplicate: Man - 2 appearances
Assuming you want the 1st option - go with Petar Minchev's solution
Assuming you want the 2nd option - go with Patrick's solution. Don't forget that when you use an Integer in a Map, you can get/put int as well, and Java will Automatically Box/Unbox it for you, but if you rely on this - you can get NPEs when asking the map for a key that does not exist:
Map<String,Integer> myMap = new HashMap<String,Integer>();
myMap.get("key that does not exist"); // NPE here <---
The NPE is caused since the return value from 'get' is null, and that value is being cast into an Integer after which the intValue() method will be invoked - thus triggering an NPE.

You can use Google collections library:
Multiset<String> words = HashMultiset.create();
while (true) {
String token = input.next();
if (".".equals(token))
break;
if (!words.add(token))
System.out.println("Duplicate detected: " + token);
}
System.out.println(words.elementSet().size() + " distinct words:\n" + words.elementSet());
Collection<Entry<String>> duplicateWords = Collections2.filter(words.entrySet(), new Predicate<Entry<String>>() {
public boolean apply(Entry<String> entry) {
return entry.getCount() > 1;
}
});
System.out.println("There are " + duplicateWords.size() + " duplicate words.");
System.out.println("The duplicate words are: " + Joiner.on(", ").join(duplicateWords));
Example of output:
Enter words separated by spaces ('.' to quit):
aaa bbb aaa ccc aaa bbb .
3 distinct words:
[aaa, ccc, bbb]
There are 2 duplicate words.
The duplicate words are: aaa x 3, bbb x 2

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex: Find first occurence and map to canonical value - java

Related

Getting integers out of a string containing words in java

How to join string in kotlin and add postfix only if more than 1 item

java parsing array input control

Regex - Find javascript methods and its variables in text

program to determine number of duplicates in a sentence

Categories

Resources