Sub string detection performance?

Sub string detection performance? - java

I need to match a sub string, and I wonder which one is faster when it comes to matching RegEx?
if ( str.matches(".*hello.*") ) {
...
}
Pattern p = Pattern.compile( ".*hello.*" );
Matcher m = p.matcher( str );
if ( m.find() ) {
...
}
And if don't need a regEx, should I use 'contains' ?
if ( str.contains("hello") ) {
...
}
Thanks.

Although matches() and using a Matcher are identical (matches() uses a Matcher in its implementation), using a Matcher can be faster if you cache and reuse the compiled Pattern. I did some rough testing and it improved performance (in my case) by 400% - the improvement depends on the regex, but there will always be sone improvement.
Although I haven't tested it, I would expect contains() to outperform any regex approach, because the algorithm is far simpler and you don't need regex for this situation.
Here are the results of 6 ways to test for a String containing a substring, with the target ("http") located at various places within a standard 60 character input:
|------------------------------------------------------------|
| Code tested with "http" in the input | µsec | µsec | µsec |
| at the following positions: | start| mid|absent|
|------------------------------------------------------------|
| input.startsWith("http") | 6 | 6 | 6 |
|------------------------------------------------------------|
| input.contains("http") | 2 | 22 | 49 |
|------------------------------------------------------------|
| Pattern p = Pattern.compile("^http.*")| | | |
| p.matcher(input).find() | 90 | 88 | 86 |
|------------------------------------------------------------|
| Pattern p = Pattern.compile("http.*") | | | |
| p.matcher(input).find() | 84 | 145 | 181 |
|------------------------------------------------------------|
| input.matches("^http.*") | 745 | 346 | 340 |
|------------------------------------------------------------|
| input.matches("http.*") | 1663 | 1229 | 1034 |
|------------------------------------------------------------|
The two-line options are where a static pattern was compiled then reused.

They are more or less equivalent if you use m.match() in the second code snippet. String.matches() specs this :
An invocation of this method of the form str.matches(regex) yields exactly the same result as the expression Pattern.matches(regex, str)
this in turn specifies:
An invocation of this convenience method of the form
Pattern.matches(regex, input);
behaves in exactly the same way as the expression
Pattern.compile(regex).matcher(input).matches()
If a pattern is to be used multiple times, compiling it once and
reusing it will be more efficient than invoking this method each time.
So calling String.matches(String) in itself will not bring performance benefits, but storing a pattern (e.g. as a constant) and reusing it does.
If you use find then matches could be more efficient if the terms don't match early, as find may keep looking. But find and matches don't perform the same function, so comparison of performance is moot.

Related

How do I run a spark sql aggregator cumulatively?

I am currently working on a project with spark datasets (in Java) where I have to create a new column derived from an accumulator run over all the previous rows.
I have been implementing this using a custom UserDefinedAggregationFunction over a Window from unboundedPreceding to currentRow.
This goes something like this:
df.withColumn("newColumn", customAccumulator
.apply(columnInputSeq)
.over(customWindowSpec));
However, I would really prefer to use a typed Dataset for type safety reasons and generally cleaner code. i.e: perform the same operation with an org.apache.spark.sql.expressions.Aggregator over a Dataset<CustomType>. The problem here is I have looked through all the documentation and can't work out how to make it behave in the same way as above (i.e. I can only get a final aggregate over the whole column rather than a cumulative state at each row).
Is what I am trying to do possible and if so, how?
Example added for clarity:
Initial table:
+-------+------+------+
| Index | Col1 | Col2 |
+-------+------+------+
| 1 | abc | def |
| 2 | ghi | jkl |
| 3 | mno | pqr |
| 4 | stu | vwx |
+-------+------+------+
Then with example aggregation operation:
First reverse the accumulator, prepend Col1 append Col2 and return this value, also setting it as the accumulator.
+-------+------+------+--------------------------+
| Index | Col1 | Col2 | Accumulator |
+-------+------+------+--------------------------+
| 1 | abc | def | abcdef |
| 2 | ghi | jkl | ghifedcbajkl |
| 3 | mno | pqr | mnolkjabcdefihgpqr |
| 4 | stu | vwx | sturpqghifedcbajklonmvwx |
+-------+------+------+--------------------------+
Using a UserDefinedAggregateFunction I have been able to produce this but with an Aggregator I can only get the last row.

You don't
My source for this is a friend who has been working on an identical problem to this and has now concluded it's impossible

Why is the Java class file format missing constant pool tag 2?

The JVM specification for Java 1.0.2 lists the following constant pool entry types:
+-----------------------------+-------+
| Constant Type | Value |
+-----------------------------+-------+
| CONSTANT_Class | 7 |
| CONSTANT_Fieldref | 9 |
| CONSTANT_Methodref | 10 |
| CONSTANT_InterfaceMethodref | 11 |
| CONSTANT_String | 8 |
| CONSTANT_Integer | 3 |
| CONSTANT_Float | 4 |
| CONSTANT_Long | 5 |
| CONSTANT_Double | 6 |
| CONSTANT_NameAndType | 12 |
| CONSTANT_Utf8 | 1 |
+-----------------------------+-------+
Subsequent JVM specs have added more constant pool entry types but haven't ever filled the "2" spot. Why is there a gap there?

I did some research and found some clue, for the constant pool tag 2, it seems to be held open under the Constant_Unicode but has never been used, because UTF-8 is already there, and UTF-8 is widely adopted, so if there is constant written in unicode, UTF-8 can handle it, and UTF-8 has a number of advantages than other encoding scheme, so I guess this historical fact might explain why 2 is missing, I guess it can be reused for other purposes if necessary.
Some statements from here:
https://bugs.openjdk.java.net/browse/JDK-8161256
For 13, 14, it should have different specific reasons why it was opened but never got used.

Suggest framework for external rule storage

There is a situation:
I've got 2 .xlsx files:
1. With bussines data
for example:
-----------------------------------------
| Column_A | Column_B| Column_C | Result |
-----------------------------------------
| test | 562.03 | test2 | |
------------------------------------------
2. With bussiness rules
for example:
-------------------------------------------------------------------------
| Column_A | Column_B | Column_C | Result |
-------------------------------------------------------------------------
| EQUALS:test | GREATER:100 | EQUALS:test2 & NOTEQUALS:test | A |
--------------------------------------------------------------------------
| EQUALS:test11 | GREATER:500 | EQUALS:test11 & NOTEQUALS:test | B |
--------------------------------------------------------------------------
With condition in each cell.
One row contains list of these conditions and composes one rule.
All rules will be processed iteratively. But of course, I think, it would be better to construct some 'decision tree' or 'classification flow-chart'.
So, my task is: to store these conditions functionality (methods like EQUALS, GREATER, NOTEQUALS) in some external file or some other resource. To have a possibility to change it without compilation into java bytecode. To have a dynamic solution, not to hard code in java methods.
I found DROOLS http://drools.jboss.org/ as a whay that can work with such cases. But maybe there are another frameworks that can work with such issues?
JavaScript, DynamicSQL, DB solution is not suitable.

Regex expression to capture hyphenated word between lines, and non hyphenated words

I am trying to write a regular expression, in java, that matches words and hyphenated words. So far I have:
Pattern p1 = Pattern.compile("\\w+(?:-\\w+)",Pattern.CASE_INSENSITIVE);
Pattern p2 = Pattern.compile("[a-zA-Z0-9]+",Pattern.CASE_INSENSITIVE);
Pattern p3 = Pattern.compile("(?<=\\s)[\\w]+-$",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
This is my test case:
Programs
Dsfasdf. Programs Programs Dsfasdf. Dsfasdf. as is wow woah! woah. woah? okay.
he said, "hi." aasdfa. wsdfalsdjf. go-to go-
to
asdfasdf.. , : ; " ' ( ) ? ! - / \ # # $ % & ^ ~ ` * [ ] { } + _ 123
Any help would be awesome
My expected result would be to match all the words ie.
Programs Dsfasdf Programs Programs Dsfasdf Dsfasdf
as is wow woah woah woah okay he said hi aasdfa
wsdfalsdjf go-to go-to asdfasdf
the part I'm struggling with is matching the words that are split up between lines as one word.
ie.
go-
to

\p{L}+(?:-\n?\p{L}+)*
\ /^\ /^\ /\ /^^^
\ / | | | | \ / |||
| | | | | | ||`- Previous can repeat 0 or more times (group of literal '-', optional new-line and one or more of any letter (upper/lower case))
| | | | | | |`-- End first non-capture group
| | | | | | `--- Match one or more of previous (any letter, upper/lower case)
| | | | | `------ Match any letter (upper/lower case)
| | | | `---------- Match a single new-line (optional because of `?`)
| | | `------------ Literal '-'
| | `-------------- Start first non-capture group
| `---------------- Match one or more of previous (any letter between A-Z (upper/lower case))
`------------------- Match any letter (upper/lower case)
Is this OK?

I would go with regex:
\p{L}+(?:\-\p{L}+)*
Such regex should match also words "fiancé", "À-la-carte" and other words containing some special category "letter" characters. \p{L} matches a single code point in the category "letter".

Ant path style patterns

What are the rules for Ant path style patterns.
The Ant site itself is surprisingly uninformative.

Ant-style path patterns matching in spring-framework:
The mapping matches URLs using the following rules:
? matches one character
* matches zero or more characters
** matches zero or more 'directories' in a path
{spring:[a-z]+} matches the regexp [a-z]+ as a path variable named "spring"
Some examples:
com/t?st.jsp - matches com/test.jsp but also com/tast.jsp or com/txst.jsp
com/*.jsp - matches all .jsp files in the com directory
com/**/test.jsp - matches all test.jsp files underneath the com path
org/springframework/**/*.jsp - matches all .jsp files underneath the org/springframework path
org/**/servlet/bla.jsp - matches org/springframework/servlet/bla.jsp but also org/springframework/testing/servlet/bla.jsp and org/servlet/bla.jsp
com/{filename:\\w+}.jsp will match com/test.jsp and assign the value test to the filename variable
http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/util/AntPathMatcher.html

I suppose you mean how to use path patterns
If it is about whether to use slashes or backslashes these will be translated to path-separators on the platform used during execution-time.

Most upvoted answer by #user11153 using tables for a more readable format.
The mapping matches URLs using the following rules:
+-----------------+---------------------------------------------------------+
| Wildcard | Description |
+-----------------+---------------------------------------------------------+
| ? | Matches exactly one character. |
| * | Matches zero or more characters. |
| ** | Matches zero or more 'directories' in a path |
| {spring:[a-z]+} | Matches regExp [a-z]+ as a path variable named "spring" |
+-----------------+---------------------------------------------------------+
Some examples:
+------------------------------+--------------------------------------------------------+
| Example | Matches: |
+------------------------------+--------------------------------------------------------+
| com/t?st.jsp | com/test.jsp but also com/tast.jsp or com/txst.jsp |
| com/*.jsp | All .jsp files in the com directory |
| com/**/test.jsp | All test.jsp files underneath the com path |
| org/springframework/**/*.jsp | All .jsp files underneath the org/springframework path |
| org/**/servlet/bla.jsp | org/springframework/servlet/bla.jsp |
| also: | org/springframework/testing/servlet/bla.jsp |
| also: | org/servlet/bla.jsp |
| com/{filename:\\w+}.jsp | com/test.jsp & assign value test to filename variable |
+------------------------------+--------------------------------------------------------+

ANT Style Pattern Matcher
Wildcards
The utility uses three different wildcards.
+----------+-----------------------------------+
| Wildcard | Description |
+----------+-----------------------------------+
| * | Matches zero or more characters. |
| ? | Matches exactly one character. |
| ** | Matches zero or more directories. |
+----------+-----------------------------------+

As #user11153 mentioned, Spring's AntPathMatcher implements and documents the basics of Ant-style path pattern matching.
In addition, Java 7's nio APIs added some built in support for basic pattern matching via FileSystem.getPathMatcher

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Sub string detection performance? - java

Related

How do I run a spark sql aggregator cumulatively?

Why is the Java class file format missing constant pool tag 2?

Suggest framework for external rule storage

Regex expression to capture hyphenated word between lines, and non hyphenated words

Ant path style patterns

Categories

Resources