Alternative of Java String.split for better performance

Alternative of Java String.split for better performance - java

In the process of adding data by import from a csv/tab seperated file, my code consumes a lot of time to upload data. Is there any alternative to do this in a more faster way ?? This is the code i use to split fields in an array.
//Here - lineString = fileReader.readLine()
public static String [] splitAndGetFieldNames(String lineString ,String fileType)
{
if(lineString==null || lineString.trim().equals("")){
return null;
}
System.out.print("LINEEEE " + lineString);
String pattern = "(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))";
if(fileType.equals("tab"))
pattern = "\t" + pattern;
else
pattern = "," + pattern;
String fieldNames[] = lineString.split(pattern);
for(int i=0 ; i < fieldNames.length ; i++){
//logger.info("Split Fields::"+fieldNames[i]);
if (fieldNames[i].startsWith("\""))
fieldNames[i] = fieldNames[i].substring(1);
if (fieldNames[i].endsWith("\""))
fieldNames[i] = fieldNames[i].substring(0, fieldNames[i].length()-1);
fieldNames[i] = fieldNames[i].replaceAll("\"\"","\"").trim();
//logger.info("Split Fields after manipulation::"+fieldNames[i]);
}
return fieldNames;
}

Use a CSV parser like super-csv.
Univocity provides a benchmark of CSV parsers. It says that univocity-parsers
is fast, which is no surprise. You could give it a try.

I would recommend you to take a look at opencsv library or try CSVParser from Apache Commons
Anyway, reinventing the wheel is not the best idea. Using 3rd party library would be less headache than writing it yourself :)

Related

In java trying to extract XMLNS using a Regexpression

I have been trying for a few hours to get this right, and I really can't seem to do it...
Given a string
"xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\""
what is the correct expression to "save" the http://www.openarchives.org/OAI/2.0/oai-identifier bit?
Thanks in advance, really having trouble getting this right.
String validXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
Pattern p = Pattern.compile(".*\\\"(.*)\\\".*");
Matcher m = p.matcher(validXML);
System.out.println(m.group(1));
Is not printing out anything. Be aware that this attempt was just to get the string inside the quotes, I was going to worry about the other part once I got that working... To bad I never got that working. Thanks

Regular Expressions are so expensive - don't use them when you don't need to!! There are a million other ways to parse a string.
String validXml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
String start = "xmlns:oai-identifier=\"";
String end = "\" ";
int location = validXml.indexOf(start);
String result;
if (location > 0) {
result = validXml.substring(location + start.length(), validXml.length());
int endIndex = result.indexOf(end);
if (endIndex > 0) {
result = result.substring(0, endIndex);
}
else {
throw new Exception("Could not find end!");
}
}
else {
throw new Exception("Could not find start!");
}
System.out.println(result);

I think the problem might be that the first .* in your regular expression is too eager and matching more characters than you'd like.
Try changing ".*\\\"(.*)\\\".*" to be "xmlns.*=\"(.*)\".*" and see whether that works.
If it doesn't work at first, you can also try re-instating the quote escaping. Off the top of my head, I think you don't need them escaping, but I'm not 100% sure.
Note also that this will only match a single namespace declaration, not each one in the validXML variable in your example. You'll have to split the string in order to use this on an arbitrary number of xmlns:.*= attributes.

Since you are reading XML, you might be using DOM, so you can extract the namespace from the prefix name using lookupNamespaceURI() once you parse the document with the setNamespaceAware() option set to true:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new InputSource(new StringReader(validXML)));
String namespace = doc.lookupNamespaceURI("oai-identifier");
It's simpler and you don't have to do any string parsing.

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)

You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");

This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";

This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

using tokenizer to read a line

public void GrabData() throws IOException
{
try {
BufferedReader br = new BufferedReader(new FileReader("data/500.txt"));
String line = "";
int lineCounter = 0;
int TokenCounter = 1;
arrayList = new ArrayList < String > ();
while ((line = br.readLine()) != null) {
//lineCounter++;
StringTokenizer tk = new StringTokenizer(line, ",");
System.out.println(line);
while (tk.hasMoreTokens()) {
arrayList.add(tk.nextToken());
System.out.println("check");
TokenCounter++;
if (TokenCounter > 12) {
er = new DataRecord(arrayList);
DR.add(er);
arrayList.clear();
System.out.println("check2");
TokenCounter = 1;
}
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Driver.class.getName()).log(Level.SEVERE, null, ex);
}
}
Hello , I am using a tokenizer to read the contents of a line and store it into an araylist. Here the GrabData class does that job.
The only problem is that the company name ( which is the third column in every line ) is in quotes and has a comma in it. I have included one line for your example. The tokenizer depends on the comma to separate the line into different tokens. But the company name throws it off i guess. If it weren't for the comma in the company column , everything goes as normal.
Example:-
Essie,Vaill,"Litronic , Industries",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com
Any ideas?

First of all StringTokenizer is considered to be legacy code. From Java doc:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Using the split() method you get an array of strings. While iterating through the array you can check if the current string starts with a quote and if that's the case check if the next one ends with a quote. If you meet these 2 conditions then you know you didn't split where you wanted and you can merge these 2 together, process it like you want and continue iterating through the array normally after that. In that pass you will probably do i+=2 instead of your regular i++ and it should go unnoticed.

You can accomplish this using Regular Expressions. The following code:
String s = "asd,asdasd,asd\"asdasdasd,asdasdasd\", asdasd, asd";
System.out.println(s);
s = s.replaceAll("(?<=\")([^\"]+?),([^\"]+?)(?=\")", "$1 $2");
s = s.replaceAll("\"", "");
System.out.println(s);
yields
asd,asdasd,asd, "asdasdasd,asdasdasd", asdasd, asd
asd,asdasd,asd, asdasdasd asdasdasd, asdasd, asd
which, from my understanding, is the preprocessing you require for your tokenizer-code to work. Hope this helps.

While StringTokenizer might not natively handle this for you, a couple lines of code will do it... probably not the most efficient, but should get the idea across...
while(tk.hasMoreTokens()) {
String token = tk.nextToken();
/* If the item is encapsulated in quotes, loop through all tokens to
* find closing quote
*/
if( token.startsWIth("\"") ){
while( tk.hasMoreTokens() && ! tk.endsWith("\"") ) {
// append our token with the next one. Don't forget to retain commas!
token += "," + tk.nextToken();
}
if( !token.endsWith("\"") ) {
// open quote found but no close quote. Error out.
throw new BadFormatException("Incomplete string:" + token);
}
// remove leading and trailing quotes
token = token.subString(1, token.length()-1);
}
}

As you can see, in the class description, the use of StringTokenizer is discouraged by Oracle.
Instead of using tokenizer I would use the String split() method
which you can use a regular expression as argument and significantly reduce your code.
String str = "Essie,Vaill,\"Litronic , Industries\",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com";
String[] strs = str.split("(?<! ),(?! )");
List<String> list = new ArrayList<String>(strs.length);
for(int i = 0; i < strs.length; i++) list.add(strs[i]);
Just pay attention to your regex, using this one you're assuming that the comma will be always between spaces.

Replace every word with tag

JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}

Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...

var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/

inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");

You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}

Extracting contents from HTML represented as a String

I have a Big html in String variable and I want to get contents of a div. I can not rely on regular expression because it can have nested div's. So, let's suppose I have following String -
String test = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
Then how can I get this with a simple java program -
<div id="mainContent">foo bar<div>good best better</div> <div>test test</div></div>
Well my approch is something like this (might be horrable, still fighting to correct) -
public static void main(String[] args) {
int count = 1;
int fl = 0;
String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
String tmp = s;
int len = s.length();
for (int i=0; i<len; i++){
int st = s.indexOf("div>");
if(st > -1) {
char c = s.charAt(st-1);
if(c == '/') {
count--;
} else {
count++;
}
s = s.substring(st+4);
System.out.println(s);
i = i + st;
System.out.println(c + " -- " + st + " -- " + count + " -- " + i);
if (count == 0) {
fl = i;
break;
}
}
}
System.out.println("final ind - " + fl);
s = tmp.substring(0, fl + 4);
System.out.println("final String - " + s);
}

I would recommend using JSoup to parse the HTML and find what you are looking for.
It fulfills the simple requirement for sure. You can do what you want in just a couple of lines of code!
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
jsoup is designed to deal with all varieties of HTML found in the
wild; from pristine and validating, to invalid tag-soup; jsoup will
create a sensible parse tree.
Using the selector syntax makes finding and extracting data extremely simple.
public static void main(final String[] args)
{
final String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
final Document d = Jsoup.parse(s);
final Elements e = d.select("#mainContent");
System.out.println(e.get(0));
}
outputs
<div id="mainContent">
foo bar
<div>
good best better
</div>
<div>
test test
</div>
</div>
Doesn't get much more simple than that!

I'm afraid the answer is: You don't. At least not with a "simple" program...
But there is hope: You can use a HTML parser library (like NekoHTML or HTMLParser, although the latter project seems to be dead) to parse the string and retrive the part you need.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Alternative of Java String.split for better performance - java

Use a CSV parser like super-csv. Univocity provides a benchmark of CSV parsers. It says that univocity-parsers is fast, which is no surprise. You could give it a try.

I would recommend you to take a look at opencsv library or try CSVParser from Apache Commons Anyway, reinventing the wheel is not the best idea. Using 3rd party library would be less headache than writing it yourself :)

Related

In java trying to extract XMLNS using a Regexpression

Replace String in Java with regex and replaceAll

using tokenizer to read a line

Replace every word with tag

Extracting contents from HTML represented as a String

Categories

Resources