Deal with PatternSyntaxException and scanning texts - java

I want to find names in a collection of text documents from a huge list of about 1 million names. I'm making a Pattern from the names of the list first:
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
String combined = "";
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined += name.replace("\"", "") + "|";
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);
After doing so I got an IllegalPatternSyntax Exception because some names contain a '+' in their names or other Regex expressions. I tried solving this by either ignoring the few names by:
if(name.contains("\""){
//ignore this name }
Didn't work properly but also messy because you have to escape everything manually and run it many times and waste your time.
Then I tried using the quote method:
Pattern all = Pattern.compile(Pattern.quote(combined));
However now, I don't find any matches in the text documents anymore, even when I also use quote on the them. How can I solve this issue?

I agree with the comment of #dragon66, you should not quote pipe "|". So your code would be like the code below using Pattern.quote() :
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
String combined = "";
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined += Pattern.quote(name.replace("\"", "")) + "|"; //line changed
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);
Also I suggest to verify if your problem domain needs optimization replacing the use of the String combined = ""; over an Immutable StringBuilder class to avoid the creation of unnecessary new strings inside a loop.

guilhermerama presented the bugfix to your code.
I will add some performance improvements. As I pointed out the regex library of java does not scale and is even slower if used for searching.
But one can do better with Multi-String-Seach algorithms. For example by using StringsAndChars String Search:
//setting up a test file
Iterable<String> lines = createLines();
Files.write(Paths.get("names.tsv"), lines , CREATE, WRITE, TRUNCATE_EXISTING);
// read the pattern from the file
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
Set<String> combined = new LinkedHashSet<>();
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined.add(name);
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
// search the pattern in a small text
StringSearchAlgorithm stringSearch = new AhoCorasick(new ArrayList<>(combined));
StringFinder finder = stringSearch.createFinder(new StringCharProvider("test " + name(38) + "\n or " + name(799) + " : " + name(99999), 0));
System.out.println(finder.findAll());
The result will be
[5:10(00038), 15:20(00799), 23:28(99999)]
The search (finder.findAll()) does take (on my computer) < 1 millisecond. Doing the same with java.util.regex took around 20 milliseconds.
You may tune this performance by using other algorithms provided by RexLex.
Setting up needs following code:
private static Iterable<String> createLines() {
List<String> list = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
list.add(i + "\t" + name(i));
}
return list;
}
private static String name(int i) {
String s = String.valueOf(i);
while (s.length() < 5) {
s = '0' + s;
}
return s;
}

Related

Replace quotes in String

I have to replace all the commas that are between double quotes with a dot.
I'm trying to do that with the replace and replaceAll Java's methods. But I still didn't sort out a solution.
Can someone help me?
EDIT:
I have to manually parse a csv file to object. So I'm trying to string split each input line, but one number has a comma inside so i'm getting more datas than I need for the split.
Example: I have to split this string.
"""LASER MEDIA SOCIETA' COOPERATIVA""",CNF146010,FM (S),PIAZZA UMBERTO I - PISTICCI,MT,40N2323,16E3328,383,,"99,1",CITY RADIO,"H: --V: 32 dBW",0.0
Notice that I have "99,1" and the ,, before that are putting me in trouble.
Scanner var = new Scanner(new BufferedReader(new FileReader ("t1.csv")));
ArrayList<Catasto> obj = new ArrayList();
String data = var.nextLine();
String data2 = null;
String full = null;
int j = 0;
while (var.hasNextLine()) {
data = var.nextLine();
data2 = var.nextLine();
full = data + data2;
//full = full.replaceAll("\"*[,]*\"", "."); attempt 1
System.out.println(full);
ArrayList<String> parts = new ArrayList();
String[] parti = full.split(",");
//for (int i = 0; i<parti.length; i++) { this is because I'm trying to change empty string with a null
//if (parti[i] == " ") in order to solve this error: java.lang.NumberFormatException: For input string: ""
// parti[i] = null;
//}
for (int i = 0; i<12; i++) {
parts.add(parti[i]);
}
Catasto foo = new Catasto(parts);
obj.add(foo);
}
var.close();
EDIT 2:
I have solved the problem of the comma between the double quotes. But I don't know why the error: java.lang.NumberFormatException: For input string: ""
You're going to struggle to do it with a single replaceAll or replace as you need to determine pairs of quotes. Your best bet is to match pairs of quotes and the use replaceAll for the group to change the comma to a full stop.
String input = "\"One,Two,There\",\"Four,Five,Six\"";
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, m.group().replaceAll(",", "."));
}
m.appendTail(sb);
String output = sb.toString(); // "One.Two.There","Four.Five.Six"

Java using \034 as delimiter in a string

I am trying to use '\034' field separator character as a delimiter in a string.
The issue is when I hardcode "\034"+opField and write it to a file it works, but if the "\034" character is read from a file, it writes the output as string "col1\034col2'.
I tried using StringBuilder but it escapes the \034 to "\\034".
I am using the following code to read the character from the file:
try (BufferedReader br = new BufferedReader(new FileReader(fConfig))){
int lc = 1;
for(String line;(line = br.readLine())!=null;){
String[] rowList = line.split(delim);
int row_len = rowList.length;
if (row_len<2){
System.out.println("Incorrect dictionary file row:"+fConfig.getAbsolutePath()+"\nNot enough values found at row:"+line);
}else{
String key = rowList[0];
String value = rowList[1];
dictKV.put(key, value);
}
lc++;
}
}catch(Exception e){
throw e;
}
Any help is welcome...
[update]: The same thing is happening with '\t' character, if harcoded fine, but if read from a file its getting appended as characters. "col0\tcol1"
if(colAl.toLowerCase().contains(" as ")){
String temp = colAl.replaceAll("[ ]+as[ ]+"," | ");
ArrayList<String> tempA = this.brittle_delim(temp,'|');
colAl = tempA.get(tempA.size()-1);
colAl = colAl.trim();
}else {
ArrayList<String> tempA = this.brittle_delim(colAl,' ');
colAl = tempA.get(tempA.size()-1);
colAl = colAl.trim();
}
if(i==0){
sb.append(colAl);
headerCols+=colAl.trim();
}else{
headerCols+= this.output_field_delim + colAl;
sb.append(this.output_field_delim);
sb.append(colAl);
}
}
}
System.out.println("SB Header Cols:"+sb.toString());
System.out.println("Header Cols:"+headerCols);
Output:
SB Header Cols:
SPRN_CO_ID\034FISC_YR_MTH_DSPLY_CD\034CST_OBJ_CD\034PRFT_CTR_CD\034LEGL_CO_CD\034HEAD_CT_TYPE_ID\034FIN_OWN_CD\034FUNC_AREA_CD\034HEAD_CT_NR
Header Cols:
SPRN_CO_ID\034FISC_YR_MTH_DSPLY_CD\034CST_OBJ_CD\034PRFT_CTR_CD\034LEGL_CO_CD\034HEAD_CT_TYPE_ID\034FIN_OWN_CD\034FUNC_AREA_CD\034HEAD_CT_NR
In the above code if I do the following I am getting correct results:
headerCols+= "\034"+ colAl;
output:
SPRN_CO_IDFISC_YR_MTH_DSPLY_CDCST_OBJ_CDPRFT_CTR_CDLEGL_CO_CDHEAD_CT_TYPE_IDFIN_OWN_CDFUNC_AREA_CDHEAD_CT_NR
The FS characters are there even if they are geting removed here
You should provide an example demonstrating your problem. Not just incomplete code snippets.
Following runable snippet does what you explained.
// create a file one line
byte[] bytes = "foo bar".getBytes(StandardCharsets.ISO_8859_1);
String fileName = "/tmp/foobar";
Files.write(Paths.get(fileName), bytes);
String headerCols = "";
String outputFieldDelim = "\034";
try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
// read the line from the file and split by blank character
String[] cols = br.readLine().split(" ");
// contcatenate the values with "\034"
// but ... for your code ...
// don't concatenate String objects in a loop like below
// use a StringBuilder or StringJoiner instead
headerCols += outputFieldDelim + cols[0];
headerCols += outputFieldDelim + cols[1];
}
// output with the "\034" character
System.out.println(headerCols);
I guess this is where I found my solution and the actual words for my Question.
How to unescape string literals in java

How can i read the same file two times in Java?

I want to counter the lines of the file and in the second pass i want to take every single line and manipulating it. It doesn't have a compilation error but it can't go inside the second while ((line = br.readLine()) != null) .
Is there a different way to get the lines(movies) of the file and storing in an array ?
BufferedReader br = null;
try { // try to read the file
br = new BufferedReader(new FileReader("movies.txt"));
String line;
int numberOfMovies = 0;
while ((line = br.readLine()) != null) {
numberOfMovies++;
}
Movie[] movies = new Movie[numberOfMovies]; // store in a Movie
// array every movie of
// the file
String title = "";
int id = 0;
int likes = 0;
int icounter = 0; // count to create new movie for each line
while ((line = br.readLine()) != null) {
line = line.trim();
line = line.replaceAll("/t", "");
line = line.toLowerCase();
String[] tokens = line.split(" "); // store every token in a
// string array
id = Integer.parseInt(tokens[0]);
likes = Integer.parseInt(tokens[tokens.length]);
for (int i = 1; i < tokens.length; i++) {
title = title + " " + tokens[i];
}
movies[icounter] = new Movie(id, title, likes);
icounter++;
}
} catch (IOException e) {
e.printStackTrace();
}
Simplest way would be to reset br again.
try { // try to read the file
br = new BufferedReader(new FileReader("movies.txt"));
String line; int numberOfMovies = 0;
while (br.hasNextLine()){
numberOfMovies++;
}
br.close();
Movie[] movies = new Movie[numberOfMovies];
// store in a Movie
// array every movie of
// the file
String title = "";
int id = 0;
int likes = 0;
int icounter = 0;
// count to create new movie for each line
br = new BufferedReader(new FileReader("movies.txt"));
while ((br.hasNextLine()) {
line = line.trim();
line = line.replaceAll("/t", "");
line = line.toLowerCase();
String[] tokens = line.split(" ");
// store every token in a
// string array
id = Integer.parseInt(tokens[0]);
likes = Integer.parseInt(tokens[tokens.length]);
for (int i = 1; i < tokens.length; i++) {
title = title + " " + tokens[i];
}
movies[icounter] = new Movie(id, title, likes);
icounter++;
}
} catch (IOException e) { e.printStackTrace(); }
I changed br.nextLine() != null to br.hasNextLine() because it's shorter and more appropriate in this case. Plus it won't consume a line.
There are two things here:
InputStreams and Readers are one-shot structures: once you've read them to the end, you either need to explicitly rewind them (if they support rewinding), or you need to close them (always close your streams and readers!) and open a new one.
However in this case the two passes are completely unnecessary, just use a dynamically growing structure to collect your Movie objects instead of arrays: an ArrayList for example.
Firstly, there is no need to read the file twice.
Secondly, why don't you use the java.nio.file.Files class to read your file.
It has a method readAllLines(Path path, Charset cs) that gives you back a List<String>.
Then if you want to know how many lines just call the size() method on the list and you can use the list to construct the Movie objects.
List<Movie> movieList = new ArrayList<>();
for (String line : Files.readAllLines(Paths.get("movies.txt"), Charset.defaultCharset())) {
// Construct your Movie object from each individual line and add to the list of Movies
movieList.add(new Movie(id, title, likes));
}
The use of the Files class also reduces your boilerplate code as it will handle closing the resource when it has completed reading meaning you will not need a finally block to close anything.
If you use the same Reader, everything is already read once you reach the second loop.
Close the first Reader, then create another one to read a second time.
You are running through the file with the BufferedReader, until the nextline points towards null. As your BufferedReader IS null, it won't even enter the second while((line = br.readline) != null), as the first read line is null.
Try getting a new BufferedReader. something like this:
...
int id = 0;
int likes = 0;
int icounter = 0;
br = new BufferedReader(new FileReader("movies.txt")) //Re-initialize the br to point
//onto the first line again
while ((line = br.readLine()) != null)
...
EDIT:
Close the reader first..
This is a combination of a couple of other answers already on this post, but this is how I would go about rewriting your code to populate a List. This doubly solves the problem of 1) needing to read the file twice 2) removing the boilerplate around using BufferedReader while using Java8 Streams to make the initializing of your List as concise as possible:
private static class Movie {
private Movie(int id, String title, int likes) {
//TODO: set your instance state here
}
}
private static Movie movieFromFileLine(String line) {
line = line.trim();
line = line.replaceAll("/t", "");
line = line.toLowerCase();
String[] tokens = line.split(" "); // store every token in a
String title = "";
int id = Integer.parseInt(tokens[0]);
int likes = Integer.parseInt(tokens[tokens.length]);
for (int i = 1; i < tokens.length; i++) {
title = title + " " + tokens[i];
}
return new Movie(id, title, likes);
}
public static void main(String[] args) throws IOException {
List<Movie> movies = Files.readAllLines(Paths.get("movies.txt"), Charset.defaultCharset()).stream().map
(App::movieFromFileLine).collect(Collectors.toList());
//TODO: Make some magic with your list of Movies
}
For cases where you absolutely need to read a source (file, URL, or other) twice, then you need to be aware that it is quite possible for the contents to change between the first and second readings and be prepared to handle those differences.
If you can make a reasonable assumption that the content of the source will fit in to memory and your code fully expects to work on multiple instances of Readers/InputStreams, you may first consider using an appropriate IOUtils.copy method from commons-io to read the contents of the source and copy it to a ByteArrayOutputStream to create a byte[] that can be re-read over and over again.

How Do I Split A String By Line Break? [duplicate]

This question already has answers here:
Split Java String by New Line
(21 answers)
Closed 6 years ago.
I'm a noob to android development and I am trying to split a string multiple times by its multiple line breaks. the string I'm trying to split is pulled from a database query and is constructed like this:
public String getCoin() {
// TODO Auto-generated method stub
String[] columns = new String[]{ KEY_ROWID, KEY_NAME, KEY_QUANTITY, KEY_OUNCES, KEY_VALUE };
Cursor c = ourDatabase.query(DATABASE_TABLE, columns, null, null, null, null, null);
String result = "";
int iRow = c.getColumnIndex(KEY_ROWID);
int iName = c.getColumnIndex(KEY_NAME);
int iQuantity = c.getColumnIndex(KEY_QUANTITY);
int iOunces = c.getColumnIndex(KEY_OUNCES);
int iValue = c.getColumnIndex(KEY_VALUE);
for (c.moveToFirst(); !c.isAfterLast(); c.moveToNext()){
result = result + /*c.getString(iRow) + " " +*/ c.getString(iName).substring(0, Math.min(18, c.getString(iName).length())) + "\n";
}
c.close();
return result;
result.getCoin reads as this:
alphabravocharlie
I want to split the string at the line break and place each substring into a String Array. This is my current code:
String[] separated = result.split("\n");
for (int i = 0; i < separated.length; i++) {
chartnames.add("$." + separated[i] + " some text" );
}
This gives me an output of:
"$.alpha
bravo
charlie some text"
instead of my desired output of:
"$.alpha some text, $.bravo some text, $.charlie some text"
Any help is greatly appreciated
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r\\n|\\n|\\r");
It's a little overkill, but you can use the standard I/O classes:
BufferedReader rdr = new BufferedReader(new StringReader(result));
List<String> lines = new ArrayList<String>();
for (String line = rdr.readLine(); line != null; line = rdr.readLine()) {
lines.add(line);
}
rdr.close(); // good form to close streams, but unnecessary for StringReader
// lines now contains all the strings between line breaks of any type
The advantage of this is that BufferedReader.readLine() has all the logic worked out for detecting all sorts of line terminators.
As of Java 8, BufferedReader has a lines() method, so there's an easier way (thanks, #jaco0646):
List<String> lines = new BufferedReader(new StringReader(result))
.lines()
.collect(Collectors.toList();
or, if an array is needed instead:
String[] lines = new BufferedReader(new StringReader(result))
.lines()
.toArray(String[]::new);
Using the Apache commons helper class StringUtils.
The platform independent way:
String[] lines = StringUtils.split(string, "\r\n");
The platform dependent way. Maybe some CPU cycles faster. But I wouldn't expect it to matter.
String[] lines = StringUtils.split(string, System.lineSeparator());
If possible I would suggest using the Guava Splitter and Joiner classes in preference to String.split. But even then, it's important to make sure that you're properly escaping your regular expressions when declaring them. I'm not certain "\n" won't be properly interpreted by the regex compiler in Java, but I'm not sure it will be either.
Covering all possible line endings is tricky, since multiple consecutive EOL markers can mess up your matching. I would suggest
String [] separated = result.replaceAll("\\r", "").split("\\n");
Matcher m = Pattern.compile("(^.+$)+", Pattern.MULTILINE).matcher(fieldContents);
while (m.find()) {
System.out.println("whole group " + m.group())
}
I propose the following snippet, which is compatiple with PC and Mac endline styles both.
String [] separated = result.replaceAll("\\r", "\\n")
.replaceAll("\\n{2,}", "\\n")
.split("\\n");

how to seperate a csv file using commas if it has null values and we need data corresponding to upper column?

My code is
BufferedReader fileReader = new BufferedReader(new FileReader(strFileName));
while ((strLine = fileReader.readLine()) != null) {
// [2011.06.28] based on OAP's mail, now ignore "quotations"
strLine = strLine.replace("\"", "");
int j = 0;
String componentID = null;
String analyteResult = null;
lineNumber++;
//break comma separated line using ","
st = new StringTokenizer(strLine, ",");
while (st.hasMoreTokens()) {
//st.nextToken();
String value=st.nextToken();
if(tokenNumber==1)
{
if(lineNumber==1){
batchNo=value;
}
else if(lineNumber==2){
instrumentNo=value;
}
}
else if(lineNumber==4)
{
if(value.contains("_")){
String temp[] = value.split("_");
analyteCode = temp[0].trim();
a[j] = tokenNumber;
j++;
}
}
else if(lineNumber>4)
{
if(tokenNumber==0){
componentID=value;
}
}
tokenNumber++;
//System.out.println("Line Number : "+lineNumber+" Token Number: "+tokenNumber+"Value: "+st.nextToken());
}
//reset token number
tokenNumber = 0;
}
But I need analyte result corresponding to previous columns analyte codes.... as it could be null also....so its vanishing all commas and not parsing the result corresponding to previous column analyte code.
Mudit,
I am not sure if you have any compulsion of not using any open sources libs, so I strongly recommend the following widely used open source CSV parsers. It will save you a lot of effort in terms of coding and performance. Pls check them out for your specific need.
Apache Commons CSV parser.
OpenCSV parser

Categories