Faster way of searching a String in a .txt

Faster way of searching a String in a .txt - java

Hey guys i have written this code for searching a string in a txt file.
Is it possible to optimize the code so that it searches for the string in fastest manner possible.
Assuming the text file would be a large one (500MB - 1GB)
I dont want to use pattern Matchers.
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class StringFinder {
public static void main(String[] args)
{
double count = 0,countBuffer=0,countLine=0;
String lineNumber = "";
String filePath = "C:\\Users\\allen\\Desktop\\TestText.txt";
BufferedReader br;
String inputSearch = "are";
String line = "";
try {
br = new BufferedReader(new FileReader(filePath));
try {
while((line = br.readLine()) != null)
{
countLine++;
//System.out.println(line);
String[] words = line.split(" ");
for (String word : words) {
if (word.equals(inputSearch)) {
count++;
countBuffer++;
}
}
if(countBuffer > 0)
{
countBuffer = 0;
lineNumber += countLine + ",";
}
}
br.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Times found at--"+count);
System.out.println("Word found at--"+lineNumber);
}
}

There are fast string search algorithms, but a big part of the time will go into reading the file from external storage. If you can index the file ahead of time, you can save reading and scanning the entire file. If you can't, perhaps you can at least avoid reading the file from external storage, e.g. if the file came in from the network, then search it before or instead of writing it to storage.

Try Matcher.find, splitting is slow since it creates a lot of objects

If you don't want to use Matcher.find for some reason, then at least go for using indexOf.
You can check on the whole line without breaking up the line into a lot of String Objects which then need iterating over.
int index = line.indexOf (inputSearch);
while (index != -1)
{
count++;
countBuffer++;
index = line.indexOf (inputSearch, index+1);
}

For a plain string, i.e., not a regex, and if you can't index the file first using some sophisticated engine (Lucene or Solr come to mind for such a large file) or database (?), you should check out the Rabin-Karp algorithm. It's a very clever algorithm that finds a simple string match in O(n+m) where n is the length of the text and m the length of the search string.

Your bottleneck may not be the time it takes to parse each line at all, but reading the actual file. Disk IO is at least an order of magnitude slower than iterating thru a char array. But you really won't know until you profile your code. Fire up VisualVM and use it to figure out where you are spending the most time. If you don't, you're just guessing.

Related

External Sort GC Overhead

I am writing an external sort to sort a large 2 gig file on disk
I first split the file into chunks that fit into memory and sort each one individually, and rewrite them back to disk. However, during this process I am getting GC Memory overhead exception in String.Split method in function geModel. Below is my code.
private static List<Model> getModel(String file, long lineCount, final long readSize) {
List<Model> modelList = new ArrayList<Model>();
long read = 0L;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
//Skip lineCount lines;
for (long i = 0; i < lineCount; i++)
br.readLine();
String line = "";
while ((line = br.readLine()) != null) {
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
String curvature = (split.length >= 7) ? split[6] : "";
String heading = (split.length >= 8) ? split[7] : "";
String slope = (split.length == 9) ? split[8] : "";
modelList.add(new Model(split[0], split[1], split[2], split[3], split[4], split[5], curvature, heading, slope));
}
br.close();
return modelList;
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
private static void split(String inputDir, String inputFile, String outputDir, final long readSize) throws IOException {
long lineCount = 0L;
int count = 0;
int writeSize = 100000;
System.out.println("Reading...");
List<Model> curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
while (curModel.size() > 0) {
lineCount += curModel.size();
System.out.println("Sorting...");
curModel.sort(new Comparator<Model>() {
#Override
public int compare(Model arg0, Model arg1) {
return arg0.compareTo(arg1);
}
});
System.out.println("Sorting Complete");
System.out.println("Writing...");
writeFile(curModel, outputDir + inputFile + count, writeSize);
System.out.println("Writing Complete");
count++;
System.out.println("Reading...");
curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
}
}
It makes it through one pass and sorts ~250 MB of data from the file. However, on the second pass it throws GC Memory Overhead exception on String.split function. I do not want to use external libraries, I want to learn this on my own. The sorting and splitting works, but I cannot understand why the GC is throwing memory overhead exception on string.split function.

I'm not sure just what is causing the exception--manipulating large strings, in particular cutting and splicing them, is a huge memory/gc issue. StringBuilder can help, but in general you may have to take more direct control over the process.
To figure out more you probably want to run a profiler with your app. There is one built into the JDK (VisualVM) that is functional. It will show you what objects Java is holding on to... because of the nature of strings it's possible that you are holding onto a lot of redundant character array data.
Personally I'd try a completely different approach, for instance, what if you sorted the entire file in memory by loading the first 10(?) sortable characters of each line into an array along with the file location they were read from, sort the array and resolve any ties by loading more (the rest?) of those lines that were identical.
If you did something like that then you should be able to seek to each line and copy it to the destination file without ever caching more than one line in memory and only reading through the source file twice.
I suppose you could manufacture a file that would fail if all the strings were identical until the last couple characters, so if that ever became an issue you might have to be able to flush the full strings you've cached (there is a java memory reference object made to automatically do this for you, it's not particularly hard)

Based on how I read your implementation readSize only makes sure that you get first block X size. You are not reading 2nd block or 3rd block. Hence its not actually complete external sort.
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
even though you are splitting each line you seem to be using only first 9 characters. And then checking no of words in each line. This means your data is not uniform.

Reading a Text File in Java and Converting to Multiple Vectors

I am fairly new to Java, and I am trying to make it read a file.
I have a data text file that needs to be imported into Java as arrays.
The first row of data file is the names row. All the variables has their names in the first row and the data are clustered in the columns. I just want to import this into Java and to be able to export all the variables as I want, just as if they were vectors in MATLAB. So basically we acquire the data and tag each vector. I need the code to be as generic as possible, so it should read a variable number of columns and rows. I was able to create the array using a non-efficient method I believe. Now I need to divide the array into multiple arrays and then convert them to numbers. But I need to group the number according to the vector they belong in.
The text file is created from an Excel spreadsheet, so it basically has the columns for different measurements, which will create the vectors. Each column in another vector which contains the data in the rows.
I searched a lot of code trying to implement, but it came to a point I cannot proceed without help. Can someone possibly tell me how to proceed in any sense. Maybe even improve the reading part also, because I know it is not the best way to do like this in Java. Here is what I have in hand:
package Testing;
import java.io.*;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;
public class Read1 {
public static void main(String[] args) {
try {
FileReader fin = new FileReader("C:/jade/test/Winter_Full_clean.txt");
BufferedReader in = new BufferedReader(fin);
String str = "";
int count = 0;
String line;
while ((line=in.readLine()) != null) {
if (count==0) {
str = line;
}
else if (in.readLine()!=null) {
str = str + line;
}
count++;
}
in.close();
//System.out.printf(str);
System.out.print(tokens);
} catch (Exception e) {
System.out.println("error crap" + e.getClass());
}
}
//{
// Path yourFile = Paths.get("C:/jade/test/Winter_Full_clean.txt");
// Charset charset = Charset.forName("UTF-8");
// List<String> lines = null;
// try {
// lines = Files.readAllLines(yourFile, charset);
// } catch (IOException e) {
// TODO Auto-generated catch block
// e.printStackTrace();
// }
// String[] arr = lines.toArray(new String[lines.size()]);
// System.out.println(arr);
//}
}

I have some code that is functional and reads a text file into an array and split on tabs ( its a tsv file). You should be able to adapt this as a starting point to read in your initial data, and then based upon the data contained within your array, alter your logic to suit:
try (BufferedReader reader = new BufferedReader(new FileReader(path))) { //path is a String pointing to the file
int lineNum = 0;
String readLine;
while ((readLine = reader.readLine()) != null) { //read until end of stream
if (lineNum == 0) { //Skip headers, i.e. start at line 2 of your excel sheet
lineNum++;
continue;
}
String[] nextLine = readLine.split("\t"); //Split on tab
for (int x = 0; x < nextLine.length; x++) { //do something with ALL the lines.
nextLine[x] = nextLine[x].replace("\'", "");
nextLine[x] = nextLine[x].replace("/'", "");
nextLine[x] = nextLine[x].replace("\"", " ");
nextLine[x] = nextLine[x].replace("\\xFFFD", "");
}
//In this example, to access data for a certain column,
//call nextLine[1] for col 1, nextLine[2] for col 2 etc.
} //close your loop and go to the next line in your file
} //close your resources.

Java - Comparing Lists

I have a program written in Java that reads in a file that is simply a list of strings into a LinkedHashMap. Then it takes a second file which consists of two columns and for each row see if the right-hand term matches one of the terms from the HashMap. The problem is it's running very slow.
Here's a code snippet, this is where it compares the second file to the HashMap terms:
String output = "";
infile = new File("2columns.txt");
try {
in = new BufferedReader(new FileReader(infile));
} catch (FileNotFoundException e2) {
System.out.println("2columns.txt" + " not found");
}
try {
fw = new FileWriter("newfile.txt");
out = new PrintWriter(fw);
try {
String str = in.readLine();
while (str != null) {
StringTokenizer strtok = new StringTokenizer(str);
strtok.nextToken();
String strDest = strtok.nextToken();
System.out.println("Term = " + strDest);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strDest) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + str);
}
str = in.readLine();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
out.print(output);
I got a performance boost from switching from an ArrayList initially to the LinkedHashMap but it's still taking a long time. What can I do to speed this up?

Your major bottleneck may be that you are recreating a StringTokenizer for every iteration of the while loop. Moving this outside the loop could help considerably. Minor speed ups can be obtained by moving the String definition outside the while loop.
The biggest speedup will probably come from using a StreamTokenizer. See below for an example.
Oh and use a HashMap instead of a LinkedHashMap as #Doug Ayers says in the above comments :)
And # MДΓΓ БДLL's suggestion of profiling your code is bang on. checkout this Eclipse Profiling Example
Reader r = new BufferedReader(new FileReader(infile));
StreamTokenizer strtok = new StreamTokenizer(r);
String strDest ="";
while (strtok.nextToken() != StreamTokenizer.TT_EOF) {
strDest=strtok.sval; //strtok.toString() might be safer, but slower
strtok.nextToken();
System.out.println("Term = " + strtok.sval);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strtok.sval) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + strDest +" "+ strtok.sval);
}
str = in.readLine();
}
One final thought is (and I'm not confident on this one) that writing to a file may also be faster if you do it in one go at the end. i.e. store all your matches in a buffer of some sort and do the writing in one hit.

StringTokenizer is a legacy class. The recommended replacement is the string "split" method.
Some of the trys might be consolidated. You can have multiple catches for a single try.
The suggestion to use HashMap instead of LinkedHashMap is a good one. Performance for gets and puts in a smidgeon faster since there is no need to maintain a list structure.
The "output" string should be a StringBuilder rather than a String. That could help a lot.

BufferedReader: read multiple lines into a single string

I'm reading numbers from a txt file using BufferedReader for analysis. The way I'm going about this now is- reading a line using .readline, splitting this string into an array of strings using .split
public InputFile () {
fileIn = null;
//stuff here
fileIn = new FileReader((filename + ".txt"));
buffIn = new BufferedReader(fileIn);
return;
//stuff here
}
public String ReadBigStringIn() {
String line = null;
try { line = buffIn.readLine(); }
catch(IOException e){};
return line;
}
public ProcessMain() {
initComponents();
String[] stringArray;
String line;
try {
InputFile stringIn = new InputFile();
line = stringIn.ReadBigStringIn();
stringArray = line.split("[^0-9.+Ee-]+");
// analysis etc.
}
}
This works fine, but what if the txt file has multiple lines of text? Is there a way to output a single long string, or perhaps another way of doing it? Maybe use while(buffIn.readline != null) {}? Not sure how to implement this.
Ideas appreciated,
thanks.

You are right, a loop would be needed here.
The usual idiom (using only plain Java) is something like this:
public String ReadBigStringIn(BufferedReader buffIn) throws IOException {
StringBuilder everything = new StringBuilder();
String line;
while( (line = buffIn.readLine()) != null) {
everything.append(line);
}
return everything.toString();
}
This removes the line breaks - if you want to retain them, don't use the readLine() method, but simply read into a char[] instead (and append this to your StringBuilder).
Please note that this loop will run until the stream ends (and will block if it doesn't end), so if you need a different condition to finish the loop, implement it in there.

I would strongly advice using library here but since Java 8 you can do this also using streams.
try (InputStreamReader in = new InputStreamReader(System.in);
BufferedReader buffer = new BufferedReader(in)) {
final String fileAsText = buffer.lines().collect(Collectors.joining());
System.out.println(fileAsText);
} catch (Exception e) {
e.printStackTrace();
}
You can notice also that it is pretty effective as joining is using StringBuilder internally.

If you just want to read the entirety of a file into a string, I suggest you use Guava's Files class:
String text = Files.toString("filename.txt", Charsets.UTF_8);
Of course, that's assuming you want to maintain the linebreaks. If you want to remove the linebreaks, you could either load it that way and then use String.replace, or you could use Guava again:
List<String> lines = Files.readLines(new File("filename.txt"), Charsets.UTF_8);
String joined = Joiner.on("").join(lines);

Sounds like you want Apache IO FileUtils
String text = FileUtils.readStringFromFile(new File(filename + ".txt"));
String[] stringArray = text.split("[^0-9.+Ee-]+");

If you create a StringBuilder, then you can append every line to it, and return the String using toString() at the end.
You can replace your ReadBigStringIn() with
public String ReadBigStringIn() {
StringBuilder b = new StringBuilder();
try {
String line = buffIn.readLine();
while (line != null) {
b.append(line);
line = buffIn.readLine();
}
}
catch(IOException e){};
return b.toString();
}

You have a file containing doubles. Looks like you have more than one number per line, and may have multiple lines.
Simplest thing to do is read lines in a while loop.
You could return null from your ReadBigStringIn method when last line is reached and terminate your loop there.
But more normal would be to create and use the reader in one method. Perhaps you could change to a method which reads the file and returns an array or list of doubles.
BTW, could you simply split your strings by whitespace?
Reading a whole file into a single String may suit your particular case, but be aware that it could cause a memory explosion if your file was very large. Streaming approach is generally safer for such i/o.

This creates a long string, every line is seprateted from string " " (one space):
public String ReadBigStringIn() {
StringBuffer line = new StringBuffer();
try {
while(buffIn.ready()) {
line.append(" " + buffIn.readLine());
} catch(IOException e){
e.printStackTrace();
}
return line.toString();
}

Java: How to read a text file

I want to read a text file containing space separated values. Values are integers.
How can I read it and put it in an array list?
Here is an example of contents of the text file:
1 62 4 55 5 6 77
I want to have it in an arraylist as [1, 62, 4, 55, 5, 6, 77]. How can I do it in Java?

You can use Files#readAllLines() to get all lines of a text file into a List<String>.
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
// ...
}
Tutorial: Basic I/O > File I/O > Reading, Writing and Creating text files
You can use String#split() to split a String in parts based on a regular expression.
for (String part : line.split("\\s+")) {
// ...
}
Tutorial: Numbers and Strings > Strings > Manipulating Characters in a String
You can use Integer#valueOf() to convert a String into an Integer.
Integer i = Integer.valueOf(part);
Tutorial: Numbers and Strings > Strings > Converting between Numbers and Strings
You can use List#add() to add an element to a List.
numbers.add(i);
Tutorial: Interfaces > The List Interface
So, in a nutshell (assuming that the file doesn't have empty lines nor trailing/leading whitespace).
List<Integer> numbers = new ArrayList<>();
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
for (String part : line.split("\\s+")) {
Integer i = Integer.valueOf(part);
numbers.add(i);
}
}
If you happen to be at Java 8 already, then you can even use Stream API for this, starting with Files#lines().
List<Integer> numbers = Files.lines(Paths.get("/path/to/test.txt"))
.map(line -> line.split("\\s+")).flatMap(Arrays::stream)
.map(Integer::valueOf)
.collect(Collectors.toList());
Tutorial: Processing data with Java 8 streams

Java 1.5 introduced the Scanner class for handling input from file and streams.
It is used for getting integers from a file and would look something like this:
List<Integer> integers = new ArrayList<Integer>();
Scanner fileScanner = new Scanner(new File("c:\\file.txt"));
while (fileScanner.hasNextInt()){
integers.add(fileScanner.nextInt());
}
Check the API though. There are many more options for dealing with different types of input sources, differing delimiters, and differing data types.

This example code shows you how to read file in Java.
import java.io.*;
/**
* This example code shows you how to read file in Java
*
* IN MY CASE RAILWAY IS MY TEXT FILE WHICH I WANT TO DISPLAY YOU CHANGE WITH YOUR OWN
*/
public class ReadFileExample
{
public static void main(String[] args)
{
System.out.println("Reading File from Java code");
//Name of the file
String fileName="RAILWAY.txt";
try{
//Create object of FileReader
FileReader inputFile = new FileReader(fileName);
//Instantiate the BufferedReader Class
BufferedReader bufferReader = new BufferedReader(inputFile);
//Variable to hold the one line data
String line;
// Read file line by line and print on the console
while ((line = bufferReader.readLine()) != null) {
System.out.println(line);
}
//Close the buffer reader
bufferReader.close();
}catch(Exception e){
System.out.println("Error while reading file line by line:" + e.getMessage());
}
}
}

Look at this example, and try to do your own:
import java.io.*;
public class ReadFile {
public static void main(String[] args){
String string = "";
String file = "textFile.txt";
// Reading
try{
InputStream ips = new FileInputStream(file);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
String line;
while ((line = br.readLine()) != null){
System.out.println(line);
string += line + "\n";
}
br.close();
}
catch (Exception e){
System.out.println(e.toString());
}
// Writing
try {
FileWriter fw = new FileWriter (file);
BufferedWriter bw = new BufferedWriter (fw);
PrintWriter fileOut = new PrintWriter (bw);
fileOut.println (string+"\n test of read and write !!");
fileOut.close();
System.out.println("the file " + file + " is created!");
}
catch (Exception e){
System.out.println(e.toString());
}
}
}

Just for fun, here's what I'd probably do in a real project, where I'm already using all my favourite libraries (in this case Guava, formerly known as Google Collections).
String text = Files.toString(new File("textfile.txt"), Charsets.UTF_8);
List<Integer> list = Lists.newArrayList();
for (String s : text.split("\\s")) {
list.add(Integer.valueOf(s));
}
Benefit: Not much own code to maintain (contrast with e.g. this). Edit: Although it is worth noting that in this case tschaible's Scanner solution doesn't have any more code!
Drawback: you obviously may not want to add new library dependencies just for this. (Then again, you'd be silly not to make use of Guava in your projects. ;-)

Use Apache Commons (IO and Lang) for simple/common things like this.
Imports:
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang3.ArrayUtils;
Code:
String contents = FileUtils.readFileToString(new File("path/to/your/file.txt"));
String[] array = ArrayUtils.toArray(contents.split(" "));
Done.

Using Java 7 to read files with NIO.2
Import these packages:
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
This is the process to read a file:
Path file = Paths.get("C:\\Java\\file.txt");
if(Files.exists(file) && Files.isReadable(file)) {
try {
// File reader
BufferedReader reader = Files.newBufferedReader(file, Charset.defaultCharset());
String line;
// read each line
while((line = reader.readLine()) != null) {
System.out.println(line);
// tokenize each number
StringTokenizer tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreElements()) {
// parse each integer in file
int element = Integer.parseInt(tokenizer.nextToken());
}
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
To read all lines of a file at once:
Path file = Paths.get("C:\\Java\\file.txt");
List<String> lines = Files.readAllLines(file, StandardCharsets.UTF_8);

All the answers so far given involve reading the file line by line, taking the line in as a String, and then processing the String.
There is no question that this is the easiest approach to understand, and if the file is fairly short (say, tens of thousands of lines), it'll also be acceptable in terms of efficiency. But if the file is long, it's a very inefficient way to do it, for two reasons:
Every character gets processed twice, once in constructing the String, and once in processing it.
The garbage collector will not be your friend if there are lots of lines in the file. You're constructing a new String for each line, and then throwing it away when you move to the next line. The garbage collector will eventually have to dispose of all these String objects that you don't want any more. Someone's got to clean up after you.
If you care about speed, you are much better off reading a block of data and then processing it byte by byte rather than line by line. Every time you come to the end of a number, you add it to the List you're building.
It will come out something like this:
private List<Integer> readIntegers(File file) throws IOException {
List<Integer> result = new ArrayList<>();
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte buf[] = new byte[16 * 1024];
final FileChannel ch = raf.getChannel();
int fileLength = (int) ch.size();
final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
fileLength);
int acc = 0;
while (mb.hasRemaining()) {
int len = Math.min(mb.remaining(), buf.length);
mb.get(buf, 0, len);
for (int i = 0; i < len; i++)
if ((buf[i] >= 48) && (buf[i] <= 57))
acc = acc * 10 + buf[i] - 48;
else {
result.add(acc);
acc = 0;
}
}
ch.close();
raf.close();
return result;
}
The code above assumes that this is ASCII (though it could be easily tweaked for other encodings), and that anything that isn't a digit (in particular, a space or a newline) represents a boundary between digits. It also assumes that the file ends with a non-digit (in practice, that the last line ends with a newline), though, again, it could be tweaked to deal with the case where it doesn't.
It's much, much faster than any of the String-based approaches also given as answers to this question. There is a detailed investigation of a very similar issue in this question. You'll see there that there's the possibility of improving it still further if you want to go down the multi-threaded line.

read the file and then do whatever you want
java8
Files.lines(Paths.get("c://lines.txt")).collect(Collectors.toList());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Faster way of searching a String in a .txt - java

Try Matcher.find, splitting is slow since it creates a lot of objects

Related

External Sort GC Overhead

Reading a Text File in Java and Converting to Multiple Vectors

Java - Comparing Lists

BufferedReader: read multiple lines into a single string

Java: How to read a text file

Categories

Resources