How can i fasten java text file search with binarysearch? - java

I am trying to make a small app which makes searches in text files and recognise the language used in it(firstly between english and turkish). For this purpose i am searching the byte appearance of letter "k". According to some researches this letter is widely used in turkish and very less used in english and have same byte number. However the problem is it takes around 20 seconds (or maybe a little more with i7 7700hq comp) to find the appearance of letter k in a text of 110k letters with my code, so it is a big problem for me as i am planning to run this code over 1k text files. Should i make the search with another method of java or is this the fastest way it can be?
Thanks in advance
My code is:
package deneme;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.stream.IntStream;
public class deneme {
public static int howmany =0;
public static double ratio;
public static void main(String args[]) throws IOException{
File file = new File("c:\\tr1.srt");
byte[] bytesArray = new byte[(int) file.length()];
FileInputStream fis = new FileInputStream(file);
fis.read(bytesArray); //read file into bytes[]
fis.close();
byte searchVal = 107; // 'k' letter in byte code
for(byte textbytes:bytesArray){
Arrays.sort(bytesArray);
int retVal = Arrays.binarySearch(bytesArray,0,bytesArray.length,searchVal);
if(retVal >-1){
bytesArray[retVal]=0;
howmany++;
}
}
System.out.println("Character \"k\" appears " + howmany +" times in the text");
ratio = (double)howmany/(double)bytesArray.length;
System.out.println("How many: "+howmany);
System.out.println("Length: "+bytesArray.length);
System.out.println("Ratio: "+ratio);
if(ratio<0.01){
System.out.println("Text file is probably not turkish");
}else{
System.out.println("Text file is probably turkish");
}
}
}

Sorting will visit every byte already, so you shouldn't need to sort but just visit every byte once.
You can actually count all bytes' frequencies if you do:
int[] freqs = new int[256];
for(byte b: bytearray)
freqs[b&0x0ff]++;
then just lookup the byte you like, as in freqs['k']+freqs['K'].
Also, you could just open a bufferedinputstream over the fileinputstream, and avoid the huge byte[], just iterate over bufferedinputstream.read() (which is an int 0..255) and stop when -1.

Sorting is a costly operation. And you are sorting your array for every character, which is inefficient. Instead, you could just go sequentially through all the characters once and if that particular character is 'k', then just increment the counter. Here is a sample code
for(byte textBytes: bytesArray) {
if(textBytes == searchVal) {
howmany++;
}
}
use this for loop instead of yours. You should get the results much faster.

First, if you work with letters, use a Reader, not InputStream:
Reader reader = new BufferedReader(new FileReader(file));
Next, the way you have implemented counting the letter 'k' is... how should I put it... very creative. You binary-search for 'k' many times as long as it is found. While this works, it is very far from optimal. I think it's O(n*log n) whereas it is easily solveable in O(n) with one pass through read characters. Something along the lines:
private static final char CHAR_k = 'k';
// ...
int count_k = 0;
int r;
while ((r = reader.read()) != -1) {
char ch = (char) r;
if (ch == CHAR_k) {
count_k++
}
}

Related

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?

I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines
Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.
There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.
Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}

How to import .dat file into multiple arrays

Alright so I'm working on a program that reads a periodic table and you can search elements based on number or abbreviation.
Anyway, I'm a bit stuck trying to read the periodic table file into 4 different arrays: Atomic Number, Abbreviation, Element Name, and Atomic Weight.
I dunno how to write a single method to import all that info into each array in one go. I want to have a class that holds all these arrays and that I can call to later when I need each one.
Here is what I got so far, I'm a bit rusty by the way... I thought working on this program would refamiliarize me with the basics.
class PeriodicTable{
private String fileName = "periodictable.dat";
private int[] atomicNumTable = new int[200];
private String[] abbreviationTable = new String[200];
private String[] nameTable = new String[200];
private double[] atomicWeightTable = new double[200];
PeriodicTable(String fileName){
readTable(fileName);
}
public int[] readTable(String fileName){
Scanner inFile = null;
try{
inFile = new Scanner(new File(fileName));
}catch(FileNotFoundException nf){
System.out.println(fileName + " not found");
System.exit(0);
}
atomicNumTable = new int[200];
int i = 0;
while(inFile.hasNext() && i < atomicNumTable.length){
int number = inFile.nextInt();
atomicNumTable[i] = number;
i++;
}
inFile.close();
return atomicNumTable;
}
}
Here is what each line of the table looks like:
1 H Hydrogen 1.00794
Simply use java.lang.String.split(' ') (assuming that your columns are separated using spaces; if it is using something else; you just need to adapt that regular expression parameter!)
That will return an array of Strings; and you basically now: first column should be an int, then you got two Strings, and then a double value. Or lets be precise: you get strings, that mean something else; thus you have to look into methods like Integer.valueOf(); and similar for Double.
Shouldn't be too hard to work your way from there.
But I recommend some changes to your logic: having 4 different tables doesn't make sense at all. Good OO programming is about creating helpful abstractions. Without abstractions, your program becomes abstract itself.
Meaning: you should introduce a class like
public class Element {
private final int id;
private final String abbreviation;
private final String fullName;
private final double atomicWeight;
... with one constructor that takes all 4 parameters
... with getter methods for the fields of this class
... and meaningful overrides for equals() and hashcode()
}
And then, instead of creating 4 arrays; you create one array, or even better an ArrayList<Element>. And instead of pushing your 4 values into 4 different arrays, you create one new Element object in each loop iteration; and you add that new object to your list.
The major difference to your solution would be: you can deal with Elements as a whole; whereas in your solution, a single "Element" is basically an index that points into 4 different tables.
You can simplify this code a lot. Try something like this.
1) Read the file line by line, split lines as you go,
add values to some ArrayList containing String[]
2) Close your file
3) Turn the ArrayList into a String[][]
4) Print the result
Also, note that arrays in Java are indexed starting at 0 not at 1.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Arrays;
public class Test {
static public void main(String[] args) throws Exception {
File file = new File("periodictable.dat");
FileReader reader = new FileReader(file);
BufferedReader buffReader = new BufferedReader(reader);
String s = null;
ArrayList<String[]> lst = new ArrayList<String[]>();
String[][] res = null;
while((s = buffReader.readLine()) != null){
String[] arr = s.split("[\\s]+");
lst.add(arr);
}
buffReader.close();
res = new String[lst.size()][lst.get(0).length];
res = lst.toArray(res);
System.out.println();
// System.out.println(res);
// String result = Arrays.deepToString(res);
// System.out.println(result);
System.out.println();
for (int i=0; i<res.length; i++){
for (int j=0; j<res[i].length; j++){
System.out.println("res[" + (i+1) + "][" + (j+1) + "]=" + res[i][j]);
}
}
System.out.println();
}
}
OUTPUT:
res[1][1]=1
res[1][2]=H
res[1][3]=Hydrogen
res[1][4]=1.00794
value iterates indexing for each line
You can distinguish four cases in the loop:
i%4 == 0
i%4 == 1
i%4 == 2
i%4 == 3
Depending on this you know the kind of next value you have to read. So, you can search you an integer, string or floating point number and put the value in the right place.
I support the recommendation of GhostCat to only have one array and a class that contains all four values of a line instead of having four arrays.

If read() method of FileInputStream return 1 byte and char in java occupy 2 bytes, how below program works

If read() method of FileInputStream return one byte and char in java occupy 2 bytes, how does casting of integer return by read() to char return character. Below is the program
import java.io.File;
import java.io.FileInputStream;
public class ReadFile {
public static void main(String[] args) throws Exception {
File file = new File("J:\\Java\\Programs\\xanadu.txt");
FileInputStream stream = new FileInputStream(file);
int i, iteration = 0;
while ((i = stream.read()) != -1) {
System.out.print((char) i);
iteration++;
}
System.out.println("\nNo of Iteration :" + iteration);
}
}
Content of file is : StackOverFlow
Output is :
StackOverflow
No of Iteration :13
So file contains 13 character which means 26 bytes. How the number of iteration is 13.
If there is a link where this behaviour is explain, please share it.
The file contains 13 ascii characters (and 1 ascii character is 1 byte). When stored in memory, in Java, each character might consumes 2 bytes. However, they are all on the basic plane... and they could be stored as UTF-8. While a single Java character might take 2 bytes of memory it might also take more when to create a single character when it's part of a String containing values from the Supplementary_Multilingual_Plane.

Array Index out of Bound Exception for returning Char Array

I am new to Java programming and I was writing code to replace spaces in Strings with %20 and return the final String. Here is the code for the problem. Since I am new to programming please tell me what I did wrong. Sorry for my bad English.
package Chapter1;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Problem4 {
public char[] replaceSpaces(char[] str_array, int length)
{
int noOfSpaces=0,i,newLength;
for(i=0;i<length;i++)
{
if(str_array[i]==' ')
{
noOfSpaces++;
}
newLength = length + noOfSpaces * 2;
str_array[newLength]='\0';
for(i=0;i<length-1;i++)
{
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
}
}
return str_array;
}
public static void main(String args[])throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Please enter the string:");
String str = reader.readLine();
char[] str_array = str.toCharArray();
int length = str.length();
Problem4 obj = new Problem4();
char[] result = obj.replaceSpaces(str_array, length);
System.out.println(result);
}
}
But I get the following error:
Please enter the string:
hello world
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 11
at Chapter1.Problem4.replaceSpaces(Problem4.java:19)
at Chapter1.Problem4.main(Problem4.java:46)
How about using String.replaceAll():
String str = reader.readLine();
str = str.replaceAll(" ", "02%");
Sample code here
EDIT:
The problem is at line 19:
str_array[newLength]='\0';//<-- newLength exceeds the char array size
Here array is static i.e. the size is fixed you can use StringBuilder, StringBuffer, etc to build the new String without worrying about the size for such small operations.
Assuming that you want to see what mistakes you made when implementing your approach, instead of looking for a totally different approach:
(1) As has been pointed out, once an array has been allocated, its size cannot be changed. Your method takes str_array as a parameter, but the resulting array will likely be larger than str_array. Therefore, since str_array's length cannot be changed, you'll need to allocate a new array to hold the result, rather than using str_array. You've computed newLength correctly; allocate a new array of that size:
char[] resultArray = new char[newLength];
(2) As Elliott pointed out, Java strings don't need \0 terminators. If, for some reason, you really want to create an array that has a \0 character at the end, then you have to add 1 to your computed newLength to account for the extra character.
(3) You're actually creating the resulting array backward. I don't know if that is intentional.
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
i starts with the first character of the string and goes upward; you're filling in characters starting with the last character of the string (newLength) and going backward. If that's what you intended to do, it wasn't clear from your question. Did you want the output to be "dlrow%20olleh"?
(4) If you did intend to go backward, then what the above code does with a space is to put %20 in the string (backwards), but then it also puts the space into the result. If the input character is a space, you want to make sure you don't execute the two lines that copy the input character to the result. So you'll need to add an else. (Note that this problem will lead to an out-of-bounds error, because you're trying to put more characters into the result than you computed.) You'll need to have an else in there even if you really meant to build the string forwards and need to change the logic to make it go forward.
Java arrays are not dynamic (they are Object instances, and they have a field length property that does not change). Because they store the length as a field, it is important to know that they're not '\0' terminated (your attempt to add such a terminator is causing your index out of bounds Exception). Your method doesn't appear to access any instance fields or methods, so I'd make it static. Then you could use a StringBuilder and a for-each loop. Something like
public static char[] replaceSpaces(char[] str_array) {
StringBuilder sb = new StringBuilder();
for (char ch : str_array) {
sb.append((ch != ' ') ? ch : "%20");
}
return sb.toString().toCharArray();
}
Then call it like
char[] result = replaceSpaces(str_array);
Finally, you might use String str = reader.readLine().replace(" ", "+"); or replaceAll(" ", "%20") as suggested by #Arvind here.
P.S. When you finally get your result you'll need to fix your call to print it.
System.out.println(Arrays.toString(result));
or
System.out.println(new String(result));
A char[] is not a String and Java arrays (disappointingly) don't override toString() so you'll get the one from Object.
please tell me what I did wrong
You tried to replace a single character with three characters %20. That's not possible because arrays are fixed length.
Therefore you must allocate a new char[] and copy the characters from str_array into the new array.
for (i = 0; i < length; i++) {
if (str_array[i] == ' ') {
noOfSpaces++;
}
}
newLength = length + noOfSpaces * 2;
char[] newArray = new char[newLength];
// copy characters from str_array into newArray
The exception is raised in this line str_array[newLength]='\0'; because value of newLength is greater than length of str_array.
Array size cannot be increased once it is defined. So try the alternative solution.
char[] str_array1=Arrays.copyOf(str_array, str_array.length+1);
str_array1[newLength]='\0';
don't forget to import the new package import java.util.Arrays;

Character Matching DNA Program

I am supposed to write a program using command line arguments to put in 3 different files, a human DNA sequence, a mouse DNA sequence, and an unknown sequence. Without using arrays, I have to compare each character and give the percent match as well aas which one it closely matches up to. Here is what I have so far
import java.io.File;
import java.io.FileInputStream;
import java.io.DataInputStream;
import java.io.*;
public class Lucas_Tilak_Hw8_DNA
{
public static void main (String args[]) throws IOException
{
//First let's take in each file
File MouseFile = new File(args[0]);
File HumanFile = new File(args[1]);
File UnknownFile = new File(args[2]);
//This allows us to view individual characters
FileInputStream m = new FileInputStream(MouseFile);
FileInputStream h = new FileInputStream(HumanFile);
FileInputStream u = new FileInputStream(UnknownFile);
//This allows us to read each character one by one.
DataInputStream mouse = new DataInputStream(m);
DataInputStream human = new DataInputStream(h);
DataInputStream unk = new DataInputStream(u);
//We initialize our future numerators
int humRight = 0;
int mouRight = 0;
//Now we set the counting variable
int countChar = 0;
for( countChar = 0; countChar < UnknownFile.length(); countChar++);
{
//initialize
char unkChar = unk.readChar();
char mouChar = mouse.readChar();
char humChar = human.readChar();
//add to numerator if they match
if (unkChar == humChar)
{
humRight++;
}
if (unkChar == mouChar)
{
mouRight++;
}
//add to denominator
countChar++;
}
//convert to fraction
long mouPercent = (mouRight/countChar);
long humPercent = (humRight/countChar);
//print fractions
System.out.println("Mouse Compare: " + mouPercent);
System.out.println("Human Compare: " + humPercent);
if (mouPercent > humPercent)
{
System.out.println("mouse");
}
else if (mouPercent < humPercent)
{
System.out.println("human");
}
else
{
System.out.println("identity cannot be determined");
}
}
}
If I put in random code {G, T, C, A} for each file I use, it doesn't seem to compare characters, so I get O = mouPercent and 0 = humPercent. Please Help!
Several errors in your code are to blame.
Remove the ; from the end of your for() statement. Basically, you are only reading a single character from each file, and your comparison is strictly limited to that first set of characters. It's unlikely they will have any overlap.
Second error: don't use the "file length". Characters are typically encoded as more than one byte, so you're going to get inconsistent results this way. Better to query the stream to see if there are more bytes available, and stopping when you run out of bytes to read. Most Streams or Readers have an available or ready method that will let you determine if there is more to be read or not.
Third error: DataInputStream is not going to do what you expect it to do. Read the docs -- you're getting strange characters because it's always pulling 2 bytes and building a character using a modified UTF-8 scheme, which only really maps to characters written by the corresponding DataOutput implementing classes. You should research and modify your code to use BufferedReader instead, which will more naturally respect other character encodings like UTF-8, etc. which is most likely the encoding of the files you are reading in.
TL;DR? Your loop is broken, file length is a bad idea for loop terminating condition, and DataInputStream is a special unicorn, so use BufferedReader instead when dealing with characters in normal files.
Try using floats instead of longs for your percentage variables.

Categories