Character Matching DNA Program - java

I am supposed to write a program using command line arguments to put in 3 different files, a human DNA sequence, a mouse DNA sequence, and an unknown sequence. Without using arrays, I have to compare each character and give the percent match as well aas which one it closely matches up to. Here is what I have so far
import java.io.File;
import java.io.FileInputStream;
import java.io.DataInputStream;
import java.io.*;
public class Lucas_Tilak_Hw8_DNA
{
public static void main (String args[]) throws IOException
{
//First let's take in each file
File MouseFile = new File(args[0]);
File HumanFile = new File(args[1]);
File UnknownFile = new File(args[2]);
//This allows us to view individual characters
FileInputStream m = new FileInputStream(MouseFile);
FileInputStream h = new FileInputStream(HumanFile);
FileInputStream u = new FileInputStream(UnknownFile);
//This allows us to read each character one by one.
DataInputStream mouse = new DataInputStream(m);
DataInputStream human = new DataInputStream(h);
DataInputStream unk = new DataInputStream(u);
//We initialize our future numerators
int humRight = 0;
int mouRight = 0;
//Now we set the counting variable
int countChar = 0;
for( countChar = 0; countChar < UnknownFile.length(); countChar++);
{
//initialize
char unkChar = unk.readChar();
char mouChar = mouse.readChar();
char humChar = human.readChar();
//add to numerator if they match
if (unkChar == humChar)
{
humRight++;
}
if (unkChar == mouChar)
{
mouRight++;
}
//add to denominator
countChar++;
}
//convert to fraction
long mouPercent = (mouRight/countChar);
long humPercent = (humRight/countChar);
//print fractions
System.out.println("Mouse Compare: " + mouPercent);
System.out.println("Human Compare: " + humPercent);
if (mouPercent > humPercent)
{
System.out.println("mouse");
}
else if (mouPercent < humPercent)
{
System.out.println("human");
}
else
{
System.out.println("identity cannot be determined");
}
}
}
If I put in random code {G, T, C, A} for each file I use, it doesn't seem to compare characters, so I get O = mouPercent and 0 = humPercent. Please Help!

Several errors in your code are to blame.
Remove the ; from the end of your for() statement. Basically, you are only reading a single character from each file, and your comparison is strictly limited to that first set of characters. It's unlikely they will have any overlap.
Second error: don't use the "file length". Characters are typically encoded as more than one byte, so you're going to get inconsistent results this way. Better to query the stream to see if there are more bytes available, and stopping when you run out of bytes to read. Most Streams or Readers have an available or ready method that will let you determine if there is more to be read or not.
Third error: DataInputStream is not going to do what you expect it to do. Read the docs -- you're getting strange characters because it's always pulling 2 bytes and building a character using a modified UTF-8 scheme, which only really maps to characters written by the corresponding DataOutput implementing classes. You should research and modify your code to use BufferedReader instead, which will more naturally respect other character encodings like UTF-8, etc. which is most likely the encoding of the files you are reading in.
TL;DR? Your loop is broken, file length is a bad idea for loop terminating condition, and DataInputStream is a special unicorn, so use BufferedReader instead when dealing with characters in normal files.

Try using floats instead of longs for your percentage variables.

Related

FileWriter doesn't write integers into file

I am trying to read 2 input files containing integers(even duplicates are considered) and trying to find common integers and write them to the output file.
input1.txt
01
21
14
27
31
20
31
input2.txt
14
21
27
08
09
14
Following is the code I tried:
public static void main(String[] args) throws NumberFormatException {
try {
BufferedReader inputFile1 = new BufferedReader(new FileReader(new File("src/input1.txt")));
BufferedReader inputFile2 = new BufferedReader(new FileReader(new File("src/input2.txt")));
FileWriter fileCommon = new FileWriter("src/common.txt");
String lineInput1;
String lineInput2;
int inputArray1[] = new int[10];
int inputArray2[] = new int[10];
int index = 0;
while ((lineInput1 = inputFile1.readLine()) != null) {
inputArray1[index] = Integer.parseInt(lineInput1);
index++;
}
index = 0;
while((lineInput2 = inputFile2.readLine()) != null) {
inputArray2[index] = Integer.parseInt(lineInput2);
index++;
}
for (int a = 0; a < inputArray1.length; a++) {
for (int b = 0;b < inputArray2.length; b++) {
if(inputArray1[a] == inputArray2[b]) {
fileCommon.write(inputArray1[a]);
}
}
}
inputFile1.close();
inputFile2.close();
fileCommon.close();
} catch (IOException e) {
e.printStackTrace();
}
}
I don't understand where I am making mistake. I am not getting any errors and the output file that is generated is empty.
output expected are common integers in both files
14
21
27
Remember, that FileWriter's write(int c) accepts an integer representing a character code from either a specified charset or the platform's default charset, which is mostly extensions of ASCII (for example, in Windows, default charset is Windows-1252 which is an extension of ASCII).
which means, that you actually don't have any (semantical or syntactical) problem per se, and you're writing into file successfully, but! you're writing some special characters which you can't see afterwards.
If you'll invoke write(..) with some integer representing Latin character (or symbol) in the ASCII table, you'll see that it'll write actual English letter (or symbol) into your file.
For instance:
fileCommon.write(37); //will write `%` into your file.
fileCommon.write(66); //will write `B` into your file.
In your code, you're only writing 21, 14 and 27 into your file, and as you can see from the ASCII table:
Decimal 21 represents Negative Acknowledgment
Decimal 14 represents Shift-out
Decimal 27 represents Escape
FileWriter.write(int) will write a single character, in your case 14, 21, and 27 are all control characters that would not be visible in a text file.
common.write("" + arr1[a]);
Should write the string representation. You'll find some other problems though, such as missing line endings and repeated values, but this should get you started.
Here's the thing.
The write(int c) method of FileWriter is not actually write an int value, but write an ASCII code of a single character.For example, write(53) will write a "5" to a file.
In your code, you are acctually writting some symbols.You can use write(String str) method of FileWriter or just use BufferedWriter class to achieve you goal.
The result of the write value is acctually "21141427" by your code, so you have to remove the repeat value when write it and line feed after write each value.
Sorry for the poor English.
You can read Strings from the original input files, instead of ints, and use the String.equals(Object):boolean function to compare Strings.
Then you won't need to parse from String to int, and convert an int to string back when writing to the file.
Also note that writing an int will write the unicode char value to the file, not the number as a string.
The problem is the common.write line. It should be as follows.
common.write(String.valueOf(arr1[a])+"\n");
Additionally, This would perform much better if you put all of the data from the first file into a Map vs an array then when reading the second file just check the map for the key and if it exists write to common.
If you are dead set on using an array you can sort the first array and use a binary search. This would also perform much better than looping through everything over and over.

How can i fasten java text file search with binarysearch?

I am trying to make a small app which makes searches in text files and recognise the language used in it(firstly between english and turkish). For this purpose i am searching the byte appearance of letter "k". According to some researches this letter is widely used in turkish and very less used in english and have same byte number. However the problem is it takes around 20 seconds (or maybe a little more with i7 7700hq comp) to find the appearance of letter k in a text of 110k letters with my code, so it is a big problem for me as i am planning to run this code over 1k text files. Should i make the search with another method of java or is this the fastest way it can be?
Thanks in advance
My code is:
package deneme;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.stream.IntStream;
public class deneme {
public static int howmany =0;
public static double ratio;
public static void main(String args[]) throws IOException{
File file = new File("c:\\tr1.srt");
byte[] bytesArray = new byte[(int) file.length()];
FileInputStream fis = new FileInputStream(file);
fis.read(bytesArray); //read file into bytes[]
fis.close();
byte searchVal = 107; // 'k' letter in byte code
for(byte textbytes:bytesArray){
Arrays.sort(bytesArray);
int retVal = Arrays.binarySearch(bytesArray,0,bytesArray.length,searchVal);
if(retVal >-1){
bytesArray[retVal]=0;
howmany++;
}
}
System.out.println("Character \"k\" appears " + howmany +" times in the text");
ratio = (double)howmany/(double)bytesArray.length;
System.out.println("How many: "+howmany);
System.out.println("Length: "+bytesArray.length);
System.out.println("Ratio: "+ratio);
if(ratio<0.01){
System.out.println("Text file is probably not turkish");
}else{
System.out.println("Text file is probably turkish");
}
}
}
Sorting will visit every byte already, so you shouldn't need to sort but just visit every byte once.
You can actually count all bytes' frequencies if you do:
int[] freqs = new int[256];
for(byte b: bytearray)
freqs[b&0x0ff]++;
then just lookup the byte you like, as in freqs['k']+freqs['K'].
Also, you could just open a bufferedinputstream over the fileinputstream, and avoid the huge byte[], just iterate over bufferedinputstream.read() (which is an int 0..255) and stop when -1.
Sorting is a costly operation. And you are sorting your array for every character, which is inefficient. Instead, you could just go sequentially through all the characters once and if that particular character is 'k', then just increment the counter. Here is a sample code
for(byte textBytes: bytesArray) {
if(textBytes == searchVal) {
howmany++;
}
}
use this for loop instead of yours. You should get the results much faster.
First, if you work with letters, use a Reader, not InputStream:
Reader reader = new BufferedReader(new FileReader(file));
Next, the way you have implemented counting the letter 'k' is... how should I put it... very creative. You binary-search for 'k' many times as long as it is found. While this works, it is very far from optimal. I think it's O(n*log n) whereas it is easily solveable in O(n) with one pass through read characters. Something along the lines:
private static final char CHAR_k = 'k';
// ...
int count_k = 0;
int r;
while ((r = reader.read()) != -1) {
char ch = (char) r;
if (ch == CHAR_k) {
count_k++
}
}

Iterate through a dictionary array

I have a String array containing a poem which has deliberate spelling mistakes. I am trying to iterate through the String array to identify the spelling mistakes by comparing the String array to a String array containing a dictionary. If possible I would like a suggestion that allows me to continue using nested for loops
for (int i = 0; i < poem2.length; i++) {
boolean found = false;
for (int j = 0; j < dictionary3.length; j++) {
if (poem2[i].equals(dictionary3[j])) {
found = true;
break;
}
}
if (found==false) {
System.out.println(poem2[i]);
}
}
The output is printing out the correctly spelt words as well as the incorrectly spelt ones and I am aiming to only print out the incorrectly spelt ones. Here is how I populate the 'dictionary3' and 'poem2' arrays:
char[] buffer = null;
try {
BufferedReader br1 = new BufferedReader(new
java.io.FileReader(poem));
int bufferLength = (int) (new File(poem).length());
buffer = new char[bufferLength];
br1.read(buffer, 0, bufferLength);
br1.close();
} catch (IOException e) {
System.out.println(e.toString());
}
String text = new String(buffer);
String[] poem2 = text.split("\\s+");
char[] buffer2 = null;
try {
BufferedReader br2 = new BufferedReader(new java.io.FileReader(dictionary));
int bufferLength = (int) (new File(dictionary).length());
buffer2 = new char[bufferLength];
br2.read(buffer2, 0, bufferLength);
br2.close();
} catch (IOException e) {
System.out.println(e.toString());
}
String dictionary2 = new String(buffer);
String[] dictionary3 = dictionary2.split("\n");
Your basic problem is in line
String dictionary2 = new String(buffer);
where you ware trying to convert characters representing dictionary stored in buffer2 but you used buffer (without 2 suffix). Such style of naming your variables may suggest that you either need a loop, or in this case separate method which will return for selected file array of words it holds (you can also add as method parameter delimiter on which string should be split).
So your dictionary2 held characters from buffer which represented poem, not dictionary data.
Another problem is
String[] dictionary3 = dictionary2.split("\n");
because you are splitting here only on \n but some OS like Windows use \r\n as line separator sequence. So your dictionary array may contain words like foo\r instead of foo which will cause poem2[i].equals(dictionary3[j] to always fail.
To avoid this problem you can split on \\R (available since Java 8) or \r?\n|\r.
There are other problems in your code like closing resource within try section. If any exception will be thrown before, close() will never be invoked leaving unclosed resources. To solve it close resources in finally section (which is always executed after try - regardless if exception will be thrown or not), or better use try-with-resources.
BTW you can simplify/clarify your code responsible for reading words from files
List<String> poem2 = new ArrayList<>();
Scanner scanner = new Scanner(new File(yourFileLocation));
while(scanner.hasNext()){//has more words
poem2.add(scanner.next());
}
For dictionary instead of List you should use Set/HashSet to avoid duplicates (usually sets also have better performance when checking if they contain some elements or not). Such collections already provide methods like contains(element) so you wouldn't need that inner loop.
I copied your code and ran it, and I noticed two issues. Good news is, both are very quick fixes.
#1
When I printed out everything in dictionary3 after it is read in, it is the exact same as everything in poem2. This line in your code for reading in the dictionary is the problem:
String dictionary2 = new String(buffer);
You're using buffer, which was the variable you used to read in the poem. Therefore, buffer contains the poem and your poem and dictionary end up the same. I think you want to use buffer2 instead, which is what you used to read in the dictionary:
String dictionary2 = new String(buffer2);
When I changed that, the dictionary and poem appear to have the proper entries.
#2
The other problem, as Pshemo pointed out in their answer (which is completely correct, and a very good answer!) is that you are splitting on \n for the dictionary. The only thing I would say differently from Pshemo here is that you should probably split on \\s+ just like you did for the poem, to stay consistent. In fact, when I debugged, I noticed that the dictionary words all have "\r" appended to the end, probably because you were splitting on \n. To fix this, change this line:
String[] dictionary3 = dictionary2.split("\n");
To this:
String[] dictionary3 = dictionary2.split("\\s+");
Try changing those two lines, and let us know if that resolves your issue. Best of luck!
Convert your dictionary to an ArrayList and use Contains instead.
Something like this should work:
if(dictionary3.contains(poem2[i])
found = true;
else
found = false;
With this method you can also get rid of that nested loop, as the contains method handles that for you.
You can convert your Dictionary to an ArrayList with the following method:
new ArrayList<>(Arrays.asList(array))

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?

I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines
Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.
There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.
Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.
Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

Categories