Tree in java to store words from a text - java

I have a text file where each line is a path of word strings word1/word2/.../wordn and I want to query the file. I need to build a tree which stores the words and each line of the file as a path so that anytime I search for a word, I get the word-node and all the paths this word belongs to. I was wondering if there is a build-in tree/graph related library in java or if there is a suitable particular tree structure I could use for the current problem. Actually, my basic idea is to construct a tree so that I read the file by line and add the nodes and line-path to that tree. Any ideas-suggestions?

I'd investigate storing the file in an XML Document and using XPath to search it. Xerces is a good start. Each part of the file (word1/) would be a node with subsequent words (word2) as a child.

I would build a class that holds a word and the set of lines that contain that word.
When traversing the file's lines, keep a map (java.util.HashMap or java.util.TreeMap, depending on how you need to use it later) with words (Strings) as keys, and the class above as values. For each word on a line, look it up in the dictionary, and add the line to its entry (or add a new entry if it isn't already there).
Looking up which lines the word occur in is a simple map lookup after you have scanned the file.

What you have is not really a tree at all. I would use a Map<String, List<String>> to store the list of lines that contains each word. This uses O(n) memory and has fast lookup. Example code:
import java.util.*;
import java.io.*;
public class WordNodes
{
Map<String, List<String>> map = new HashMap<String, List<String>>();
void readInputFile(String filename) throws IOException, FileNotFoundException
{
FileReader fileReader = new FileReader(filename);
BufferedReader bufferedReader = new BufferedReader(fileReader);
try
{
List<String> lines = new ArrayList<String>();
String line = null;
while ((line = bufferedReader.readLine()) != null)
{
for (String word: line.split("/"))
{
List<String> list = map.get(word);
if (list == null)
{
list = new ArrayList<String>();
map.put(word, list);
}
list.add(line);
}
}
} finally {
bufferedReader.close();
}
}
void run() throws IOException, FileNotFoundException
{
readInputFile("file.txt");
InputStreamReader inputStreamReader = new InputStreamReader(System.in);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
try
{
while (true)
{
String word = bufferedReader.readLine();
List<String> lines = map.get(word);
if (lines == null)
{
System.out.println("Word not found.");
}
else
{
for (String line: lines)
{
System.out.println(line);
}
}
}
} finally {
bufferedReader.close();
}
}
public static void main(String[] args) throws Exception
{
new WordNodes().run();
}
}

My first though is similar to Liedman's, but slightly different: Rather than creating a new class for the lines, just use a Set<String> (HashSet<String>) or List<String> (ArrayList<String>).

Related

How read big text file and work with it in Java

I have a large text file, and i want read it, when i try do it without any operations like add some text from this file to List it read file maximum to one minute but when i try add some text to arrayList and next i want do some operations it is too slowly, do you know how can i read this data and use it?
This is my code:
public class ReaderTEst {
public static void main(String[] args) throws IOException {
List<String> graphList = new ArrayList<>();
List<String> edgeList = new ArrayList<>();
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream("myText.txt");
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
line = line.replace("\uFEFF", "");//i use UTF-8 file so I need delete unneeded character
if (Character.isWhitespace(line.charAt(0))) {
edgeList.add(line.trim());
} else {
graphList.add(line);
}
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
}
It takes to many time, do you know how it could be faster? I have file txt with 600 MB
When i change :
List<Integer> graphList = new ArrayList<>(1);
int i = 0;
while (sc.hasNextLine()) {`String line = sc.nextLine();`
line = line.replace("\uFEFF", "");//i use UTF-8 file so I need delete unneeded character
graphList.add(i++);
}
i works, but when i want put string it takes too long time
You should use BufferedReader.readLine(). You can read millions of lines per second with that. Scanner is overkill for what you're doing.
BUT \uFEFF is not text. Is this really a text file? Is that a BOM marker? in which case it will only be at the beginning of the first line: no need to scan for it in every line.
Your main issues are the following:
List<String> graphList = new ArrayList<>();
List<String> edgeList = new ArrayList<>();
You should initialize each List with an initial capacity so that the JVM does not need to automatically expand the backing array.
line = line.replace("\uFEFF", "");
This will also slow down your program. How often is \uFEFF in each line? I would check if the line contains \uFEFF before attempting to replace it.
Other than that, there's not much else to optimize; maybe you can utilize a FileChannel to read the file, but that's about it.
First of all I advise to use the LinkedList realization of List because of architectual features. Thus the ArrayList is built-on arrays, the LinkedList consists on Nodes. The ArrayList creates new bigger arrays and copy old one the new one, then it is reach some capasity. Oracle has perfect documentation about this, I recommend it to you LinkedList
ArrayList

Java Reading in text file and outputting it to new file with removed duplicates

I have a text file with an integer on each line, ordered from least to greatest, and I want to put them in a new text file with any duplicate numbers removed.
I've managed to read in the text file and print the numbers on the screen, but I'm unsure on how to actually write them in a new file, with duplicates removed?
public static void main(String[] args)
{
try
{
FileReader fr = new FileReader("sample.txt");
BufferedReader br = new BufferedReader(fr);
String str;
while ((str = br.readLine()) != null) {
out.println(str + "\n");
}
br.close();
}
catch (IOException e) {
out.println("File not found");
}
}
When reading the file, you could add the numbers to a Set, which is a data structure that doesn't allow duplicate values (just Google for "java collections" for more details)
Then you iterate through this Set, writing the numbers to a FileOutputStream (google for "java io" for more details)
Instead of printing each of the numbers, add them to an Array. After you've added all the integers, you can cycle through the array to remove duplicates (sample code for this can be found fairly easily).
Once you have an array, use BufferedWriter to write to an output file. Example code for how to do this can be found here: https://www.mkyong.com/java/how-to-write-to-file-in-java-bufferedwriter-example/
Alternatively, use a Set, and BufferedWriter should still work in the same way.
assuming the input file is already ordered:
public class Question42475459 {
public static void main(final String[] args) throws IOException {
final String inFile = "sample.txt";
try (final Scanner scanner = new Scanner(new BufferedInputStream(new FileInputStream("")), "UTF-8");
BufferedWriter writer = new BufferedWriter(new FileWriter(inFile + ".out", false))) {
String lastLine = null;
while (scanner.hasNext()) {
final String line = scanner.next();
if (!line.equals(lastLine)) {
writer.write(line);
writer.newLine();
lastLine = line;
}
}
}
}
}

Compare values in two files

I have two files Which should contain the same values between Substring 0 and 10 though not in order. I have Managed to Outprint the values in each file but I need to Know how to Report say id the Value is in the first File and Notin the second file and vice versa. The files are in these formats.
6436346346....Other details
9348734873....Other details
9349839829....Other details
second file
8484545487....Other details
9348734873....Other details
9349839829....Other details
The first record in the first file does not appear in the second file and the first record in the second file does not appear in the first file. I need to be able to report this mismatch in this format:
Record 6436346346 is in the firstfile and not in the secondfile.
Record 8484545487 is in the secondfile and not in the firstfile.
Here is the code I currently have that gives me the required Output from the two files to compare.
package compare.numbers;
import java.io.*;
/**
*
* #author implvcb
*/
public class CompareNumbers {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
File f = new File("C:/Analysis/");
String line;
String line1;
try {
String firstfile = "C:/Analysis/RL001.TXT";
FileInputStream fs = new FileInputStream(firstfile);
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
while ((line = br.readLine()) != null) {
String account = line.substring(0, 10);
System.out.println(account);
}
String secondfile = "C:/Analysis/RL003.TXT";
FileInputStream fs1 = new FileInputStream(secondfile);
BufferedReader br1 = new BufferedReader(new InputStreamReader(fs1));
while ((line1 = br1.readLine()) != null) {
String account1 = line1.substring(0, 10);
System.out.println(account1);
}
} catch (Exception e) {
e.fillInStackTrace();
}
}
}
Please help on how I can effectively achieve this.
I think I needed to say that am new to java and may not grab the ideas that easily but Am trying.
Here is the sample code to do that:
public static void eliminateCommon(String file1, String file2) throws IOException
{
List<String> lines1 = readLines(file1);
List<String> lines2 = readLines(file2);
Iterator<String> linesItr = lines1.iterator();
while (linesItr.hasNext()) {
String checkLine = linesItr.next();
if (lines2.contains(checkLine)) {
linesItr.remove();
lines2.remove(checkLine);
}
}
//now lines1 will contain string that are not present in lines2
//now lines2 will contain string that are not present in lines1
System.out.println(lines1);
System.out.println(lines2);
}
public static List<String> readLines(String fileName) throws IOException
{
List<String> lines = new ArrayList<String>();
FileInputStream fs = new FileInputStream(fileName);
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
String line = null;
while ((line = br.readLine()) != null) {
String account = line.substring(0, 10);
lines.add(account);
}
return lines;
}
Perhaps you are looking for something like this
Set<String> set1 = new HashSet<>(FileUtils.readLines(new File("C:/Analysis/RL001.TXT")));
Set<String> set2 = new HashSet<>(FileUtils.readLines(new File("C:/Analysis/RL003.TXT")));
Set<String> onlyInSet1 = new HashSet<>(set1);
onlyInSet1.removeAll(set2);
Set<String> onlyInSet2 = new HashSet<>(set2);
onlyInSet2.removeAll(set1);
If you guarantee that the files will always be the same format, and each readLine() function is going to return a different number, why not have an array of strings, rather than a single string. You can then compare the outcome with greater ease.
Ok, first I would save the two sets of strings in to collections
Set<String> s1 = new HashSet<String>(), s2 = new HashSet<String>();
//...
while ((line = br.readLine()) != null) {
//...
s1.add(line);
}
Then you can compare those sets and find elements that do not appear in both sets. You can find some ideas on how to do that here.
If you need to know the line number as well, you could just create a String wrapper:
class Element {
public String str;
public int lineNr;
public boolean equals(Element compElement) {
return compElement.str.equals(str);
}
}
Then you can just use Set<Element> instead.
Open two Scanners, and :
final TreeSet<Integer> ts1 = new TreeSet<Integer>();
final TreeSet<Integer> ts2 = new TreeSet<Integer>();
while (scan1.hasNextLine() && scan2.hasNexLine) {
ts1.add(Integer.valueOf(scan1.nextLigne().subString(0,10));
ts1.add(Integer.valueOf(scan1.nextLigne().subString(0,10));
}
You can now compare ordered results of the two trees
EDIT
Modified with TreeSet
Put values from each file to two separate HashSets accordingly.
Iterate over one of the HashSets and check whether each value exists in the other HashSet. Report if not.
Iterate over other HashSet and do same thing for this.

Read file and get key=value without using java.util.Properties

I'm building a RMI game and the client would load a file that has some keys and values which are going to be used on several different objects. It is a save game file but I can't use java.util.Properties for this (it is under the specification). I have to read the entire file and ignore commented lines and the keys that are not relevant in some classes. These properties are unique but they may be sorted in any order. My file current file looks like this:
# Bio
playerOrigin=Newlands
playerClass=Warlock
# Armor
playerHelmet=empty
playerUpperArmor=armor900
playerBottomArmor=armor457
playerBoots=boot109
etc
These properties are going to be written and placed according to the player's progress and the filereader would have to reach the end of file and get only the matched keys. I've tried different approaches but so far nothing came close to the results that I would had using java.util.Properties. Any idea?
This will read your "properties" file line by line and parse each input line and place the values in a key/value map. Each key in the map is unique (duplicate keys are not allowed).
package samples;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.TreeMap;
public class ReadProperties {
public static void main(String[] args) {
try {
TreeMap<String, String> map = getProperties("./sample.properties");
System.out.println(map);
}
catch (IOException e) {
// error using the file
}
}
public static TreeMap<String, String> getProperties(String infile) throws IOException {
final int lhs = 0;
final int rhs = 1;
TreeMap<String, String> map = new TreeMap<String, String>();
BufferedReader bfr = new BufferedReader(new FileReader(new File(infile)));
String line;
while ((line = bfr.readLine()) != null) {
if (!line.startsWith("#") && !line.isEmpty()) {
String[] pair = line.trim().split("=");
map.put(pair[lhs].trim(), pair[rhs].trim());
}
}
bfr.close();
return(map);
}
}
The output looks like:
{playerBoots=boot109, playerBottomArmor=armor457, playerClass=Warlock, playerHelmet=empty, playerOrigin=Newlands, playerUpperArmor=armor900}
You access each element of the map with map.get("key string");.
EDIT: this code doesn't check for a malformed or missing "=" string. You could add that yourself on the return from split by checking the size of the pair array.
I 'm currently unable to come up with a framework that would just provide that (I'm sure there are plenty though), however, you should be able to do that yourself.
Basically you just read the file line by line and check whether the first non whitespace character is a hash (#) or whether the line is whitespace only. You'd ignore those lines and try to split the others on =. If for such a split you don't get an array of 2 strings you have a malformed entry and handle that accordingly. Otherwise the first array element is your key and the second is your value.
Alternately, you could use a regular expression to get the key/value pairs.
(?m)^[^#]([\w]+)=([\w]+)$
will return capture groups for each key and its value, and will ignore comment lines.
EDIT:
This can be made a bit simpler:
[^#]([\w]+)=([\w]+)
After some study i came up with this solution:
public static String[] getUserIdentification(File file) throws IOException {
String key[] = new String[3];
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String lines;
try {
while ((lines = br.readLine()) != null) {
String[] value = lines.split("=");
if (lines.startsWith("domain=") && key[0] == null) {
if (value.length <= 1) {
throw new IOException(
"Missing domain information");
} else {
key[0] = value[1];
}
}
if (lines.startsWith("user=") && key[1] == null) {
if (value.length <= 1) {
throw new IOException("Missing user information");
} else {
key[1] = value[1];
}
}
if (lines.startsWith("password=") && key[2] == null) {
if (value.length <= 1) {
throw new IOException("Missing password information");
} else {
key[2] = value[1];
}
} else
continue;
}
br.close();
} catch (IOException e) {
e.printStackTrace();
}
return key;
}
I'm using this piece of code to check the properties. Of course it would be wiser to use Properties library but unfortunately I can't.
The shorter way how to do that:
Properties properties = new Properties();
String confPath = "src/main/resources/.env";
try {
properties.load(new FileInputStream(confPath));
} catch (IOException e) {
e.printStackTrace();
}
String specificValueByKey = properties.getProperty("KEY");
Set<Object> allKeys = properties.keySet();
Collection<Object> values = properties.values();

Java + readLine with BufferedReader

I'm trying to read a line of text from a text file and put each line into a Map so that I can delete duplicate words (e.g. test test) and print out the lines without the duplicate words. I must be doing something wrong though because I basically get just one line as my key, vs each line being read one at a time. Any thoughts? Thanks.
public DeleteDup(File f) throws IOException {
line = new HashMap<String, Integer>();
try {
BufferedReader in = new BufferedReader(new FileReader(f));
Integer lineCount = 0;
for (String s = null; (s = in.readLine()) != null;) {
line.put(s, lineCount);
lineCount++;
System.out.println("s: " + s);
}
}
catch(IOException e) {
e.printStackTrace();
}
this.deleteDuplicates(line);
}
private Map<String, Integer> line;
To be honest, your question isn't particularly clear - it's not obvious why you've got the lineCount, or what deleteDuplicates will do, or why you've named the line variable that way when it's not actually a line - it's a map from lines to the last line number on which that line appeared.
Unless you need the line numbers, I'd use a Set<String> instead.
However, all that aside, if you look at the keySet of line afterwards, it will be all the lines. That's assuming that the text file is genuinely in the default encoding for your system (which is what FileReader uses, unfortunately - I generally use InputStreamReader and specify the encoding explicitly).
If you could give us a short but complete program, the text file you're using as input, the expected output and the actual output, that would be helpful.
What I understood from your question is to print the lines which do not have duplicate words in the line.
May be you could try the following snippet for it.
public void deleteDup(File f)
{
try
{
BufferedReader in = new BufferedReader(new FileReader(f));
Integer wordCount = 0;
boolean isDuplicate = false;
String [] arr = null;
for (String line = null; (line = in.readLine()) != null;)
{
isDuplicate = false;
wordCount = 0;
wordMap.clear();
arr = line.split("\\s+");
for(String word : arr)
{
wordCount = wordMap.get(word);
if(null == wordCount)
{
wordCount = 1;
}
else
{
wordCount++;
isDuplicate = true;
break;
}
wordMap.put(word, wordCount);
}
if(!isDuplicate)
{
lines.add(line);
}
}
}
catch(IOException e)
{
e.printStackTrace();
}
}
private Map<String, Integer> wordMap = new HashMap<String, Integer>();
private List<String> lines = new ArrayList<String>();
In this snippet, lines will contain the lines which do not have duplicate words in it.
It would have been easier to find your problem if we knew what
this.deleteDuplicates(line);
tries to do. Maybe it is not clearing any of the data structure used. Hence, the words checked in previous lines will be checked for other lines too though they are not present.
Your question is not very clear.
But while going through your code snippet, I think you tried to remove duplicate words in each line.
Following code snippet might be helpful.
public class StackOverflow {
public static void main(String[] args) throws IOException {
List<Set<String>> unique = new ArrayList<Set<String>>();
BufferedReader reader = new BufferedReader(
new FileReader("C:\\temp\\testfile.txt"));
String line =null;
while((line = reader.readLine()) != null){
String[] stringArr = line.split("\\s+");
Set<String> strSet = new HashSet<String>();
for(String tmpStr : stringArr){
strSet.add(tmpStr);
}
unique.add(strSet);
}
}
}
Only problem with your code I see is That DeleteDup doesn't have return type specified.
Otherwise code looks fine and reads from file properly.
Please post deleteDuplicates method code and file used.
You are printing out every line read, not just the unique lines.
Your deleteDuplicateLines() method won't do anything, as there will never be any duplicates in the HashMap.
So it isn't at all clear what your actual problem is.

Categories