I have a csv file with nearly 200000 rows containing two columns- name & job. The user then inputs a name, say user_name, and I have to search the entire csv to find the names that contain the pattern user_name and finally print the output to screen. I have implemented this using ArrayList in Java where I put the entire names from csv to ArrayList and then searched for the pattern in it. But in that case the overall time complexity for searching is O(n). Is there any other data strucure in Java that I can use to perform the searching in o(logn) or something more efficient than ArrayList? I can't use any database approach by the way. Also if there is a good data structure in any other language that I can use to accomplish my goal, then kindly suggest it to me?
Edit- The output should be the names in the csv that contains the pattern user_name as the last part. Eg: If my input is "son", then it should return "jackson",etc. Now what I have done so far is read the name column of csv to a string ArrayList, then read each element of the ArrayList and using the regular expression (pattern-matcher of Java) to see if the element has the user_name as the last part. If yes, then print it. If I implement this in a multi-threaded environment, will it increase the scalability and performance of my program?
You can use:
TreeMap, it is sorted red-black tree,
If you are unable to use a commercial database then you are going to have to write code to mimic some of a database's functionality.
To search the entire dataset sequentially in O(n) time you just read it and search each line. If you write a program that loads the data into an in-memory Map, you could search the Map in amortized O(1) time but you'd still be loading it into memory each time, which is an O(n) operation, gaining you nothing.
So the next approach is to build a disk-based index of some kind that you can search efficiently without reading the entire file, and then use the index to tell you where the record you want is located. This would be O(log n), but now you are at significant complexity, building, maintaining and managing the disk-based index. This is what database systems are optimized to do.
If you had 200 MILLION rows, then the only feasible solution would be to use a database. For 200 THOUSAND rows, my recommendation is to just scan the file each time (i.e. use grep or if that's not available write a simple program to do something similar).
BTW, if your allusion to finding a "pattern" means you need to search for a regular expression, then you MUST scan the entire file every time since without knowing the pattern you cannot build an index.
In summary: use grep
Related
I have a database and am trying to find a safe way to read this data into an XML file. Reading the verticies is not the problem. But with the edges the app runs since a while into a HeapSpace-Exception. g.V().has(Timestamp)
.order()
.by(Timestamp, Order.asc)
.range(startIndex, endIndex)
.outE()
Data to be Taken eg. Label, Id, Properties
I use the timestamp of the verticies as a reference for sorting to avoid capturing duplicates.
But because some verticies make up the largest value of all outEdges, the program runs out of the heap. My question is can I use the alphanumeric ID(UUID) of the edges to sort them safely or is there another or better way to reach the goal.
Something like this:
g.E().order().by(T.id, Order.asc).range(startIndex, endIndex)
Ordering the edges by T.id would be a fine property to order by however the problem is not related to the property chosen, it is instead related to the sheer number of edges being exported. Neptune, as with other databases, has to retrieve all the edges in order to then order them. Retrieving all these edges is why you are running out of heap. To solve this problem you can either increase the instance size to get additional memory for the query or export the data differently. If you take a look at this utility https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export you can see current recommended best practice. This utility will export the entire graph as a CSV file. In your use case you may be able to either use this utility and then transform the CSV into an XML document or modify the code to export as an XML document.
So i have a .txt file storing a genome (it is a long sequence combination of repeats with A,C,G,T) e.g-
TCGTGTTGAGAGGTATGAGACCTCTGGCAAGTACTTTGCCTACAAGATGGAGGAGAA....(it contains millions of these repeating characters stored in separate file)
now i wanted to write a code to find the number of "ACGT" sequence motif in the complete Genome. Please can someone help with this.
This is a simplification of the sequence alignment problem. There are multiple alignment tools already in existence that perform this kind of function, using data structures designed to reduce the time required to search for multiple sequences. If you want to run this kind of search for more than one query string, you should use one of these tools rather than performing a linear search in java.
I want to achieve something like this
Given a document say a txt file with an id, I need to process it, do stemming on the words, and generate a index table out of it. But this index table is distributed over 3 systems probably on the basis of the criteria that words beginning with letters from [a-h] are indexed on 1st system, next one third on second and last one third on 3rd system. But i have no idea what technology should i use to achieve this? The index table data structure in ought to be in the RAM so that the search queries can be answered quickly(supposing we are able to index it in this way and have a user searching for a word or sentence from different system). Can this purpose be fulfilled by use of JAVA Sockets?
Actually we(group of 5) are trying to make a small but distributed search engine. Supposing the crawling has been done and the page(the document i was talking about) is saved somewhere and i extract it, do the processing , stemming etc, I would like to finally make a distributed Index data structure based on scheme mentioned above. Would it be possible? I just want to know what technology to use to achieve something like this. Like modifying a data structure inside some program running on some other machine(but in the same network).
Secondly, since we actually don't know if this approach is feasible, if thats the case I would be keen to know the correct way I should look at a distributed index table.
Have the index information saved as you crawl the documents. Have a head node which presents the search user interface. The head node then distributes the search to the index nodes, and collects the results to present to the user.
There are a number of available frameworks, such as Mapreduce, which will help you solve this problem.
I have 10 pages of txt file, in which I have to search a series of 10,000 words to check how many words exists in the file content.
I am using AhoCorasic algorithm for search. However to check if each word exists or not, I have add 10,000 terms to the list. That is I have to iterate 10,000 times to know if each word exists. (10,000 can grow to n)
Problems with above approach are --> CPU boosts as it has to loop 10,000 times and Time it takes to complete the task is more.
I am looking at alternative approach, where I can give all 10,000 words at a time ( to avoid looping) and get the result for each word.
Is there a way to implement this. Or if there is any other alternative to Ahocorasic search to achieve above scenario.
Invert the search: create a Set of the words in the text, then looking up whether a term is in the source material is O(1). Still, as others have suggested, if you need to do more complex matching than simple term existence, I'd also recommend using Lucene.
I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?
You shall use an external sorting algorithm, do not try to fit this in memory.
http://en.wikipedia.org/wiki/External_sorting
If you think it is too complex, try this:
include H2 database in your project
create a new on-disk database (will be created automatically on first connection)
create some simple table where the numbers will be stored
read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
select numbers with ORDER BY clause :)
use JDBC resultSet to fetch results on-the-fly and write them to an output file
H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).
You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.