Alphanumeric sorting in Gremlin - java

I have a database and am trying to find a safe way to read this data into an XML file. Reading the verticies is not the problem. But with the edges the app runs since a while into a HeapSpace-Exception. g.V().has(Timestamp)
.order()
.by(Timestamp, Order.asc)
.range(startIndex, endIndex)
.outE()
Data to be Taken eg. Label, Id, Properties
I use the timestamp of the verticies as a reference for sorting to avoid capturing duplicates.
But because some verticies make up the largest value of all outEdges, the program runs out of the heap. My question is can I use the alphanumeric ID(UUID) of the edges to sort them safely or is there another or better way to reach the goal.
Something like this:
g.E().order().by(T.id, Order.asc).range(startIndex, endIndex)

Ordering the edges by T.id would be a fine property to order by however the problem is not related to the property chosen, it is instead related to the sheer number of edges being exported. Neptune, as with other databases, has to retrieve all the edges in order to then order them. Retrieving all these edges is why you are running out of heap. To solve this problem you can either increase the instance size to get additional memory for the query or export the data differently. If you take a look at this utility https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export you can see current recommended best practice. This utility will export the entire graph as a CSV file. In your use case you may be able to either use this utility and then transform the CSV into an XML document or modify the code to export as an XML document.

Related

How do I achieve the task of distributing my index table over 3 systems?

I want to achieve something like this
Given a document say a txt file with an id, I need to process it, do stemming on the words, and generate a index table out of it. But this index table is distributed over 3 systems probably on the basis of the criteria that words beginning with letters from [a-h] are indexed on 1st system, next one third on second and last one third on 3rd system. But i have no idea what technology should i use to achieve this? The index table data structure in ought to be in the RAM so that the search queries can be answered quickly(supposing we are able to index it in this way and have a user searching for a word or sentence from different system). Can this purpose be fulfilled by use of JAVA Sockets?
Actually we(group of 5) are trying to make a small but distributed search engine. Supposing the crawling has been done and the page(the document i was talking about) is saved somewhere and i extract it, do the processing , stemming etc, I would like to finally make a distributed Index data structure based on scheme mentioned above. Would it be possible? I just want to know what technology to use to achieve something like this. Like modifying a data structure inside some program running on some other machine(but in the same network).
Secondly, since we actually don't know if this approach is feasible, if thats the case I would be keen to know the correct way I should look at a distributed index table.
Have the index information saved as you crawl the documents. Have a head node which presents the search user interface. The head node then distributes the search to the index nodes, and collects the results to present to the user.
There are a number of available frameworks, such as Mapreduce, which will help you solve this problem.

Which data structure should I use to search a string from CSV?

I have a csv file with nearly 200000 rows containing two columns- name & job. The user then inputs a name, say user_name, and I have to search the entire csv to find the names that contain the pattern user_name and finally print the output to screen. I have implemented this using ArrayList in Java where I put the entire names from csv to ArrayList and then searched for the pattern in it. But in that case the overall time complexity for searching is O(n). Is there any other data strucure in Java that I can use to perform the searching in o(logn) or something more efficient than ArrayList? I can't use any database approach by the way. Also if there is a good data structure in any other language that I can use to accomplish my goal, then kindly suggest it to me?
Edit- The output should be the names in the csv that contains the pattern user_name as the last part. Eg: If my input is "son", then it should return "jackson",etc. Now what I have done so far is read the name column of csv to a string ArrayList, then read each element of the ArrayList and using the regular expression (pattern-matcher of Java) to see if the element has the user_name as the last part. If yes, then print it. If I implement this in a multi-threaded environment, will it increase the scalability and performance of my program?
You can use:
TreeMap, it is sorted red-black tree,
If you are unable to use a commercial database then you are going to have to write code to mimic some of a database's functionality.
To search the entire dataset sequentially in O(n) time you just read it and search each line. If you write a program that loads the data into an in-memory Map, you could search the Map in amortized O(1) time but you'd still be loading it into memory each time, which is an O(n) operation, gaining you nothing.
So the next approach is to build a disk-based index of some kind that you can search efficiently without reading the entire file, and then use the index to tell you where the record you want is located. This would be O(log n), but now you are at significant complexity, building, maintaining and managing the disk-based index. This is what database systems are optimized to do.
If you had 200 MILLION rows, then the only feasible solution would be to use a database. For 200 THOUSAND rows, my recommendation is to just scan the file each time (i.e. use grep or if that's not available write a simple program to do something similar).
BTW, if your allusion to finding a "pattern" means you need to search for a regular expression, then you MUST scan the entire file every time since without knowing the pattern you cannot build an index.
In summary: use grep

Parsing 20 GB input file to an ArrayList

I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?
You shall use an external sorting algorithm, do not try to fit this in memory.
http://en.wikipedia.org/wiki/External_sorting
If you think it is too complex, try this:
include H2 database in your project
create a new on-disk database (will be created automatically on first connection)
create some simple table where the numbers will be stored
read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
select numbers with ORDER BY clause :)
use JDBC resultSet to fetch results on-the-fly and write them to an output file
H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).
You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.

Java big list approach to entry coupling

FILE:
I'm working with a refined csv version of a searchlog file which contains 3.3mio lines of data, with each line resembling a single query and containing various data about that query.
The entries in the file are sorted ascending by the session / userid.
GOAL:
Coupling entries that submitted the same queryterm while belonging to the same userid
APPROACH:
I'm reading the csv file line by line, saving the data in selfmade 'Entry'-object and adding these objects to an arraylist. When this is done, I'll sort the list by two criteria with a custom comparator
PROBLEM:
While reading the lines and adding the Entry-objects to the list (which takes very long) the program terminates with a OutOfMemoryException "Java heap"
So it seems that my approach is too hard on memory (and runtime).
Any ideas for a better approach?
Your approach itself may be valid, and perhaps the simplest solution is to simply boost the memory available to the JVM.
The JVM will only allocate itself a maximum amount of system memory, and you can increase this value via the -Xmx command line attribute. See here for more details.
Obviously this solution doesn't scale, and if (in the future) you want to read much bigger files, then you'll likely need a better solution to reading these files.
Instead of sorting the lines in memory, you could insert the parsed lines in a database with an index based on the columns defining the duplicity.
Another approach would be to dispatch the lines in many files, each file being named, for example, as the first 2 chars of a sha1 of the concatened columns defining the duplicity. So you'd never have to read more than one file for your ultimate operation because all duplicata would be together.

Java: Most efficient way to store/retrieve workout information from a file?

I'm working on a Java project for class that stores workout information in a flat file. Each file will have the information for one exercise (BenchPress.data) that holds the time (milliseconds since epoch), weight and repetitions.
Example:
1258355921365:245:12
1258355921365:245:10
1258355921365:245:8
What's the most efficient way to store and retrieve this data? It will be graphed and searched through to limit exercises to specific dates or date ranges.
Ideas I had was to write most recent information to the top of the file instead of appending at the end. This way when I start reading from the top, I'll have the most recent information, which will match most of the searches (assumption).
There's no guarantee on the order of the dates, though. A user could enter exercises for today and then go in and enter last weeks exercises, for whatever reason. Should I take the hit upon saving to order all of the information by the date?
Should I go a completely different direction? I know a database would be ideal, but this is a group project and managing a database installation and data synch amongst us all would not be ideal. The others have no experience with databases and it'll make grading difficult.
So thanks for any advice or suggestions.
-John
Don't overcomplicate things. Unless you are dealing with million records you can just read the whole thing into memory and sort it any way you like. And always add records in the end, this way you are less likely to damage your file.
For simple projects, using an embedded like JavaDB / Apache Derby may be a good idea. Configuration for the DB is absolutely minimal and in your case, you may need a maximum of just 2 tables (User and Workout). Exporting data to file is also fairly simple for sync between team members.
As yu_sha pointed out though, unless expect to have a large dataset ( to run on a PC , > 50000), you can just use the file and read everything into memory.
Read in every line via BufferedReader and parse with StringTokenizer. Looking at the data, I'd likely store an array of fields in a List that can be iterated and sorted according to your preference.
If you must store the file in this format, you're likely best off just reading the entire thing into memory at startup and storing it in a TreeMap or some other sorted, searchable map. Then you can use TreeMap's convenience methods such as ceilingKey or the similar floorKey to find matches near certain dates/times.
Use flatworm, a Java library allowing to parse and create flat files. Describe the format with a simple XML definition file, and there you go.

Categories