I have a list of records containing country, city, district and building name information (more than 50,000 records) where building name is unique for every record.
I want to search building, district & city. But I want to get a list of cities if I pass the country to a method, e.g. get(String country). Or, get a list of districts if I pass country and city to the method, e.g. get(String country, String city).
Is there any existing collection/library/data structure to do something like this? I am thinking of a tree-like structure / Map. I tried MultiKeyMap, but it does not return a list of values and it is not thread-safe. Also, I don't want to use database for doing this.
Thanks in advance for your help.
SolR might do the job you are after:
Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
distributed search and index replication, and it powers the search and
navigation features of many of the world's largest internet sites...
It should allow you to create queries which will in turn allow you to search through your records.
You can also interact with SolR through Solrj:
Solrj is a java client to access solr. It offers a java interface to
add, update, and query the solr index.
You can use HashMap like
HashMap<country,HashMap<City,HashMap<district,HashMap<building,value>>>>
You could take a look at Apache's Commons CollectionUtils. It has a "select" method that do what you want.
An off-beat type of way maybe using .properties files for each country to refer to a subset of localities in their each own .properties that again contains a a .properties to refer to cities that refer to .properties file containing buildings.
Another may be a class hierarchy system with a base instantiated "new" class e.g. GeographicLocation with a constructor that is fed an index to load an abstract class that indicates a Region or brings back a list of regions if not indicated by calling one of the two methods overloaded and that in turn automatically loads the next abstract class layer of city over the top of that.
Inside GeographicLocation class ....
CountryMap cntry = (CountryMap)this();
RegionMap rgion = (RegionMap)cntry;
CityMap cty = (CityMap)rgion;
....e.t.c.
Why not simply use three hashtables (e.g. of the type HashMap<String, List<Record>>): one keyed by buildings, one keyed by city and one keyed by district. Sure, you'll be using about three times as much memory; but 50,000 records really isn't that much. Furthermore, lookups will be really fast and simple. I'd recommend trying this and seeing how it performs.
Related
Background
I'm using NoSql database supporting graphs for the first time. It is a huge medical application handling thousands of patients. It is a greenfield project and we as a team are struggling with our persistence layer. We don't know how relationships should be represented and if we should use Triples to handle queries involving huge amount of data. We are using Java API.
Data structure
Imagine that there are 3 types of JSON documents in our Marklogic database: Patient, Event, File Evidence.
There are thousands of patients in the application
One patient can have multiple events associated with this patient (admitted, discharged, transferred, prescribed medications, added note, changed internal status etc.)
each event can have multiple files attached to it as an evidence
Assume there are hundreds of thousands of patients, events and files.
Question
Is it possible to query patients with events and files at once? Is using semantics (possible triples: 'patient has event', 'event has file') recommended in our case?
Our approach
We try to use triples to provide relationships between our documents, add them to one graph, use combination query to fetch IRI first and then in the second call fetch documents by IRI. We tried self-paced trainings and exploring https://github.com/marklogic/marklogic-samplestack but with no luck. Help of someone who has done that in the past and would like to share his experience would be great.
I your situation, keep in mind that you can also store the triples in each of the documents themselves (with the inferred subject being the document itself). Then in your example, you could be combining cts:triple-range-query with standard cts:search.
Example:
If I had events and embedded a triple such as [this event-> ownedByPatient -> [iri/for/patiens#12345]
Then I could query:
search for events filtered by fragments where the cts:triple-range-query states that the events are owned by patient 12345
This approach is a combination of semantics and MarkLogic search - using triples to link the appropriate types.
As for different types of documents, triples do not care what they are pointing at - an IRI of a person, event, etc. Its just about how you model you data itself and the ontology used to describe the relationships. So, you can also approach this as managed triples (not embedded) and treat it all as a graph database pointing at your content (like the approach you are describing)
Once you get further along, you may also decide to force restrictions on the types of relationships using RDF rules.
You've given us very little information to work with to answer such broad questions. Nevertheless, I'll do my best with what you gave.
One option is organize the data however is most intuitive to you, and use server-side Javascript (SJS) to combine the documents at query time into whatever you need for a particular query. That SJS could be in the form of a resource extension or search response transform. A resource extension has the advantage that it could do multiple queries across different document types and piece them together to form an answer. A search response transform, on the other hand will be given the results of only one query but could do additional queries as needed to bring in more data. Since you only have hundreds of thousands of records, you may not need to stress too much about raw speed.
If you plan to scale to millions of documents and want raw speed, you could keep everything you want to query about one patient in the patient record. That would allow you to find a patient by full-text search through all their records plus field-match on patient-specific data.
That assumes the only search results you ever want are patients. If you want something else, you'll need to let us know what other search results you might want.
When you say "attachment" I think of binary documents with scanned images, no metadata, and no full-text to search. Those would obviously be stored as separate binary documents. If they have metadata or full-text, you'll have to decide whether any of that should be in the big patient record for fast queries or in separate documents. All "attachment" documents that are separate JSON files could have a field that points to the patient by id.
I'd avoid triples at first. As David Ennis pointed out, you can combine triples and search, but it's a bit of a ninja move. One big JSON document per patient is much easier for most developers to understand.
I am currently developing a Google AppEngine (GAE) application and I am struggling a bit with the GAE DataStore best practices. I would like to use the DataStore in the most efficient way. I am using the Objectify framework, but am flexible to use something else if there is a better alternative.
My application uses three objects/tables:
- Items (id, description)
- List (id, listId, listDescription
- SecurityProfile (id,listId, username, accessType)
I an relational world, my Items and SecurityProfiles tables would have an external key to link them to a list (ListId) and I would then use joins in my queries.
The typical Queries I need to make:
- Get all lists accessible to a particular user (need an index on "username" to filter by username and need to get the description from the List table)
- Get all items in list for a particular user (get the Items linked to the Lists retrieved in the query above)
I am struggling a bit to come up with a way to link the different objects in an efficient way (minimizing the DataStore queries and indexes).
I have seen in other posts that joins should be avoided and that I should de-normalize the model as much as possible.
So kind of creating one object only:
- Data (id, description, listId, listDescription, username, accessType)
I can see how that work from a read point of view, but if I update a listDescription, an accessType or add a new username, I could potentially have to update a massive amount of records. Is this really the way to go ?
I'm only familiar with the Python NDB API, but things are similar in Java.
In Python NDB, I would recommend to create a Model for each
User,
List,
List item
Then, you can reference them with repeated KeyProperties, e.g.
class SecurityProfiles(ndb.Model):
accessibleLists = ndb.KeyProperty(repeated=true)
class List(ndb.Model):
listItems = ndb.KeyProperty(repeated=true)
Like this, you can pull a user's profile from the DataStore, and with the keys stored in accessibleLists you can get the lists accessible to the user.
Alternatively, you could do it the other way around:
class List(ndb.Model):
usersWithAccess = ndb.KeyProperty(repeated=true)
and then you could immediately query for lists that are accessible to a given user.
I am currently working with a Java based web application (JSF) backed by Hibernate that has a variety of different search pages for different areas.
A search page contains a search fields section, which a user can customize the search fields that they are interested in. There are a range of different search field types that can be added (exact text, starts with, contains, multi-select list boxes, comma separated values, and many more). Search fields are not required to be filled in and are ignored, where as some other search fields require a different search field to have a value for this search field to work.
We currently use a custom search object per area that is specific to that area and has hard coded getter and setter search fields.
public interface Search {
SearchFieldType getSearchPropertyOne();
void setSearchPropertyOne(SearchFieldType searchPropertyOne);
AnotherSearchFieldType getSearchPropertyTwo();
void setSearchPropertyTwo(AnotherSearchFieldType searchPropertyTwo);
...
}
In this example, SearchFieldType and AnotherSearchFieldType represent different search types like a TextSearchField or a NumericSearchField which has a search type (Starts with, Contains, etc.) or (Greater Than, Equals, Less Than, etc.) respectively and a search value that they can enter or leave empty (to ignore the search field).
We use this search object to prepare a Criteria object
The search results section is a table that can also be customized by the user to contain only columns of the result object that they are interested in. Most columns can be ordered ascending or descending.
We back our results in a Result object per result which also hard codes the columns that can be displayed. This table is backed by hibernate annotations, but we are trying to use flat data instead of allowing other hibernate backed objects to minimize lazy joining data.
#Entity(table = "result_view")
public interface Result {
#Column(name = "result_field_one")
Long getResultFieldOne();
void setResultFieldOne(Long resultFieldOne);
#Column(name = "result_field_two")
String getResultFieldTwo();
void setResultFieldTwo(String resultFieldTwo);
...
}
The search page is backed by a view in our database which handles the joining to all the tables needed for every possible outcome. This view has gotten pretty massive and we take a huge performance hit for every search, even when a user only really wants to search on one field and display a few columns because we have upwards of thirty search field options and thirty different columns they can display and this is all backed by the one view.
On top of this, users request new search fields and columns all the time that they would like added to the page. We end up having to alter the search and result objects as well as the backing view to make these changes.
We are trying to look into this matter and find alternatives to this. One approach mentioned was to create different views that we dynamically choose based on the fields searched on or displayed in the results table. The different views might join different columns and we pick and choose which view we need for any given search.
I'm trying to think about the problem a different way. I think it might be better to not use a view and instead dynamically join tables we need based on what search fields and result columns are requested. I also feel that the search and result objects should not contain hard coded getters/setters and should instead be a collection of search fields and a collection (or map) of result columns. I have yet to completely flesh out my idea.
Would hibernate still be a valid solution to this issue? I wouldn't want to have to create a Result object used in a hibernate criteria since they result columns can be different. Both search fields and/or result columns might require joining tables.
Is there a framework I could use that might help solve the problem? I've been trying to look for something, and the closest thing I have found is SqlBuilder.
Has anyone else solved a similar problem dynamically?
I would prefer not to reinvent the wheel if a solution already exists.
I apologize that this ended up as a wall of text. This is my first stackoverflow post, and I wanted to make sure I thoroughly defined my problem.
Thanks in advance for your answers!
I don't fully understand the problem. But JPA Criteria API seems very flexible, which can be used to build query based on user-submitted filtering conditions.
I'm working on a web app for a class. It's basically a project management system, similar to a watered down version of Bugzilla, but specifically tailored for an academic environment. One of the requirements is that for a number of settings (such as project type which could be master's project, PhD thesis, etc.) the lists of possible values be configurable. So there would be a configuration or settings page where you could change the values in each list, but then in the rest of the app (like when creating a project or task) the values in the list will be the only options to choose from. Also if you change one of the values (say from master's paper to master's thesis) all the records which use that value should have it changed, too. So all projects marked as master's paper would now be marked as master's thesis.
I'm using an HSQLDB to store data and the app is written all in Java (JDBC, JavaServlets, JSP).
I'm having a hard time figuring out how to deal with this requirement from a design perspective. First, how do I store these lists in the database? Would each list be its own table? Having each list be a column in one table seems wrong (wouldn't that violate normalization rules?). I'm not super familiar with database design, but googling hasn't revealed a good solution to this.
Second, how do I treat these lists in my code? I've been thinking of using static variables (Collections of some sort) in the associated classes, because these settings are meant to be global, not specific to one user or project. That's generally not considered good design though.
Any recommendations would be greatly appreciated. I want to get the design correct not only because this is a software engineering class so design is important, but also because I may end up expanding this project into a master's project.
this is standard normalization.
create a list table
mylist
---------
option_id
option_name
then associate it to the other table as appropriate
my_other_table
--------------
attributes...
option_id
the UI for setting values for my_other_table queries to mylist for the values that should go into the combo box or whatever UI component you choose.
Each "enum" should be stored in its own table, so that you can have foreign keys to this table.
You could store all the possible values of each "enum" in a cache, to avoid going to the database each time you need the list of options, but be careful not to propose stale data. Since the number of entries should be very small, I wouldn't care much about performance until you have a real problem.
In my company we have table Dictionary(class, field, value, description) - and for each class and field we have as many rows, as there are allowed values, and it works quite well.
I have a requirement to implement contact database. This contact database is special in a way that user should be able to dynamically (on runtime) add properties he/she wants to track about the contact. Some of these properties are of type string, other numbers and dates. Some of the properties have pre-defined values, others are free fields etc.. User wants to be also able to query such structure fast and easily. The database needs to handle easily 500 000 contacts each having around 10 properties.
It leads to dynamic property model having Contact class with dynamic properties.
class Contact{
private Map<DynamicProperty, Collection<DynamicValue> values> propertiesAndValues;
//other userfull methods
}
The question is how can I store such a structure in "some database" - it does not have to be RDBMS so that I can easily express queries such as
Get all contacts whose name starts with Martin, they are from Company of size 5000 or less, order by time when this contact was inserted in a database, only first 100 results (provide pagination), where each of these segments correspond to a dynamic property.
I need:
filtering - equal, partial equal, (bigger, smaller for integers, dates) and maybe aggregation - but it is not necessary at this point
sorting
pagination
I was considering RDBMS, but this leads more less to this structure which is quite hard to query and it tends to be slow for this amount of data
contact(id serial pk,....);
dynamic_property(dp_id serial pk, ...);
--only one of the values is not empty
dynamic_property_value(dpv_id serial pk, dynamic_property_fk int, value_integer int, date_value timestamp, text_value text);
contact_properties(pav_id serial pk, contact_id_fk int, dynamic_propert_fk int);
property_and_its_value(pav_id_fk int, dpv_id int);
I consider following options:
store contacts in RDBMS and use Lucene for querying - is there anything that would help with this?
Store dynamic properties as XML and store it to rdbms and use xpath support - unfortunatelly it seems to be pretty slow for 500000 contacts
use another database - Mango DB or Jackrabbit to store this information
Which way would you go and why?
Wikipedia has a great entry on Entity-Attribute-Value modeling which is a data modeling technique for representing entities with arbitrary properties. It's typically used for clinical data, but might apply to your situation as well.
Have you considered using Lucene for your querying needs? You could probably get away with just using Lucene and store all your data in the index. Although I wouldn't recommend using Lucene as your only persistence store.
Alternatively, you could use Lucene along with a RDBMS and take advantage of something like Compass.
You could try other kind of databases like CouchDB which is a document oriented db and is distributed
If you want a dumb solution, for your contacts table you could add some 50 columns like STRING_COLUMN1, STRING_COLUMN2... upto 10, DATE_COLUMN1..DATE_COLUMN10. You have another DESCRIPTION column. So if a row has a name which is a string then STRING_COLUMN1 stores the value of your name and the DESCRIPTION column value would be "STRING_COLUMN1-NAME". In this case querying can be a bit tricky. I know many purists laugh at this, but I have seen a similar requirement solved this way in one of the apps :)