count number of column in mongoDB using Java - java

I have installed MongoDB in my system and I inserted 10 documents into the username collection. That document contains name, roll no, city fields. I need to count the number of fields in the username collection.
I except 3... How to get the 3 from java program?

If you are using Java in your application you are manipulating DBObject ( http://api.mongodb.org/java/2.6/com/mongodb/DBObject.html ) and you can get the KeySet from this and the size will be the number of attributes. (in your case it will be 4 since you have the _id attribute)
But this is PER DOCUMENT, remember that in a collection, each document can have its own "structure", in your case one user could have 4 attributes, another could have 10... and some of them could have sub documents with their own structure. MongoDB does not have any "catalog".
Some system are "sampling" the data to analyze the global structure of the documents and provide a catalog but this is will not be exact.

Try this:
System.out.println(coll.getCount());
More Detail here: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/

Related

Which datastore to use when you have unbounded(dynamic) number of fields/attributes for an entity?

I am designing a system where I have a fixed set of attributes (an entity) and then some dynamic attributes per client.
e.g. customer_name, customer_id etc are common attributes.
whereas order_id, patient_number, date_of_joining etc are dynamic attributes.
I read about EVA being an anti-pattern. I wish to use a combination of mysql and a nosql datastore for complex queries. I already use elastic search.
I cannot let the mapping explode with unlimited number of fields. So I have devised the following model:
mysql :
customer, custom_attribute, custom_attribute_mapping, custom_attribute_value
array of nested documents in elasticsearch :
[{
"field_id" :123,
"field_type" : "date",
"value" : "01/01/2020" // mapping type date - referred from mysql table at time on inserting data
}...]
I cannot use flattened mappings on es, as I wish to use range queries as well on custom fields.
Is there a better way to do it? Or an obvious choice of another database that I am too naive to see?
If I need to modify the question to add more info, I'd welcome the feedback.
P.S. : I will have large data (order in 10s of millions of records)
Why not using something like mongoDB as a pure NoSQL database.
Or as non-popular solution, I would recommend triple stores such as virtuoso or any other similar ones. Then you can use SPARQL as a query language over them and there are many drivers for such stores, e.g. Jena for Java.
Triples stores allow you to store data in the format of <Subject predicate object>
wherein your case subject is the customer id, predicates are the attributes and object will be the value. All standard and dynamic attributes will be in the same table.
Triple stores can be modeled as 3 columns table in any database management system.

Elastic search index on all attributes?

I am new to elastic search(ES) and have gone through basic tutorials like
this mykong tutorial
I have question on create part of any document
CREATE Operation Example
To insert a new Document with /mkyong/posts/1001 and the following Request Data:
{
"title": "Java 8 Optional In Depth",
"category":"Java",
"published_date":"23-FEB-2017",
"author":"Rambabu Posa"
}
Question 1 :- Will ES create the inverted index on all attributes of above document i.e. title/category/published/author by default and provide
full text search or I need to mention it explicitly ?
Question 2 :- In above example we already have unique_id i.e. 1001. That's fine if I am already storing it in DB and generate ID. What if
I need to generate the ID through ES engine and do not have any DB ?
Update :-
Got the answer for question 1 from Specify which fields are indexed in ElasticSearch
Question 1 :- Yes, by default ES will index your field twice as two separate types. Once as "text" and once as "keyword" as a sub-field like "title.keyword". The "text" type runs through language analyzers to support the standard search case (remove stop words, stem words, etc). The "keyword" type makes no changes and indexes the data exactly as it is support exact match and aggregations. You can explicitly tell ES a mapping for any field, but if you don't this is the default behavior.
Here is some information on the text vs keyword behavior:
https://www.elastic.co/blog/strings-are-dead-long-live-strings
Question 2 :- ES will automatically create it's own internal ID for every document you index in a field called "_id". You can technically replace this with your own ID, but typically you don't want to do that because it can impact performance by making ES's hashing algorithm to spread out the data preform poorly. It is usually better to just to add any ID's you would like as new fields in the document and let ES index them for you, ideally as the keyword type.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html

Parsing HTML and storing in java collection

I have a requirement where I have to parse an HTML page that contains multiple tables of scorecard. table structure remains same but based on data for different matches,different tables can contain different, though column names are same.
Now i need to search based on table columnname and data contained in it with an argument pair. e.g. if i have a column called playername and multiple tables contain many player names. if i search for a particular player name by passing 2 arguments- playername(column name) and Jason, it should fetch all rows where playername column has Jason as its data. i can pass another pair of arguments as a AND - matchesplayed(column name) and 15, it should fetch all rows from above result set where Jason played 15 matches.
Can you assist how I can achieve this. Logic I tried is-
get the data for all columns in different array lists.then create a map with the column names as keys and its values as different arraylists containing that column's data. Is my approach correct or i need to solve it using different approach.
Thanks for your help.
Let's make order. I use your example.
1) The first thing you have to do is searching for rows where playername == Jason. Using jsoup or another HTML parser you can easily have access to the td where Jason is contained in. From there, you can easily access to the parend trand to the table.
2) Using the table you can access first tr or th to individuate columnname to use as keys. Then using a positional logic (first with first, second with second) you can understand which columnname corresponds to which content (inside the td
3) How to collect data is up to you. Probably, a Map<String, String> can be a solution. Or, if data is static, you can create a Player pojo and use reflection api to fill it.
Giving us more details and snippets of code we can help you more.
You can use Jsoup to get the HTML Document and then write a method with the input player name values. This method should parse through <table> elements in the html document to get you what is needed. Parsing will be easy if you understand Jquery/css selectors.
Check this link for Jsoup Selectors.
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

how can i generate a item code value with auto increment integer plus an identification integer

i'm creating a Itemcode for my inventory system i want the number system of integer values like this using java
for example this
for group 1 the code would be 001 -
0010001,
0010002
for group 2 the code would be 002-
0020003,
0020004
for group 3 the code would be 003-
0030005,
0030006
the items are encoded individually so when i add a new entry it will detect which group it belongs to and generate it desired item code the first 3 digits will be the corresponding Value identification in which group it belongs to the the next 4 digit code will just be the increment value..and would be stored as one integer using MySQL database
You need to decide:
Are the item codes to be represented as: one integer, a pair of integers (group & item), a string ... or something else.
Is the numbering scheme per the first example or the second one. (You seem to have chosen one scheme now ...)
How you are going to populate the items and codes. Do you read the codes? Do you generate them all in one go while loading items from a file. Do you create items and item ids one at a time (e.g. interactively).
How is this information going to be "stored"? In memory only? In a flat file? In a database? (MySQL ... ?)
These decisions will largely dictate how you implement the item id "generation".
Basically, your problem here is that >>you<< need to figure out what the requirements are. Once you have done that, the set of possible solutions will reduce to a manageable size, and you can then either work it out for yourself or ask a sensible question.

store documents based on sort order in lucene index

I have two field (name, modifiedDate) in my index. i want to store new document based on modifiedDate and keep index sorted on modifiedDate
doc #1 is the oldest document and (modifiedDate) is oldest too
doc #n is most recent document and (modifiedDate) is near to now
1) how can i create this index structure that documents physically stored base on (modifiedDate) and keep the structure even after any change happened in the index (optimize, delete, update)
2) the following structure let me search for documents in specific date range.
but i don't want to search the entire index and then filter. i want to use the following structure to skip all other documents if it goes beyond the date range
Current lucene behavior
for (1 to docCount)
if (modifiedDate is in date range filter)
calculate the score based on query
Accepted behavior
for (1 to docCount)
if (modifiedDate is greater than upper bound of date range)
break
else
calculate the score based on query
if i have 3,000,000 document and my date range only meets 20 top document, in current lucene behavior i need to check all of the documents, but in accepted behavior I am only scoring top 20 document, and you can guess the huge performance gain
The existing answers are fine but Lucene 4.3.0 came out this year with a new "SortingMergePolicy" that allows advanced Lucene users to use the algorithm suggested in the original poster to cancel a search early. See the javadocs
Lucene will index and query efficiently on numeric fields, see NumericRangeQuery. The javadoc I linked to above have notes about the TrieRangeQuery implementation.
You can store modifiedDate as a NumericField which contains the modified date as a long in ms. Then use a QueryWrapperFilter around a NumericRangeFilter to limit your search to the appropriate date range.
This should be very efficient.

Categories