I've written a Java application that users install on there desktop. It crawls websites, storing the data about each page in a LinkedList. The application allows users to view all the pages crawled in a JTable.
This works great for small sites, but doesn't scale very well. Currently users have to allocate more memory (which translates to a -Xmx when starting Java) for larger crawls.
My current thinking is to move to storing all the data in a database, possibly using something like HSQLDB.
Are there any other approaches I should be considering?
relation db is not a good place to store web page data. could you save pages on disk? i you want to do searching on the crawling results. try the apache lucene searching engine. loading all the results all-in-once in memory is not reasonable. you can paginate the JTable model.,and use soft-reference to cache some results when pagination.
A relational database is probably the right approach for this case. Reasons:
It'll enable you to handle larger-than-memory crawls.
If you keep the link data in separate tables from the considerable larger volumes of page data, you may still be able to fit all your links in memory which will be pretty important from a performance and searching perspective
It will give you an easy way of persisting crawled data (in case this is needed in the future)
It's pretty well known / standard technology
There are good open source database implementation available (H2 or JavaDB would probably be my first choices as they are embeddable and written in pure Java)
The relational features could turn out to be useful, for example queries on link data
It doesn't sound like you have the data volumes or availability requirements that might push you towards a NoSQL-type solution
You have basically 4 options:
Store the data in flat files
Store the data in a database
Somehow transmit the data to "the cloud" (I have no idea how)
Somehow "pare" the data down to the essentials, knowing that you can re-extract the full info when needed
You can also do a variant of 4 to gain some space -- rather than a "rich" object structure, compress each distinct datum into a single String or byte[] or some such that you keep in an array or arraylist vs a linked list. This can reduce your storage requirements by 2X or more. Less "object oriented", but sometimes reality intervenes.
Try storing the page data in db4o http://community.versant.com , an object database. Object databases handle complex objects (eg. with lots of siblings) than relational databases.
Related
I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?
The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.
Extending this thread - I would just like to know why it's faster to retrieve files from a file system, rather than a MySQL database. If one were to benchmark the two to see which would retrieve the most data (multiple types of data) over 10 minutes - which one would win?
If a file system is truly faster, then why not just store everything in a file system and replace a database with csv or xml?
EDIT 1:
I found a good resource for alternate storage options for java
EDIT 2:
I'm looking for a Java API/Jar that has the functionality of a SQL Database Server Engine (or at least some of it) that uses XML for data storage (preferably). If you know of something, please leave a comment below.
At the end of the day the database does just store the data in the file system. It's all the useful stuff on top of just the raw data that makes you decide to use a database.
If you can replicate the functionality, scalability, robustness, integrity, etc, etc of a database system using CSV and still make it perform faster than a relational database then yes I'd suggest doing it your way.
It'd take you a few years to get there though.
Of course, relational systems are not the only way to store data. There are object-oriented database systems (db4o, InterSystems Cache) and document-based systems (RavenDB).
Performance is also relative to the style and volume of data you are working with and what you intend to do with it - I'm not going to even try and discuss that, it's too open ended.
I will also not start the follow on discussion: if memory is truly faster than the file system, why not just store everything in memory? :-)
This also seems similar to another question I answered a long while ago:
Is C# really slower than say C++?
Basically stuff isn't always done just for performance.
MySQL uses the file system the same as everything else on a computer. To retrieve a single piece of data, or a table of data, there is no faster way that directly from the file system. MySQL would just be a small bit of overhead added to that file system pull.
If you need to do some intelligent selecting, match some rows, or filter that data, MySQL is going to do that faster than most other options. The database server provides you calculation and data manipulation power that a filesystem can't.
When you have mixed/structured data, a DBMS is the only solution. For eg. try to get the people's name, surname and country for all your customers stored into your DB, but only those born in 1981 and living in Rome. If you have this data into files on the filesystem, how do you easily get only the required data without scanning all your files and how do you join returned data?
A DBMS give you much more than that.
Many DBMS store data into files.
This abstraction layer will make you retrieve data in a very easily, standard and structured way.
The difference is in how the desired data is located.
In a file system, locating the desired data means searching through all existing data until you find it.
Databases provide indexing which results in locating the desired data almost immediately (within ~12 comparisons) regardless of the amount of data.
What we want is an indexed file system - lucky for us, we have them. They are called databases.
My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng
Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.
I'm developing a Java application that has performance at its core.
I have a list of some 40,000 "final" objects,
i.e., I have an initialization input data of 40,000 vectors.
This data is unchanged throughout the program's run.
I am always preforming lookups against a single ID property to retrieve the proper vectors.
Currently I am using a HashMap over a sub-sample of a 1,000 vectors,
but
I'm not sure it will scale to production.
When is BIG, actually big enough for a use of DB?
One more thing, an SQLite DB is a viable option as no concurrency is involved,
so I guess the "threshold" for db use, is perhaps lower.
I think you're asking whether a HashMap with 40,000 entries in will be okay. The answer is yes - unless you really don't have enough memory, that should be absolutely fine. If you're writing a performance-sensitive app, then putting a large amount of fast memory in the machine running the app is likely to be an efficient way of boosting performance anyway.
There won't be very much overhead for each HashMap entry, so if you've got enough space to store the objects themselves in memory, it's unlikely that the overhead of the map would cause a problem.
Is there any reason why you can't just test this with a reasonable amount of data?
If you really have no more requirements than:
Read data at start-up
Put data in a map by a single ID (no need for joins, queries against different fields, substring matches etc)
Fetch data from map
... then using a full-blown database would be a huge amount of overkill, IMO.
As long as you're loading the data set in a memory at the beginning of the program and keeping it in memory and you don't have any complex queries, some sort of serialization/deserialization seems to be more feasible than a full blown database.
You could start a DB with as little as 100 (or less). There is no general rule of when the amount of data is large enough to store in a database. It's more if you believe you should better store this data in a database, if this will give you any profit (performance boost, easier programming, more flexible options for your users).
When the benefits are greater than the cost of implementation put it in a database.
There is no set size for a Collection vs a Database. It high depends on what you want to do with the data. Size is less important.
You can have a Map with a billion entries.
There's no such thing as 'big enough for a database'. The question is whether there are enough advantages in using a database to overcome the costs.
Having said that, 40,000 isn't 'big' ;-) Unless the objects are huge or you have complex query requirements I would start with an in-memory implementation. But if you expect to scale this number up over time it might be better to use the database from the beginning.
One option that you might want to consider is the Oracle Berkeley DB Java Edition library. It's a simple JAR file that can read/write data to persistent storage. Because of it's small footprint and ease of use, it's used for applications running on small to very large data sets. It's designed to be linked into the application, so that it's embedded and doesn't require complex client/server installation or protocol stacks.
What's even better is that it's extremely scalable (which works well if you end up with larger data sets than you expect), is very fast, and supports both a Java Collections API and a Direct Persistence Layer API (POJO-like). So you can use it seamlessly with Java Collections.
Berkeley DB Java Edition was designed specifically with Java application developers in mind. It's designed to be simple to use, light weight in terms of resources required, but very fast, scalable and reliable.
You can find information more about Oracle Berkeley DB Java Edition here
Regards,
Dave
I'm working on a java project that will allows users to parse multiple files with potentially thousands of lines. The information parsed will be stored in different objects, which then will be added to a collection.
Since the GUI won't require to load ALL these objects at once and keep them in memory, I'm looking for an efficient way to load/unload data from files, so that data is only loaded into the collection when a user requests it.
I'm just evaluation options right now. I've also thought of the case where, after loading a subset of the data into the collection, and presenting it on the GUI, the best way to reload the previously observed data. Re-run the parser/Populate collection/Populate GUI? or probably find a way to keep the collection into memory, or serialize/deserialize the collection itself?
I know that loading/unloading subsets of data can get tricky if some sort of data filtering is performed. Let's say that I filter on ID, so my new subset will contain data from two previous analyzed subsets. This would be no problem is I keep a master copy of the whole data in memory.
I've read that google-collections are good and efficient when handling big amounts of data, and offer methods that simplify lots of things so this might offer an alternative to allow me to keep the collection in memory. This is just general talking. The question on what collection to use is a separate and complex thing.
Do you know what's the general recommendation on this type of task? I'd like to hear what you've done with similar scenarios.
I can provide more specifics if needed.
You can embed a database into the application, like HSQLDB. That way you parse the files the first time and then use SQL to do simple and complex querys.
HSQLDB (HyperSQL DataBase) is the
leading SQL relational database engine
written in Java. It has a JDBC driver
and supports nearly full ANSI-92 SQL
(BNF tree format) plus many SQL:2008
enhancements. It offers a small, fast
database engine which offers in-memory
and disk-based tables and supports
embedded and server modes.
Additionally, it includes tools such
as a command line SQL tool and GUI
query tools.
If you have tons of data, lots of files, and you are short on memory, you can do an initial scan of the file to index it. If the file is divided into records by line feeds, and you know how to read the record, you could index your records by byte locations. Later, if you wanted to read a certain set of indeces, you would do a fast lookup to find which byte ranges you need to read, and read those from the File's InputStream. When you don't need those items anymore, they will be GCed. You will never hold more items than you need into the heap.
This would be a simple solution. I'm sure you can find a library to provide you with more features.