Currently I'm using client.getAsync(folder + "/files", liveListener) to check the contents of a folder for a specific file. At the moment it works well but if the folder has many files in it (which it is likely to have) then the result returned will be rather large so I was wondering if there was any way to limit this?
Using the Google Drive api I can query for files of a certain mimetype which means the results returned is greatly reduced.
Is there anything like this for the Windows Live api?
The documentation doesn't suggest so..
From here:
http://msdn.microsoft.com/en-us/library/live/hh826531.aspx
Looks like a variety of filtering and selection mechanisms. Not one explicitly for mime-types but it does let you filter on some notion of 'type':
Get only certain types of items by using the filter parameter in the
preceding code and specifying the item type: all (default), photos,
videos, audio, folders, or albums. For example, to get only photos,
use FOLDER_ID/files?filter=photos.
Also looks like you can sort by a variety of criteria and then select by offset, which would seemingly be very useful for what you discuss above.
Introduce a nontrivial folder structure in your file storage. Folders are a natural filtering mechanism for files.
Related
Scenario:
I'm doing some HTML information extraction using a crawler. Right now, most the rules for extraction are hardcoded (not the tags or things like that, but loops, nested elements, etc.)
For instance, one common task is as follows:
Obtain table with ID X. If it doesn't exists there may be additional mechanisms so find the info which are triggered
Find a row which contains some info. Usually the match is a regexp against an specific column.
Retrieve the data in a different column (usually marked in the td, or previously detected in the header)
The way I'm currently doing so is:
Query to get the body of first table with id X (X is in config file). Some websites of my list are buggy and duplicate that id on elements different than table -.-
Iterate over interesting cells, executing regexp on cell.text() (regexp is in config file)
Get the parent row of the matching cells, and obtain the cell I need from the row (identifier of the row is in config file)
Having all this hardcoded for the most part (except column names, table ids, etc) gives me the benefit or being easy to implement and more efficiency than a generic parser, however, it is less configurable, and some changes in the target websites force me to deal with code, which makes it harder to delegate the task.
Question
Is there any language (preferably with a java implementation available) which allows to consistently define rules for extractions like those? I'm using css-style selectors for some tasks, but others are not so simple, so my best guess is that there must be something extending that that a non-programmer maintainer to add/modify rules on demand.
I would accept a Nutch-based answer, if there's one, as we're studying migrating our crawlers to nutch, although, I'd prefer a generic java solution.
I was thinking about writing a Parser generator and create my own set of rules to allow users/maintainers to generate parsers, but it really feels like reinventing the wheel for no reason.
I'm doing something somewhat similar - not exactly what you're searching for, but maybe you can get some ideas.
First the crawling part:
I'm using Scrapy on Python 3.7.
For my project, that brought the advantage, that it's very flexible and an easy crawling framework to build upon. Things like delays between requests, HTTP header language etc. can mostly be configured.
For the information extraction part and rules:
In my last generation of crawler (I'm now working on the 3rd gen, the 2nd one is still running but not as scalable) I've used JSON files to enter the XPath / CSS rules for every page. So on starting my crawler, I've loaded the JSON file for one specific page that is currently being crawled and a generic crawler, knew what to extract based on the loaded JSON file.
This approach isn't easily scalable since one config file per domain has to be created.
Currently, I'm still using Scrapy, with a starting list of 700 Domains to crawl and the crawler is now only responsible for downloading the whole website as HTML files.
These are being stored in tar archives by a shell script.
Afterward, a Python script is going through all members of the shell script and analyzing the content for the information I'm looking to extract.
Here, as you said, it's a bit like re-inventing the wheel or writing a wrapper around an existing library.
In Python, one can use BeautifulSoup for removing all tags like script and style etc.
Then you can extract for instance all text.
Or you'd focus first on tables only, extract all tables into dicts and can then analyze with regex or similar.
There are libraries like DragNet for boilerplate removal.
And there are some specific approaches on how to extract table structured information.
I want to store my blobs outside of the database in files, however they are just random blobs of data and aren't directly linked to a file.
So for example I have a table called Data with the following columns:
id
name
comments
...
I can't just include a column called fileLink or something like that because the blob is just raw data. I do however want to store it outside of the database. I would love to create a file called 3.dat where 3 is the id number for that row entry. The only thing with this setup is that the main folder will quickly start to have a large number of files as the id is a flat folder structure and there will be OS file issues. And no the data is not grouped or structured, it's one massive list.
Is there a Java framework or library that will allow me to store and manage the blobs so that I can just do something like MyBlobAPI.saveBlob(id, data); and then do MyBlobAPI.getBlob(id) and so on? In other words something where all the File IO is handled for me?
Simply use an appropriate database which implements blobs as you described, and use JDBC. You really are not looking for another API but a specific implementation. It's up to the DB to take care of effective storing of blobs.
I think a home rolled solution will include something like a fileLink column in your table and your api will create files on the first save and then write that file on update.
I don't know of any code base that will do this for you. There are a bunch that provide an in memory file system for java. But it's only a few lines of code to write something that writes and reads java objects to a file.
You'll have to handle any file system limitations yourself. Though I doubt you'll ever burn through the limitations of modern file systems like btrfs or zfs. FAT32 is limited to 65K files per directory. But even last generation file systems support something on the order of 4 billion files per directory.
So by all means, write a class with two functions. One to serialize an object to a file; given it a unique key as a name. And another to deserialize the object by that key. If you are using a modern file system, you'll never run out of resources.
As far as I can tell there is no framework for this. The closest I could find was Hadoop's HDFS.
That being said the advice of just putting the BLOB's into the database as per the answers below is not always advisable. Sometimes it's good and sometimes it's not, it really depends on your situation. Here are a few links to such discussions:
Storing Images in DB - Yea or Nay?
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
I did find some addition really good links but I can't remember them offhand. There was one in particular on StackOverFlow but I can't find it. If you believe you know the link please add it in the comments so that I can confirm it's the right one.
I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.
I'm in the early stages of a note-taking application for android and I'm hoping that somebody can point me to a nice solution for storing the note data.
Ideally, I'm looking to have a solution where:
Each note document is a separate file (for dropbox syncing)
A note can be composed of multiple pages
Note pages can have binary data (such as images)
A single page can be loaded without having to parse the entire document into memory
Thread-safety: Multiple reads/writes can occur at the same time.
XML is out (at least for the entire file), since I don't have a good way to extract a single page at a time. I considered using zip files, but (especially when compressed) I think they'd be stuck loading the entire file as well.
It seems like there should be a Java library out there that does this, but my google-fu is failing me. The only other alternative I can think of is to make a separate sqlite database for every note.
Does anybody know of a good solution to this problem? Thanks!
Seems like a relational database would work here. You just need to play around with the schema a little.
Maybe make a Pages table with each page including, say, a field for the document it belongs to and a field for its order in the document. Pages could also have a field for binary data, which might be contained in another table. If the document itself has additional data, maybe you have a table for documents too.
I haven't used SQLite transactions on an Android device, but it seems like that would be a good way to address thread safety.
I would recommend using SQLite to store the documents. Ultimately, it'll be easier than trying to deal with file I/O every time you access the note. Then, when somebody wants to upload to dropbox, you generate the file on the fly and upload it. It would make sense to have a Notes table and a pages table, at least. That way you can load each page individually and a note is just a collection of pages anyway. Additionally, you can store images as BLOBS in the database for a particular page. Basically, if you only want one type of content per page, then you would have, in the pages table, something like an id column and a content column. Alternatively, if you wanted to support something that is more complex such as multiple types of content then you would need to make your pages a collection of something else, like "entities."
IMO, a relational database is going to be the easiest way to accomplish your requirement of reading from particular pages without having to load the entire file.
I am making a java program that has a collection of flash-card like objects. I store the objects in a jtree composed of defaultmutabletreenodes. Each node has a user object attached to it with has a few string/native data type parameters. However, i also want each of these objects to have an image (typical formats, jpg, png etc).
I would like to be able to store all of this information, including the images and the tree data to the disk in a single file so the file can be transferred between users and the entire tree, including the images and parameters for each object, can be reconstructed.
I had not approached a problem like this before so I was not sure what the best practices were. I found XLMEncoder (http://java.sun.com/j2se/1.4.2/docs/api/java/beans/XMLEncoder.html) to be a very effective way of storing my tree and the native data type information. However I couldn't figure out how to save the image data itself inside of the XML file, and I'm not sure it is possible since the data is binary (so restricted characters would be invalid). My next thought was to associate a hash string instead of an image within each user object, and then gzip together all of the images, with the hash strings as the names and the XMLencoded tree in the same compmressed file. That seemed really contrived though.
Does anyone know a good approach for this type of issue?
THanks!
Thanks!
Assuming this isn't just a serializable graph, consider bundling the files together in Jar format. If you already have your data structures working with XMLEncoder, you can reuse this code by saving the data as a jar entry.
If memory serves, the jar library has better support for Unicode name entries than the zip package, which is why I would favour it.
You might consider using an MS JET database (.mdb file) and storing all the stuff in there. That'll also make it easy to examine and edit the data in (for example) MS Access.
You can employ some virtual file system, which stores it's data in a single container. We develop and offer one of such files sytems, SolFS, however right now there's no Java binding for it. We will release Java JNI interface for SolFS within a month.