Modify all urls from HTML fragments (migration to S3)

Modify all urls from HTML fragments (migration to S3) - java

We use a clone of Amazon S3 called GreenQloud Storage. But they are shutting down so we have to take all the files here and migrate them in S3.
Our app is a little similar to a CMS in that our DB has fields that contains HTML fragments. And these fragments may reference GreenQloud urls, so we have to replace all these urls with S3 urls.
The files are already migrated. Here is a sample file in both storage providers:
https://s.greenqloud.com/com.stample.s3/stample-1420827843028-spotlight.png
https://stample-files.s3.amazonaws.com/stample-1420827843028-spotlight.png
I'm thinking of using a HTML parser to extract tags like a, img and http urls found in text nodes, but I'm frightened to miss some urls this way. Do you see any problem with that?
I'm also considering using regexes but some people advice against using regexes to parse HTML. But anyway, in my case I'm not really sure this could be considered "parsing HTML" as I just want to replace a pattern by another.
So, I would appreciate to know which solution is the best / most secure for this migration. I'm not concerned so much about migration throughput / performances but rather to migrate correctly all the links accurately.
We use Java/Scala, and all our fields to migrate are in MongoDB so any Java / MongoDB based snippets are welcome.
Also note that some old HTML fragments may not be well-formed in our DB, but a Java parser can generally fix that.
Thanks
Edit
A typical MongoDB document might look like:
{
_id: ObjectId(xxx),
title: "yyy",
content: "HTML FRAGMENT CONTAINING GREENQLOUD URLS",
mainPictureUrl: "GREENQLOUD URL"
}
I can't really give any example for the html fragment, as it can come in many different shapes.

Related

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

Does JSoup achieve this?

I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.

jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

File-based Document Storage in android

I'm in the early stages of a note-taking application for android and I'm hoping that somebody can point me to a nice solution for storing the note data.
Ideally, I'm looking to have a solution where:
Each note document is a separate file (for dropbox syncing)
A note can be composed of multiple pages
Note pages can have binary data (such as images)
A single page can be loaded without having to parse the entire document into memory
Thread-safety: Multiple reads/writes can occur at the same time.
XML is out (at least for the entire file), since I don't have a good way to extract a single page at a time. I considered using zip files, but (especially when compressed) I think they'd be stuck loading the entire file as well.
It seems like there should be a Java library out there that does this, but my google-fu is failing me. The only other alternative I can think of is to make a separate sqlite database for every note.
Does anybody know of a good solution to this problem? Thanks!

Seems like a relational database would work here. You just need to play around with the schema a little.
Maybe make a Pages table with each page including, say, a field for the document it belongs to and a field for its order in the document. Pages could also have a field for binary data, which might be contained in another table. If the document itself has additional data, maybe you have a table for documents too.
I haven't used SQLite transactions on an Android device, but it seems like that would be a good way to address thread safety.

I would recommend using SQLite to store the documents. Ultimately, it'll be easier than trying to deal with file I/O every time you access the note. Then, when somebody wants to upload to dropbox, you generate the file on the fly and upload it. It would make sense to have a Notes table and a pages table, at least. That way you can load each page individually and a note is just a collection of pages anyway. Additionally, you can store images as BLOBS in the database for a particular page. Basically, if you only want one type of content per page, then you would have, in the pages table, something like an id column and a content column. Alternatively, if you wanted to support something that is more complex such as multiple types of content then you would need to make your pages a collection of something else, like "entities."
IMO, a relational database is going to be the easiest way to accomplish your requirement of reading from particular pages without having to load the entire file.

Loading raw HTML in Java/Android

I have written a couple of live wallpapers in recent weeks using local resources. Now a potential client wants me to make one that loads and displays the photos (usually between 3 and 10) from his daily news report posted online. The report file has a URL along the lines of http://example.com/dailytext/report.html which loads images along the lines of http://example.com/dailymedia/obama.jpg The references in report.html look like
img src="../dailymedia/obama.jpg" ...
Am I supposed to use a WebView to do this? That doesn't seem quite right, because I don't want to display the HTML. I would think that I want to throw the raw HTML into an array, parse the HTML looking for the instances of "img src...", reconstruct the full URLs, then load the bitmaps. I'm getting the impression this is more of a pure Java task than anything to do with Android's specialized classes, but I don't know. Any suggestions about "best practice?"

Unless I have misunderstood, this really isn't hard. You need to do the following:
Fetch the HTML, using either the native java networking api or something like HttpClient
Use a parser like Jericho or Dom4j to extract out the image links
Construct the absolute URLs, can be done with just java.net.URL
Fetch the images

You could also prefer using Jsoup
HTML Parser.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.