Download list of pages from some domain with URL constraint - java

I need to download a list of all the pages on some domain that have specific URL endings.
For example, I have a webpage, like http://brnensky.denik.cz/, which is a Czech webpage with news. Every article has URL ending with post date, like http://brnensky.denik.cz/zpravy_region/ruzova-kola-usnadni-presun-po-brne-20140418.html.
So I would like to find the list of all URLs that begin with http://brnensky.denik.cz/, then whatever, and then for example -20140418.html. Is it possible to achieve?
I'm trying to solve this in Java, but also any other way would help.

Regex would be
^http://brnensky\.denik\.cz.*[0-9]{8}\.html
Logic
Beginning with URL and ending with date.html and date will be always 8 digit string.
You may have to escape '/' according to tool or Lang used to implement this expression

Related

Cleanning a String from html code and accents with java

I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Regex href parsing

a regex question in java.
I'm scraping Id numbers from a element href attribute. I have a bunch on links like these in a string:
Whatever
After the 'pdf' and slash comes an Id number, which I'm interested in.
So I must get all Id's from multiple occurences of this kind of url in the string. What would be the best regex for it?
Thanks in advance.
If you know that the url will be exactly this, your regex can just be:
someplacelol\\.com/pdf/([0-9]+)/
I'm no regex artist but you should be able to get the url out of the element with:
\<a\s.*?href=(?:\"([\w\.:/?=&#%_\-]*)\"|([^\"][\w\.:/?=&#%_\-]*[^\"\>])).*?\>
The first group will contain the URL.
From there you should be able to extract the number without too much difficulty. I tested that link on the source of this page and it was able to correctly identify all of the HREFS in all of the as.
Please don't comment and say It breaks for <a id="<<<>><><<>>href=" href="<a href="> because OP has provided in his description of the problem that ridiculous abuses of the HTTP standard such as this one will not be present in his trail cases.
Also, if for some weird reason, an element has 2 hrefs, only the first will be grabbed. You could probably address that if you cared.
Edit: added whitespace requirement after <a so it won't match things like <asdffsdfsfg href="lol">.

Regex for university emails

I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!
We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.
Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.

How to catch URLs given by user in text

I would like to get URLs given by user in his/her text (I assume that URL must be started with http://) . This is first attempt:
Pattern pattern = Pattern.compile("http://[^ ]+");
but if user types something like this:
"look at somepage (http://somepage.net)"
"look at http://somepage1.net, http://somepage2.net and sth else"
"Please visit our page http://somepage.net."
the URL was with incorrect(?) character at the end. How to avoid this?
Can math, what URL can't end by [,.)] etc, end only [A-Za-z] or / , but this broke url's whith specific end such as http://site.com/read.php?key=F#$.)
The answer is that you cannot do this with 100% accuracy.
A URL like "http://somepage1.net," is technically legal, and there is no way of knowing for sure whether the "," is part of the URL or just punctuation.
A URL like "http://somepage1.net or something" is technically illegal, but typical end users don't know this. (They are used to browsers that do all sorts of funky things to what they type at their browser.)
Probably, best you can do is use a regex to extract legal URLs, and then trim text punctuation characters from the right end of the URL ... on the assumption that they are not intended to be part of the URL.
You could also treat matching quotes or left / right brackets as denoting URL boundaries; e.g.
The secret URL is "http://example.com/?" ... don't leave off the "?"

Categories