How to extract links from a web content? - java

I have download a web page and I want to extract all the links in that file. this links include absolutes and relatives. for example we have :
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>
or
<a href="http://stackoverflow.com/" />
so after reading the file, what should I do?

This isn't that complicated to do, if you want to use the builtin regex system from Java. The hard bit is finding the right regex to match URLs[1][2]. For the sake of the answer, I'm gonna just assume you've done that, and stored that as a Pattern with syntax along the lines of this:
Pattern url = Pattern.compile("your regex here");
and some way of iterating through each line. What you'll want to do is define an ArrayList<String>:
ArrayList<String> urlsFound = new ArrayList<>();
From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line), and inside you'll put this:
Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());
What this does is create a Matcher for your line and the URL-matching Pattern from before. Then, it loops until #find() returns false (i.e., there are no more matches) and adds the match (with #group()) to the list, urlsFound.
At the end of your loop, urlsFound will contain all the matches for all of the URLs on the page. Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound will get quite big, and you'll be creating and ditching a lot of Matchers.
1: I found a few good sites with a quick Google search; the cream of the crop seem to be here and here, as far as I can tell. Your needs may vary.
2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all. It can be tweaked to work if there are multiple parts, though.

Related

Best way to store words for given scenerio

I am working on Java project [Maven].
I am confused in one point. I don't know what is logiclaly corect.
Problem is as follows :-
Sentence is given, and from their I have extract some particular words.
Solution that I found
I make one regex and put in Constants class. Whenever I have to add more words, I simply appended words in regex.
This solves the problem.
I am confused here
I am thinking, if I put numbers of text files in resources folder where each text file denotes one regex expression.
REGEX = (?:A|B|C|D)
A, B, C, D = Word(String)
Is it a good idea ? If not please suggest any other.
Why would you save regex's in a text file? The fact that you're using a regex seems like an implementation detail that you would want to encapsulate (unless you want the significantly greater functionality but also overhead of supporting regexes).
Also, why do you need new files for each word? That seems like you could just have one file with a word per line that is all of the words you're interested in. This would be much more simple for a user to understand than 100 files with one regex per file.
As my understanding, you want to find some key words from the input string. And those key words could be extened according your requirments.
your current solution is to make this regex (?:A|B|C|D) in your Constant class, wheneveer it's required, you'll add more key words in this regex.
If my understanding is not wrong, maybe, one suggestion is to put this regex in your properties file, like this
REGEX = (?:city|Animal|plant|student)
if too long, it's could be like this
REGEX = (?:city|Animal|plant|student|car|computer|clothes|\
furnature|others)
Your second idea, if my understanding is not wrong, is to put the keywords as the file name, and those files are put in one resource folder. therefore, you could obtain those files name to compose the final regexp. If your regex are always fixed as the (?:A|B|C|D) format, then this solution is good & convenient. (Every time, you add one new keyword file, you don't need to modify any source code & property file)

Regular Expressions to match an <a> tag

I am writing a small java program for a class, and I can't quite figure out why my regex isn't working properly. In the special case of having 2 tags on the same line that is read in, it only matches the second one.
Here is a link that has the regex included, along with a simple set of test data:
Regex Test Link.
In my java program I have the following code:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String[] results;
System.out.println(p.toString());
Matcher m = null;
while((line = input.readLine()) != null) {
m = p.matcher(line);
while(m.find()) {
System.out.println("Matches: " + m.group(1));
}
}
The goal is to extract the href value, as long as it starts with http://, the website ends in either no page (like http://www.google.com) or ends in index.htm or index.html (like http://www.google.com/index.html).
My regex works for every case of the above, but doesnt match in the special case of 2 tags that are on the same line.
Any help is appreciated.
Just use a proper HTML parsing library, such as HTML cleaner. It is theoretically impossible to properly parse HTML with a regex - there are so many constructs that will confound it. For example:
<![CDATA[ > bar ]]>
This is not a link. This is literal text in XHTML.
baz
This is only one link.
<a rel="next" href="bar?2">Next</a>
This is a realistic example of a link with a relation attribute and a relative URI.
<a name="foo">The href="http://example.com" part is the link destination...</a>
This is a named anchor, not a link. However your regex would parse out the literal text here as a link.
Foo
Does your regex handle line-spanning links properly?
There are all kinds of other Fun edge cases that can occur. Save yourself time and headaches. These problems have already been solved and wrapped up in nice neat libraries for you to use. Take advantage of this.
Regexes may be a powerful tool, but as they say - when all you have is a hammer, everything looks like a nail. You are currently trying to hammer in a screw.
This worked for me in that regex tester page
<a[^>]*>[^<]*</a>
Regex Solution
So I was playing around and realized my issue. I adjusted my regex a bit. My main problem was at the beginning my .* was causing everything to match up until the last tag, and therefore it was really only matching once instead of twice. I made that .* lazy and it matched twice instead of once. That was the only issue. Once that regex was added to java, my loop code worked fine.
Thanks everyone that responded. While you may not have provided the answer, your comments got me thinking in the right direction!
You would have to look through all the matches you got per line and find which one looks like a url (like with some more regex ;))

sample editor - same as stackOverflow

I want to create an editor and store formatted text in database. I want just a sample editor do functions like StackOverFlow editor:
_sfdfdgdfgfg_ : for underlined text
/sfdfdgdfgfg/ : for bolded text
I will use a function to replace the first _ by <b> and for the second </b> (respec. for /).
So my question is how can I do to detect the end and the last _ and / if they are nested??
For example :
/dsdfdfffddf _ dsdsdssd_/ ffdfddsgds /dfdfehgiuyhbf/ ....
I will use this editor in Java Application.
So what you want is a Java Version of markdown.
Here's what Google finds:
http://code.google.com/p/markdownj/
It will not make you happy, but you should probably take the time to learn to write parsers (dragon book is nice for that). The thing with parser tasks is that they are easy if you know how to do it and nearly impossible if you don't.
I would write a tokenizer that can recognize tokens like <start_underline, "_"> and <end_underline, "_"> for the format indicators you want to use in your editor and one for everything else. Results could look like this:
Text: Hello _world_, /how are you?/
Tokens: <text, "Hello ">,<start_underline, "_">,<text, "world">,<end_underline, "_">,<text, ", ">,<start_bold, "/">,<text, "how are you?">,<end_bold, "/">,
Start and End can be tracked fairly easy with boolean variables, because it makes no sense to nest them. That's why I would do that tracking in the tokenizer already.
After that I would write a parser class that takes these tokens and configures the output to a textarea accordingly.
You see, that this is actually just an application of the principle divide and conquer. The task of How do I do everything I want with my text? is split up into 3 parts:
According to a useful structure, what is this string about? (Answer from Tokenizer)
How do I handle specific textpart x for all possible x? (Answer from Parser)
How do I represent the parsers interpretation of this string? (Answer from JTextpane or alike)
Both Tokenizer and Parser don't need to be extra classes. Because the context is not complicated they can just be methods in an extension class of the Textarea type you prefer.
Giving more detailed advice is not helpful I think, the next best step would be an implementation that you probably better want to do by yourself. Don't hesitate to ask if you fail to find a good solution to one specific part, though.
You can see stackoverflow.com Page Source and try to integrate... I guess it should work...
https://blog.stackoverflow.com/2008/12/reverse-engineering-the-wmd-editor/
This is an example show how to use MarkDownJ:
First, make sure that MarkdownJ is as a class library invoked in your Java application.
Second, use this code to invoke MarkdownJ :
MarkdownProcessor m = new MarkdownProcessor();
String html = m.markdown("this is *a sample text*");
System.out.print("<html>"+html+"</html>");

Extracting everything but tags from a web page without a parser - using scanner and regex?

Working on Android SDK, it's Java minus some things.
I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over.
currently, I grab the page source into a IntputStream
is = uconn.getInputStream();
and the match and extract like this:
Scanner scanner = new Scanner(in, "UTF-8");
String match = "";
while (match != null) {
match = scanner.findWithinHorizon(extractPattern, 0);
if (match != null) {
String matchit = scanner.match().group(grp);
it works very nicely and is fast.
My regex pattern is already kinda crazy, actually two patterns in an or like this (p1|p2)
Any ideas on how I do that "but not inside HTML tags" or exclude HTML tags at the start?
If I can exclude HTML tags from my source that will likely speed up my interface significantly as I have a few other things I need to do with the raw data.
One thing you can do is add a lookahead for the closing angle bracket:
(p1|p2)(?![^<>]*+>)
The idea is, after you find a match you scan forward a bit; if you find a closing bracket without first seeing an opening bracket, the match must have occurred inside a tag, so reject it. But be aware that even in well-formed HTML there are many things that can mess you up, like SGML comments, CDATA sections, or even angle brackets in attribute values.
Another approach would be to match the tags and ignore those matches:
((?:<[^<>]++>)++)(p1|p2)
Then you test whether it was group #1 that matched:
MatchResult match = scanner.match();
if (match.start(1) != -1) {
// keep searching
}
But again, as a general solution this is way too fragile, for the reasons I cited above. You should only use one of these solutions (or any regex solution) if you're sure it's compatible with the particular pages you're working on.
Why don't you use javax.xml.parsers to parse HTML (ergo xml)

Regex exclusion behavior

Ok, so I know this question has been asked in different forms several times, but I am having trouble with specific syntax. I have a large string which contains html snippets. I need to find every link tag that does not already have a target= attribute (so that I can add one as needed).
^((?!target).)* will give me text leading up to 'target', and <a.+?>[\w\W]+?</a> will give me a link, but thats where I'm stuck. An example:
<a href="http://www.someSite.com>Link</a> (This should be a match)
Link (this should not be a match).
Any suggestions? Using DOM or XPATH are not really options since this snippet is not well-formed html.
You are being wilfully evil by trying to parse HTML with Regexes. Don't.
That said, you are being extra evil by trying to do everything in one regexp. There is no need for that; it makes your code regex-engine-dependent, unreadable, and quite possibly slow. Instead, simply match tags and then check your first-stage hits again with the trivial regex /target=/. Of course, that character string might occur elsewhere in an HTML tag, but see (1)... you have alrady thrown good practice out of the window, so why not at least make things un-obfuscated so everyone can see what you're doing?
If you insist on doing it with Regex a pattern such as this should help...
<a(?![^>]*target=) [^>]*>.*?</a>
It's by no means 100% perfect technically speaking a tag can contain a > in places other than then end so it won't work for all HTML tags.
NB. I work with PHP, you may have to make slight syntax adjustments for Java.
You could try a negative lookahead like this:
<a(?!.*?target.*?).*?>[\w\W]+?</a>
I didn't test this and spent about a minute writing it, but for your specific example if you can do it on the client-side, try this via the DOM:
var links = document.getElementsByTagName("a");
for (linkIndex=0; linkIndex < links.length; linkIndex++) {
var link = links[linkIndex];
if (link.href && !link.target) {
link.target = "someTarget"
// or link.setAttribute("target", "someTarget");
}
}

Categories