Undoing automatic linkification using Java and Regex

Undoing automatic linkification using Java and Regex - java

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.

You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.

First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Related

extract information from xml using regular expression

I need extract the author from the text using regex. Also, I need have the index of every tags and authors. I tried few parser, none of them can preserve the index correctly. So the only solution is using regex. I have following regex and it has a problem on "[^]"
How could I fix this regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
in order to extract the author in following text:
<post author="luckylindyslocale" datetime="2012-03-03T04:52:00" id="p7">
<img src="http://img.photobucket.com/albums/v303/lucky196/siggies/ls1.png"/>
Grams thank you, for this wonderful tag and starting this thread. I needed something to encourage me to start making some new tags.
<img src="http://img.photobucket.com/albums/v303/lucky196/holidays/stpatlucky.jpg"/>
Cruelty is one fashion statement we can all do without. ~Rue McClanahan
</post>

Why couldn't regex:
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
extract the author in following text.
Because
[^</post>]*
represents a character class and will match everything but the characters <, /, p, o, s, t, and > 0 or more times.
That doesn't happen in your text. As for how to fix it, consider using the following regex
<post\s*author=\"([^\"]+?)\"[^>]+>(.|\s)*?<\/post>
// obviously, escape appropriate characters in Java String literals
with a multiline flag.

You can just do it like the following
/<post author="(.*?)"/
Working Demo
The comments are correct though with Regex not being the best tool to parse HTML. But this should do what you are looking for

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

I need to create a java regex expression which matches URLS with http or https but should not match urls with specific file extensions.
I can get the urls with http or https using the following expression and I am unable to complete the 2nd part that is eliminating urls with certain extensions (eg:- js|css|jpg etc..)
I guess I need to use negations but I am not sure how to do it.
String regex = "\\s*(?i)(http|https)\\s*://\\s*((\"[^\"]*\"|'[^']*'|([^'\">\\s]+)))";
Please help me to modify this regex to meet this requirement.

An easy way to implement this in Java is to use the Pattern class (from java.util.regex). To accomplish what you're suggesting, you could use two separate regex objects to check the conditions for the URL. For example (using the string regex from your question):
Scanner in = new Scanner(System.in);
String input = in.nextLine();
Pattern one = Pattern.compile(regex);
Pattern two = Pattern.compile("([^\s]+(\.(?i)(js|css|jpg|etc))$)");
if(one.matcher(input).matches() && !two.matcher(input).matches())
System.out.println("It matches!");
else System.out.println("Nope!");
In short, using two Pattern objects makes your code more readable and easy to manage, since you're considering multiple aspects about an input string of a URL.

You need an anchor to look behind - see regex to match url that should give you the expression you need. The regex you have currently will match malformed urls with disallowed characters.
Here's a good site to check your expressions: http://www.regexplanet.com/advanced/java/index.html

How to extract Substring from a String in Java

I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?

This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.

If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.

You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>

Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers

If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.

Java regex to retain specific closing tags

I'm trying to write a regex to remove all but a handful of closing xml tags.
The code seems simple enough:
String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");
However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.
I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).

You probably shouldn't use regex for this task, but let's see what happens...
Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:
"</(?!a|em|li).*?>"
But this won't handle a number of cases correctly:
Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...
You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.

I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.
See this answer for more passionate info re. parsing XML/HTML via regexps.

You cannot use an alternation inside a character class. A character class always matches a single character.
You likely want to use a negative lookahead or lookbehind instead:
"</(?!a|em|li).*?>"

How to change this regex to properly extract tag attributes - should be simple

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.
A sample of XML that I need to work with is
<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->
I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.
My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"
I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:
This worked for me:
String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
System.out.println(m.group(2));
} else {
System.out.println("no match found");
}
This prints:
- content goes here -

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.
If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.
Edit: should be
Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.
Hope that helps

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
You may find the answer using TagSoup helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Undoing automatic linkification using Java and Regex - java

Related

extract information from xml using regular expression

Need to write a java regex expression which matches urls with http or https but does not contains specific file extensions

How to extract Substring from a String in Java

Java regex to retain specific closing tags

How to change this regex to properly extract tag attributes - should be simple

Categories

Resources