Ticket #3715 (closed defect: fixed)

Opened 7 months ago

Last modified 2 months ago

URL regex matching fails at closing brackets

Reported by: erlehmann Owned by: asterix
Priority: normal Milestone: 0.12
Component: chat Version: 0.11.2
Severity: normal Keywords: URL, regex
Cc: OS: All

Description

with "http://en.wikipedia.org/wiki/Mornington_Crescent_(game)" in the chat window the last bracket is not considered a part of the URL; the link fails badly.

Attachments

sample.jpg (42.7 kB) - added by nk 7 months ago.

Change History

  Changed 7 months ago by erlehmann

and trac fails it, too.

  Changed 7 months ago by steve-e

It is not possible to match a string like this with regular expressions.

Regular expressions are only capable of describing a Regular Language (Chomsky Hierarchy type-3). What you request here, is to correctly parse a Context-free language of type-2. This would require something like a on-deterministic pushdown automaton.

Let me describe it that way: You cannot 'count' opening and corresponding closing brackets. Therefore I propose something between CANTFIX and WONTFIX.

  Changed 7 months ago by erlehmann

why not make the regex as greedy as possible so that it only terminates when there is a genuine space character ?

  Changed 7 months ago by steve-e

So we would fail for all links like this?

Hey see cool stuff FooBar (http://myFoobar.org) you really should...

follow-up: ↓ 6   Changed 7 months ago by erlehmann

basically, yes.

if you look at RFC 1738, brackets are not unsafe characters; in fact, the characters .$-_.+!*'(), may be used unencoded in URLs. the RFC describes the characters < and > as "unsafe because they are used as the delimiters around URLs in free text".

i deduce from that that an URL can be safely terminated only when there is a space or a > character.

in reply to: ↑ 5   Changed 7 months ago by erlehmann

i was wrong, of course the quote mark (") would also terminate the string (according to RFC 1738).

  Changed 7 months ago by nk

ok trac fails, but Gmail chat doesn't. see img I attach

Changed 7 months ago by nk

  Changed 4 months ago by erlehmann

OH GOD HOW DID THIS PATCH CAME INTO EXISTENCE I AM NOT GOOD WITH REGEXES.

(apply to chat-control.py)

155c155
<               self.urlfinder = re.compile(r"(www\.(?!\.)|[a-z][a-z0-9+.-]*://)[^\s<>'\"]+[^!,\.\s<>\)'\"\]]")
---
>               self.urlfinder = re.compile(r"[a-z][a-z0-9+.-]+:[%/;\?\:\@\=\&a-zA-Z0-9\$\-\_\.\+!*'\(\)\,]+")

  Changed 4 months ago by erlehmann

New version that should not only match arbitrary URIs but also stuff starting with "www.":

(www\.|[a-z][a-z0-9+.-]+:)[%/;\?\:\@\=\&a-zA-Z0-9\$\-\_\.\+!*'\(\)\,]+

  Changed 4 months ago by erlehmann

Update: Correct RFC is <http://www.ietf.org/rfc/rfc3986.txt>. I am looking into it.

  Changed 2 months ago by steve-e

  • status changed from new to closed
  • resolution set to fixed
  • milestone set to 0.12

(In [9845]) [erlehmann] Improved regular URL matching expressions. Fixes #3715.

URLs like (http://myFoobar.org) and http://en.wikipedia.org/wiki/Mornington_Crescent_(game) are now correctly detected.

Add/Change #3715 (URL regex matching fails at closing brackets)

Author



Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.