Ticket #3715 (closed defect: fixed)

Opened 17 months ago

Last modified 8 months ago

URL regex matching fails at closing brackets

Reported by: erlehmann Owned by: asterix
Priority: normal Milestone: 0.12
Component: chat Version: 0.11.2
Severity: normal Keywords: URL, regex
Cc: Blocked By:
OS: All Blocking:

Description

with "http://en.wikipedia.org/wiki/Mornington_Crescent_(game)" in the chat window the last bracket is not considered a part of the URL; the link fails badly.

Attachments

sample.jpg (42.7 kB) - added by nk 17 months ago.

Change History

  Changed 17 months ago by erlehmann

and trac fails it, too.

  Changed 17 months ago by steve-e

It is not possible to match a string like this with regular expressions.

Regular expressions are only capable of describing a Regular Language (Chomsky Hierarchy type-3). What you request here, is to correctly parse a Context-free language of type-2. This would require something like a on-deterministic pushdown automaton.

Let me describe it that way: You cannot 'count' opening and corresponding closing brackets. Therefore I propose something between CANTFIX and WONTFIX.

  Changed 17 months ago by erlehmann

why not make the regex as greedy as possible so that it only terminates when there is a genuine space character ?

  Changed 17 months ago by steve-e

So we would fail for all links like this?

Hey see cool stuff FooBar (http://myFoobar.org) you really should...

follow-up: ↓ 6   Changed 17 months ago by erlehmann

basically, yes.

if you look at RFC 1738, brackets are not unsafe characters; in fact, the characters .$-_.+!*'(), may be used unencoded in URLs. the RFC describes the characters < and > as "unsafe because they are used as the delimiters around URLs in free text".

i deduce from that that an URL can be safely terminated only when there is a space or a > character.

in reply to: ↑ 5   Changed 17 months ago by erlehmann

i was wrong, of course the quote mark (") would also terminate the string (according to RFC 1738).

  Changed 17 months ago by nk

ok trac fails, but Gmail chat doesn't. see img I attach

Changed 17 months ago by nk

  Changed 14 months ago by erlehmann

OH GOD HOW DID THIS PATCH CAME INTO EXISTENCE I AM NOT GOOD WITH REGEXES.

(apply to chat-control.py)

155c155
<               self.urlfinder = re.compile(r"(www\.(?!\.)|[a-z][a-z0-9+.-]*://)[^\s<>'\"]+[^!,\.\s<>\)'\"\]]")
---
>               self.urlfinder = re.compile(r"[a-z][a-z0-9+.-]+:[%/;\?\:\@\=\&a-zA-Z0-9\$\-\_\.\+!*'\(\)\,]+")

  Changed 14 months ago by erlehmann

New version that should not only match arbitrary URIs but also stuff starting with "www.":

(www\.|[a-z][a-z0-9+.-]+:)[%/;\?\:\@\=\&a-zA-Z0-9\$\-\_\.\+!*'\(\)\,]+

  Changed 14 months ago by erlehmann

Update: Correct RFC is <http://www.ietf.org/rfc/rfc3986.txt>. I am looking into it.

  Changed 12 months ago by steve-e

  • status changed from new to closed
  • resolution set to fixed
  • milestone set to 0.12

(In [29a67c9e0fe547c875e052f0f48445b8dfb0ed0c]) [erlehmann] Improved regular URL matching expressions. Fixes #3715.

URLs like (http://myFoobar.org) and http://en.wikipedia.org/wiki/Mornington_Crescent_(game) are now correctly detected.

Add/Change #3715 (URL regex matching fails at closing brackets)

Author



Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
Next status will be 'needinfo'
 
Note: See TracTickets for help on using tickets.