Ticket #4098 (closed defect: fixed)

Opened 4 months ago

Last modified 4 months ago

[Win32] Hangig sockets

Reported by: js Owned by:
Priority: highest Milestone: 0.12
Component: None Version: svn
Severity: blocker Keywords:
Cc: OS: Windows

Description

It seems that on Windows a lot of things cause hanging sockets. This is *very* evil, as to the user, it seems that everything is right - he is connected, he is online. It's just no message he sends will ever receive at the other end and he'll never receive a message. It would be even better if Gajim would just crash. This is especially a BIG issue as everytime Gajim isn't shut down properly, it will hang the next time it's started. As an example, a user hasn't noticed that the socket hangs for 2 days and was wondering why nobody talked to him. It seems that nearly everything can cause this hanging socket - every traceback seems to cause it. It seems the traceback isn't shown or at least not before you quit and then the socket hangs. The most trivial thing I've seen so far: When a Windows user received a file and couldn't resolve my host specified in ft_add_hosts, the socket just hang and when he quitted Gajim, a .log file was created (I'll create a seperate bug for that TB). We *REALLY* need to find a way to prevent the socket from hanging, and if it's only quitting Gajim so the user at least notices it!

Attachments

Change History

Changed 4 months ago by js

Two interesting tracebacks, both occoured when the connection hung:

Traceback (most recent call last):
  File "gajim.py", line 2734, in process_connections
  File "common\xmpp\idlequeue.pyc", line 217, in process
AttributeError: 'NoneType' object has no attribute 'pollend'
Traceback (most recent call last):
  File "gajim.py", line 2734, in process_connections
  File "common\xmpp\idlequeue.pyc", line 211, in process
  File "common\xmpp\transports_nb.pyc", line 351, in pollin
  File "common\xmpp\transports_nb.pyc", line 495, in _do_receive
  File "common\xmpp\dispatcher_nb.pyc", line 355, in dispatch
  File "common\connection_handlers.pyc", line 553, in _siResultCB
  File "common\connection_handlers.pyc", line 195, in send_socks5_info
  File "common\socks5.pyc", line 85, in start_listener
  File "common\socks5.pyc", line 803, in __init__
socket.gaierror: (10022, 'getaddrinfo failed')

Changed 4 months ago by js

Having a further look at this, it might be possible that not only Windows is affected. It seems that EVERY exception in the idlequeue causes the connection to "hang". No more stanzas can be sent or received then.

I will investigate further and change OS if I can reproduce that on Linux.

Changed 4 months ago by js

This kinda fixes it:

Index: idlequeue.py
===================================================================
--- idlequeue.py	(revision 9939)
+++ idlequeue.py	(working copy)
@@ -214,6 +214,8 @@
 			if q:
 				q.pollout()
 		for fd in waiting_descriptors[2]:
-			self.queue.get(fd).pollend()
+			q = self.queue.get(fd)
+			if q:
+				q.pollend()
 		self.check_time_events()
 		return True

However, this is more a workaround to me, as there shouldn't be None in waiting_descriptors[2]. The direct file transfers works with this diff, but I should investigate why None was added there. I'll commit that anyway, as it at least fixes the very ugly symptoms of the actual bug.

Changed 4 months ago by js

  • status changed from new to closed
  • resolution set to fixed

(In [9941]) Partially fix #4098.

This is, however, only a half fix as this fixes that it fails when there's None in waiting_descriptors[2], but actually, there never should be None, so I have to investigate why there was none.

This patch is, however, correct and needed and the other queues also have that check. It's just that I also need to fix the reason for the None in the queue.

Changed 4 months ago by js

  • status changed from closed to reopened
  • resolution fixed deleted

Reopened as this is only a half fix.

Changed 4 months ago by js

Yet another traceback that hung the socket:

Traceback (most recent call last):
  File "gajim.py", line 2741, in process_connections
  File "common\xmpp\idlequeue.pyc", line 211, in process
  File "common\xmpp\transports_nb.pyc", line 351, in pollin
  File "common\xmpp\transports_nb.pyc", line 495, in _do_receive
  File "common\xmpp\dispatcher_nb.pyc", line 355, in dispatch
  File "common\connection_handlers.pyc", line 1679, in _messageCB
  File "session.pyc", line 413, in handle_negotiation
  File "common\stanza_session.pyc", line 712, in accept_e2e_alice
  File "secrets.pyc", line 190, in secrets
  File "secrets.pyc", line 166, in load_secrets
ValueError: Input strings must be a multiple of 16 in length
Traceback (most recent call last):
  File "gajim.py", line 2741, in process_connections
  File "common\xmpp\idlequeue.pyc", line 211, in process
  File "common\xmpp\transports_nb.pyc", line 351, in pollin
  File "common\xmpp\transports_nb.pyc", line 495, in _do_receive
  File "common\xmpp\dispatcher_nb.pyc", line 355, in dispatch
  File "common\connection_handlers.pyc", line 1679, in _messageCB
  File "session.pyc", line 413, in handle_negotiation
  File "common\stanza_session.pyc", line 712, in accept_e2e_alice
  File "secrets.pyc", line 190, in secrets
  File "secrets.pyc", line 166, in load_secrets
ValueError: Input strings must be a multiple of 16 in length

It really seems that really *EVERYTHING* that throws an exception in idlequeue.pu hangs the connection.

Changed 4 months ago by js

  • status changed from reopened to closed
  • resolution set to fixed

(In [9943]) This should fix #4098. However, I'll let that bug open until I'm very very sure and gave that a few days of testing.

Changed 4 months ago by js

  • status changed from closed to reopened
  • resolution fixed deleted

This restarts the idle queue when a traceback happened. This should fix it, but I want to give it some more testing before I finally close this bug.

Regarding the tracebacks above: These are all bugs of their own which just caused this bug. They should be handled in separate tickets.

Changed 4 months ago by js

  • status changed from reopened to closed
  • resolution set to fixed

It seems this didn't happen again - I didn't see any Windows user hang since r9943 anymore. Guess that really fixed it. Closed.

Add/Change #4098 ([Win32] Hangig sockets)

Author



Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.