Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not much of a blogger, but here's the short version. If anyone happened to be on #freepbx yesterday morning they might already have seen this.

I had just upgraded and migrated one of my clients from an on premise FreePBX system that was a few years out of date and running on a repurposed desktop computer with a failing fan to a brand new instance running on a VPS. Everything was working fine with basic phone functionality, but their main ring group was taking a few seconds to stop ringing when answered. Calls would ring in to all phones effectively simultaneously as expected, but when someone answered the call certain phones kept ringing for almost four full seconds after that point.

In the past I had seen similar behaviors on AT&T DSL caused by their mandatory modem/router device having an anti-flood filter enabled by default which saw a bunch of nearly identical UDP packets hitting at once and dropped them after the first few. This site has cable internet through a dumb modem so I knew it wasn't that, but they had recently had their IT side taken over by a new company who put in a new firewall so that was a plausible answer.

Their IT however had been taken over from us so I wasn't about to go accusing them of getting it wrong without strong evidence. I'm also just that kind of person, I hate when someone blames me or my gear for problems we're not causing so I do my best to never be that guy either. I'll waste an extra few hours of mine any day of the week to be sure I'm not accusing someone else of getting it wrong without a reason.

I fired up sngrep on the server, waited for a call to come in, and saved all the SIP sessions that resulted. Download that file, load it up in Wireshark, and I see that while the INVITE messages to start ringing all went out more or less simultaneously (27 phones in ~5ms) the CANCEL messages that stop them from ringing once one answered were sent out sequentially, with the PBX waiting for the first one to respond and confirm it had stopped ringing before sending the next. Clearly this wasn't right, and it obviously wasn't a problem with the firewall either.

At that point I started looking at the Asterisk logs and saw that an AGI script was being run for each line that was ringing which wasn't there previously. That script was associated with a new FreePBX module for missed call notifications which was installed but unconfigured on the new server. It didn't indicate it was doing anything in the UI, but it sure seemed to be doing something in the logs.

I uninstalled that module and the next call all the CANCEL messages went out in ~5ms just like the INVITEs. I then filed a bug with FreePBX documenting what happened because I'm pretty sure it's not expected or desired for simply having that module installed to cause massive delays in ring groups.

---

In this case the packet captures demonstrated conclusively that the problem was on the server itself and not in the network. If the capture at the server had looked reasonable my next step would have been to have the IT vendor capture traffic on their firewall at the same time as I was capturing at the server so we could compare and see if it's getting messed with along the way, but here it was not necessary.

Like toast0 mentioned, captures help you narrow down where the problem is.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: