#vlan

felix@pod.pc-tiede.de

Here's another #post-mortem for you.

Tl;dr: Be sure your diagnostic tools show everything and do not hide vital information even though running in verbose or verbose-verbose mode.

First of all: My home #network is not as simple as it could be and certainly not as simple as it would be with most private users around the globe.
That brings with it its very own bunch of challenges, none of which is of any particular concern here. Suffice it to say, a regular consumer-grade internet access device (colloquially "WiFi router") would not do it. I have heavier requirements.

That in turn means my #ISP did not supply me with a real router – in fact I decline any ISP-supplied device whenever possible. Instead I bought my own device which does fulfill my requirements, in this case a semi-professional router with #PPPoE #pass-through capabilities, essentially a #DSL‌-modem.

The latter is the important part, because that is where my weekend worries began.

The modem did work for years on end with no reason for me to ever regret my decision to buy it in the first place.
Until last friday morning when it rebooted from the latest firmware upgrade.

At that very moment my internet connection dropped with a very dreaded error message from the #PPP daemon: "Timeout waiting for PADO packets."
That's tech-talk for "the ISP does no longer reply to my connection requests."

The message was already known to me and experience had me on the side of stupid coincidence which I have seen more often than I could count. So, quickly I was on the phone with the ISP requesting a port reset hoping for instant recovery.
Which didn't happen.

I'll skip the boring part of booting, downgrading, recabling and testing (even with a new OS install) for hours while other people expected me to work, but it went on until monday morning. Needless to say, I had a lot of time to think during the weekend.

I will just mention that most of the time I got beyond that PADO step, yet #pppd failed to get any replies for its #LCP configuration requests (that way I can add some more hashtags to the post, yay!).

On monday morning I finally figured I should do a real trace of the conversation appearing on the wire, for as far as #tcpdump told me not only was I sending requests, I even got answers from the ISP. Only, they were ignored by my endpoint software.
tcpdump's console output did not reveal why this should happen, a different network card, updated endpoint software - nothing actually helped. Requests were sent, replies came in and were promptly ignored.

Until I was fed up enough and decided to have a deep look into that very conversation on the wire, using #wireshark. Wireshark is a very powerful tool and its most notable feature is it doesn't hide any information by default.
And it was this now unhidden information which finally pointed me in the right direction: The oh-so-dutifully ignored replies from my ISP carried with them a #VLAN tag.

Short excursion: I am using a #VDSL landline on which the provider offers "triple-play", that is, Internet, Telephone and TV all over the same cable. For that to work technically, the ISP requires its customers to tag their "regular" outgoing internet traffic with a certain VLAN id before sending it out to the ISP. Likewise, the ISP tags all incoming internet traffic with same id.

Usually, the modem should do the tag/untag of the traffic before handing it over to the LAN, in my case to my PPP endpoint.
And it was this very tag/untag which actually stopped working.

From this point on the solution was pretty simple: If the tagging inside the modem is unreliable, the feature can be disabled (in fact, it can be configured to suit the ISP's requirements). So I did that. Now that caused the situation to deteriorate in the first step, because now nothing was coming back at all anymore. Where at least I got ignored replies from the ISP before now I got nothing.

However, #linux is capable of a lot of things, one of which is making network cards do "the right thing". In this case I reconfigured the system's network to add the required VLAN tag to everything related to the much needed PPP connection. A side effect is that the kernel now also removes the tag from any incoming frame.
At this point my PPP endpoint finally got the replies again and within seconds the connection was back up as it was designed in the first place.

Needless to say I think it's a bug in the modem's firmware, yet since downgrading did not improve the situation, it seems the bug was introduced much earlier but never manifested itself.
Also, to reference the "tl;dr" from above: Had I done the wireshark trick right from the start I probably would have stumbled over the wrong VLAN tag much earlier, saving me a lot of testing hours as well as a lot of worries on how to get the system back to its intended state.
The lesson I learned is that even though diagnostic tools can be made to be very verbose, sometimes they still skip some information vital to the issue in question and thus can not necessarily be fully trusted. It is not only important to have those tools at hand, it is quite as important to know their limitations and how to handle or overcome them.