Bug Description

[Impact]

* Certain WiFi captive portals do not support EDNS0 queries, as per RFC.
* Instead of responding with the captive portal IP address, they resond with domain not found
* This prevents the user from hitting the captive portal login page, able to authenticate, and gain access to the internets.

[The Fix]

* As per tcp dumps, the problem arrises from receiving NXDOMAIN when queried with EDNS0
* And receiving the right response without EDNS0
* The solution was to downgrade transactions, and retry EDNS0 + NXDOMAIN result without EDNS0 with a hope of getting the right answer.

I have an odd network situation that I have so far managed to narrow down to the inability to resolve a domain via systemd-resolved which is resolvable with nslookup. If I use nslookup against the two nameservers on this network I get answers for the domain, but ping says it is unable to resolve the same domain (as do browsers and crucially the captive portal mechanism).

Yes, this is a captive portal situation on up-to-date 17.10. The captive portal popup fails with a DNS error looking for securelogin.arubanetworks.com and then hilarity ensues. Manually editing /etc/resolv.conf to use one of these DNS servers makes it all work. So the problem is systemd-resolved which is the default resolv.conf:

~$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.
nameserver 127.0.0.53

I also experienced this problem on Ubuntu 17.10. Tried to connect to a wifi network, that utilized a captive portal, failed miserably. The DNS redirect to the captive portal would not occur. Manually overwriting /etc/resolv.conf with nameservers other than "127.0.0.53" got me to the captive portal login page. I spent a few hours on vacation trying to workaround this. Really unfortunate problem.

What is the contents of /etc/resolv.conf?
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.
nameserver 127.0.0.53

Where does the symlink of /etc/resolv.conf point to? (if it is a symlink)
../run/systemd/resolve/stub-resolv.conf

What is the contents of /etc/systemd/resolved.conf ?
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details

All captive portal wifi connections are failing with Ubuntu 17.10. This was confirmed on two different captive portals which both worked just fine with other devices. The solution was to go back to Ubuntu 17.04.

You get to the portal page, put in whatever info, then afterwards the redirect wouldn't redirecting properly. This failed in both Chromium and Firefox, and with both "use system proxy" as well as "automatic proxy" settings. Manual proxy configuration settings weren't tried, so I can't confirm if this might be a problem relating to the proxy pac file or not.

Either way this is a pretty critical issue since it seems all captive portals don't work, making Ubuntu 17.10 on laptops pretty broken at the moment.

- com to arubanetworks.com: Authoritative AAAA records exist for dns5.arubanetworks.com, but there are no corresponding AAAA glue records.
- com to arubanetworks.com: The following NS name(s) were found in the authoritative NS RRset, but not in the delegation NS RRset (i.e., in the com zone): dns6.arubanetworks.com
- com to arubanetworks.com: The following NS name(s) were found in the delegation NS RRset (i.e., in the com zone), but not in the authoritative NS RRset: dns4.arubanetworks.com

Checking for the state of the domain from outside a captive portal won't get much; "securelogin.arubanetworks.com" only exists while you're behind the captive portal, in unauthenticated mode.

I think the next steps will be to do some testing with various captive portals and see why systemd-resolved is unhappy with them. As far as I can tell from the provided answers, everything is in place (/etc/resolv.conf has the right values, systemd-resolved knows about the right nameservers, so some part of resolved is failing to send/receive the DNS messages in a meaningful way: this has all the hallmarks of a systemd-resolved bug.

The next steps for debugging this will be to stop systemd-resolved and restart it, then attempt to resolve the domain normally (via ping, for example):

So I managed to reproduce this in a way that looks correct (Starbucks WiFi here uses Datavalet but fails in a way that looks the same: it thinks you're logged in once you clicked the "Login" button on the captive portal page once, but then updates DNS, but still attempts to look up secure.datavalet.io (but this is no longer resolving because we're in the public side now); and once you try to hit another site, you did not get to the "landing" page on the public side so it thinks you're still unauthenticated.

One one hand, this looks like just really terrible behavior of the captive portal, and it "worked" only because we were pretty slow to deal with the changing settings; or because we were caching the DNS responses for just long enough.

I got logs from my reproducer as well as packet captures, and I will have to comb through them to figure out if there's anything really obscure and wrong, but my initial guess is that this is an issue related to DNS caching. Probably the cache is invalidated when the IP changes as we get to the public side, but ought to retain the resolution address for the portal.

It needs a little more investigation and testing, but I think this qualifies as "Triaged" now; and should have some fix or workaround to deal with Aruba and Datavalet, both are reasonably common hotspot infrastructure.

I believe caching is enabled by default, this might be a regression in behaviour since switching to 127.0.0.53 caching resolver. It does drop caches upon every new connection / re-connection.

It is freezing out there and snowing out there. But I guess I'll have to make trek to Starbucks with my laptop tomorrow. Also will watch out for this bug happening during my travels next week through different airports.

So, after looking at it more, it seems the issue with Datavalet is due to EDNS0-enabled queries failing to be captured and rewritten by the captive portal. It might not in fact be the same issue as for securelogin.arubanetworks.com, though the wireless hardware comes from the same manufacturer.

I have a tentative patch ready for secure.datavalet.io; I will adapt it to work the same way for securelogin.arubanetworks.com, and I'll put all this in a PPA for testing.

Unfortunately, I haven't found a location that uses securelogin.arubanetworks.com; so we will need to others (if you remember where you encountered such a network) to make sure it really works correctly before uploading to release.

I updated the importance and status as well, this bug was not in fact Fix Committed, I checked with Dimitri. Fix Committed was set because of the proposed change to caching, but it doesn't look like it helps.

I also have been experiencing this. Connecting to WiFi network that uses any captive portal fails.

I dual boot with Windows, and everything is fine in Windows.

System:
Ubuntu 17.10 (Clean install)

Behavior:
After authenticating with the portal and connecting I cannot resolve DNS.

My work around (after a lot of digging) was to edit "/etc/resolv.conf"

And add "nameserver 8.8.8.8" (Google Public DNS)

So my nameserver looked liked:

nameserver 127.0.0.53
nameserver 8.8.8.8

I had to do this after every reboot obviously.

Also very peculiar:
I have been using this work-around every day since installing 17.10 one week ago. However, I connected to a regular home WiFi network (non-captive portal) fine last night. Then after returning to my work captive-portal WiFi network its now connecting and resolving DNS successfully. Without having to update my /etc/resolv.cong. I don't understand why...

Wanted to mention while searching forums and bug reports I came across many reported bugs that are likely duplicates of this issue. Where users are reporting spotty or intermittent WiFi issues etc. Many think its kernel and driver issues, while this isn't the case.

@ #19: Mathieu,
I included the '234-2ubuntu12.3~mtrudel1' from the ppa you mention to my computer.
Unfortunately no change in behavior. I am still not forwarded to the captive portal after connecting to this particular wifi.
Do you have a suggestion what I could try?

@ #20: Does this: 'but it doesn't look like it helps.' mean that the new systemd does not fix the bug and therefore it does not work on my side?

I can readily reproduce this using my company's guest network with current systemd in Bionic. We use "securelogin.networks.dell.com" for our redirector. If I can provide something useful, happy to do so.