How we find Missing VLANs

Transcript

Missing VLANs is a two decade old networking problem. It sounds so simple. But in a large enterprise, it could become the ghost in the machine. As users complain their calls always drop at a certain area. And conventional wisdom is, well, there must be interference or Wi-Fi issues over there.

In many cases, when Mist support helped troubleshoot, we found a user VLAN was indeed not provisioned on the network switch. Hence, the user had no place to roam. And the call dropped. For customers with tens of thousands of APs, this truly becomes the needle in the haystack problem. At Mist, we wanted to use AI to solve this problem. But first, let’s take a look at how you might start out today.

You can manually take a look, but I only have two VLANs. Or you can programmatically take a look, but this makes my brain hurt. If an AP is connected to a switchboard, but the user can’t get an IP address or pass any traffic, then the VLAN probably isn’t configured on the port or its black holed. The traditional way to measure a missing VLAN is to monitor traffic on the VLAN.

And if one VLAN continuously lacks traffic, then there is a high chance that the VLAN is missing on the switch port. The problem of this approach is false positives. Here, you can see during a 24 hour window, we detected more than 33,000 APs missing one or more VLANs because they had little or no traffic. But this was not accurate, as we learned that every VLAN is not created equal.

There are at least two types of special purpose VLANs that can cause detection problems. One is the black hole VLAN. Folks can create a black hole VLAN on all unconfigured ports or as a quarantine VLAN for users until they are fully authorized. This VLAN is supposed to be provisioned on the switch in case a quarantined user shows up on the AP.

The second example is the overprovision VLAN. Larger customers use special VLANs for special sites. For example, legacy devices might only be present at certain sites. So a special VLAN should only be applicable to those sites. But because people do use automation, they want to keep their configurations consistent. So they provision that VLAN across all the sites.

In this case, you would expect low traffic or no traffic. Those VLANs shouldn’t be flagged as missing because they were intentionally over-provisioned. So the key for reducing false positives is to really identify the purpose of each VLAN. We could ask the customer for their own internal list, perhaps, in the form of a spreadsheet, but that’s very error prone.

Mist developed an unsupervised machine learning model to automatically discover the purpose of each VLAN by learning from the traffic patterns on the VLANs. In this graph, each dot represents all of the VLANs across the Mist customer base. So for each VLAN, we collect several features. How many APs lack traffic on that VLAN? How many sites lack traffic? How busy is that VLAN minute by minute from all the APs?

Then we use another technique called principal component analysis to combine all of these features and map them into this two-dimensional space. The interesting thing here is that different VLAN types, high traffic, low traffic, black holes, and overprovisioned are separated really well even across different customers. Because it turns out, VLAN behavior is very similar across different customers.

The beauty of this is instead of developing per customer anomaly detection tools, we actually built one model for everybody. So for any new customers, we don’t have to ask them anything. We can determine the purpose of their VLANs very quickly after they deploy. This is really the power of this multi tenant infrastructure design.

Every customer can benefit from the knowledge learned from our extended customer base. By precisely identifying each VLANs purpose, we reduced our initial detection rate from 33,000 plus to specifically 607 VLANs. Which we believed were actually missing from the AP switch ports. For Mist, this was the moment of truth. When we were confident in the model, we contacted the customers with these 607 detected missing VLANs. And when we finally heard back, we had an astonishing 100% hit rate. No false positives.

For Mist, this was simply awesome, as there are so many mundane problems we can apply this technique to going forward. So right now, this is shown in Marvis actions. And with the support of Juniper switch, we can provide the user specific CLI commands that we suggest they add to their config to get these missing VLANs going. With the goal to automatically doing this from the cloud, as we gain their trust.

And for non-Juniper switches, we have detailed info. Like which switch, which port, and which VLAN ID to guide them on how to solve the problem that they probably didn’t even know they had. This is all built on open protocols like OpenConfig and NETCONF. And lessons learned by the Mist data science team, AI solutions should first start by solving real problems, rather than deploying models and hoping for the best.

Some AI vendors treat AI as a hammer in search of a nail. And this isn’t going to work. The Marvis AI engine was designed starting with human expertise and then learning over time. At Mist, each support ticket is first run through Marvis to both measure its efficacy and continue to train the model to solve the most important customer issues.