Mist Juniper Service Level Expectations (SLE) shows the client experience on your network at any given point in time. Use the SLE page to measure and manage your network proactively by identifying pain points before they become too big of an issue. Do you want to know how your network is doing today? SLEs can help you identify which metric is running into problems by showing the percentages of success and failures in a given time range.
In the following video, Jacob Thomas explains how to troubleshoot issues with the entire SLE frameworks considering different elements that can provide direct or correlate information to the root cause.
We understand the audio is not very good, but the content is excellent and we thought hearing the interaction between our professional team was worth it. The transcription is accurate and should be helpful
– I’m assuming people might not know what a SLE framework is. But the reason why we say that SLE is the most popular and the most actionable, that is one SLE which is just finally, are you able to connect or you’re not able to connect?
So the conclusions are pretty straightforward. And there’s not much thinking required, right? I mean, you just look at the failure and then get to the scope. What we have done over time is– like when we introduced the new classification two years for Roaming.
So in the beginning, early days of Mist a SLE only used to track, an initial association or domain. Once the user is associated with Associated to an AP, then it doesn’t track until the next warning event or the cessation event.
What we did almost 18 months ago was to actually continuously track the users to see if there is a DHCP time out, ARP time out, and keep-alive coming from the clients, asset method to see if the user is connected to the net. So the new definition of Successful connect is, are you able to connect to the network successfully? Or are you failing to connect network? And once they’re actually on the network, are they able to stay connected successfully from a data path perspective? And we use DNS, DHCP, and ARP as our indicators.
Longer term, once we look at integrating other sources and maybe longer term and additional feedback from the AP, we could expand. But for a lot of deployments, and given the fact that these devices do tend to check internet connectivity all the time, and if they don’t have it they do re-IP, it’s a very good indication that when see timeouts.
So let’s just go through in how they talk to the customer. Because from a customer perspective, they are [INAUDIBLE], the new customer, they’re not used to this SLE framework. They don’t really understand association authentications. Very rarely, you will see vendors mixing that and you know, DHCP, DNS and ARP , and the reasons why. So just explaining the concept– pre-connection, reconnection, once they’re connected, trying to make sure that the connectivity is around. So explaining the concept actually helps.
And then you walk through the flow, and then being able to generalize.
– So the question is, why do you have some of the same classifiers under successful connect and time to connect?
– Yeah.
– Because they tell you two different things. So time to connect measures how long did it take you to connect. And if it took you longer than your threshold, what was the reason? So it may be slow association. So time to connect doesn’t mean that you didn’t connect to the network. Time to connect means that you did connect to the network, but it took you a while. So you may have a slow association or a slow response from radius, so authorization would trigger.
Or DHCP, right? Like, we had a customer bring up a new site. He brought up a new site. And they noticed time to connect– DHCP time to connect was flagging.
– Yeah.
– The user was actually connecting. They were getting a DHCP. But it was taking them like 5 seconds to get a DHCP address.
– So then in that case time to connect is going to show a degregated SLE. But a successful connection will still show here 100% because I still got DHCP, but it just took longer.
– That’s why the concept is here. Successful connect is a binary thing. Are you able to connect? Are you able to stay connected? It is yes or no. Time to connect is– if you are not able to connect, it will actually go into Successful connect. If you’re connecting, was the connection experience optimal?
– So let’s say I get no DHCP address at all, it’s going to impact the SLE?
– It won’t because you did not get through. That’s why you have to actually explain to the customer that being able to connect to the network is not good enough, right? Are you able to connect in an optimal fashion within a stipulated timeline?
– So the classifier is a successful connect, right? So association is the 802.11. Authorization is the network authorization, whether it’s radius, or PSK, or guest portal? DHCP is your IP address. And then DNS. So you have to complete all of these to connect.
And so I think this part is generally pretty straightforward. But the question that comes up usually is, OK, what is an acceptable success number, success percentage for the SLE? And the answer is it depends. It really does.
So what I usually say is figure out what’s normal, what is the baseline? And then monitor deviations. So a site could be at 85% success. And usually, that would be low. But if that is the normal site, then it’s normal.
– One example is and they do a good job of removing devices that fall off the AD domain. So they continuously hit and fail authentication. So it actually takes the SLE down.
So if you look at the composites, they’re always in the 80s. And you’ll be surprised. Why? It’s because you just have a handful of devices– not handful, quite a lot of devices sitting in the IT desk. And they’re updated domain, but they continuously have.
So for each deployment, you have not an aged deployment. correct? For a given organization, you will have a certain baseline. And you look towards that baseline and try to stay within 5%.
– So that the time to connect in successful connect SLEs with the enhancement that we did recently are now session-aware. So they’re aware if your client that tried to get DHCP, it failed, then eventually got it, or Arp for your gateway, it timed out, but then a few seconds later, you were able to successfully approve gateway.
So that’s why in the SLEs, you may notice client events where– or even sometimes, like Marvis actions ARP anomaly will trigger, whereas that’s looking at more failing points than the number of failures, whereas the SLE is looking at failures and successes in a 5-minute window.
So if you fail DHCP then get DHCP, that will count as a success, right? So there’s some session logic that we do. So that’s a good thing to understand why you may see in the client events, that there are successes and failures. But it doesn’t show up in the SLE.
I guess I the last thing I will point out on successful connect, the most common issue that we see, or common issue that we see with DHCP, DHCP has the sub-classifiers. And so we’re looking at DHCP Discover Unresponsive. So with the initial Discover Unresponsive, we look at incomplete, which means the server sent an offer back, but the client didn’t do anything after that.
Then there is Renew Unresponsive, which is the client has that address. It’s doing a renew, but the renewal failed. And so usually, if Discover succeeds but Renew fails, it means that the Discover is usually a broadcast– is broadcast. And so if there’s DHCP relay, that functionality is working. But Renew is often unicast. And so the unicast path to a server is failing.
– Yeah, it’s [INAUDIBLE] someone can attest. We actually– just like Wes said about after the very first POC deployment of [INAUDIBLE], we actually showed them like they were having a DHCP forwarding problem for renews because of the way the policy was set. They have been chasing this with Meraki and Extreme for so long. They just couldn’t figure it out. We just clearly showed it, there were DHCP issues. They don’t know why. Clearly not going to be able to renew, and it is not just renew in the connectivity sequence– anytime.
So 18 months ago, we couldn’t catch this unless until the renew was following an association. Now, we can actually catch it anytime a renew happens. So if you have an forwarding issue on your WAN side or helpers, we will catch it. So this why the SLE is even more significant is your upstream problems are actually caught by this SLE. So DHCP, it’ll actually catch any kind of forwarding issues. Or DHCP service issues.
And again, it’ll probably probably catch upstream WAN issues or forwarding. Something with your core, it is not stable, or if you’re having some kind of asymmetric paths and issues and things like that. There’s some kind of [INAUDIBLE]. So there are a lot of custom examples where upstream issues are caught today. We can catch.
– Hey, Jacob.
quick question. Does this is cool by the way, with the renew and stuff, we have all seen that. I have a question. When the client is doing unicast and like T1, T2 timers and stuff, if it fails, it eventually goes to broadcast. Does it keep the address during that time and just succeed so the client doesn’t know? Or is there a time where the client loses the address, and then it goes broadcast and comes back?
– Simply within the same transaction ID, you should know. And if it is putting address and as long as that [INAUDIBLE].
– So we see– from the client’s perspective, are they keeping their old address and then switch back to broadcast?
– So it’s on the state machine, right? Most of the time, in the switch to broadcast.
– One more thing on Mac, another thing is this. IP Pool exhaustion I mean, if you actually end up filling up your IP pool or if some kind of a conflict with overlapping calls, you actually see that. It’s also one of these scenario’s where you we POC very well over our peers.
– I see that all the time.
– For sure, yeah. So what you’ll see next is without Mist Edge where your clients are going to– they’re going to re-IP. So either going from one vendor to another with different subnets or going between buildings where there’s different subnets. So you’ll see the client Nack and so that [AUDIO OUT]. And we did an analysis for them just showing how you have clients figure this out. They figure it out within a couple of seconds, and they’re back and happy.
– So if you have a VVLAN pool implementation with the Aruba and you do a full implementation with us, our logic it’s different than Aruba. You will never have to attach to the same VLAN. So when you cross roam, you will see NACK’s, So again, it can show why they should be quickly replacing Aruba and putting in Mist across the board.
[LAUGHTER]
The thing is likely the reason is the failure mode is not– it doesn’t generate more failure. It’s just a normal event. The same way as you walk away from this room, maybe you’ll see like a disconnect or de-auth. Normal type of activity. It’s the same thing that we would see. So the machine doesn’t know now where the correlation and can continue when you have many clients doing it, right? So that– yes, that is something that we could do. But Wes explored with higher rates. We don’t see that with a client, so we just must [INAUDIBLE].
– So if they have a client that’s experiencing an issue, go look at their client events and figure out a signature. I haven’t found the signature, so I can’t tell you it. I suspect that there’s– you’ll see client de-auth, in which case you could do rank client events, your clients by client event count, with event type client de-auth.
– So if there is a lot of this happening you will see it showing up in the ranking. And then my client. So you can troubleshoot reactively. I think proactive, given the signature, I don’t think there’s a clean way to get it right.
– Can you, on the DNS failures, just dive deeper into that what constitutes– and how many does it need– is it one? Is it three? Is it five? What constitutes that?
– So in the DNS– in the DNS failure client event, you’ll see– you’ll see a failure account. And so this count is one. Basically, it has to be two to show up, to count as a failure in the SLE.
– So the same URL that is trying to resolve?
– Yeah. So, this is a perfect example. So we’re doing a per server per host correlation. So if the client– if there’s one failure type, account failure one, we ignore for the SLE. It’ll show up in client events. Client events are like what actually happened. Yeah, so we have that session logic. So we’ll only count when the count is two or more. And so–
– We correlate the server, and that starts to move.
– That’s the goal if we and could we change that behavior?
– DNS wasn’t wasn’t there in the SLE. When we introduced the ARP and DNS, we brought it back as well.
– So the question. Previously, some of the DNS failures. Does it tell us that the DNS server downloaded, or how do we correlate this particular DNS Server?
– DNS is inherently noisy, so you’ll see DNS failures. You kind of have to take a holistic approach. So it may be that the organization is blocking certain domains. And so you’re going to see DNS timeouts. It could be– it could be a DNS server overloaded. But you kind of have to do a further analysis of to what’s actually failing. And so just seeing a string of failures to me, is not concerning. But if there’s failures without success, then that’s– maybe you want to look at it if it’s happening to multiple clients.
– I think the block– if you see a big block of continuous failures, is that common. Usually, look at the day before and the current. That’s a good example. If you see the site is typically trending, that could be the norm. But if it has been broken for long– from day one, then again, you won’t catch it. So DNS, just inherently, is very noisy. So you have to kind of go to the next level to see whether it is actually wrong. Now, in Marvis actions, when the DNS failure gets picked up, that’s what we actually get picked up. So then we use LSTM model. So not when the site is launched, but over time, we predict what’s the DNS failure rate. Is it new devices and things like that? So there, you have higher efficacy.
– Does URL filtering come into that as well? So people are using umbrella for their DNS queries and stuff?
– That’s actually one of the things that saying. If the organization’s policy blocks wired like a URL filter, if they just block DNS, we will catch it.
– Where do we share the statistic with a lot of the problems they were having and so response to the clients because of the DNS. The client’s trying to go to places they don’t have to go to. . So it’s really powerful to show them it’s their problem.
– I mean, literally, what you’re seeing there is actual, this is literally what the client is experiencing. So from a customer perspective, they can look at and see whether the devices are they supposed to go to all these URLs?
– So I say ARP, if you see ARP failures, extremely high efficacy, that’s a problem you need to deal with immediately. If you see DHCP problems, probably it’s an issue you need to deal with immediately. DNS, it can go either way, based on the policies of the organization.
– Is there a way to– for example, with HPC, they used to tell DNS or don’t resolve external addresses. So office app applications, Google applications, et cetera, are chatting to there native backend. So we constantly see failures. Is there a way to reclassify or suppress on a WLAN certain failures?
– You can exclude the WLAN, or you can look at the servers that are failing.
it’s actually breaking it down. Once they’re actually catering to a certain [INAUDIBLE].
– Obviously you don’t want to access the corporate WLAN, so I didn’t want to turn that off from the SLEs. You just want to suppress the DNS.
– You don’t want to do that because it will– this is something that they’re doing on the network? The system is telling them that?
– Yeah.
– They just need to accept that that’s normal. So again, it goes back to what is the baseline for that particular WLAN.
– That’s kind of where I’ve landed with this.
– The thing is, SLE always presents the actual facts. The customer can consume it, and then they can alert based on the ones that they want to act on. So they can do the filter.
– I think the problem we have is if a DNS server does go down– this is probably just [INAUDIBLE]. If a DNS server does go down, there’s too much noise on that event already. We’re not going to see that problem.
– It will go from 60% failure to 100% failure. They just have to know what is normal, what is not normal. In theory, DNS Marvis Actions will catch that as well. That’s why then we’re going to move on.
– So Marvis actions, how do the customers consume? They consume it well with webhooks, right? And then they can cut tickets based on the information that’s available. We can apply a filter then.
– You can apply the filter there because they know their internal servers. So as long as the alert doesn’t correspond to any of the internal servers, they don’t have to cut the ticket. So the Webhook will be there. And then they put the filter. So that’s why the programming or the customization, you do all the session side. So we generate, and then the customer can decide do they want do alert for certain organizations or certain Marvis actions or all Marvis actions with some additional filters because that business logic belongs outside our system.
So we want our system to generate it, and then in their consumption level of ServiceNow, or whatever they use to ingest the Webhook, they can put rules. ServiceNow is a platform that actually gives you flexible options to actually program the way you want. Just for one use case and then we’ll just end up complicating our models. So that’s why we don’t want to filter from the source. We actually filter on the destination side. At the end of the day the customer wants to take action, they need to consume it, and then it will generate alert.
And they have the flexibility to if they program. I mean, they can even try an example code on how to filter. And they getting platforms that give you the flexibility.
– OK, let’s move it along. Sorry, we’re short on time. Let’s talk about capacity SLE. Capacity SLE has a lot of power in it that I think people don’t realize. So number one question we get is can I see my spectrum. This gives you that. So when you have capacity issues, something that, to my knowledge, no other vendor provides to you is what is causing that capacity issue. Is it you? Is it somebody else? Is it non WiFi interference? That’s what capacity SLE shows.
So if is client count or client usage, it is your clients using the network. If it is WiFi interference, it is either somebody else, or it’s like your channel width is too wide.
– Yeah, either we have extremely dense with dual channels, like some of the customers are enabling 8 channels, and then they put a lot of APs. Or they do 40 megahertz, and then 8 channels, which is now we’re actually taking it down to 2.4. They don’t realize it, but that’s what they literally did. So it’s your own or external than the non WiFi interference . But the key is it’s continuous. So think of it as it’s a continuous assessment of the spectrum availability. Do you have– are you pushing the channel utilization to levels where the users don’t have any headroom? That’s what the SLE is actually doing.
So if you look at systems out there– there are systems out there that give you spectrum view. They will report channel utilization and they also report usage. You have three buckets. You have three buckets. And then you have to figure out how to put it all together and then see if there is impact. The only time you’ll do that is when a user complains, and you actually have to go to that. And if the user is quiet they will say that happened here if you’re lucky.
Otherwise, you have to know again, looking at the whole site and the list of APs, and then you have to put these three pieces together and then figure out when the user complaint was .
– So this is Juniper, right? And so WiFi interference, Juniper has 40 megahertz configured in this campus. Look at the APs that show up. These are the dome APs. So this tells me 40 megahertz is an incorrect configuration for these APs in that environment. So now, we’re going to bring this into RRM in other parts. But the information is there. The SLEs often have the information first. Now we have this and can do more interesting things and make– we’re going to do an auto channel width on a site level, where the other vendors have done at the AP level.
And we’re going to do– we’re going to do a site determination of what is the appropriate bandwidth for your site using this type of information. You can use this with your customers today. So if they have 40 megahertz or the other channel width, and you see WiFI interference on certain APs, that very likely points to its incorrect channel width.
– And your hypothesis is this is co-channel interference from our neighboring APs? I mean, with RRM they should be able to understand
– This is interference from the secondary channels of the AP. So the APs are configured for 40 megahertz. We’re picking up interference on that secondary 20.
– So my question is– my question was that you’re basically deducing that there is a interference from the higher channel on the 40 megahertz band. I’m saying in RRM, you can actually see proof in the pudding and look at your site AP interference, right? On that particular channel, the–
– Yes, so while Wes is pulling it up, there are a few things that we’re doing around today trying to mitigate an adjacent channel. So we try to switch the order directly . So usually when you see your channels are always in the forward direction, that means that it’s good isolation. But if you see channels in both directions, that means that RRM is trying to stay away. But end of the day, it’s the interference and you just cannot avoid. So if your deployment is dense enough, you do not want use 40 megahertz. You just want re-test. When you do a speed test or when you do an isolated test, you will see the burst.
But continually, you will start to see– and {} was an eye opening experience. We looked at the SLE information for them. All was good because the site didn’t have a lot of usage. There are power user . It’s all usage. Every AP is going to hit hard. . But if you look at what [INAUDIBLE], there’s a lot of data pollutions and there’s a lot of collisions that is contributing to the problem.
It wasn’t affecting the normal usage because it was someone on a Zoom call. So what literally happened is somebody is on a Zoom call. They are listen– they are ok. So when the usage goes up on that connected AP, the Zoom call will take it. So if the Zoom call metrics will take a hit when that addressing AP has downloads like another event.
So it took us a while to actually make that correlation. And the answer was they were running 40 MHz and we actually put them on 20 megahertz. And it is working now. So what we want to do now is we literally don’t want it to be reactive. We actually want it to proactive. So we’re working on mechanism in RRM itself along with some very large sites, they did 40. Can it do 40?
– So just again, the basic determination of is it you? Is it somebody else? So at– they had some rooms where they had WiFi complaints. Go look at the capacity SLE for that AP. And it was all client usage. So the answer is add a second AP. Or if it’s– they had some areas where they actually turned on 40 megahertz because they had specific cloud use cases. But this year, they had so many of those rooms close together, that they had– they ran out of channels. And so they– based on the capacity, SLE made the determination to go back to 20 megahertz.
OK, so then next would be let’s talk about the roaming SLE. So roaming is an SLE that has changed over the past 12 months. So it used to just be time-based roams. But now, it is has roam quality as well. So we look at both. And so now we actually grade every single client roam. Every single roam that a client gets within this has been graded.
So from a– that also is up under signal quality. So signal quality, the classifiers are sticky, interband, and suboptimal. So sticky means a client did not roam, but we think there’s a better candidate. Suboptimal means that the client did roam, but it picked the wrong candidate. And then interband means that it’s going between 2.4, 5, or 5 and 2.4, whatever. And so generally, if you see these, these are things that you don’t want to happen.
– Clients are sticky, are we looking at probes that are heard by another AP that we know it shouldn’t have roamed?
– Yes, because we’re looking at is there a better candidate for this AP. It doesn’t matter what the roaming algorithm is on the client. The fact is that they’re on the wrong AP. There’s interesting ways that you can look once you get into the SLE. So you have the location view, which obviously allows you to visualize. Because honestly, the roaming SLE can be– it can be a bit obtuse. It can be tough to consume. So ultimately, this will come in to Marvis in other places and run in RRM. But just again, the SLEs are the tip of the spear, so to speak.
You have the location view, so you can visualize where your failures are. But then if you go into– if you go into a classifier, and you look at– you pick out a client– so basically, what I want to show here is if you click on a client, you look at the client SLE, it’ll actually give you the correlation. This is information that feeds into Marvis.
But it will give you the correlation. Is this a client issue, or is it an infrastructure issue? So if you have sticky clients, is it a client specific issue, or is it because an AP is an inappropriate spot? So that would show up in the correlation. You would see a correlation tab.
– They do.
On sticky, are we saying when we saw the client roam, before they moved, they were already on a low signal?
– Sticky means the client did not roam. Suboptimal means they roamed, but they picked the bad candidate.
– Are you seeing roaming a query on our CI or MQL. So you will actually get a pretty good instance of–
– So you use roaming SLE in conjunction with the roaming and you get a very good idea. The roaming latency and stability, these ones are useful too. So these are the– measure the time. So at , the failed to fast roam classifier picked up because clients, they weren’t fast roaming. They were failing fast roam. And time to connect also went down at the slow associations.
– At the interband room, so what do you see in this classifier is specifically current signal quality. It’s not just– if you switch from 2.4 to 5– sorry, 5 to 2.4, we just don’t flag it just based on that. There is intelligence built in where if you get size significantly as part of it, it actually does some assessment. So ultimately, it does take into account the fringes. If it is a fringe case, it does try to– that obviously won’t expose 5Ghz coverage holes. But there is some tolerance built in. So it’s not a automatic thing where every time you switch bands it will actually count here. So there’s some logic.
We can just give maybe a couple of examples. So coverage SLE is one SLE where– yeah, that’s probably the only SLE in a lot of deployments that you will see it is not clean. And people tend to get concerned. So think of coverage SLE as– yesterday, I was mentioning– that it’s a continuous site survey. We continuously survey, survey, survey and see like your service can go up and down, and then all of that just kicks in.
So if you really don’t have a ubiquitous deployment with limited mobility, it is perfectly OK for your coverage SLEs to be in the 70s you definitely want it to be in the 90s. In a retail environment, 70s is not bad. 70s is not bad. There are customers telling us today they took it upon themselves to improve it. So they lowered the threshold of the SLE. They lowered the threshold they augmented the design to actually take it back to 80s. But typically in retail you will see the values to be in the 60s, 70s. And normally it is because of guest WiFi and clients actually roaming around the store.
Even in warehousing, you don’t– because of height and because of the frequent roaming, you will see some– you will see lower values. But what you really need to look for is is there an AP. So the distribution is the key. Are there certain APs that actually contribute to bulk of the coverage anomalies? I mean, do you see most of the hits due to a single AP? We went to a store, and he made the comment that it is only Mist store that I had a coverage problem. And he had been to like thousands of Mist stores. It’s only in this Mist store that we had a coverage problem. Actually, [INAUDIBLE] it up.
And that is right in the middle of the sales floor, you have an AP with that coverage, showing weak coverage. So depending on where the AP is, the location can be extremely powerful with the coverage SLE. It is actually OK for you to actually have a low coverage SLE on your fringe APs because you are going to walk away from one of them. It is perfectly normal. That’s why in Marvis Actions if you are not doing any machine learning in it, you will have an every AP, every outdoor AP in crisis. So in this case, look at the distribution. Look at the impact and the failure rate.
So on outdoor APs, fringe APs, you will see very high failure rate. But it’s actually below. In which case, it is OK. So limiting the distribution and the location correlation is key. What looks– don’t think of it like a lower, slightly lower coverage SLEs is not a problem. It’s just telling you your environment, how clients have walked out of coverage.
– If my site has been flagged as poor coverage. Should we be seeing their transmit power at maximum level?
– Not necessarily.
– So we can transmit at 17. Again, the question that more people always ask. There’s no point in increasing the power simply because the client– how the client is at a lower power. So it’s one mistake that people make. They go and raise the power thinking it will help. It won’t help.