- Troubleshooting NetScaler
- Raghu Varma Tirumalaraju
- 4069字
- 2021-07-16 11:06:46
Load balancing
The NetScaler started off as a high performance load balancer and is still its most prominent use case. In this chapter, we will look at a range of issues/questions that you come across when setting up or managing a load balanced environment with the NetScaler.
Considerations
First, let's look at some considerations around the general settings of load balancing.
Let's consider a scenario where you've created a new load balancing (LB) vServer or bound a service to an existing vServer that already has a bunch of services. You will notice that even though you have the LB method as the default (also happens to be the recommended) of least connections, NetScaler starts to send requests to the backend in a round robin fashion. This is a deliberate behavior to ensure that the new service you've just added doesn't get inundated with requests; after all, being a new server, it will have the least connections. This behavior is controlled by adjusting the tunable Startup RR factor.
By default, when you create a load balancing VIP (we will shorten this to LB VIP for conciseness), NetScaler uses one of its own IPs, usually a SNIP to send that packet to the servers. This is controlled by the USNIP global mode setting. USIP (use client IP as the source IP), on the other hand, is needed only for specific scenarios.
Some scenarios where USIP really is required are as follows:
- With Direct Server Return, where the return traffic bypasses the NetScaler
- With some applications that need to use the actual Client IP to function
Occasionally, USIP gets deployed purely with the goal of getting visibility into the Client IP. There are definitely better ways to achieve this requirement:
- By using Weblogging for HTTP/S, which is both high performance and logs a lot of useful info
- By using Client IP Insertion (-cip) on the service, which then presents the Client IP in a header that can be extracted on the server
Why should you avoid USIP?
- The first reason is routing. When you present the Client IP address as the source IP to the Server, without any special configuration in place, it will try to reach out to the Client directly; this will trip the client and the packet will be dropped, so you will need either PBR or to set the NetScaler as the default gateway for the Servers.
- Next is performance. When USIP is enabled, the reuse pool, which is how the NetScaler maintains optimized connections to the backend, is not used efficiently, since it now means that it is fragmented as per IP, that is, a connection optimized for one User cannot be used for another. So, this means more connections are opened on the Server.
- Because of the preceding reason, if you now enable surge protection (not surge queue, they are completely different), you will see very aggressive throttling, and that will mean users can't get to their applications.
To get the most performance out of the NetScaler, as much as possible you should choose one of the native VIP types, such as HTTP, SSL, DNS, or MYSQL. Apart from performance, this also gives you granular control, such as your choice of rewrite policies. For applications that don't have a native protocol, or use a mix of sub protocols, a layer 4 protocol would be the right choice, such as TCP, UDP, or SSL_TCP. Some applications might need you to forward traffic with even minimal handling; this is when SSL_Bridge
or the ANY
type of VIPs are in use, where the NetScaler is essentially just flinging packets it receives on the VIP to the services as fast as it can.
Special considerations for load balancing Firewalls or CloudBridge appliances
When load balancing Firewalls and CloudBridge devices, there are a couple of options that are not very evident. Let's take a look at what these are because they are the only way to achieve the necessary scenarios.
If you are setting up Firewall load balancing, this will require you to have a vServer of type ANY
with IP and Port set to *
(Wildcard) and MBF enabled so that you are not introducing asymmetry in routing. This all works great except when you also have the L3 mode set to ON
(default) and have more specific static routes, or when you have one of these destination IPs available on the same NetScaler as a VIP:

The Prefer Direct Route option will route traffic directly to this destination VIP directly without passing it through the Firewall first. If you are using FW LB and using routing, or have a corresponding VIP, disable this option; if you are not, leave it at its default.
When you want traffic to pass through different sets of firewalls, the limitation you will run into is the NetScaler's default behavior of only intercepting packets for a *
VIP once for VIPs that have the forwarding mode set to MAC instead of IP. This behavior exists to avoid issues resulting from packets running in a loop between two wildcard VIPs. To enable the interception more than once, enable the –vServerSpecificMac option:

httpOnlyCookieFlag setting when enabled, inserts a flag called httponly
when forwarding the response to the client, for example: Set-Cookie: NSC_iuuq_wjq=ffffffffc3a01f2445a4a423660; expires=Sun, 03-May-2015 15:14:35 GMT;path=/;httponly
.
The significance is that the cookie is not available to applications outside of the browser. This is a recommended approach from a security perspective, as this means even if a Cross Site Scripting (XSS) affected server is accessed by the User, the cookie can't be stolen. You need to however watch out for applications that require out-of-browser handling, a classic example being Java Applets or Client side scripts that need access to this cookie. The problem you may run into is that the requests generated outside of the browser will arrive at the NetScaler without a cookie and potentially end up on a different backend during load balancing, thus breaking the application.
You should also note that this flag exists as a tunable parameter also on the AAA vServer, which is also covered in this book, in a later chapter.
A related flag is the secure http flag, which tells the browser to use the cookie only in secure exchanges. CTX138055 shows a way of setting this using rewrite. So it goes by definition that you should only set this for SSL-based vServers, or you will be breaking the application since the cookies will never be returned.
Note that the articles I mention throughout the book can be found on the Citrix support site. The easiest way to get to them is https://www.citrix.com/support.
Services or ServiceGroups
This choice usually comes down to how big the Server pool you manage is. A couple of servers are easy to manage using the services approach, but as you are starting several of them, you should consider using ServiceGroups
. ServiceGroups
present the following benefits:
- All settings are at the
ServiceGroup
level so adding new servers or removing them is faster, since you only need to provide Server and port details without repeating the parameters each time - The resulting simplicity also means that you avoid any human errors that might lead to inconsistencies between the different services
Common LB issues
Now that you've learned how to choose the key options for our LB deployment, let's take a look at troubleshooting some of the common issues.
You've set up load balancing for the first time and tried to access the web page. Your browser appears to hang. Here's how you go about troubleshooting such issues. Start by checking whether the VIP and services are up. If the service is down, selecting show service <servicename> will show you why that service is down.
Some examples of what you might see are:
- No appropriate MIP or SNIP:
Resolution: Add a MIP or SNIP. Also make sure that the IP you add is from the right subnet, using a subnet calculator if you have to.
- The Server can't be reached:
Resolution: This will take involvement from you as well as your server teams. Start by looking at a trace. Tracing on the NetScaler is introduced in Chapter 9, Troubleshooting Tools. If the server itself is running perfectly, a blocking Firewall rule might be the problem here.
Also, be sure that the monitor bound is of the right type; the port might be UP but you might need a monitor that runs specific queries to report an unavailable service accurately, especially for multi-tiered services. ECV (Extended Content Verification) monitors serve well here.
Note
It's not uncommon to land in a situation as a NetScaler Administrator, where the NetScaler shows a monitor time out, but the Server logs will not show any problems. This is one reason why the recommended way of approaching such issues is to get simultaneous traces—on the Client, on the NetScaler, and on the Server.
Troubleshooting application failures where VIP is UP
We've looked at troubleshooting a VIP being down. However, the vServer could be UP but you might see other issues when accessing it; this section talks about a troubleshooting approach for such issues:
- Persistence issues: If users report unusual behavior, such as no longer seeing items in their shopping baskets or seeing application errors, persistence could be a problem. If the situation allows, unbind all services except one and see if the issue still exists.
- Application complexity: This is also something to bear in mind; some applications might appear to be HTTP-based, but might have components/calls that use a different protocol. To rule out this being the problem, start with a
TCP
VIP or anANY
VIP first and observe all ports in use between the client and server. The application's documentation is a great way to get that information too. You can start going up a layer once this characteristic of the application is understood. - httponly cookie: This is another area to watch out for; as we discussed while introducing it, any out-of-browser applications or programs will not have access to it if the cookie is marked httponly.
- AppFirewall or rewrite/responder policies: Check whether any AppFirewall or rewrite/responder policies are getting hit. Appfirewall can cause specific objects or buttons to fail if the necessary learnt rules are not deployed, or if the request matches the configured signature.
- NetScaler HTTP protection features: Lastly, it's also useful to know how the access to the VIP is being tested. Some enterprises may choose to use automated tools. This helps greatly with testing for scale, but also introduces the possibility that one of NetScaler's TCP/HTTP protection mechanisms are being triggered.
One example I've encountered is when a customer was using a traffic load generator to test how a newly set up VIP would hold up. They noticed that the servers weren't getting much of the traffic. On deeper analysis, we realized that the traffic generated was pumping a lot of requests with a very small advertised window size (hence the
_SW_
in the countertcp_err_SW_init_pktdrop
in the following screenshot). This kind of behavior closely resembles a Slow Read attack, for which this protection was put in:Once this was understood, the tool was tweaked to instead resemble regular User traffic and the throughput issue was corrected.
Performance issues manifest usually in the way of pages taking time to load or upload to or download from a site that is timing out. To troubleshoot these:
- First, check what the direct access experience looks like and confirm that the issue is only seen when going via the NetScaler.
- Obtain a trace. The key here again is simultaneous traces. Taking them simultaneously will save you time in the long run as you start questioning where in the path the bottleneck is.
Once you have the trace, look for the following:
- MSS issues: MSS (Maximum Segment Size) is the maximum size for a TCP segment that a receiver advertising the value can receive. The NetScaler can be configured to advertise different values but by default it advertises 1,460. If you are seeing that one of the entities is advertising a much smaller number here, it is something to consider as a cause for performance issues:
MSS will be shown on the SYN from Client and SYN/ACK from server right in the Info section, but you can also look this up from the Options section of the TCP Frame:
The Options part in the TCP handshake packets will also show you vital information, such as whether Window scaling is enabled and what the scale factor (multiplier) is. If the application experiencing delays involves large transfers, you can try increasing the scale factor so that the receive windows are expanded to accept more data per acknowledgement.
- Networking issues: If you are plugging more than one network interface into the same broadcast domain, ensure that you are not introducing a loop. This can very easily bring performance to its knees. Common issues are misconfigured or missing vlans, NIC flaps, or MBF-related issues. These are covered in greater detail in Chapter 5, Networking.
- TCP Window issues: Is it possible that the client, server, or the NetScaler running out of receive window? Usually, the Window size that each of the parties can receive will be 64 K to start with and decreases as it accumulates data that it needs to send, either onwards if it's the NetScaler, or to the application if it's the client or the server. If one of the parties is slow in consuming what is already sent to it, you could see Zero Window situations creeping in:
zero windows
Occasional zero windows are not a serious problem, as long as the receiver is able to quickly empty the receive buffer and send out a notification that it has free buffers to accept more data. The problem is when the zero window situation persists long enough that the sender has to give up, or if timeouts are getting hit. Take the following screenshot for example:
Here, SNIP has advertised a zero window to server, it cannot accept any further data, and the server is obliged to wait. If it thinks it has waited long enough, it will even send a probe to see whether the NetScaler is ready to accept more data (you can find such probes using the Wireshark filter,
tcp.analysis.zero_window_probe
). NetScaler on its part, waits for a packet from the client indicating that the client is ready to accept more data or that it has processed the data it has previously received. That confirmation arrives in the form of an ACK. Following this ACK, NetScaler SNIP sends out a TCP Window update, telling the server that is ready to accept more data. The key is whether this recovery happens fast enough, if it doesn't the performance will drop.Also, a high number of zero windows from the client can cause the NetScaler to reset the connection in order to protect its memory from saturation as that kind of TCP pattern is characteristic of a known TCP attack (
sockstress
). This protection by the way is toggled using the command:set ns tcpparam -limitedPersist ENABLED/DISABLED
. - Intermediate single packet drops: A related issue is the situation where an intermediate device (such as a firewall) keeps dropping a particular packet seeing something suspicious about the packet. The difference compared to the earlier firewall issue we talked about is that this wouldn't be a simple 100% drop of packets, which actually is easier to spot. Instead, a large packet or simply an ACK from client or server is dropped continuously, which causes a retransmission loop until the connection fails.
These issues are best diagnosed by a trace and manually calculating SEQ and ACK numbers to find out whether each receiver is receiving and ACKing what the sender sends and whether that ACK is reaching the sender. Some amount of retransmissions or DupAcks are inevitable on any busy production network; however, if you are seeing a high number of them in the same TCP stream, that is a cause for concern.
Also, if you are seeing ICMP messages indicating that the packets are too large, please enable PMTUD in the list modes to avoid fragmentation or drop due to unable-to-fragment issues. We discussed PMTUD in the first chapter.
- Surge queue building up on the service: In the output that follows, we can see that the requests are ending up in the surge queue. In the following scenario, that I have set up to demonstrate such a situation, I have set the
MaxClients
parameter to1
. This is telling the NetScaler not to send more requests to a service that is already processing one:"Is the solution to immediately remove the
MaxClient
setting?" That depends. The value you configure here is to protect the servers from getting saturated and preventing an extremely degraded performance or worse, the server from crashing due to the load. So a deeper understanding of what the server can handle is needed (working with the Server vendor if needed) to choose an appropriate value. - NetScaler resource issues: Check how the CPU and memory are doing. The NetScaler is a hardened appliance with a very well-tuned TCP stack and, as such, it can handle millions of connections before it starts becoming a bottleneck. Nevertheless, you can certainly hit situations where the resources on the NetScaler are saturated. These can be in the form of memory leaks or CPU spikes. Please check out Chapter 8, Troubleshooting the NetScaler System later in the book where I cover these in detail.
When you are bringing a new service or set of services into an existing load balancing environment, you would want to verify that the NetScaler is distributing load fairly evenly. There are multiple ways of doing this including looking at the service hits. I find it helpful, especially for troubleshooting, to look at nsconmsg
outputs to see how this is happening.
Note
nsconmsg
is covered in more detail in Chapter 9, Troubleshooting tools.
In the following example, I have used the –j
option to list the vServer name to help me narrow the output and using the –s ConLB
command to set the debug level and the distrconmsg
display (-d
) option, I am able to see how the distribution evolves every 7 seconds.
What I can see in the following screenshot is that the VIP 205_vip
is using the least connections method for load balancing and SourceIP persistence and that all the hits are persistent. Also, it has two services 192:168.1.61:80
and 192.168.1.63:80
, and their respective hits. I started the second client with a delay, which is why 192.168.1.80
starts to be the only one to take load, given the persistence implication, before 192.168.1.63
starts to get some hits. The goal is to get this to be as even as possible:

Persistence, is the most common reason why you might see uneven distribution to the servers. This disparity coming from persistence can sometimes be much exaggerated due to all clients coming from NAT device and consequently having a single IP. This is why cookie insertion makes an excellent persistence method.
Another reason is using a load/response based method that changes distribution based on server capacity. Here's how to see those persistence entries:

This helps when you are trying to determine which server a particular client is being served from.
If using cookie-based persistence, the client comes back with the LB cookie it was provided each time it places a request, so a persistence table is not necessary. Instead, use the show lb vserver
output to identify which server the request will persist to:

Now, if you look at the header of a response and do a match, you can see that this response was served from 63_svc
:
HTTP/1.1 200 OK Content-Type: image/png Last-Modified: Sat, 18 Apr 2015 11:05:01 GMT Accept-Ranges: bytes ETag: "b07dcc83c779d01:0" Server: Microsoft-IIS/7.5 Date: Fri, 08 May 2015 13:28:01 GMT Content-Length: 184946 Set-Cookie: NSC_205_wjq=ffffffffc3a01f2e45525d5f4f58455e445a4a423660
Note that the cookie name can be changed from the default NSC_vipname
format. This ability was specifically added for applications that required the cookie to have a specific name, for example, Lync 2013 needed the cookie to be called MS-WSMAN
:

You can also dive a little deeper into the LB VIP's performance using the –d oldconmsg
option. This will give you a ton of information to work with:
nsconmsg output with –d oldconmsg

In the preceding screenshot, you can see the packets/sec, the weight of each service, the hits that each service is getting, the number of active transactions (Atr), traffic handled in Mbps, and in this particular test case, critically, the Response time (RspTime) which is 492.66 ms. We can see that the server is struggling a little bit, considering it's taking nearly half a second to respond to each request. Another, perhaps bigger indicator of trouble, is the surge queue that is building up SQ (893).
There are also some very useful connection level details you can learn from this screenshot:
- CSvr(8154, 2774/sec): This is the number connections to the Service
- MCSvr(894): This is the maximum number of connections it has handled at one time
- OE(1112): This is the number of open established connections at one time
- E(303): This is the number of established connections
- RP(0): This is the number of connections in the reuse pool
Established, as the name would imply, are the connections that are active and traffic is being received on them. OPEN established are the ones that are in the TIME_WAIT
state awaiting closure. There is a NetScaler zombie cleanup process that runs periodically and closes them to make room for newer connections. Any resets resulting from this type of closure will contain a window size of 9300
.
If your users or application team reports that a VIP was unavailable for a brief period before automatically getting restored, here are a few possibilities that you need to look at:
- Was the traffic actually getting to the NetScaler? SNMP or the Reporting tab (covered in detail in Chapter 2, Troubleshooting Core NetScaler Features) of NetScaler are a great way to look this up. If you find that it never got to the NetScaler, consider:
- Possible VLAN mismatches
- Switch/firewall failures
- DNS failures
- Did the Service Flap? The first place to look at will be the events to see if services or the VIP itself has flapped; you can do this by using the IP if it's unique, or using the IP and Port combination:
In the preceding snippet, I can see the monitor failure that brought the service down. Notice that unlike with the earlier examples of
nsconmsg
, I am using the–K
option and anewnslog
file location. This is because we are doing a postmortem and want to look at historical data. If you leave it out, you will see only live data. You can also get a more detailed look at monitor states using the following:nsconmsg -K /var/nslog/newnslog -s ConMon=3 -d oldconmsg
- If the issue is still occurring, a trace would be even better (
nstrace
is covered in Chapter 9, Troubleshooting Tools) as it provides more insight than logs. As always, try and get simultaneous ones on NetScaler and the server to be able to effectively narrow this down. - It is also worthwhile knowing if server maintenance was scheduled around that time. One such example would be scheduled snapshots of the server virtual machines, which will cause them to be unavailable over the network briefly.
- Surge queue can play a role here as well; if the server is loaded beyond configured capacity, the NetScaler will stop forwarding traffic to the servers until the established connections on that service comes down.
- Check whether there was a crash; you will find numbered folders under
/var/core/
with core dumps matching the timestamp of the issue and report them to techsupport. This has the potential of looking like a brief outage due to a failover that follows or in the case of a standalone because the unit reboots and starts serving traffic again. HA Failovers are covered in Chapter 5, High Availability and Networking.
- Kali Linux Web Penetration Testing Cookbook
- Mastering JavaScript Object-Oriented Programming
- Mastering Selenium WebDriver
- ASP.NET Core Essentials
- Network Automation Cookbook
- 精通搜索分析
- Monitoring Elasticsearch
- Python之光:Python編程入門與實戰(zhàn)
- Swift Playgrounds少兒趣編程
- Getting Started with Eclipse Juno
- Application Development with Swift
- 計算機應(yīng)用基礎(chǔ)(第二版)
- Learning Kotlin by building Android Applications
- Learning D
- jQuery Mobile Web Development Essentials(Second Edition)