Blackboard Perspectives and thoughts from Zhenhua Yao

How Do We Deal With Flaky Tests

Today I read a blog article from Google Testing Blog “Flaky Tests at Google and How We Mitigate Them” and would like to share my thoughts.

Flaky tests not unheard of for a large software project, particularly if test cases are owned by developers with variety level of experience. People hate flaky tests as much as failed tests, rerun takes more resource, false alarms waste precious dev resource, and often times people tend to ignore them or disable them entirely. Personally I do not agree with the approach in the blog, it is simply not a quality-driven culture.

My opinion is always that, Heisenberg uncertain principle plays no role in software development, any “indeterministic” behavior can be traced back to a quality issue, and the number of flaky tests should be driven down to zero.

In the past observation many flakiness is caused by test code issues. There is no test code for the test code, and people may not have the same level of quality awareness as the product code. Besides unsafe threading, race conditions, lack of synchronizations, etc., there are common anti-patterns causing flakiness (only what I can think of at the moment):

  • Checking driven by timeout instead of event: for instance, click a button on UI, wait for 5 seconds, then click the next button.
  • Unaware the async event: for instance, load a web page and wait for finish by checking if a certain tag is found, then proceed to submit the form. But the page actually has a iframe which has to be completed loading.
  • Incorrect assumption of the runtime condition. There are too many such exmaples. In one case, a P1 test case owned by my team fails over the primary replica of the controller, then wait for its coming back by checking the primary IP address reported by the storage cluster management (SCM). Unfortunately the checking is incorrect, because only the layer above SCM is able to tell if the new primary is up reliably.

Besides test code bugs, real product issues may also cause flakiness of the test execution. This is particularly dangerous in cloud environment since massive scale magnifies the probability of hitting the real issue in production. We sometimes say that, if some bad thing has a small chance to happen, then it will happen after deployment.

Most of time, driving down flaky tests requires right mindset and right prioritization from leadership team. As long as flaky tests and failed tests are treated as rigorously as product bugs and live-site incidents in terms of priority and resource assignment, nothing cannot be fixed. As the tests become more reliable and more proven patterns are adopted, improved CI experience will benefit everyone from IC to leadership. One should not underestimate the ROI of driving for better quality.

Create Process Dump File

In cloud environment we mostly rely on log traces to understand what is happening inside a program and investigate abnormal behaivors if things do not work as expected. However, sometimes it is needed to get a deep insight of the process internals via debugging. When live debugging is not possible, we have to capture a user-mode process dump and conduct post-mortem analysis. For instance, memory leak or unusual memory usage increase is such a case where objects on heap need to examined closely.

The most common way to create a dump file is to use Windows Task Manager. One can open Task Manager, click “Processes” tab for Windows 7, or “Details” tab for Windows 8/10, then right-click the name of the process and then click “Create Dump File”. Once the dump file is created, it will be saved at %TEMP% directory, which is usually \Users\UserName\AppData\Local\Temp directory on the system drive. The size of the dump file is roughly the number of virtual bytes of the process.

The downside of this method is that the location of the dump file cannot be specified. In addition, one cannot choose whether minidump (only thread and handle information) or full dump (all process memory) to create.

In some cases, this could be a severe issue. In Azure data center, free space on the system drive is extremely limited (often times less than 15 GB). Personally I have seen (via post-mortem analysis) that dump file creation causes disk space exhaustion and makes the OS unusable when responding to a live-site incident, which makes the situation worse by having a second incident.

A better way to create dump file is to use ProcDump or AdPlus (a part of WinDBG). An example of creating a full dump is:

procdump -ma MyProcess c:\temp\myprocess.dmp

ProcDump is written by Mark Russinovich, a Microsoft Technical Fellow. It is very small in size. One can visit technet page to download. If a GUI is perferred, I strongly recommend a Task Manager replacement, Process Explorer by the same author.

How Does VM Boot Up in Azure

One of the frequent complaints in Azure support is that customers cannot RDP / SSH into their VMs. In many cases there are very few options for the users other than retry or filing a support ticket. After all, many components may go wrong, trace logs are internal, and no one can put hands on the physical machine and check what is happening. I hope a brief explanation can give you some insight and make your life slightly easier. :-)

WARNING: some technical details are omitted, accuracy is not guaranteed. Maybe more importantly, some design parts may sound suboptimal or even controversial. In public space many companies like to brag how great their culture is and how much they care engineering quality. The reality is, in a work-intensive environment, engineers often just “make stuff work and ship it”, fortunately people learn and grow over time so products are improved gradually.

When you deploy a cloud service or start an IaaS VM via web portal or automation, a Deployment is created. Inside fabric controller, the deployment is called Tenant. Service package (cspkg) and service configuration file (cscfg) are translated to service description file (svd), tenant description file (trd), and images. In each cluster, compute manager and network managers are the two important roles to manage the resource inventory and the tenant life cycle. At regional level, some networking resources such as VIP and virtual networks (VNET) are managed by regional controller. Each tenant is consisted of mutiple roles, each role has one or more role instances, each role instance live in a VM which is called a container. VMs are hosted by Hyper-V server, which are managed by the fabric controller running in the cluster. Fabric controller is also a tenant like all other customer tenants (imagine clusters managed by clusters).

Create / update tenant: svd/trd as well as regional resource is received by the compute manager. This component will allocate the resource and find physical nodes which have sufficient capacity to host the required containers. After compute resource allocation is completed, it will pass information to network manager (NM). NM will then allocate networking related resource and then send relevant information to the agent program running on the nodes where containers will be running. After NM sends back confirmation that the network allocation is done, the tenant is successfully updated to the required service description. The communication between compute and network managers is implemented by single-direction WCF with net.tcp binding. Many APIs are synchronous with certain configurable timeout.

Start tenant: Upon receiving this command from the frontend, the compute manager will drive the start container workflow by sending goal state to compute agent running on the node. At the same time, it will also notify NM that certain role instances have to be connected (or in service). NM will send the goal state to its agent for performing the network programming. Note this work will happen to multiple nodes in parallel.

Once compute agent receives the command to start container, it will download whatever needed from Azure storage (e.g. VHD file for IaaS, service package for PaaS), then create a VM via Hyper-V WMI. By default, the VM NIC port is blocked, no network traffic can go through. NM agent program periodically gets the list of all the VM NICs on the host, checks if the goal state is achieved, and program the interface if needed. Once the programming is completed, the port is unblocked then the VM can talk to outside. Also note that compute and network agents are two different programs and are driven by different manager components, and they work somewhat independently.

Assuming everything is working properly, the VM is created successfully, the port is unblocked, the guest OS boots up, the DHCP client will send DHCP discover in order to get its IP configuration and other information. DHCP packets will be intercepted and encapsulated if needed, then forwarded to WDS in the cluster. WDS receives all DHCP traffic in the cluster, from both physical hosts and virtual machines. A plugin in WDS will then forward the request to cluster hardware management component, which will ask NM if the request is for the physical hosts or the tenants. If the former, this components will handle by itself, if the latter it will forward the request to NM.

After NM receives the DHCP, it will find out the corresponding tenant based on the MAC address then return IP address and other relevant information. After that the response goes back via the same path all the way to the VM. Now the VM has an IP configuration and continute to boot.

Start role instance: This is mainly for PaaS tenants but also relevant for IaaS. Guest agent in the VM is configured to start automatically. In the DHCP response, there is a special DHCP custom option 245 named wire server address. It is the IP address of a web server running on the physical host to which the guest agent (or anyone running inside VM) can talk. GA retrieves this address from DHCP response and does the following:

  • ask wire server which version of the protocol is supported.
  • periodically ask for the machine goal state, i.e. should the VM continue to run or shutdown.
  • retrives the role configuration, which contains IP and other useful information about the runtime environment.
  • keep the heartbeat with the wireserver, report the health state and current status.
  • upon request start the role instance by loading the program on app VHD mounted in the OS.

Wire server is a low-previlege web app designed to be the middle man between fabric and containers. If there is any state request, for instance VM to be shut down, the request will be forwarded by compute agent to the wire server, and the guest agent is supposed to poll it. The state of the role instance, e.g. starting / running / busy / unresponsive, is reported by the GA to the wire server and then compute agent. Wire server and GA work together to keep VM in a healthy and responsive state.

For PaaS containers, thanks to Hyper-V integration service (primarily heartbeat service) and GA, compute agent knows if the guest OS receives proper DHCP offer, when the container is ready, if the container picks up the desired machine goal state, if the role instance starts properly, and so on. If unexpected thing happens, the compute stack will decide to retry, or move the container to other hosts (assuming the problem is caused by malfunctioning physical host), or give up to wait for engineering team investigation.

Once wire server / compute agent is notified of role instance state, the information will be propagated to the upper layer and finally reflected in the web portal. Assuming RDP / SSH endpoint is configuration properly, the incoming request will be routed by the software load balancer, then it is up to the OS how to handle it. Many other factors determine if RDP / SSH works or not, including but not limited to firewall rules, service state, etc.

For IaaS VMs, the OS VHD may be uploaded by customers thus GA may not be installed, and the Hyper-V integration service may not exist. In this case, compute manager has no way to know what is running inside guest OS, so it will start the VM using Hyper-V WMI. As long as the VM is healthy from hypervisor perspective, even if the guest OS is stuck in the boot process somewhere, compute agent will not do anything since it has no knowledge about it.

As you can see, many components participate in the VM / container boot process. Most of time things work amazingly well. However, sometimes fabric controller or other parts of the platform may have glitches, sometimes even the Windows OS may hit issues, consequently causing access issues for customers. Personally I have seen a variety of incidents due to platform quality issues, such as:

  • General platform unavailability due to cluster outage. In this case many number of VMs are offline.
  • Storage incidents slows down OS VHD retrieval, and oddly enough breaks Base Filter Engine service during the boot.
  • Physical networking issue causes packet drop.
  • Networking agent does not program the interface properly so DHCP discover is blocked, so no IP address is available.
  • Missing notification from compute manager or regional network controller so network manager does send correct goal state to network agent.
  • After container in VNET tenant moving to different host, the DHCP response was incorrectly formed so the guest OS DHCP client fails to take the new lease.

Many more issues are caused by guest OS, such as:

  • RDP endpoint misconfigured.
  • DHCP disabled.
  • Incorrect firewall configuration.
  • Mistake in sysprep so Windows image stops working after moving to Azure Hyper-V server.
  • Corrupted VHD so OS does not even boot into login screen.

There are also many Windows OS issues that engineers in Azure have to escalate to Windows team. The number of options to investigate at customer side is limited. Before contacting customer service, check a few things:

Often times a support ticket has to be filed, so our engineers can collect detailed information and figure out if it is caused by platform issue or not. If you suspect platform issue is the culprit, for instance deployment stuck in starting or provisioning state for a long time, contact customer support. We will check all the way from frontend down to the Hyper-V server, including both control path and data path, to mitigate and resolve the incident. Security and privacy are treated very seriously in Azure. Without consent from customers, no one is able to check the internal state of guest VM and/or attached VHD. In some cases, cooperation from customers speeds up investigation significantly because it allows engineers to read Windows events, logs, and perform kernel mode debugging to understand what exactly prevents RDP from working.

This is a brief and high-level description of how VM boots up. Later I may discuss a few components in more details to give you more insight. Thanks for reading!

Time Span Measurement in Managed Code

In performance analysis, the first step is often the time span measurement of various operations in the applications. It is important to know the correct timestamp and understand which step takes how much time accurately. Without right data, one may make incorrect decision and spend resource inappropriately. This just happened in my team when a DHCP slow response issue was investigated.

On computers the time is measured by some variant of system clock, which can be a hardware device that maintains a simple count of ticks elapsed since a known starting date, a.k.a. the epoach, or the relative time measurement device by performance counters in CPU. In applications sometimes we want to know the calendar time, or the wall clock time so we can correlate what happens inside the program with what happens in real world (e.g. customer places an order), sometimes we want to know the time span, or how fast / slow an operation is. We often use timestamp retrieved from the system time for the former, and calculate the difference of two timestamps to get the latter. Conceptually it works, but the relative error of the measurement matters.

On the PC motherboard there is a real-time clock chip, some people call it CMOS clock since the clock is a part of CMOS. It keeps the date and time, as well as a tiny CMOS RAM which contains the BIOS settings in old days. Even when PC is powered off, it is still running on a built-in battery. Via I/O ports 0x70 and 0x71, one can read or update the current date / time. Because the cheap oscillator often not works at designed frequency, the clock may drift over time. The OS compensates this by periodically synchronizing with the time service using NTP, e.g. time.nist.gov is considered as one of most authortiative time source.

In the OS the initial date / time clock is retrieved from RTC, then it is updated periodically, typically 50 - 2000 times per second. The duration is called clock interval. The clock interval can be adjusted by applications, for instance multimedia applications and Chrome browser often set the clock interval to 1 ms. Smaller interval has negative impact on the battery usage, it is avoided whenever possible. On servers this is normally kept as default.

One can query the clock using sysinternal app clockres.exe or Windows built-in program powercfg. On my desktop the clock interval is 15.625 milliseconds (ms) (64 Hz frequency), and the minimum supported interval is 0.5 ms. 15 ms resolution is sufficient for most real-world events, e.g. when the next TV show starts, but it is insufficient for time measurements on computers in many cases, in particular the events lasting tens of milliseconds or less. For instance, if an action starts at some point in clock interval 1 and stops at another point in clock interval, using the system clock you will see a time span of 31 ms, but actually it can be anywhere from 15.625 to 46.875 ms. On the other hand, if an action starts and stops within the same clock interval, the system clock will tell you the duration is 0 but it can be as long as 15 ms. My coworker once said “the request is received at xxx, at the same time it gets processed …”, sorry within a single thread two different things do not happen at the same time.

In .NET, system clock is retrieved using DateTime.UtcNow (see MSDN doc). It is implemented by calling Win32 API GetSystemTimeAsFileTime function. Note that although the unit is in tick (100 ns), the resolution is really clock interval.

In old days (Windows XP, Linux many years ago) people used to read processor timestamp counter (rdtsc) to acquire high resolution timestamps. It is tricky to get it right in virtual environment, multiple-core system, and special hardwares. Nowadays on Windows the solution is Win32 API QueryPerformanceCounter function. On modern hardware (i.e. almost all PCs nowadays), the resolution is less than 1 microseconds (us). On my “cost effective” home PC, the resolution is about 300 ns, or 3 ticks. For more information on QPC read MSDN article here.

In .NET, QPC is implemented by System.Diagnostics.Stopwatch (see reference source here). For any time span measurement this should be considered as the default choice.

Another thing to remember is that Stopwatch or QPC is not synchronized to UTC or any wall clock time. This means that if the computer adjusts the clock after synchronizing to the time server, this will not be affected – no forward or backward jump. In fact I saw a stress test failure caused by clock forward adjustment, when the timeout was evaluated the time sync happened so the calculated time span was several minutes greater than the actual value. This kind of bug is hard to notice and investigate, but trivial to fix. Avoid wall clock time if possible.

In term of overhead, Stopwatch is more expensive than DateTime.UtcNow. However both take very little CPU time. On my home PC, Stopwatch takes about 6 ns vs 3 ns for DateTime.UtcNow. Normally it is much shorter than the duration being measured.

The last question is that, if we do need the absolute time correlation on multiple computers, is there anything better than System.DateTime.UtcNow? The answer is yes, setup all computers to the same time source, then use GetSystemTimePreciseAsFileTime API. It is supported in Win8 / Server 2012 or later. In .NET one needs to use P/Invoke to use it.

How Does Bandwidth Throttling Work in Azure?

In the weekend I saw a few posts on StackOverflow asking how the network traffic is throttled in Azure, how much bandwidth a VM can use, etc. There are some answers based on measurements or MSDN docs. I hope this post may give a more complete picture from engineering perspective.

When we talk about bandwidth throttling, we refer to the network bandwidth cap at the VM / vNIC level for the outbound traffic (or the transmit path). Inbound traffic from public internet goes through Software Load Balancer but no throttling is applied at the VM / host. All network traffic going out of a VM, including both billed and unbilled traffic, are throttled so the bandwidth is limited to a certain number. For instance, a “Small” VM is capped at 500 Mbps. Different sizes of VM have different caps, some of them are very high.

Then the question is, if a VM has more than one interface, is the cap shared by all interfaces or divided equally among them? If the value of the bandwidth cap is updated, will VM be rebooted? The answer is it depends. Some time ago, the network bandwidth is managed by tenant management component in FC. Technically the agent running on the host sets a bandwidth cap on VM switch using Hyper-V WMI when creating VM. If there are multiple interfaces, the cap is divided by the number of interfaces. If we want to change the bandwidth of individual VM or all VMs with the same size, fabric policy file in the cluster has to be updated and VMs have to be created to apply the new values. Recently we changed the design to let network management component in FC to handle this work. Network programming agent program communicates with a filter driver (VFP) on the host to create a QoS queue and then associates all interfaces with the queue. So all interfaces share the same cap. For instance, if a small VM has two NICs, if the first NIC is idling the second NIC can use up to 500 Mbps. Basically now the cap should apply to the entire VM. Some cluster may not have this feature enabled temporarily, but this should be rare.

Another question is, since you call it “cap”, does it mean my VM will get that amount of bandwidth in the best case, and it may get less bandwidth if neighbors are noisy? The answer is noisy neighbors do not affect the bandwidth throttling. The allocation algorithm in FC knows how much resource exits on each host, including total network bandwidth, and the container allocation is designed to allow each individual container uses its full capacity. If you absolutely believes the bandwidth is a lot less than advertised (note Linux VM needs Hyper-V IC being deployed), you may open a support ticket, ultimately the engineering team will figure it out. From our side, we can see the values in SDN controller as well as the value set on QoS queue.

In term of latency and throughput, performance measurement shows no statistically significant difference between QoS queue based throttling and VM switch based throttling. Both are equally high performance.

The new design allows the seamless/fast update of bandwidth in a cluster – the entire process takes less than a half minute and no visible impact to the running VMs. It also leaves room for further enhancement should upper layer supports, for instance adjustable bandwidth for same container size based upon customer requirement. Hope all customers are satisfied with networking in Azure. :-)