29 May 2016
Today I read a blog article from Google Testing Blog “Flaky Tests at Google
and How We Mitigate Them” and would
like to share my thoughts.
Flaky tests not unheard of for a large software project, particularly if test cases are owned by developers with variety
level of experience. People hate flaky tests as much as failed tests, rerun takes more resource, false alarms waste
precious dev resource, and often times people tend to ignore them or disable them entirely. Personally I do not agree
with the approach in the blog, it is simply not a quality-driven culture.
My opinion is always that, Heisenberg uncertain principle plays no role in software development, any “indeterministic”
behavior can be traced back to a quality issue, and the number of flaky tests should be driven down to zero.
In the past observation many flakiness is caused by test code issues. There is no test code for the test code, and
people may not have the same level of quality awareness as the product code. Besides unsafe threading, race conditions,
lack of synchronizations, etc., there are common anti-patterns causing flakiness (only what I can think of at the
moment):
- Checking driven by timeout instead of event: for instance, click a button on UI, wait for 5 seconds, then click the
next button.
- Unaware the async event: for instance, load a web page and wait for finish by checking if a certain tag is found,
then proceed to submit the form. But the page actually has a iframe which has to be completed loading.
- Incorrect assumption of the runtime condition. There are too many such exmaples. In one case, a P1 test case owned
by my team fails over the primary replica of the controller, then wait for its coming back by checking the primary IP
address reported by the storage cluster management (SCM). Unfortunately the checking is incorrect, because only the
layer above SCM is able to tell if the new primary is up reliably.
Besides test code bugs, real product issues may also cause flakiness of the test execution. This is particularly
dangerous in cloud environment since massive scale magnifies the probability of hitting the real issue in production.
We sometimes say that, if some bad thing has a small chance to happen, then it will happen after deployment.
Most of time, driving down flaky tests requires right mindset and right prioritization from leadership team. As long as
flaky tests and failed tests are treated as rigorously as product bugs and live-site incidents in terms of priority and
resource assignment, nothing cannot be fixed. As the tests become more reliable and more proven patterns are adopted,
improved CI experience will benefit everyone from IC to leadership. One should not underestimate the ROI of driving for
better quality.
28 May 2016
In cloud environment we mostly rely on log traces to understand what is happening inside a program and investigate
abnormal behaivors if things do not work as expected. However, sometimes it is needed to get a deep insight of the
process internals via debugging. When live debugging is not possible, we have to capture a user-mode process dump and
conduct post-mortem analysis. For instance, memory leak or unusual memory usage increase is such a case where objects
on heap need to examined closely.
The most common way to create a dump file is to use Windows Task Manager. One can open Task Manager, click “Processes”
tab for Windows 7, or “Details” tab for Windows 8/10, then right-click the name of the process and then click “Create
Dump File”. Once the dump file is created, it will be saved at %TEMP%
directory, which is usually
\Users\UserName\AppData\Local\Temp
directory on the system drive. The size of the dump file is roughly the number of
virtual bytes of the process.
The downside of this method is that the location of the dump file cannot be specified. In addition, one cannot choose
whether minidump (only thread and handle information) or full dump (all process memory) to create.
In some cases, this could be a severe issue. In Azure data center, free space on the system drive is extremely limited
(often times less than 15 GB). Personally I have seen (via post-mortem analysis) that dump file creation causes disk
space exhaustion and makes the OS unusable when responding to a live-site incident, which makes the situation worse by
having a second incident.
A better way to create dump file is to use ProcDump or
AdPlus (a part of WinDBG). An example of creating a full dump is:
procdump -ma MyProcess c:\temp\myprocess.dmp
ProcDump is written by Mark Russinovich, a Microsoft Technical Fellow. It is very small in size. One can visit technet
page to download. If a GUI is perferred, I strongly recommend a
Task Manager replacement, Process Explorer by the
same author.
15 May 2016
One of the frequent complaints in Azure support is that customers cannot RDP / SSH into their VMs. In many cases there
are very few options for the users other than retry or filing a support ticket. After all, many components may go
wrong, trace logs are internal, and no one can put hands on the physical machine and check what is happening. I hope a
brief explanation can give you some insight and make your life slightly easier. :-)
WARNING: some technical details are omitted, accuracy is not guaranteed. Maybe more importantly, some design parts
may sound suboptimal or even controversial. In public space many companies like to brag how great their culture is and
how much they care engineering quality. The reality is, in a work-intensive environment, engineers often just “make
stuff work and ship it”, fortunately people learn and grow over time so products are improved gradually.
When you deploy a cloud service or start an IaaS VM via web portal or automation, a Deployment is created. Inside
fabric controller, the deployment is called Tenant. Service package (cspkg) and service configuration file (cscfg)
are translated to service description file (svd), tenant description file (trd), and images. In each cluster, compute
manager and network managers are the two important roles to manage the resource inventory and the tenant life cycle. At
regional level, some networking resources such as VIP and virtual networks (VNET) are managed by regional controller.
Each tenant is consisted of mutiple roles, each role has one or more role instances, each role instance live in a VM
which is called a container. VMs are hosted by Hyper-V server, which are managed by the fabric controller running in
the cluster. Fabric controller is also a tenant like all other customer tenants (imagine clusters managed by clusters).
Create / update tenant: svd/trd as well as regional resource is received by the compute manager. This component will
allocate the resource and find physical nodes which have sufficient capacity to host the required containers. After
compute resource allocation is completed, it will pass information to network manager (NM). NM will then allocate
networking related resource and then send relevant information to the agent program running on the nodes where
containers will be running. After NM sends back confirmation that the network allocation is done, the tenant is
successfully updated to the required service description. The communication between compute and network managers is
implemented by single-direction WCF with net.tcp binding. Many APIs are synchronous with certain configurable timeout.
Start tenant: Upon receiving this command from the frontend, the compute manager will drive the start container
workflow by sending goal state to compute agent running on the node. At the same time, it will also notify NM that
certain role instances have to be connected (or in service). NM will send the goal state to its agent for performing
the network programming. Note this work will happen to multiple nodes in parallel.
Once compute agent receives the command to start container, it will download whatever needed from Azure storage (e.g.
VHD file for IaaS, service package for PaaS), then create a VM via Hyper-V WMI. By default, the VM NIC port is blocked,
no network traffic can go through. NM agent program periodically gets the list of all the VM NICs on the host, checks
if the goal state is achieved, and program the interface if needed. Once the programming is completed, the port is
unblocked then the VM can talk to outside. Also note that compute and network agents are two different programs and are
driven by different manager components, and they work somewhat independently.
Assuming everything is working properly, the VM is created successfully, the port is unblocked, the guest OS boots up,
the DHCP client will send DHCP discover in order to get its IP configuration and other information. DHCP packets will
be intercepted and encapsulated if needed, then forwarded to WDS in the cluster. WDS receives all DHCP traffic in the
cluster, from both physical hosts and virtual machines. A plugin in WDS will then forward the request to cluster
hardware management component, which will ask NM if the request is for the physical hosts or the tenants. If the
former, this components will handle by itself, if the latter it will forward the request to NM.
After NM receives the DHCP, it will find out the corresponding tenant based on the MAC address then return IP address
and other relevant information. After that the response goes back via the same path all the way to the VM. Now the VM
has an IP configuration and continute to boot.
Start role instance: This is mainly for PaaS tenants but also relevant for IaaS. Guest agent in the VM is configured
to start automatically. In the DHCP response, there is a special DHCP custom option 245 named wire server address. It
is the IP address of a web server running on the physical host to which the guest agent (or anyone running inside VM)
can talk. GA retrieves this address from DHCP response and does the following:
- ask wire server which version of the protocol is supported.
- periodically ask for the machine goal state, i.e. should the VM continue to run or shutdown.
- retrives the role configuration, which contains IP and other useful information about the runtime environment.
- keep the heartbeat with the wireserver, report the health state and current status.
- upon request start the role instance by loading the program on app VHD mounted in the OS.
Wire server is a low-previlege web app designed to be the middle man between fabric and containers. If there is any
state request, for instance VM to be shut down, the request will be forwarded by compute agent to the wire server, and
the guest agent is supposed to poll it. The state of the role instance, e.g. starting / running / busy / unresponsive,
is reported by the GA to the wire server and then compute agent. Wire server and GA work together to keep VM in a
healthy and responsive state.
For PaaS containers, thanks to Hyper-V integration service (primarily heartbeat service) and GA, compute agent knows if
the guest OS receives proper DHCP offer, when the container is ready, if the container picks up the desired machine goal
state, if the role instance starts properly, and so on. If unexpected thing happens, the compute stack will decide to
retry, or move the container to other hosts (assuming the problem is caused by malfunctioning physical host), or give up
to wait for engineering team investigation.
Once wire server / compute agent is notified of role instance state, the information will be propagated to the upper
layer and finally reflected in the web portal. Assuming RDP / SSH endpoint is configuration properly, the incoming
request will be routed by the software load balancer, then it is up to the OS how to handle it. Many other factors
determine if RDP / SSH works or not, including but not limited to firewall rules, service state, etc.
For IaaS VMs, the OS VHD may be uploaded by customers thus GA may not be installed, and the Hyper-V integration service
may not exist. In this case, compute manager has no way to know what is running inside guest OS, so it will start the
VM using Hyper-V WMI. As long as the VM is healthy from hypervisor perspective, even if the guest OS is stuck in the
boot process somewhere, compute agent will not do anything since it has no knowledge about it.
As you can see, many components participate in the VM / container boot process. Most of time things work amazingly
well. However, sometimes fabric controller or other parts of the platform may have glitches, sometimes even the Windows
OS may hit issues, consequently causing access issues for customers. Personally I have seen a variety of incidents due
to platform quality issues, such as:
- General platform unavailability due to cluster outage. In this case many number of VMs are offline.
- Storage incidents slows down OS VHD retrieval, and oddly enough breaks Base Filter Engine service during the boot.
- Physical networking issue causes packet drop.
- Networking agent does not program the interface properly so DHCP discover is blocked, so no IP address is available.
- Missing notification from compute manager or regional network controller so network manager does send correct goal
state to network agent.
- After container in VNET tenant moving to different host, the DHCP response was incorrectly formed so the guest OS
DHCP client fails to take the new lease.
Many more issues are caused by guest OS, such as:
- RDP endpoint misconfigured.
- DHCP disabled.
- Incorrect firewall configuration.
- Mistake in sysprep so Windows image stops working after moving to Azure Hyper-V server.
- Corrupted VHD so OS does not even boot into login screen.
There are also many Windows OS issues that engineers in Azure have to escalate to Windows team. The number of options
to investigate at customer side is limited. Before contacting customer service, check a few things:
Often times a support ticket has to be filed, so our engineers can collect detailed information and figure out if it is
caused by platform issue or not. If you suspect platform issue is the culprit, for instance deployment stuck in
starting or provisioning state for a long time, contact customer support. We will check all the way from frontend down
to the Hyper-V server, including both control path and data path, to mitigate and resolve the incident. Security and
privacy are treated very seriously in Azure. Without consent from customers, no one is able to check the internal state
of guest VM and/or attached VHD. In some cases, cooperation from customers speeds up investigation significantly
because it allows engineers to read Windows events, logs, and perform kernel mode debugging to understand what exactly
prevents RDP from working.
This is a brief and high-level description of how VM boots up. Later I may discuss a few components in more details to
give you more insight. Thanks for reading!
14 May 2016
In performance analysis, the first step is often the time span measurement of various operations in the applications.
It is important to know the correct timestamp and understand which step takes how much time accurately. Without right
data, one may make incorrect decision and spend resource inappropriately. This just happened in my team when a DHCP
slow response issue was investigated.
On computers the time is measured by some variant of system clock, which can be a hardware device that maintains a
simple count of ticks elapsed since a known starting date, a.k.a. the epoach, or the relative time measurement device by
performance counters in CPU. In applications sometimes we want to know the calendar time, or the wall clock time so
we can correlate what happens inside the program with what happens in real world (e.g. customer places an order),
sometimes we want to know the time span, or how fast / slow an operation is. We often use timestamp retrieved from
the system time for the former, and calculate the difference of two timestamps to get the latter. Conceptually it
works, but the relative error of the measurement matters.
On the PC motherboard there is a real-time clock chip, some people call it CMOS clock since the clock is a part of CMOS.
It keeps the date and time, as well as a tiny CMOS RAM which contains the BIOS settings in old days. Even when PC is
powered off, it is still running on a built-in battery. Via I/O ports 0x70 and 0x71, one can read or update the current
date / time. Because the cheap oscillator often not works at designed frequency, the clock may drift over time. The OS
compensates this by periodically synchronizing with the time service using NTP, e.g. time.nist.gov is considered as one
of most authortiative time source.
In the OS the initial date / time clock is retrieved from RTC, then it is updated periodically, typically 50 - 2000
times per second. The duration is called clock interval. The clock interval can be adjusted by applications, for
instance multimedia applications and Chrome browser often set the clock interval to 1 ms. Smaller interval has negative
impact on the battery usage, it is avoided whenever possible. On servers this is normally kept as default.
One can query the clock using sysinternal app clockres.exe
or Windows built-in program powercfg
. On my desktop the
clock interval is 15.625 milliseconds (ms) (64 Hz frequency), and the minimum supported interval is 0.5 ms. 15 ms
resolution is sufficient for most real-world events, e.g. when the next TV show starts, but it is insufficient for time
measurements on computers in many cases, in particular the events lasting tens of milliseconds or less. For instance,
if an action starts at some point in clock interval 1 and stops at another point in clock interval, using the system
clock you will see a time span of 31 ms, but actually it can be anywhere from 15.625 to 46.875 ms. On the other hand,
if an action starts and stops within the same clock interval, the system clock will tell you the duration is 0 but it
can be as long as 15 ms. My coworker once said “the request is received at xxx, at the same time it gets processed
…”, sorry within a single thread two different things do not happen at the same time.
In .NET, system clock is retrieved using DateTime.UtcNow
(see MSDN
doc).
It is implemented by calling Win32 API GetSystemTimeAsFileTime
function. Note that although the unit
is in tick (100 ns), the resolution is really clock interval.
In old days (Windows XP, Linux many years ago) people used to read processor timestamp counter (rdtsc) to acquire high
resolution timestamps. It is tricky to get it right in virtual environment, multiple-core system, and special
hardwares. Nowadays on Windows the solution is Win32 API QueryPerformanceCounter
function. On modern hardware (i.e.
almost all PCs nowadays), the resolution is less than 1 microseconds (us). On my “cost effective” home PC, the
resolution is about 300 ns, or 3 ticks. For more information on QPC read MSDN article
here.
In .NET, QPC is implemented by System.Diagnostics.Stopwatch
(see reference source
here).
For any time span measurement this should be considered as the default choice.
Another thing to remember is that Stopwatch
or QPC is not synchronized to UTC or any wall clock time. This means that
if the computer adjusts the clock after synchronizing to the time server, this will not be affected – no forward or
backward jump. In fact I saw a stress test failure caused by clock forward adjustment, when the timeout was evaluated
the time sync happened so the calculated time span was several minutes greater than the actual value. This kind of bug
is hard to notice and investigate, but trivial to fix. Avoid wall clock time if possible.
In term of overhead, Stopwatch is more expensive than DateTime.UtcNow. However both take very little CPU time. On my
home PC, Stopwatch takes about 6 ns vs 3 ns for DateTime.UtcNow. Normally it is much shorter than the duration being
measured.
The last question is that, if we do need the absolute time correlation on multiple computers, is there anything better
than System.DateTime.UtcNow
? The answer is yes, setup all computers to the same time source, then use
GetSystemTimePreciseAsFileTime
API.
It is supported in Win8 / Server 2012 or later. In .NET one needs to use P/Invoke to use it.
30 Apr 2016
In the weekend I saw a few posts on StackOverflow asking how the network traffic is
throttled in Azure, how much bandwidth a VM can use, etc. There are some answers based on measurements or MSDN docs. I
hope this post may give a more complete picture from engineering perspective.
When we talk about bandwidth throttling, we refer to the network bandwidth cap at the VM / vNIC level for the outbound
traffic (or the transmit path). Inbound traffic from public internet goes through Software Load Balancer but no
throttling is applied at the VM / host. All network traffic going out of a VM, including both billed and unbilled
traffic, are throttled so the bandwidth is limited to a certain number. For instance, a “Small” VM is capped at 500
Mbps. Different sizes of VM have different caps, some of them are very high.
Then the question is, if a VM has more than one interface, is the cap shared by all interfaces or divided equally among
them? If the value of the bandwidth cap is updated, will VM be rebooted? The answer is it depends. Some time ago, the
network bandwidth is managed by tenant management component in FC. Technically the agent running on the host sets a
bandwidth cap on VM switch using Hyper-V WMI when creating VM. If there are multiple interfaces, the cap is divided by
the number of interfaces. If we want to change the bandwidth of individual VM or all VMs with the same size, fabric
policy file in the cluster has to be updated and VMs have to be created to apply the new values. Recently we changed
the design to let network management component in FC to handle this work. Network programming agent program
communicates with a filter driver (VFP) on the host to create a QoS queue and then associates all interfaces with the
queue. So all interfaces share the same cap. For instance, if a small VM has two NICs, if the first NIC is idling the
second NIC can use up to 500 Mbps. Basically now the cap should apply to the entire VM. Some cluster may not have this
feature enabled temporarily, but this should be rare.
Another question is, since you call it “cap”, does it mean my VM will get that amount of bandwidth in the best case, and
it may get less bandwidth if neighbors are noisy? The answer is noisy neighbors do not affect the bandwidth throttling.
The allocation algorithm in FC knows how much resource exits on each host, including total network bandwidth, and the
container allocation is designed to allow each individual container uses its full capacity. If you absolutely believes
the bandwidth is a lot less than advertised (note Linux VM needs Hyper-V IC being deployed), you may open a support
ticket, ultimately the engineering team will figure it out. From our side, we can see the values in SDN controller as
well as the value set on QoS queue.
In term of latency and throughput, performance measurement shows no statistically significant difference between QoS
queue based throttling and VM switch based throttling. Both are equally high performance.
The new design allows the seamless/fast update of bandwidth in a cluster – the entire process takes less than a half
minute and no visible impact to the running VMs. It also leaves room for further enhancement should upper layer
supports, for instance adjustable bandwidth for same container size based upon customer requirement. Hope all customers
are satisfied with networking in Azure. :-)