Blackboard Perspectives and thoughts from Zhenhua Yao

Notes on AWS Kinesis Event RCA

Long time no post! I have written many technical analysis internally but probably I should share more with people who have no access to that. :-)

As a senior engineer working in Azure Core, alerts and incidents are more common than what you might think. We (me and my team) strive to build the most reliable cloud in the world and deliver the best customer experience. In reality, like everyone in the battle field we make mistakes, we learn from them, and try not to make the same mistake twice. During this Thanksgiving, we tried everything possible to not disturb the platform including suspending non-critical changes, and it turned out be another non-eventful weekend thank goodness. When I heard the AWS outage in US east, my heart was with the frontline engineers. Later I enjoyed reading the RCA of the outage. The following is my personal takeaways.

The first lesson is do not make changes during or right before high-priority event. Most of time stuffs do not break if you don’t touch them. If your projection is more capacity may be required, carry out the expansion a few days prior to the important period of time. Furthermore, even if something does not look quite right, be conservative and be sure to not make it worse while making a “small” fix.

The second lesson is monitoring gap, in other words why did the team not get the high-severity alert to indicate the actual source of the problem. Regarding the maximum number of threads being exceeded, or the high number of threads problem, actually in my observation this isn’t a rare event. A few days ago, I was invited to check why a backend service replica did not make any progress. Once loading the crash dump in the debugger, it was quite obivious where the problem is – several thousands of threads were spin-waiting a shared resource, which was held by a poorly implemented logging library. The difference in this case is the team did notice the situation by accurate monitoring and mitigated the problem by failing over the process. If we know the number of threads should not exceed N, we absolutely need to configure the monitor to know it immediately if it goes out of the expected range, and a troubleshooting guide should be linked with the alert so even the half-waken junior engineers are able to follow the standard operation procedure to mitigate the issue or escalate (in case the outcome is unexpected). I am glad to read the repair item for this:

We are adding fine-grained alarming for thread consumption in the service…

In many services here, the threadpool worker thread growth can be unbounded under certain rare cases until we receive alert to fix it. For instances, theorectically the number of threads can go up to 32767 although I’ve never seen that many (maximum is about 5000-6000 in the past). In some services, the upbound is set to a much conservative number. So I think the following is something we can learn:

We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well.

In addition, the following caught my attention:

This information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers … It takes up to an hour for any existing front-end fleet member to learn of new participants.

Maybe the design philosophy in AWS (or upper layer) is different from the practice in my org. The service initialization is an important aspect to check the performance and reliability. Usually, we try not take dependency from layers above us, or assume certain components must be operating properly in order to bootstrap the service replica in question. The duration of service initialization will be measured and tracked over time. If the initialization takes too long to complete, we will be called. The majority of services in the control plane services takes seconds to be ready, outliers hit high-severity incidents unfortunately. A few days ago a service in a busy datacenter took about 9 minutes to get online (from process creation to ready to process incoming requests), that was such a pain during outage. In my opinion, fundamental improvement has to be performed to fix this, and the following is on the right track:

…to radically improve the cold-start time for the front-end fleet…Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range…

Usually we favor scale-out the service insteading scale-up, however the following is right on spot in this context:

we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet. This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet…

Finally, kudos to Kinesis team on mitigating the issue and finding the root cause. I greatly appreciate the detailed RCA report which will benefit everyone working in cloud computing!

How to Retrive Internet Cookies Programmatically

Using the cookies stored by the website in the script is a nice trick to use the existing authentication to access the web service, etc. There are several ways to retrieve the cookies from IE / Edge, the most convenient way is to directly read the files on the local disk. Basically we can use the Shell.Application COM object to locate the cookies folder, then parse all text files for the needed information. In each file, there are several records delimited by a line of single character *, in each record the first line is the name, second line the value, third line the host name of website that sets the cookie. Here is a simple PowerShell program to retrieve and print all cookies:

Set-StrictMode -version latest

$shellApp = New-Object -ComObject Shell.Application
$cookieFolder = $shellApp.NameSpace(0x21)
if ($cookieFolder.Title -ne "INetCookies") {
    throw "Failed to find INetCookies folder"
}

$allCookies = $cookieFolder.Items() | ? { $_.Type -eq "Text Document" } | % { $_.Path }
foreach ($cookie in $allCookies) {
    Write-Output "Cookie $cookie"
    $items = (Get-Content -Raw $cookie) -Split "\*`n"
    foreach ($item in $items) {
        if ([string]::IsNullOrEmpty($item.Trim())) {
            continue
        }
        $c = $item -Split "\s+"
        Write-Output "  Host $($c[2])"
        Write-Output "  $($c[0]) = $($c[1])"
    }
}

Note that files in %LOCALAPPDATA%\Microsoft\Windows\INetCookies\Low do not show up in $cookieFolder.Items() list. An alternative approach is to browse the file system directly, e.g.

    gci -File -Force -r ((New-Object -ComObject Shell.Application).Namespace(0x21).Self.Path)

Primary Tracker

I believe in KISS principle. Albert Einstein said:

Make everything as simple as possible, but not simpler.

This is one of guiding prinicples that I follow in every designs and implementations. Needless to say, this principle is particularly important in cloud computing. Simplicity makes it easier to reason about the system behaviors and drive code defects down to zero. For critical components, decent performance and reliability are two attributes to let you sleep well in the night. Primary Tracker in networking control plane is a good example to explain this topic.

Basic layering of fabric controller

Fabric Conrtoller (FC) is the operational center of Azure platform. FC gets customer orders from the Red Dog Front End (RDFE) and/or modern replacement Azure Resource Manager (ARM) and then performs all the heavy-lifting work such as hardware management, resource inventory management, provisioning and commanding tenants / virtual machines (VMs), monitoring, etc. It is a “distributed stateful application distributed across data center nodes and fault domains”.

Three most important roles of FC are data center manager (DCM), tenant manager (TM), and network manager (NM). They manage three key aspects of the platform, i.e. data center hardware, compute, networking. In production, FC roles instances are running with 5 update domains (UDs).

FC layering

The number of UDs is different in test clusters. Among all UDs, one of them is elected to the primary controller, all others are considered as backup. The election of primary is based on Paxos algorithm. If primary role instance fails, all the remaining backup replicas will vote a new primary which will resume the operation. As long as there are 3 or more replicas, a quorum can be made and FC will operate normally.

In the above diagram, different nodes communicate with each other and form a ring via the bottom layer RSL. On top of it is a layer of cluster framework, libraries, utilities, collectively we call it CFX. Via CFX and RSL a storage cluster management service is provided where In-Memory Object Store (IMOS) is served. Each FC role defines several data models living in IMOS which is used to persis the state of the role.

Note that eventual consistency model is not used in FC as far as one role is concerned. In fact, strong consistency model is used to add the safety guarantee (read Data Consistency Primer for more information on consistency models). Whether this model is best for FC is debatable, I may explain more in a separate post later.

Primary tracker

Clients from outside of a cluster communicate with FC via Virtual IP address (VIP), and the software load balancer (SLB) routes the request to the right node at where primary replica is located. In the event of primary fail-over, SLB ensures the traffic to the VIP always (or eventually) reaches the new primary. For performance consideration, communication among FC roles does not go through VIP but Dynamic IP address (DIP) directly. Note that primary of one role is often different from the primary of another role, although sometimes they can be the same. Then the question is, where is the primary? The wrong answer of this question has the same effect of service unavailability.

This is why we have Primary Tracker. Basically primary tracker keeps track of IP address of primary replica and maintains a WCF channel factory so ensure the request to the role can be made reliably. The job is as simple as finding a primary, and re-finding the primary if the old one fails over.

Storage cluster management service provides an interface that, once connecting to any replica, it can tell where the primary is as long as the replica serving the request is not disconnected from the ring. Obviously this is a basic operation of any leader election algorithm, nothing mysterious. So primary tracker sounds trivial.

In Azure environment there are a few more factors to consider. Primary tracker object can be shared by multiple threads when many requests are processed concurrently. WCF client channel cannot be shared among multiple threads reliably, re-creating channel factory is too expensive. Having too many concurrent requests may be a concern to the healthy of the target service. So it is necessary to maintain a shared channel factory and perform request throttling (again, this is debatable).

Still this does not sound complicated. In fact, with proper compoentization and decoupling, many problems can be modeled in a simple way. Therefore, we had a almost-working implementation, and it has been in operation for a while.

Use cases

From the perspective of networking control plane, two important use cases of the primary tracker are:

  • Serving tenant command and control requests from TM to NM.
  • Serving VM DHCP requests from DCM to NM.

Load of both cases depends on how busy a cluster is, for instance if customers are starting many new deployments or stopping existing ones.

Problems

Although the old primary tracker worked, it often gave us some headache. Sometimes customers complained that starting VMs took a long time or even got stuck, and we root caused the issue to unresponsiveness of DHCP requests. Occasionally a whole cluster was unhealthy because DHCP stopped, and no new deployment could start because the start container failed repeatedly and pushed physical blades to Human Investigate (HI) state. Eventually the problem happened more often to the frequency of more than once per week, DRI on rotation got nervous since they did not know when the phone would ring them up after going to bed.

Then we improved monitoring and alerting in this area to collect more data, and more importantly got notified as soon as failure occured. This gave us right assessment of the severity but did not solve the problem itself. With careful inspection of the log traces, we found that failover of primary replica would cause the primary track losing contact to any primary for indefinite amount of time, anywhere from minutes to hours.

Analysis

During one of Sev-2 incident investigation, a live dump of the host processs of the primary tracker was taken. The state of object as well as all threads were analyzed, and the conclusion was astonishingly simple – there was a prolonged race condition triggered by channel factory disposal upon the primary failover, then all the threads accessing the shared object just started an endless fight with each other. I will not repeat the tedius process of the analysis here, basically it is backtracking from the snapshot of 21 threads to the failure point with the help of log traces, nothing really exciting.

Once having the conclusion, the evidence in the source code became obivious. The irony part is that the first line of the comment said:

This class is not designed to be thread safe.

But in reality the primary use case is in a multi-thread environment. And the red flag is that the shared state is mutable by multiple thread without proper synchronization.

Fix

Strictly speaking the bugfix is a rewrite of the class with existing behavior preserved. As one can imagine it is not a complicated component, the core design is using reader-writer lock, specifically ReaderWriterLockSlim class (see the reference source here). In addition, a concept of generation is introduced to the shared channel factory in order to prevent the problem of different threads finding new primary multiple times after failover.

Stress test

The best way to check the reliability is to run a stress test with as much load as possible. Since the new implementation is backward compatible with the old one, it is straightforward to conduct the comparative study. The simulated stress environment has many threads sending requests continuously, and the artificial primary failover occurs much more often than any production cluster, furthermore the communication channel is injected with random faults and delay. It is a harsh environment for this component.

It turns out the old implementation breaks down within 8 minutes. The exact failure pattern is observed as the ones happening in production clusters. On the contrary, the new implementation has not failed so far.

Performance measurement

Although the component is perf sensitive, it has no regular perf testing. A one-time perf measurement conducted in the past shows that the maximum load it is able to handle is around 150 to 200 request/sec in a test cluster. This number is more than twice of the peak traffic in a production cluster under normal operational condition, according to live instrumentation data. Is it good enough? Different people have different opinions. My principle is to design for the worst scenario and ensure the extreme case is covered.

As a part of bugfix work, a new perf test program is added to measure both the throughput and latency of the system. The result shows that the new component is able to process about ten times of load, and the per-request overhead is less than one millisecond. After tuning a few parameters (which is a bit different than production setup), the throughput is increased further by about 30-40%.

Rollout

Despite the fear of severe incident caused by the change in critical component, with the proof of functional / perf / stress test data, the newly designed primary tracker has been rolled out to all production clusters. Finally the repeated incidents caused by primary tracking failure no longer wake up DRIs during the night. From customers perspective, this means less number of VM starting failure and shorter VM bootup time.

How Do We Deal With Flaky Tests

Today I read a blog article from Google Testing Blog “Flaky Tests at Google and How We Mitigate Them” and would like to share my thoughts.

Flaky tests not unheard of for a large software project, particularly if test cases are owned by developers with variety level of experience. People hate flaky tests as much as failed tests, rerun takes more resource, false alarms waste precious dev resource, and often times people tend to ignore them or disable them entirely. Personally I do not agree with the approach in the blog, it is simply not a quality-driven culture.

My opinion is always that, Heisenberg uncertain principle plays no role in software development, any “indeterministic” behavior can be traced back to a quality issue, and the number of flaky tests should be driven down to zero.

In the past observation many flakiness is caused by test code issues. There is no test code for the test code, and people may not have the same level of quality awareness as the product code. Besides unsafe threading, race conditions, lack of synchronizations, etc., there are common anti-patterns causing flakiness (only what I can think of at the moment):

  • Checking driven by timeout instead of event: for instance, click a button on UI, wait for 5 seconds, then click the next button.
  • Unaware the async event: for instance, load a web page and wait for finish by checking if a certain tag is found, then proceed to submit the form. But the page actually has a iframe which has to be completed loading.
  • Incorrect assumption of the runtime condition. There are too many such exmaples. In one case, a P1 test case owned by my team fails over the primary replica of the controller, then wait for its coming back by checking the primary IP address reported by the storage cluster management (SCM). Unfortunately the checking is incorrect, because only the layer above SCM is able to tell if the new primary is up reliably.

Besides test code bugs, real product issues may also cause flakiness of the test execution. This is particularly dangerous in cloud environment since massive scale magnifies the probability of hitting the real issue in production. We sometimes say that, if some bad thing has a small chance to happen, then it will happen after deployment.

Most of time, driving down flaky tests requires right mindset and right prioritization from leadership team. As long as flaky tests and failed tests are treated as rigorously as product bugs and live-site incidents in terms of priority and resource assignment, nothing cannot be fixed. As the tests become more reliable and more proven patterns are adopted, improved CI experience will benefit everyone from IC to leadership. One should not underestimate the ROI of driving for better quality.

Create Process Dump File

In cloud environment we mostly rely on log traces to understand what is happening inside a program and investigate abnormal behaivors if things do not work as expected. However, sometimes it is needed to get a deep insight of the process internals via debugging. When live debugging is not possible, we have to capture a user-mode process dump and conduct post-mortem analysis. For instance, memory leak or unusual memory usage increase is such a case where objects on heap need to examined closely.

The most common way to create a dump file is to use Windows Task Manager. One can open Task Manager, click “Processes” tab for Windows 7, or “Details” tab for Windows 8/10, then right-click the name of the process and then click “Create Dump File”. Once the dump file is created, it will be saved at %TEMP% directory, which is usually \Users\UserName\AppData\Local\Temp directory on the system drive. The size of the dump file is roughly the number of virtual bytes of the process.

The downside of this method is that the location of the dump file cannot be specified. In addition, one cannot choose whether minidump (only thread and handle information) or full dump (all process memory) to create.

In some cases, this could be a severe issue. In Azure data center, free space on the system drive is extremely limited (often times less than 15 GB). Personally I have seen (via post-mortem analysis) that dump file creation causes disk space exhaustion and makes the OS unusable when responding to a live-site incident, which makes the situation worse by having a second incident.

A better way to create dump file is to use ProcDump or AdPlus (a part of WinDBG). An example of creating a full dump is:

procdump -ma MyProcess c:\temp\myprocess.dmp

ProcDump is written by Mark Russinovich, a Microsoft Technical Fellow. It is very small in size. One can visit technet page to download. If a GUI is perferred, I strongly recommend a Task Manager replacement, Process Explorer by the same author.