Blackboard Perspectives and thoughts from Zhenhua Yao

Persistence in Stateful Applications

In cloud computing, we build highly available applications on commodity hardware. The software SLA is typically higher thn the underlying hardware for an order or more. This is achieved by distributed application based on state machine replication. If strong consistency is required, state persistence based on Paxos algorithm is often used. Depending on the requirements on layering, latency, availability, failure model, and other factors, there are several solutions available.

Cosmos DB or Azure SQL Database

Most apps build on top of core Azure platform can take dependency of Cosmos DB or Azure SQL Database. Both are easier to use and integrate with existing apps. This is often the most viable path with the least resistence, particularly Cosmos DB with excellent availability, scalability, and performance.

If you are looking for lowest latency possible, the state is better to be persisted locally and cache inside the process. In this case, remote persistence such as Cosmos DB may not be desirable. For services within the platform below Cosmos DB, this approach may not be viable.


Although not many people have noticed it, Replicated State Library is one of the greatest contributions to OSS from Microsoft. It is a verified and well tested Paxos implmentation, which has been in production for many years. RSL has been the core layer to power the Azure core control plane since the beginning. The version released on GitHub is the one used in the product as of now. Personally I am not aware of other implementation with greater scale, performance, and reliability (in term of bugs) on Windows platform. If you have to store 100 GBs of data with strong consistency in a single ring, RSL is well capable of doing the job.

Note that it is for Windows platforms only, both native and managed code is supported. I guess it is possible to port it to Linux, however no one has looked into it and no plan to do so.

In-Memory Object Store (IMOS) is a proprietary managed code on top of RSL to provide transaction semantics, strong-typed object, object collections, relationships, and code-generation from UML class diagrams. Although the performance and scale are sacrificed somewhat, it is widely used because of convenience and productivity.

Service Fabric Reliable Collections

RSL and IMOS are often used by “monolithic” distributed applications before Service Fabric is widely adopted. SF is a great platform to build scalable and reliable microservices, in particular stateful services. Hosting RSL on SF isn’t impossible but it is far from straightforward. At least, the primary election in RSL is totally independent of SF, you’d better ensure both are consistent via some trick. In addition, SF may move the replicas around any time, and this must be coordinated with RSL dynamic replica set reconfiguration. Therefore, the most common approach is to use SF reliable collections in the stateful application as recommended. Over time, this approach will be the mainstream in the foundational layer.

Ring Master

If you need distributed synchorinization and are not satisfied with ZooKeeper because of its scale, or you want native SF integration, then you should consider adopting Ring Master which is released to open source. Essentially Ring Master provides a superset of ZooKeeper semantics. This is the core component supporting the goal state delivery in several mission-critical foundational services in the platform. The persistence layer can be replaced, the released source code supports SF reliable collections for production use and in-memory for testing. If you want absolute best performance and scale, considering persist to RSL.

If you have any question or comments, please leave a message in the discussion. Thanks!

Build Systems Through My Eyes

Before joining Microsoft, I worked on Linux almost all the time for years. Similar as most other projects, we used shell scripts to automate the build process, and GNU automake / autoconf were the main toolset. Occasionally CMake was used to handle some components where necessary. In Windows team, I witnessed how to build enormouse amount of code consistently and reliably using sophicated in-house software. In this note, a few build systems that I used in the past are discussed to share some learnings.

Why do we need a “build system”?

A simple hello world program or school project doesn’t need a build system. Load it to your favorite IDE and run it. If it works, congrats. If not, it is your responsibility to fix it. Obiviously this logic won’t fly for any projects shared by multiple people. Windows SDK and Visual Studio don’t really tell us how to deal with large number of projects in an automated and reliable manner. NMake is the counterpart of Makefile and able to do the job to some extend. However, honestly I haven’t seen anyone using it directly because of the complexity level at large scale. We need a layer on top of SDK and VS toolset to automate the entire build process for both developers and lab builds, and the process must be reliable and repeatable. For Windows, reproducibility is critical. Imagine you have to fix a customer reported issue on a version released long time back, it would be unthinkable if you could not produce the same set of binaries as the build machines did previously in order to debug. By the way, all build systems are command line based since no one will glare at their monitor for hours, no fancy UI is needed.

Razzle and Source Depot

Razzle is the first production-quality build system I used. Essentially it is a collection of command line tools and environment variables to run build.exe for regular dev build and timebuild for lab builld. At the start of day, a command prompt is opened, razzle.cmd is invoked, which performs some validation to the host environment, sets up environment variables and presents a command prompt for conducting subsequent work for the day.

In Razzle, everything is checked into the source repository. Here “everything” is literally everything including source code, compilers, SDK, all external dependencies, libraries, and all binaries needed for the build process. Outside of build servers, no one checks out everything on their dev machine which could be near or at TB. Working enlistment is a partial checkout at tens of GB level. Because of the outrageous requirement on the scale, an in-house source repository called Source Depot (rumor said it was based off Perforce with needed improvement, not sure the accuracy though) is used, and a federation of SD servers is used to support the Windows code base. On top of sd.exe, there is a batch script called sdx.cmd to coordinate the common operations across multiple SD servers. For instance, instead of using “sd sync”, we used to run “sdx sync” to pull down the latest checkine. Some years later, in order to modernize the dev environment git replaced SD, which I have no hands-on experience.

Razzle deeply influenced other build systems down the line. Even now, people used to type “build” or even “bcz” even if the latter is not really meaningful in the contemporary build systems. One of the great advantages of Razzle is its reproducibility and total independence. Because everything is stored in SD, if you want to reproduce an old build, you just check out the version at the required changeset, type “build” and eventually you will get the precise build required by the work, other than timestamp, etc. In practicality, with a clean installed OS, run the enlistment script on a network share which in turn calls sd to download the source code (equivalent to “git clone”), then you have the fully working enlistment, nothing else is needed (assuming you are fine with editing C++ code with notepad).

Instead of Makefile or MSBuild project files, dirs files are used for directory traversal, sources files are used to build the individual project. An imaginary sources file is like the following (illustration purpose only):

UMTYPE      = console




SOURCES     = \
    hello.cpp \

    $(SDK_LIB_PATH)\kernel32.lib \
    $(PROJROOT)\world\win32\$(O)\world.lib \

Invocation of underlying ntbuild will carry out several build passes to run various tasks, such as preprocessing, midl, compile, linking, etc. There are also postbuild tasks to handle code signing, instrumentation, code analysis, localization, etc. Publish/consume mechanism is used to handle the dependencies among projects, so it is possible to enlist a small subset of projects and build without missing dependencies.

Coming from Linux world, I didn’t find it too troublesome using another set of command line tools, other than missing cygwin and VIM. However, for people who loved Visual Studio and GUI tools, this seemed to be a unproductive environment. Additionally, you cannot easily use Razzle for projects outside Windows.


After moving out of Windows, I came to know CoreXT in an enterprise software project. Initially as a Razzle clone, it is believed to be a community project maintained by passionate build engineers inside Microsoft (by the way I have never been a build engineer). It is widely used in Office, SQL, Azure, and many organizations even today. Six years ago, Azure projects were based on CoreXT and followed similar approach as Windows on Razzle: everything stored in SD, dirs/sources on top of ntbuild, timebuild to produce nightly build, etc. The main difference was each service had its own enlistment project, just like a miniature of Windows code base. Inter-service dependencies were handled by copying files around. For instance, if project B had to use some libraries generated by project A, project A would export those files, and project B would import them by adding them to SD. For projects based on managed code (most are), msbuild instead of NTBuild was used for convenience.

At the time, the dev experience on CoreXT was not too bad. It inherited all the goodness of Razzle. But it was still a bit heavyweight. Even you only had tens of MB in source code, the build environment and external dependencies would still be north of ten GBs in size. Young engineers considered it as dinosour environment, which was hard to argue if comparing with open source toolset. The supportibility of Visual Studio IDE was via csproj files (used by both build and IDE) and sln files (used by IDE only).

Five years ago, people started to modernize the dev environment. The first thing was move from SD to git. Without LFS, it is impractical to store much data in git. At least, 1 GB was considered as acceptable upbound at the time. So we had to forget about the practice of checking in everything and started to reduce the repo size dramatically. But Windows SDK alone was already well over 1 GB, how to handle the storage issue without sacrifising reproducibility? The solution was to leverage NuGet. Essentially, besides corext bootstrapper (very small) and source code everything was wrapped into NuGet packages. This solution has been lasted until today.

Most projects have its own git repository. Under the root directory, init.cmd is the replacement of Razzle.cmd, it invokes corext bootstrapper to setup the enlistment environment. Similarly as Razzle, it is still a command prompt with environment variables and command aliases. corext.config under .corext is similar as nuget.config, which contains the list of NuGet feeds (on-premises network shares in the past, ADO nowadays) and list of packages. All packages are downloaded and extracted into CoreXT cache directory. MSBuild project files are modified to use the toolset in the cache directory, such as:

<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="15.0" DefaultTargets="Build" xmlns="">
  <Import Project="$(EnvironmentConfig)" />
  <Import Project="$(ExtendedTargetsPath)\Microsoft.CSharp.targets" />

Here the trick is EnvironmentConfig is a environment variable pointing to a MSBuild props file in CoreXT cache, this file bootstraps everything after that. With that, when build alias is invoked, MSBuild program is called, then compilers and build tools in CoreXT cache are used, instead of the one installed on the host machine.

In theory, the entire process relies on nothing but the files in CoreXT cache. One does not need to install Visual Studio or any developer tools on their computers. In practice, occasionally some packages reference files outside of the cache and assume certain software to be installed. However, that is rather an exception than a norm.

For developers, we use Visual Studio or VS Code to browse code, write code, build and debug. A tool is provided to generate solution file from a set of MSBuild project files (csproj and dirs.proj). Then the solution is loaded in IDE.

Dependencies among projects are handled by NuGet packages. During official build, we can choose whether or not to publish packages into feeds on ADO. Other projects simply add the <package .../> in corext.config file should they want to consume any packages.

So far most projects and most people in my org are still using CoreXT in this form. It is used by engineers during daily development, by build machines in the lab, by distributed build in the cloud, and everywhere we want it to be. Other than compiling source code and building product binaries, it also carries out various other tasks, including but not limited to static code analysis, policy check, VHD generation, NuGet package creation, making app package suitable for deployment, publishing symbols, etc.

CBT and Retail MSBuild

Again, CoreXT is considered to be modern day relic. People use it because they have to. In particular, it is highly desireable to have seamless integration with Visual Studio and ability to consume latest technology from dotnet core. Before MSBuild becomes more capable, Common Build Toolset (CBT) was developed as a GitHub project to fulfill this requirement. This is a lightweight framework to provide consistent “git clone + msbuild” experience to codebase using it. One of additional advantage is it is open source, for the internal projects that need to sync to github periodically, no more duplicate build systems (one for internal build, one for public) is needed.

Using CBT is extremely simple from dev perspective. No internal machinary whatsoever. Just clone the repo and open it using Visual Studio. Adding new project is also straightforward, no more need to perform brain surgery to csproj files like CoreXT. The downside is obvious, you must install essential build tools such as VS. Reproducibility isn’t strictly guaranteed as far as I can. After all, the VS developer command prompt is used. For most Azure teams, this may not be a concern since things move so fast, I haven’t met anyone who complains they cannot reproduce the build one year ago for serving old version of their service.

CBT is somewhat short-lived. For some people, by the time they come to know the migration path from CoreXT to CBT, it is already deprecated. The latest shiny framework on the street is Retail MSBuild. :-) It works similarly as CBT but even more lightweight. With this framework, engineering teams are able to use Visual Studio and retail Microsoft toolset in their most natural way. In CoreXT, people have to spend a lot of time for any new technology because the framework intentionally works differently. Personally I’ve spent many hours to make dotnet core working in my team, some other components might be worse. With retail MSBuild, everything just works with plain simple SDK style project files with PackageReference. Precious resource can be spent on real work, we are not rewarded for reinventing the wheel (and possibly a worse one) anyway.


Other than the most popular ones aforementioned, some teams write their own framework for meeting their unique requirement. For instance, several years ago a team needed a high-performance build integrating with VSTS build defitions with minimal overhead, so a thin wrapper was built on top of collection of project files and batch scripts. In RingMaster I had to write my own version of build framework because internal proprietary build system could not be released because of approval process, project would not build without one similar to CoreXT, and no other alternative was available (CBT did not exist at the time). At the end, the projects were migrated to SDK-style to make this work easier.

In the future, I look forward to retail MSBuild being adopted more widely and internal build systems going away eventually. I love open source down to my heart. :-)

Notes on AWS Kinesis Event RCA

Long time no post! I have written many technical analysis internally but probably I should share more with people who have no access to that. :-)

As a senior engineer working in Azure Core, alerts and incidents are more common than what you might think. We (me and my team) strive to build the most reliable cloud in the world and deliver the best customer experience. In reality, like everyone in the battle field we make mistakes, we learn from them, and try not to make the same mistake twice. During this Thanksgiving, we tried everything possible to not disturb the platform including suspending non-critical changes, and it turned out be another non-eventful weekend thank goodness. When I heard the AWS outage in US east, my heart was with the frontline engineers. Later I enjoyed reading the RCA of the outage. The following is my personal takeaways.

The first lesson is do not make changes during or right before high-priority event. Most of time stuffs do not break if you don’t touch them. If your projection is more capacity may be required, carry out the expansion a few days prior to the important period of time. Furthermore, even if something does not look quite right, be conservative and be sure to not make it worse while making a “small” fix.

The second lesson is monitoring gap, in other words why did the team not get the high-severity alert to indicate the actual source of the problem. Regarding the maximum number of threads being exceeded, or the high number of threads problem, actually in my observation this isn’t a rare event. A few days ago, I was invited to check why a backend service replica did not make any progress. Once loading the crash dump in the debugger, it was quite obivious where the problem is – several thousands of threads were spin-waiting a shared resource, which was held by a poorly implemented logging library. The difference in this case is the team did notice the situation by accurate monitoring and mitigated the problem by failing over the process. If we know the number of threads should not exceed N, we absolutely need to configure the monitor to know it immediately if it goes out of the expected range, and a troubleshooting guide should be linked with the alert so even the half-waken junior engineers are able to follow the standard operation procedure to mitigate the issue or escalate (in case the outcome is unexpected). I am glad to read the repair item for this:

We are adding fine-grained alarming for thread consumption in the service…

In many services here, the threadpool worker thread growth can be unbounded under certain rare cases until we receive alert to fix it. For instances, theorectically the number of threads can go up to 32767 although I’ve never seen that many (maximum is about 5000-6000 in the past). In some services, the upbound is set to a much conservative number. So I think the following is something we can learn:

We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well.

In addition, the following caught my attention:

This information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers … It takes up to an hour for any existing front-end fleet member to learn of new participants.

Maybe the design philosophy in AWS (or upper layer) is different from the practice in my org. The service initialization is an important aspect to check the performance and reliability. Usually, we try not take dependency from layers above us, or assume certain components must be operating properly in order to bootstrap the service replica in question. The duration of service initialization will be measured and tracked over time. If the initialization takes too long to complete, we will be called. The majority of services in the control plane services takes seconds to be ready, outliers hit high-severity incidents unfortunately. A few days ago a service in a busy datacenter took about 9 minutes to get online (from process creation to ready to process incoming requests), that was such a pain during outage. In my opinion, fundamental improvement has to be performed to fix this, and the following is on the right track:

…to radically improve the cold-start time for the front-end fleet…Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range…

Usually we favor scale-out the service insteading scale-up, however the following is right on spot in this context:

we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet. This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet…

Finally, kudos to Kinesis team on mitigating the issue and finding the root cause. I greatly appreciate the detailed RCA report which will benefit everyone working in cloud computing!

How to Retrive Internet Cookies Programmatically

Using the cookies stored by the website in the script is a nice trick to use the existing authentication to access the web service, etc. There are several ways to retrieve the cookies from IE / Edge, the most convenient way is to directly read the files on the local disk. Basically we can use the Shell.Application COM object to locate the cookies folder, then parse all text files for the needed information. In each file, there are several records delimited by a line of single character *, in each record the first line is the name, second line the value, third line the host name of website that sets the cookie. Here is a simple PowerShell program to retrieve and print all cookies:

Set-StrictMode -version latest

$shellApp = New-Object -ComObject Shell.Application
$cookieFolder = $shellApp.NameSpace(0x21)
if ($cookieFolder.Title -ne "INetCookies") {
    throw "Failed to find INetCookies folder"

$allCookies = $cookieFolder.Items() | ? { $_.Type -eq "Text Document" } | % { $_.Path }
foreach ($cookie in $allCookies) {
    Write-Output "Cookie $cookie"
    $items = (Get-Content -Raw $cookie) -Split "\*`n"
    foreach ($item in $items) {
        if ([string]::IsNullOrEmpty($item.Trim())) {
        $c = $item -Split "\s+"
        Write-Output "  Host $($c[2])"
        Write-Output "  $($c[0]) = $($c[1])"

Note that files in %LOCALAPPDATA%\Microsoft\Windows\INetCookies\Low do not show up in $cookieFolder.Items() list. An alternative approach is to browse the file system directly, e.g.

    gci -File -Force -r ((New-Object -ComObject Shell.Application).Namespace(0x21).Self.Path)

Primary Tracker

I believe in KISS principle. Albert Einstein said:

Make everything as simple as possible, but not simpler.

This is one of guiding prinicples that I follow in every designs and implementations. Needless to say, this principle is particularly important in cloud computing. Simplicity makes it easier to reason about the system behaviors and drive code defects down to zero. For critical components, decent performance and reliability are two attributes to let you sleep well in the night. Primary Tracker in networking control plane is a good example to explain this topic.

Basic layering of fabric controller

Fabric Conrtoller (FC) is the operational center of Azure platform. FC gets customer orders from the Red Dog Front End (RDFE) and/or modern replacement Azure Resource Manager (ARM) and then performs all the heavy-lifting work such as hardware management, resource inventory management, provisioning and commanding tenants / virtual machines (VMs), monitoring, etc. It is a “distributed stateful application distributed across data center nodes and fault domains”.

Three most important roles of FC are data center manager (DCM), tenant manager (TM), and network manager (NM). They manage three key aspects of the platform, i.e. data center hardware, compute, networking. In production, FC roles instances are running with 5 update domains (UDs).

FC layering

The number of UDs is different in test clusters. Among all UDs, one of them is elected to the primary controller, all others are considered as backup. The election of primary is based on Paxos algorithm. If primary role instance fails, all the remaining backup replicas will vote a new primary which will resume the operation. As long as there are 3 or more replicas, a quorum can be made and FC will operate normally.

In the above diagram, different nodes communicate with each other and form a ring via the bottom layer RSL. On top of it is a layer of cluster framework, libraries, utilities, collectively we call it CFX. Via CFX and RSL a storage cluster management service is provided where In-Memory Object Store (IMOS) is served. Each FC role defines several data models living in IMOS which is used to persis the state of the role.

Note that eventual consistency model is not used in FC as far as one role is concerned. In fact, strong consistency model is used to add the safety guarantee (read Data Consistency Primer for more information on consistency models). Whether this model is best for FC is debatable, I may explain more in a separate post later.

Primary tracker

Clients from outside of a cluster communicate with FC via Virtual IP address (VIP), and the software load balancer (SLB) routes the request to the right node at where primary replica is located. In the event of primary fail-over, SLB ensures the traffic to the VIP always (or eventually) reaches the new primary. For performance consideration, communication among FC roles does not go through VIP but Dynamic IP address (DIP) directly. Note that primary of one role is often different from the primary of another role, although sometimes they can be the same. Then the question is, where is the primary? The wrong answer of this question has the same effect of service unavailability.

This is why we have Primary Tracker. Basically primary tracker keeps track of IP address of primary replica and maintains a WCF channel factory so ensure the request to the role can be made reliably. The job is as simple as finding a primary, and re-finding the primary if the old one fails over.

Storage cluster management service provides an interface that, once connecting to any replica, it can tell where the primary is as long as the replica serving the request is not disconnected from the ring. Obviously this is a basic operation of any leader election algorithm, nothing mysterious. So primary tracker sounds trivial.

In Azure environment there are a few more factors to consider. Primary tracker object can be shared by multiple threads when many requests are processed concurrently. WCF client channel cannot be shared among multiple threads reliably, re-creating channel factory is too expensive. Having too many concurrent requests may be a concern to the healthy of the target service. So it is necessary to maintain a shared channel factory and perform request throttling (again, this is debatable).

Still this does not sound complicated. In fact, with proper compoentization and decoupling, many problems can be modeled in a simple way. Therefore, we had a almost-working implementation, and it has been in operation for a while.

Use cases

From the perspective of networking control plane, two important use cases of the primary tracker are:

  • Serving tenant command and control requests from TM to NM.
  • Serving VM DHCP requests from DCM to NM.

Load of both cases depends on how busy a cluster is, for instance if customers are starting many new deployments or stopping existing ones.


Although the old primary tracker worked, it often gave us some headache. Sometimes customers complained that starting VMs took a long time or even got stuck, and we root caused the issue to unresponsiveness of DHCP requests. Occasionally a whole cluster was unhealthy because DHCP stopped, and no new deployment could start because the start container failed repeatedly and pushed physical blades to Human Investigate (HI) state. Eventually the problem happened more often to the frequency of more than once per week, DRI on rotation got nervous since they did not know when the phone would ring them up after going to bed.

Then we improved monitoring and alerting in this area to collect more data, and more importantly got notified as soon as failure occured. This gave us right assessment of the severity but did not solve the problem itself. With careful inspection of the log traces, we found that failover of primary replica would cause the primary track losing contact to any primary for indefinite amount of time, anywhere from minutes to hours.


During one of Sev-2 incident investigation, a live dump of the host processs of the primary tracker was taken. The state of object as well as all threads were analyzed, and the conclusion was astonishingly simple – there was a prolonged race condition triggered by channel factory disposal upon the primary failover, then all the threads accessing the shared object just started an endless fight with each other. I will not repeat the tedius process of the analysis here, basically it is backtracking from the snapshot of 21 threads to the failure point with the help of log traces, nothing really exciting.

Once having the conclusion, the evidence in the source code became obivious. The irony part is that the first line of the comment said:

This class is not designed to be thread safe.

But in reality the primary use case is in a multi-thread environment. And the red flag is that the shared state is mutable by multiple thread without proper synchronization.


Strictly speaking the bugfix is a rewrite of the class with existing behavior preserved. As one can imagine it is not a complicated component, the core design is using reader-writer lock, specifically ReaderWriterLockSlim class (see the reference source here). In addition, a concept of generation is introduced to the shared channel factory in order to prevent the problem of different threads finding new primary multiple times after failover.

Stress test

The best way to check the reliability is to run a stress test with as much load as possible. Since the new implementation is backward compatible with the old one, it is straightforward to conduct the comparative study. The simulated stress environment has many threads sending requests continuously, and the artificial primary failover occurs much more often than any production cluster, furthermore the communication channel is injected with random faults and delay. It is a harsh environment for this component.

It turns out the old implementation breaks down within 8 minutes. The exact failure pattern is observed as the ones happening in production clusters. On the contrary, the new implementation has not failed so far.

Performance measurement

Although the component is perf sensitive, it has no regular perf testing. A one-time perf measurement conducted in the past shows that the maximum load it is able to handle is around 150 to 200 request/sec in a test cluster. This number is more than twice of the peak traffic in a production cluster under normal operational condition, according to live instrumentation data. Is it good enough? Different people have different opinions. My principle is to design for the worst scenario and ensure the extreme case is covered.

As a part of bugfix work, a new perf test program is added to measure both the throughput and latency of the system. The result shows that the new component is able to process about ten times of load, and the per-request overhead is less than one millisecond. After tuning a few parameters (which is a bit different than production setup), the throughput is increased further by about 30-40%.


Despite the fear of severe incident caused by the change in critical component, with the proof of functional / perf / stress test data, the newly designed primary tracker has been rolled out to all production clusters. Finally the repeated incidents caused by primary tracking failure no longer wake up DRIs during the night. From customers perspective, this means less number of VM starting failure and shorter VM bootup time.