18 Dec 2020
In cloud computing, we build highly available applications on commodity hardware. The software SLA is typically higher
thn the underlying hardware for an order or more. This is achieved by distributed application based on state machine
replication. If strong consistency is required, state persistence based on
Paxos algorithm is often used. Depending on the requirements
on layering, latency, availability, failure model, and other factors, there are several solutions available.
Cosmos DB or Azure SQL Database
Most apps build on top of core Azure platform can take dependency of Cosmos
DB or Azure SQL
Database. Both are easier to use and integrate with existing
apps. This is often the most viable path with the least resistence, particularly Cosmos DB with excellent availability,
scalability, and performance.
If you are looking for lowest latency possible, the state is better to be persisted locally and cache inside the
process. In this case, remote persistence such as Cosmos DB may not be desirable. For services within the platform below
Cosmos DB, this approach may not be viable.
RSL
Although not many people have noticed it, Replicated State Library is one of the
greatest contributions to OSS from Microsoft. It is a verified and well tested Paxos implmentation, which has been in
production for many years. RSL has been the core layer to power the Azure core control plane since the beginning. The
version released on GitHub is the one used in the product as of now. Personally I am not aware of other implementation
with greater scale, performance, and reliability (in term of bugs) on Windows platform. If you have to store 100 GBs of
data with strong consistency in a single ring, RSL is well capable of doing the job.
Note that it is for Windows platforms only, both native and managed code is supported. I guess it is possible to port it
to Linux, however no one has looked into it and no plan to do so.
In-Memory Object Store (IMOS) is a proprietary managed code on top of RSL to provide transaction semantics, strong-typed
object, object collections, relationships, and code-generation from UML class diagrams. Although the performance and
scale are sacrificed somewhat, it is widely used because of convenience and productivity.
Service Fabric Reliable Collections
RSL and IMOS are often used by “monolithic” distributed applications before Service
Fabric is widely adopted. SF is a great
platform to build scalable and reliable microservices, in particular stateful services. Hosting RSL on SF isn’t
impossible but it is far from straightforward. At least, the primary election in RSL is totally independent of SF, you’d
better ensure both are consistent via some trick. In addition, SF may move the replicas around any time, and this must
be coordinated with RSL dynamic replica set reconfiguration. Therefore, the most common approach is to use SF reliable
collections
in the stateful application as recommended. Over time, this approach will be the mainstream in the foundational layer.
Ring Master
If you need distributed synchorinization and are not satisfied with ZooKeeper because
of its scale, or you want native SF integration, then you should consider adopting Ring
Master which is released to open source. Essentially Ring Master provides a
superset of ZooKeeper semantics. This is the core component supporting the goal state delivery in several
mission-critical foundational services in the platform. The persistence layer can be replaced, the released source code
supports SF reliable collections for production use and in-memory for testing. If you want absolute best performance and
scale, considering persist to RSL.
If you have any question or comments, please leave a message in the discussion. Thanks!
17 Dec 2020
Before joining Microsoft, I worked on Linux almost all the time for years. Similar as most other projects, we used shell
scripts to automate the build process, and GNU automake / autoconf were the main toolset. Occasionally
CMake was used to handle some components where necessary. In Windows team, I witnessed how to
build enormouse amount of code consistently and reliably using sophicated in-house software. In this note, a few build
systems that I used in the past are discussed to share some learnings.
Why do we need a “build system”?
A simple hello world program or school project doesn’t need a build system. Load it to your favorite IDE and run it. If
it works, congrats. If not, it is your responsibility to fix it. Obiviously this logic won’t fly for any projects shared
by multiple people. Windows SDK and Visual Studio don’t really tell us how to deal with large number of projects in an
automated and reliable manner.
NMake is the counterpart of
Makefile and able to do the job to some extend. However, honestly I haven’t seen anyone using it directly because of the
complexity level at large scale. We need a layer on top of SDK and VS toolset to automate the entire build process for
both developers and lab builds, and the process must be reliable and repeatable. For Windows, reproducibility is
critical. Imagine you have to fix a customer reported issue on a version released long time back, it would be
unthinkable if you could not produce the same set of binaries as the build machines did previously in order to debug. By
the way, all build systems are command line based since no one will glare at their monitor for hours, no fancy UI is
needed.
Razzle and Source Depot
Razzle is the first production-quality build system I used. Essentially it is a collection of command line tools and
environment variables to run build.exe for regular dev build and timebuild for lab builld. At the start of day, a
command prompt is opened, razzle.cmd is invoked, which performs some validation to the host environment, sets up
environment variables and presents a command prompt for conducting subsequent work for the day.
In Razzle, everything is checked into the source repository. Here “everything” is literally everything including
source code, compilers, SDK, all external dependencies, libraries, and all binaries needed for the build process.
Outside of build servers, no one checks out everything on their dev machine which could be near or at TB. Working
enlistment is a partial checkout at tens of GB level. Because of the outrageous requirement on the scale, an in-house
source repository called Source Depot (rumor said it was based off Perforce
with needed improvement, not sure the accuracy though) is used, and a federation of SD servers is used to support the
Windows code base. On top of sd.exe, there is a batch script called sdx.cmd to coordinate the common operations across
multiple SD servers. For instance, instead of using “sd sync”, we used to run “sdx sync” to pull down the latest
checkine. Some years later, in order to modernize the dev environment git replaced
SD, which I have no hands-on experience.
Razzle deeply influenced other build systems down the line. Even now, people used to type “build” or even “bcz” even if
the latter is not really meaningful in the contemporary build systems. One of the great advantages of Razzle is its
reproducibility and total independence. Because everything is stored in SD, if you want to reproduce an old build, you
just check out the version at the required changeset, type “build” and eventually you will get the precise build
required by the work, other than timestamp, etc. In practicality, with a clean installed OS, run the enlistment script
on a network share which in turn calls sd to download the source code (equivalent to “git clone”), then you have the
fully working enlistment, nothing else is needed (assuming you are fine with editing C++ code with notepad).
Instead of Makefile or MSBuild project files, dirs
files are used for directory traversal, sources
files are used to
build the individual project. An imaginary sources file is like the following (illustration purpose only):
TARGETNAME = Hello
TARGETTYPE = DYNLINK
UMTYPE = console
C_DEFINES = $(C_DEFINES) -DWIN32 -DMYPROJECT
LINKER_FLAGS = $(LINKER_FLAGS) -MAP
INCLUDES = $(INCLUDES);\
$(PROJROOT)\shared;
SOURCES = \
hello.cpp \
TARGETLIBS = $(TARGETLIBS) \
$(SDK_LIB_PATH)\kernel32.lib \
$(PROJROOT)\world\win32\$(O)\world.lib \
Invocation of underlying ntbuild will carry out several build passes to run various tasks, such as preprocessing, midl,
compile, linking, etc. There are also postbuild tasks to handle code signing, instrumentation, code analysis,
localization, etc. Publish/consume mechanism is used to handle the dependencies among projects, so it is possible to
enlist a small subset of projects and build without missing dependencies.
Coming from Linux world, I didn’t find it too troublesome using another set of command line tools, other than missing
cygwin and VIM. However, for people who loved Visual Studio and GUI tools, this seemed to be a unproductive
environment. Additionally, you cannot easily use Razzle for projects outside Windows.
CoreXT
After moving out of Windows, I came to know CoreXT in an enterprise software project. Initially as a Razzle clone, it is
believed to be a community project maintained by passionate build engineers inside Microsoft (by the way I have never
been a build engineer). It is widely used in Office, SQL, Azure, and many organizations even today. Six years ago,
Azure projects were based on CoreXT and followed similar approach as Windows on Razzle: everything stored in SD,
dirs/sources on top of ntbuild, timebuild to produce nightly build, etc. The main difference was each service had its
own enlistment project, just like a miniature of Windows code base. Inter-service dependencies were handled by copying
files around. For instance, if project B had to use some libraries generated by project A, project A would export
those files, and project B would import them by adding them to SD. For projects based on managed code (most are),
msbuild instead of NTBuild was used for convenience.
At the time, the dev experience on CoreXT was not too bad. It inherited all the goodness of Razzle. But it was still a
bit heavyweight. Even you only had tens of MB in source code, the build environment and external dependencies would
still be north of ten GBs in size. Young engineers considered it as dinosour environment, which was hard to argue if
comparing with open source toolset. The supportibility of Visual Studio IDE was via csproj files (used by both build and
IDE) and sln files (used by IDE only).
Five years ago, people started to modernize the dev environment. The first thing was move from SD to git. Without LFS,
it is impractical to store much data in git. At least, 1 GB was considered as acceptable upbound at the time. So we had
to forget about the practice of checking in everything and started to reduce the repo size dramatically. But Windows SDK
alone was already well over 1 GB, how to handle the storage issue without sacrifising reproducibility? The solution was
to leverage NuGet. Essentially, besides corext bootstrapper
(very small) and source code everything was wrapped into NuGet packages. This solution has been lasted until today.
Most projects have its own git repository. Under the root directory, init.cmd is the replacement of Razzle.cmd, it
invokes corext bootstrapper to setup the enlistment environment. Similarly as Razzle, it is still a command prompt with
environment variables and command aliases. corext.config
under .corext
is similar as nuget.config, which contains the
list of NuGet feeds (on-premises network shares in the past, ADO nowadays) and list of packages. All packages are
downloaded and extracted into CoreXT cache directory. MSBuild project files are modified to use the toolset in the
cache directory, such as:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="15.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Import Project="$(EnvironmentConfig)" />
...
<Import Project="$(ExtendedTargetsPath)\Microsoft.CSharp.targets" />
</Project>
Here the trick is EnvironmentConfig
is a environment variable pointing to a MSBuild props file in CoreXT cache, this
file bootstraps everything after that. With that, when build alias is invoked, MSBuild program is called, then compilers
and build tools in CoreXT cache are used, instead of the one installed on the host machine.
In theory, the entire process relies on nothing but the files in CoreXT cache. One does not need to install Visual
Studio or any developer tools on their computers. In practice, occasionally some packages reference files outside of the
cache and assume certain software to be installed. However, that is rather an exception than a norm.
For developers, we use Visual Studio or VS Code to browse code, write code, build and debug. A tool is provided to
generate solution file from a set of MSBuild project files (csproj and dirs.proj). Then the solution is loaded in IDE.
Dependencies among projects are handled by NuGet packages. During official build, we can choose whether or not to
publish packages into feeds on ADO. Other projects simply add the <package .../>
in corext.config file should they
want to consume any packages.
So far most projects and most people in my org are still using CoreXT in this form. It is used by engineers during
daily development, by build machines in the lab, by distributed build in the cloud, and everywhere we want it to be.
Other than compiling source code and building product binaries, it also carries out various other tasks, including but
not limited to static code analysis, policy check, VHD generation, NuGet package creation, making app package suitable
for deployment, publishing symbols, etc.
CBT and Retail MSBuild
Again, CoreXT is considered to be modern day relic. People use it because they have to. In particular, it is highly
desireable to have seamless integration with Visual Studio and ability to consume latest technology from dotnet core.
Before MSBuild becomes more capable, Common Build Toolset (CBT) was developed
as a GitHub project to fulfill this requirement. This is a lightweight framework to provide consistent “git clone +
msbuild” experience to codebase using it. One of additional advantage is it is open source, for the internal projects
that need to sync to github periodically, no more duplicate build systems (one for internal build, one for public) is
needed.
Using CBT is extremely simple from dev perspective. No internal machinary whatsoever. Just clone the repo and open it
using Visual Studio. Adding new project is also straightforward, no more need to perform brain surgery to csproj files
like CoreXT. The downside is obvious, you must install essential build tools such as VS. Reproducibility isn’t strictly
guaranteed as far as I can. After all, the VS developer command prompt is used. For most Azure teams, this may not be a
concern since things move so fast, I haven’t met anyone who complains they cannot reproduce the build one year ago for
serving old version of their service.
CBT is somewhat short-lived. For some people, by the time they come to know the migration path from CoreXT to CBT, it is
already deprecated. The latest shiny framework on the street is Retail MSBuild. :-) It works similarly as CBT but even
more lightweight. With this framework, engineering teams are able to use Visual Studio and retail Microsoft toolset in
their most natural way. In CoreXT, people have to spend a lot of time for any new technology because the framework
intentionally works differently. Personally I’ve spent many hours to make dotnet core working in my team, some other
components might be worse. With retail MSBuild, everything just works with plain simple SDK style project files with
PackageReference. Precious resource can be spent on real work, we are not rewarded for reinventing the wheel (and
possibly a worse one) anyway.
Snowflakes
Other than the most popular ones aforementioned, some teams write their own framework for meeting their unique
requirement. For instance, several years ago a team needed a high-performance build integrating with VSTS build
defitions with minimal overhead, so a thin wrapper was built on top of collection of project files and batch scripts. In
RingMaster I had to write my own version of build framework because internal
proprietary build system could not be released because of approval process, project would not build without one similar
to CoreXT, and no other alternative was available (CBT did not exist at the time). At the end, the projects were
migrated to SDK-style to make this work easier.
In the future, I look forward to retail MSBuild being adopted more widely and internal build systems going away
eventually. I love open source down to my heart. :-)
15 Dec 2020
Long time no post! I have written many technical analysis internally but probably I should share more with people who
have no access to that. :-)
As a senior engineer working in Azure Core, alerts and incidents are more common than what you might think. We (me and
my team) strive to build the most reliable cloud in the world and deliver the best customer experience. In reality, like
everyone in the battle field we make mistakes, we learn from them, and try not to make the same mistake twice. During
this Thanksgiving, we tried everything possible to not disturb the platform including suspending non-critical changes,
and it turned out be another non-eventful weekend thank goodness. When I heard the AWS outage in US east, my heart was
with the frontline engineers. Later I enjoyed reading the RCA of the outage.
The following is my personal takeaways.
The first lesson is do not make changes during or right before high-priority event. Most of time stuffs do not break
if you don’t touch them. If your projection is more capacity may be required, carry out the expansion a few days prior
to the important period of time. Furthermore, even if something does not look quite right, be conservative and be sure
to not make it worse while making a “small” fix.
The second lesson is monitoring gap, in other words why did the team not get the high-severity alert to indicate the
actual source of the problem. Regarding the maximum number of threads being exceeded, or the high number of threads
problem, actually in my observation this isn’t a rare event. A few days ago, I was invited to check why a backend
service replica did not make any progress. Once loading the crash dump in the debugger, it was quite obivious where the
problem is – several thousands of threads were spin-waiting a shared resource, which was held by a poorly implemented
logging library. The difference in this case is the team did notice the situation by accurate monitoring and mitigated
the problem by failing over the process. If we know the number of threads should not exceed N, we absolutely need to
configure the monitor to know it immediately if it goes out of the expected range, and a troubleshooting guide should be
linked with the alert so even the half-waken junior engineers are able to follow the standard operation procedure to
mitigate the issue or escalate (in case the outcome is unexpected). I am glad to read the repair item for this:
We are adding fine-grained alarming for thread consumption in the service…
In many services here, the threadpool worker thread growth can be unbounded under certain rare cases until we receive
alert to fix it. For instances, theorectically the number of threads can go up to 32767 although I’ve never seen that
many (maximum is about 5000-6000 in the past). In some services, the upbound is set to a much conservative number. So I
think the following is something we can learn:
We will also finish testing an increase in thread count limits in our operating system configuration, which we believe
will give us significantly more threads per server and give us significant additional safety margin there as well.
In addition, the following caught my attention:
This information is obtained through calls to a microservice vending the membership information, retrieval of
configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers
… It takes up to an hour for any existing front-end fleet member to learn of new participants.
Maybe the design philosophy in AWS (or upper layer) is different from the practice in my org. The service initialization
is an important aspect to check the performance and reliability. Usually, we try not take dependency from layers
above us, or assume certain components must be operating properly in order to bootstrap the service replica in question.
The duration of service initialization will be measured and tracked over time. If the initialization takes too long to
complete, we will be called. The majority of services in the control plane services takes seconds to be ready, outliers
hit high-severity incidents unfortunately. A few days ago a service in a busy datacenter took about 9 minutes to get
online (from process creation to ready to process incoming requests), that was such a pain during outage. In my opinion,
fundamental improvement has to be performed to fix this, and the following is on the right track:
…to radically improve the cold-start time for the front-end fleet…Cellularization is an approach we use to isolate
the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache)
operating within a previously tested and operated range…
Usually we favor scale-out the service insteading scale-up, however the following is right on spot in this context:
we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required
by each server to communicate across the fleet. This will provide significant headroom in thread count used as the
total threads each server must maintain is directly proportional to the number of servers in the fleet…
Finally, kudos to Kinesis team on mitigating the issue and finding the root cause. I greatly appreciate the detailed RCA
report which will benefit everyone working in cloud computing!
09 Jul 2016
Using the cookies stored by the website in the script is a nice trick to use the existing authentication to access the
web service, etc. There are several ways to retrieve the cookies from IE / Edge, the most convenient way is to directly
read the files on the local disk. Basically we can use the Shell.Application
COM object to locate the cookies folder,
then parse all text files for the needed information. In each file, there are several records delimited by a line of
single character *
, in each record the first line is the name, second line the value, third line the host name of
website that sets the cookie. Here is a simple PowerShell program to retrieve and print all cookies:
Set-StrictMode -version latest
$shellApp = New-Object -ComObject Shell.Application
$cookieFolder = $shellApp.NameSpace(0x21)
if ($cookieFolder.Title -ne "INetCookies") {
throw "Failed to find INetCookies folder"
}
$allCookies = $cookieFolder.Items() | ? { $_.Type -eq "Text Document" } | % { $_.Path }
foreach ($cookie in $allCookies) {
Write-Output "Cookie $cookie"
$items = (Get-Content -Raw $cookie) -Split "\*`n"
foreach ($item in $items) {
if ([string]::IsNullOrEmpty($item.Trim())) {
continue
}
$c = $item -Split "\s+"
Write-Output " Host $($c[2])"
Write-Output " $($c[0]) = $($c[1])"
}
}
Note that files in %LOCALAPPDATA%\Microsoft\Windows\INetCookies\Low
do not show up in $cookieFolder.Items()
list.
An alternative approach is to browse the file system directly, e.g.
gci -File -Force -r ((New-Object -ComObject Shell.Application).Namespace(0x21).Self.Path)
01 Jun 2016
I believe in KISS principle. Albert Einstein said:
Make everything as simple as possible, but not simpler.
This is one of guiding prinicples that I follow in every designs and implementations. Needless to say, this principle
is particularly important in cloud computing. Simplicity makes it easier to reason about the system behaviors and drive
code defects down to zero. For critical components, decent performance and reliability are two attributes to let you
sleep well in the night. Primary Tracker in networking control plane is a good example to explain this topic.
Basic layering of fabric controller
Fabric Conrtoller (FC) is the operational center of Azure platform. FC gets customer orders from the Red Dog Front End
(RDFE) and/or modern replacement Azure Resource Manager (ARM) and then performs all the heavy-lifting work such as
hardware management, resource inventory management, provisioning and commanding tenants / virtual machines (VMs),
monitoring, etc. It is a “distributed stateful application distributed across data center nodes and fault domains”.
Three most important roles of FC are data center manager (DCM), tenant manager (TM), and network manager (NM). They
manage three key aspects of the platform, i.e. data center hardware, compute, networking. In production, FC roles
instances are running with 5 update domains (UDs).
The number of UDs is different in test clusters. Among all UDs, one of them is elected to the primary controller, all
others are considered as backup. The election of primary is based on Paxos
algorithm. If primary role instance
fails, all the remaining backup replicas will vote a new primary which will resume the operation. As long as there are
3 or more replicas, a quorum can be made and FC will operate normally.
In the above diagram, different nodes communicate with each other and form a ring via the bottom layer RSL. On top of
it is a layer of cluster framework, libraries, utilities, collectively we call it CFX. Via CFX and RSL a storage
cluster management service is provided where In-Memory Object Store (IMOS) is served. Each FC role defines several data
models living in IMOS which is used to persis the state of the role.
Note that eventual consistency model is not used in FC as far as
one role is concerned. In fact, strong consistency model is used to add the safety guarantee (read Data Consistency
Primer for more information on consistency models). Whether
this model is best for FC is debatable, I may explain more in a separate post later.
Primary tracker
Clients from outside of a cluster communicate with FC via Virtual IP
address (VIP), and the software load balancer (SLB) routes the
request to the right node at where primary replica is located. In the event of primary fail-over, SLB ensures the
traffic to the VIP always (or eventually) reaches the new primary. For performance consideration, communication among
FC roles does not go through VIP but Dynamic IP address (DIP) directly. Note that primary of one role is often
different from the primary of another role, although sometimes they can be the same. Then the question is, where is the
primary? The wrong answer of this question has the same effect of service unavailability.
This is why we have Primary Tracker. Basically primary tracker keeps track of IP address of primary replica and
maintains a WCF channel factory so ensure the request to the role can be made reliably. The job is as simple as finding
a primary, and re-finding the primary if the old one fails over.
Storage cluster management service provides an interface that, once connecting to any replica, it can tell where the
primary is as long as the replica serving the request is not disconnected from the ring. Obviously this is a basic
operation of any leader election algorithm, nothing mysterious. So primary tracker sounds trivial.
In Azure environment there are a few more factors to consider. Primary tracker object can be shared by multiple threads
when many requests are processed concurrently. WCF client channel cannot be shared among multiple threads reliably,
re-creating channel factory is too expensive. Having too many concurrent requests may be a concern to the healthy of the
target service. So it is necessary to maintain a shared channel factory and perform request throttling (again, this is
debatable).
Still this does not sound complicated. In fact, with proper compoentization and decoupling, many problems can be
modeled in a simple way. Therefore, we had a almost-working implementation, and it has been in operation for a while.
Use cases
From the perspective of networking control plane, two important use cases of the primary tracker are:
- Serving tenant command and control requests from TM to NM.
- Serving VM DHCP requests from DCM to NM.
Load of both cases depends on how busy a cluster is, for instance if customers are starting many new deployments or
stopping existing ones.
Problems
Although the old primary tracker worked, it often gave us some headache. Sometimes customers complained that starting
VMs took a long time or even got stuck, and we root caused the issue to unresponsiveness of DHCP requests. Occasionally
a whole cluster was unhealthy because DHCP stopped, and no new deployment could start because the start container failed
repeatedly and pushed physical blades to Human Investigate (HI) state. Eventually the problem happened more often to
the frequency of more than once per week, DRI on rotation got nervous since they did not know when the phone would ring
them up after going to bed.
Then we improved monitoring and alerting in this area to collect more data, and more importantly got notified as soon as
failure occured. This gave us right assessment of the severity but did not solve the problem itself. With careful
inspection of the log traces, we found that failover of primary replica would cause the primary track losing contact to
any primary for indefinite amount of time, anywhere from minutes to hours.
Analysis
During one of Sev-2 incident investigation, a live dump of the host processs of the primary tracker was taken. The
state of object as well as all threads were analyzed, and the conclusion was astonishingly simple – there was a
prolonged race condition triggered by channel factory disposal upon the primary failover, then all the threads accessing
the shared object just started an endless fight with each other. I will not repeat the tedius process of the analysis
here, basically it is backtracking from the snapshot of 21 threads to the failure point with the help of log traces,
nothing really exciting.
Once having the conclusion, the evidence in the source code became obivious. The irony part is that the first line of
the comment said:
This class is not designed to be thread safe.
But in reality the primary use case is in a multi-thread environment. And the red flag is that the shared state is
mutable by multiple thread without proper synchronization.
Fix
Strictly speaking the bugfix is a rewrite of the class with existing behavior preserved. As one can imagine it is not a
complicated component, the core design is using reader-writer
lock, specifically ReaderWriterLockSlim
class (see the
reference source
here).
In addition, a concept of generation is introduced to the shared channel factory in order to prevent the problem of
different threads finding new primary multiple times after failover.
Stress test
The best way to check the reliability is to run a stress test with as much load as possible. Since the new
implementation is backward compatible with the old one, it is straightforward to conduct the comparative study. The
simulated stress environment has many threads sending requests continuously, and the artificial primary failover occurs
much more often than any production cluster, furthermore the communication channel is injected with random faults and
delay. It is a harsh environment for this component.
It turns out the old implementation breaks down within 8 minutes. The exact failure pattern is observed as the ones
happening in production clusters. On the contrary, the new implementation has not failed so far.
Although the component is perf sensitive, it has no regular perf testing. A one-time perf measurement conducted in the
past shows that the maximum load it is able to handle is around 150 to 200 request/sec in a test cluster. This number
is more than twice of the peak traffic in a production cluster under normal operational condition, according to live
instrumentation data. Is it good enough? Different people have different opinions. My principle is to design for the
worst scenario and ensure the extreme case is covered.
As a part of bugfix work, a new perf test program is added to measure both the throughput and latency of the system.
The result shows that the new component is able to process about ten times of load, and the per-request overhead is less
than one millisecond. After tuning a few parameters (which is a bit different than production setup), the throughput is
increased further by about 30-40%.
Rollout
Despite the fear of severe incident caused by the change in critical component, with the proof of functional / perf /
stress test data, the newly designed primary tracker has been rolled out to all production clusters. Finally the
repeated incidents caused by primary tracking failure no longer wake up DRIs during the night. From customers
perspective, this means less number of VM starting failure and shorter VM bootup time.