25 Dec 2020
The release of .NET 5.0 is an exciting news for us. No more argument whether to migrate to .NET core or upgrade to newer
version of .NET framework, .NET 5 or 6 (LTS version) will be the north star for control plane services with better
performance and faster speed of innovations. Additionally it is relieving to know multiple platforms (not just different
editions of Windows) can be unified with one set of source code.
Today I found a little cute Raspberry Pi 3B lying
unused. It was a toy for my son but he got a real computer already. I wondered whether it was able to run .NET 5 apps,
so I decided to give a try. The process turns out to be quite straightforward, although I don’t think it’s useful to do
so. Anyway here is what I’ve done.
Upgrade Raspberry Pi OS
Although no endpoint is opened, it is still a good practice to keep the OS up to date:
sudo apt update
sudo apt upgrade
sudo apt full-upgrade
My work doesn’t actually use Docker, but I was curious whether it runs in such a resource-constrained environment.
Firstly, run the script from the official website:
curl -sSL https://get.docker.com | sh
In order to not prefix almost all commands with
sudo, I added the default user account to the docker user group:
sudo usermod -aG docker pi
Then ran a smoke testing:
It was encouraging to see everything just works.
Install .NET SDK 5.0.1
Initially I thought package management might take care of this. But I had to do it manually like the following:
sudo mkdir /var/dotnet
sudo tar zxvf dotnet-sdk-5.0.101-linux-arm.tar.gz -C /var/dotnet
Then I created a sample console app to confirm it indeed worked. Lastly, I changed
$HOME/.bashrc for required change
of environment variables.
Visual Studio Code
VI is preinstalled on Raspberry Pi, just like every other Linux distributions. However, VS Code is so popular that I
must give a try. After download the Debian package from VS Code download site, I installed it with the following:
sudo apt install ./code_1.52.1-1608136275_armhf.deb
Now the “development” environment is ready. Understandably nothing is as responsive as my desktop, but it isn’t slow to
the point of unbearable. In fact, writing a simple code was just fine.
Since .NET already provides a sample docker image, why not give a try:
pi@raspberrypi:~ $ docker run --rm mcr.microsoft.com/dotnet/samples
Unable to find image 'mcr.microsoft.com/dotnet/samples:latest' locally
latest: Pulling from dotnet/samples
c06905228d4f: Pull complete
6938b34386db: Pull complete
46700bb56218: Pull complete
7cb1c911c6f7: Pull complete
a42bcb20c9b3: Pull complete
08b374690670: Pull complete
Status: Downloaded newer image for mcr.microsoft.com/dotnet/samples:latest
Hello from .NET!
. ......'......... .....
.. . .. ......
.... . .......
...... ....... ............
........ ..'.............'..'.... ..........
..'..'... ...............'....... ..........
...'...... ...... .......... ...... .......
........... ....... ........ ......
....... '...'.'. '.'.'.' ....
....... .....'.. ..'.....
.. .......... ..'........
Linux 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019
The following is a screenshot of VS Code:
Create a Docker Image
It was not tricky to create a Docker image using the official .NET 5.0 image. The following command shows the pulled
docker pull mcr.microsoft.com/dotnet/runtime:5.0
After copying the published directory to the image, it ran smoothly. However, I found the image was quite large in size,
the above was 153 MB. After some trial and error, I found a way to make it smaller.
Firstly, change the csproj file to enable Self-Contained-Deployment with trimming, and also turn off globalization since
I almost never need to deal with it in the control plane:
Then a SCD package was published to
dotnet publish -c release -r ubuntu.18.04-arm --self-contained -o out
Note that it is specifically targeted to Ubuntu 18.04 LTS. The package size seems to be reasonable given the runtime is
pi@raspberrypi:~/tmp $ du -h out
A docker file is written to build image on top of Ubuntu 18.04 minimal image:
RUN mkdir /app
COPY out /app
Build the image:
docker build --pull -t dotnetapp.ubuntu -f Dockerfile.ubuntu .
Give a try to the image and compare with the execution outside of container:
pi@raspberrypi:~/tmp $ docker run --rm dotnetapp.ubuntu
Duration to schedule 100000 async tasks: 00:00:00.2684313
pi@raspberrypi:~/tmp $ out/tmp
Duration to schedule 100000 async tasks: 00:00:00.2351646
Besides Ubuntu, 18.04 I tried other images as well. Here is what I found:
- Debian 10 Slim image works similarly as Ubuntu 18.04, the size is about 3 MB larger.
- Default Alpine image doesn’t have glibc, which is required by the bootstrapper. The packaging works but the image
doesn’t run even the runtime identifier is set to Alpine specifically.
- Google image
gcr.io/distroless/dotnet works, but the base image is 134 MB already since it ships the entire runtime.
- The base image
gcr.io/distroless/base has glibc, the base image is only 13 MB (Ubuntu is 45.8 MB). However, I
didn’t figure out how to fix image build problem. It seems missing
/bin/sh is problematic.
- The base image of busybox with glibc is only 2.68 MB. Seems promising, but it doesn’t have required libs
arm-linux-gnueabihf (both at /lib and /usr/lib). I guess it can be resolved by copying some files but in real work
this would be unmaintainable.
By the way, other than new apps many things haven’t changed much on Linux, for instance font
rendering is still miserable and requires heavy modification. Practically WSL
seems to be more productive from development perspective.
20 Dec 2020
In distributed computing, we rely on traces and metrics to understand the runtime behavior of programs. However, in some
cases we still need assistance from debuggers for live-site issues. For instance, if the service crashes all of sudden
and no trace offers any clue, we need to load crashdump into debugger. Or some exception is raised but traces are
insufficient to understand the nature of the problem, we may need to capture a full state of the process.
In old days, at least starting in Windows 3.1, there was a Dr. Watson
to collect the error information following a process crash, mainly the crash dump file. Every time I saw it, something
bad happened. Nowadays it has been under the new name of Windows Error
Reporting, or WER. Inside the platform,
there is still a “watson” service to collect all the crashdumps created by the platform code, process it, assign to the
right owner, and send alerts as configured. Some times during live-site investigation, we can also request a dump file
collection using “Node Diagnostiics”, then the file will be taken over by Watson (assuming your hand isn’t fast enough
to move the file somewhere else).
Like it or not, to look at the dump file you have to use
windbg. You can choose cdb
or windbgx but they are not really different. If you are too busy to learn how windbg
managed code debugging using
SOS, then you may use this
quick guide to save some time.
Download sosex from Steve’s TechSpot and save the DLL in the extension directory.
Download mex from Microsoft download and save the
DLL in the extension directory.
To find the extension directory, find the directory at where windbg.exe is located using Task Manager, then go to
Exit windbg: enter
qd or simply Alt-F4.
Display process environment block
Wou will see where the execution image is, all the environment variables which contains the machine name, processor ID,
To check which threads have consumed how much CPU time:
To check CPU utilization, thread pool worker thread and completion port thread usage:
List of threads: check if how many threads there are, any threads are terminated or hitting some exception, etc.
If you click the blue underlined link you can switch to that thread, then use the following to see the native stack
or see the managed stack trace
To check the object on the stack run the following:
To check the local variables of a specific frame (use the frame number in “k” output):
Object count: to get the statistics of objects in the managed heap.
If you want to get the live objects (the objects that cannot be garbage collected), add
-live parameter. If you want
to get the dead object, add
Find object by type name: firstly find the list of types with statistics by the type name (either full name of
!dumpheap -stat -type MyClassName
Then click the method table link, which is essentially:
!dumpheap /d -mt [MethodTableAddress]
You can click the address link to dump the object, or
A better way to browse the object properties is to use sosex:
To know why it’s live, or the GC root:
or use sosex
Check the current symbol path, you use use menu or
Add a directory where PDB files (symbols) are located, use menu or
Find all the class names and properties with a particular string (use your own wildcard string):
List of all modules loaded:
To get the details about a module, click the link in above output or:
Here you can see the file version, product version string, timestamp, etc. For the files from many repos, you can see
the branch name and commit hash. If you are interested in the module info:
To show disassembled IL code, firstly switch to a managed frame, then run mu:
Find unique stack traces: this will go through the stack trace of all threads, group them by identical ones, and
show you which stack has shown up how many times:
Often times you can see lock contentions or slow transaction isuse, etc.
Find all exceptions:
Dump all async task objects:
If you have to debug memory related issue, refer to my previous post.
Many debugging topics are not covered, for instance finalization, deadlock, locking, etc. If quick guidance is
insufficient, please spend some time starting from Getting Started With Windows
the book Advanced .NET
19 Dec 2020
In cloud computing, a prevailing design pattern is multiple loosely coupled
microservice working in synergy to build the app, and
RPC is used for inter-service communication. The platform itself
is no exception. If you are interested in how we (mainly the services I worked on) use RPC, keep reading.
External and Internal Interface
Some services are exposed to public internet using published API contract, for instance xRP (resource providers).
Usually the API is defined in a consistent and platform-neutral manner, such as REST with JSON payload. Typically the
underlying framework is some form of ASP.NET. In this note customer facing services are not discussed.
For internal services that are not exposed to external customers, we have a lot of freedom to choose what works the best
for the context from the technical perspective. In theory, one can choose any protocol one may feel appropriate. In
practice, because of conformity and familiarity, most of time the design choice is converged to a few options as
discussed in the note.
Before starting further discussion, it is helpful to understand a little bit on service to service authentication, which
always scopes down the number of options facing us. In the past, when we choose the communication protocol we look at if
two services are within the same trust boundary or not. If a unsecure
protocol is used for talking to a service outside of your trust boundary, the design will be shot down before anyone has
a chance to use it in either internal review or compliance review with the security team. The trust boundary of services
can be the fabric tenant boundary at deployment unit level, or within the same Service Fabric cluster. The most common
case is within the trust boundary use unencrypted protocol, outside of trust boundary secure protocol must be used.
The most common authentication is based on RBAC. No one has
persisted privileged access to the service, engineers request JIT access before conducting privileged operations, source
service has to request security token in order to talk to destination service. Foundational services typically use
claims-based identity associated with the
X.509 certificate provisioned with the service. For people who are familiar with
HTTP authentication, the authentication is
orthogonal and usually separated from the data contract for the service communication. This means we need some way to
carry the OOB payload for the authentication headers.
Some services choose to not use RBAC due to various reasons, for instance it must be able to survive when all other
services are down, or resolve the circular dependency in the buildout stage. In this case, certificate-based
authentication is used with stringent validation. Because certficate exchange occurs at the transport level, it is
simpler to understand and more flexible to implement, although I personally don’t like it because of the security.
WCF, or Windows Communication Foundation is a framework for
implementing Service-Oriented Architecture on .NET
platform. Based on SOAP WCF supports interoperability with standard web services
built on non-Windows platform as well. It is extremely flexible, powerful, and customizable. And the adoption barrier is
low for developers working on .NET platform. Naturally, it has been the default option for internal RPC. As of today,
many services are still using it.
The common pattern is that unencrypted communication uses NetTcp
binding, if cert-based authentication is
required HTTP binding is used, if RBAC
is needed federation HTTP
binding is used.
For years WCF has been supporting the cloud well without being criticized. However, it is not without downside,
particularly people feel it offers too much flexibility and complexity that we often use it incorrectly. The fact is
most people follow the existing code patterns and do not learn it in a deep level prior to using the technology. After
enough mistakes are made, the blame is moving from people to the technology itself, we need to make things easy to use
otherwise it won’t be sustainable. The following are common problems at this point.
Timeout and retries
When using WCF, it is important to configure timeout
correctly. Unfortunately, not everyone know it, and the price is live-site incident. Considering the following scenario:
- Client sends a request to serve. Now it waits for response back. Receive timeout is one minute.
- The operation is time consuming. It is completed at the server side at 1.5 minutes.
- No response is received at the client side after 1 minute, so the client side considers the request has failed.
- Now the state at client and server sides is inconsistent.
The issue must be considered in the implementation. Often times, the solution is to handle the failures at the transport
layer with retries. Different kinds of back-off logic and give-up threshold may be used, but usually retry logic is
required to deal with intermittent failures, for instance catch the exception, if communication exception then tear down
the channel and establish a new one. In the testing or simulation environment this works well. In real world, when a
customer sends a request to the front-end, several hops is needed to reach the backend which is responsible for the
processing, and each hop has its own retry logic. Sometimes uniform backoff is used at certain hop to ensure the
responsiveness as a local optimization. When unexpected downtime occurs, cascading effect is caused, the failure is
propagated to the upper layer, multi-layer retry is triggered, then we see avalanche of requests. Now a small
availability issue becomes a performance problem and it lasts much longer than necessary.
The problem is well known and has been dealt with. However, it never goes away completely.
For every WCF binding we must configure the message size and various parameters correctly. The default values don’t work
in all cases. For transferring large data,
streaming can be used,
however in reality often times only buffered mode is an option. As the workload increases continuously, the quota is
exceeded occasionally. This has caused live-site incidents several times. Some libraries (e.g. WCF utility in SF) simply
increase those parameters to the maximum, and that caused different set of problems.
Load balancer friendly
In many cases, server to service communication goes through virtualized IP which is handled by load balancer.
Unsurprising, not many people understand the complication of LB in the middle and how to turn WCF parameters to work
around it. Consequently,
happens after service goes online, and it becomes difficult to tune the parameters without making breaking change.
This is more coding issue than WCF framework problem – service contracts are often defined as sync API, and this is
what people feel more comfortable to use. When the server receives short burst of requests and the processing gets
stuck, the number of I/O completion port threads increases sharply, often times the server can no longer receive more
requests. To be fair, this is configuration problem of service
uninformed engineers mistakenly treat it as WCF issue.
Support on .NET core
There is no supported way to host WCF service in a .NET core program, and the replacement is ASP.NET core
gRPC. Forward-looking projects move
away from WCF rightfully.
The general impression is WCF is slow and the scalability is underwhelming. In some case it is true. For instance when
using WS federation HTTP, SOAP XML serialization performance isn’t satisfying, payload on the wire is relatively large
comparing with JSON or protobuf, now adding over 10 kB of authentication header (correct, it is that large) to every
request you won’t expect a great performance out of that. On the other hand, NetTcp can be very fast when authentication
isn’t a factor – it is slower than gRPC but much faster than what control plane services demand. Much of the XML
serialization can be tuned to be fast. Unfortunately, few people know how to do it and leave most parameters as factory
Easy mistakes in data contract
With too much power, it is easy to get hurt. I have seen people use various options or flags in unintended way and are
surprised later. The latest one is
data contract and
on data members misconfiguration. Human error it is, people wish they didn’t have to deal with this.
RPC inside transaction
Making WCF calls gives inaccurate impression that the statement is no different from calling a regular method in another
object (maybe for novices), so it is casually used everywhere including inside of IMOS transactions. It works most of
time until connection issue arises, then we see mystery performance issue. Over time, people are experienced to steer
away from anti-patterns like this.
As we can see, some of the problems are caused by WCF but many are incorrect use pattern. However, the complexity is
undisputable, the perception is imprinted in people’s mind. We have to move forward.
By the way, I must point out that WCF use does not correlate with low availability or poor performance directly. For
instance, the SLA of a foundational control plane service is hovering around four to five 9’s most of time but it is
still using WCF as both server and client (i.e. communicating with other WCF services).
REST using ASP.NET
It is no doubt that ASP.NET is superior in many aspects. The performance, customizability, and supportibility is
unparalleled. Many services moved to this framework before the current recommendation becomes mainstream. However, it
does have more boilerplate than WCF, not as convenient in some aspects.
Some projects use custom solution for highly specialized scenarios. For instance, exchange
bond messages over TCP or HTTP connection, or even customize the
serialization. This is hardly “RPC” and painful to maintain. Over time this approach is being deprecated.
Protobuf over gRPC
As many .NET developers can see, gRPC has more or less become the “north star” as far as RPC concerned. Once green light
is given, the prototyping and migration has started. Initially it was Google gRPC, later ASP.NET
core gRPC becomes more popular because of integration with ASP.NET,
customizability, and security to some extent. The journey isn’t entirely smooth, for instance people coming from WCF
background has encountered several issues such as:
- Inheritance support in protobuf.
- Reference object serialization, cycling in large object graph.
- Managed type support, such as Guid, etc.
- Use certificate object from certificate store instead of PEM files.
- Tune of parameters to increase max larger header size to handle oversized authentication header (solved already).
Usually people find a solution after some hard work, and sometimes a workaround or adopting new design paradigm. In a
few cases, the team back off to ASP.NET instead. Overall trend of using gRPC is going up across the board. Personally I
think this will be beneficial for building more resilient and highly available services with better performance.
18 Dec 2020
In cloud computing, we build highly available applications on commodity hardware. The software SLA is typically higher
thn the underlying hardware for an order or more. This is achieved by distributed application based on state machine
replication. If strong consistency is required, state persistence based on
Paxos algorithm is often used. Depending on the requirements
on layering, latency, availability, failure model, and other factors, there are several solutions available.
Cosmos DB or Azure SQL Database
Most apps build on top of core Azure platform can take dependency of Cosmos
DB or Azure SQL
Database. Both are easier to use and integrate with existing
apps. This is often the most viable path with the least resistence, particularly Cosmos DB with excellent availability,
scalability, and performance.
If you are looking for lowest latency possible, the state is better to be persisted locally and cache inside the
process. In this case, remote persistence such as Cosmos DB may not be desirable. For services within the platform below
Cosmos DB, this approach may not be viable.
Although not many people have noticed it, Replicated State Library is one of the
greatest contributions to OSS from Microsoft. It is a verified and well tested Paxos implmentation, which has been in
production for many years. RSL has been the core layer to power the Azure core control plane since the beginning. The
version released on GitHub is the one used in the product as of now. Personally I am not aware of other implementation
with greater scale, performance, and reliability (in term of bugs) on Windows platform. If you have to store 100 GBs of
data with strong consistency in a single ring, RSL is well capable of doing the job.
Note that it is for Windows platforms only, both native and managed code is supported. I guess it is possible to port it
to Linux, however no one has looked into it and no plan to do so.
In-Memory Object Store (IMOS) is a proprietary managed code on top of RSL to provide transaction semantics, strong-typed
object, object collections, relationships, and code-generation from UML class diagrams. Although the performance and
scale are sacrificed somewhat, it is widely used because of convenience and productivity.
Service Fabric Reliable Collections
RSL and IMOS are often used by “monolithic” distributed applications before Service
Fabric is widely adopted. SF is a great
platform to build scalable and reliable microservices, in particular stateful services. Hosting RSL on SF isn’t
impossible but it is far from straightforward. At least, the primary election in RSL is totally independent of SF, you’d
better ensure both are consistent via some trick. In addition, SF may move the replicas around any time, and this must
be coordinated with RSL dynamic replica set reconfiguration. Therefore, the most common approach is to use SF reliable
in the stateful application as recommended. Over time, this approach will be the mainstream in the foundational layer.
If you need distributed synchorinization and are not satisfied with ZooKeeper because
of its scale, or you want native SF integration, then you should consider adopting Ring
Master which is released to open source. Essentially Ring Master provides a
superset of ZooKeeper semantics. This is the core component supporting the goal state delivery in several
mission-critical foundational services in the platform. The persistence layer can be replaced, the released source code
supports SF reliable collections for production use and in-memory for testing. If you want absolute best performance and
scale, considering persist to RSL.
If you have any question or comments, please leave a message in the discussion. Thanks!
17 Dec 2020
Before joining Microsoft, I worked on Linux almost all the time for years. Similar as most other projects, we used shell
scripts to automate the build process, and GNU automake / autoconf were the main toolset. Occasionally
CMake was used to handle some components where necessary. In Windows team, I witnessed how to
build enormouse amount of code consistently and reliably using sophicated in-house software. In this note, a few build
systems that I used in the past are discussed to share some learnings.
Why do we need a “build system”?
A simple hello world program or school project doesn’t need a build system. Load it to your favorite IDE and run it. If
it works, congrats. If not, it is your responsibility to fix it. Obiviously this logic won’t fly for any projects shared
by multiple people. Windows SDK and Visual Studio don’t really tell us how to deal with large number of projects in an
automated and reliable manner.
NMake is the counterpart of
Makefile and able to do the job to some extend. However, honestly I haven’t seen anyone using it directly because of the
complexity level at large scale. We need a layer on top of SDK and VS toolset to automate the entire build process for
both developers and lab builds, and the process must be reliable and repeatable. For Windows, reproducibility is
critical. Imagine you have to fix a customer reported issue on a version released long time back, it would be
unthinkable if you could not produce the same set of binaries as the build machines did previously in order to debug. By
the way, all build systems are command line based since no one will glare at their monitor for hours, no fancy UI is
Razzle and Source Depot
Razzle is the first production-quality build system I used. Essentially it is a collection of command line tools and
environment variables to run build.exe for regular dev build and timebuild for lab builld. At the start of day, a
command prompt is opened, razzle.cmd is invoked, which performs some validation to the host environment, sets up
environment variables and presents a command prompt for conducting subsequent work for the day.
In Razzle, everything is checked into the source repository. Here “everything” is literally everything including
source code, compilers, SDK, all external dependencies, libraries, and all binaries needed for the build process.
Outside of build servers, no one checks out everything on their dev machine which could be near or at TB. Working
enlistment is a partial checkout at tens of GB level. Because of the outrageous requirement on the scale, an in-house
source repository called Source Depot (rumor said it was based off Perforce
with needed improvement, not sure the accuracy though) is used, and a federation of SD servers is used to support the
Windows code base. On top of sd.exe, there is a batch script called sdx.cmd to coordinate the common operations across
multiple SD servers. For instance, instead of using “sd sync”, we used to run “sdx sync” to pull down the latest
checkine. Some years later, in order to modernize the dev environment git replaced
SD, which I have no hands-on experience.
Razzle deeply influenced other build systems down the line. Even now, people used to type “build” or even “bcz” even if
the latter is not really meaningful in the contemporary build systems. One of the great advantages of Razzle is its
reproducibility and total independence. Because everything is stored in SD, if you want to reproduce an old build, you
just check out the version at the required changeset, type “build” and eventually you will get the precise build
required by the work, other than timestamp, etc. In practicality, with a clean installed OS, run the enlistment script
on a network share which in turn calls sd to download the source code (equivalent to “git clone”), then you have the
fully working enlistment, nothing else is needed (assuming you are fine with editing C++ code with notepad).
Instead of Makefile or MSBuild project files,
dirs files are used for directory traversal,
sources files are used to
build the individual project. An imaginary sources file is like the following (illustration purpose only):
TARGETNAME = Hello
TARGETTYPE = DYNLINK
UMTYPE = console
C_DEFINES = $(C_DEFINES) -DWIN32 -DMYPROJECT
LINKER_FLAGS = $(LINKER_FLAGS) -MAP
INCLUDES = $(INCLUDES);\
SOURCES = \
TARGETLIBS = $(TARGETLIBS) \
Invocation of underlying ntbuild will carry out several build passes to run various tasks, such as preprocessing, midl,
compile, linking, etc. There are also postbuild tasks to handle code signing, instrumentation, code analysis,
localization, etc. Publish/consume mechanism is used to handle the dependencies among projects, so it is possible to
enlist a small subset of projects and build without missing dependencies.
Coming from Linux world, I didn’t find it too troublesome using another set of command line tools, other than missing
cygwin and VIM. However, for people who loved Visual Studio and GUI tools, this seemed to be a unproductive
environment. Additionally, you cannot easily use Razzle for projects outside Windows.
After moving out of Windows, I came to know CoreXT in an enterprise software project. Initially as a Razzle clone, it is
believed to be a community project maintained by passionate build engineers inside Microsoft (by the way I have never
been a build engineer). It is widely used in Office, SQL, Azure, and many organizations even today. Six years ago,
Azure projects were based on CoreXT and followed similar approach as Windows on Razzle: everything stored in SD,
dirs/sources on top of ntbuild, timebuild to produce nightly build, etc. The main difference was each service had its
own enlistment project, just like a miniature of Windows code base. Inter-service dependencies were handled by copying
files around. For instance, if project B had to use some libraries generated by project A, project A would export
those files, and project B would import them by adding them to SD. For projects based on managed code (most are),
msbuild instead of NTBuild was used for convenience.
At the time, the dev experience on CoreXT was not too bad. It inherited all the goodness of Razzle. But it was still a
bit heavyweight. Even you only had tens of MB in source code, the build environment and external dependencies would
still be north of ten GBs in size. Young engineers considered it as dinosour environment, which was hard to argue if
comparing with open source toolset. The supportibility of Visual Studio IDE was via csproj files (used by both build and
IDE) and sln files (used by IDE only).
Five years ago, people started to modernize the dev environment. The first thing was move from SD to git. Without LFS,
it is impractical to store much data in git. At least, 1 GB was considered as acceptable upbound at the time. So we had
to forget about the practice of checking in everything and started to reduce the repo size dramatically. But Windows SDK
alone was already well over 1 GB, how to handle the storage issue without sacrifising reproducibility? The solution was
to leverage NuGet. Essentially, besides corext bootstrapper
(very small) and source code everything was wrapped into NuGet packages. This solution has been lasted until today.
Most projects have its own git repository. Under the root directory, init.cmd is the replacement of Razzle.cmd, it
invokes corext bootstrapper to setup the enlistment environment. Similarly as Razzle, it is still a command prompt with
environment variables and command aliases.
.corext is similar as nuget.config, which contains the
list of NuGet feeds (on-premises network shares in the past, ADO nowadays) and list of packages. All packages are
downloaded and extracted into CoreXT cache directory. MSBuild project files are modified to use the toolset in the
cache directory, such as:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="15.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Import Project="$(EnvironmentConfig)" />
<Import Project="$(ExtendedTargetsPath)\Microsoft.CSharp.targets" />
Here the trick is
EnvironmentConfig is a environment variable pointing to a MSBuild props file in CoreXT cache, this
file bootstraps everything after that. With that, when build alias is invoked, MSBuild program is called, then compilers
and build tools in CoreXT cache are used, instead of the one installed on the host machine.
In theory, the entire process relies on nothing but the files in CoreXT cache. One does not need to install Visual
Studio or any developer tools on their computers. In practice, occasionally some packages reference files outside of the
cache and assume certain software to be installed. However, that is rather an exception than a norm.
For developers, we use Visual Studio or VS Code to browse code, write code, build and debug. A tool is provided to
generate solution file from a set of MSBuild project files (csproj and dirs.proj). Then the solution is loaded in IDE.
Dependencies among projects are handled by NuGet packages. During official build, we can choose whether or not to
publish packages into feeds on ADO. Other projects simply add the
<package .../> in corext.config file should they
want to consume any packages.
So far most projects and most people in my org are still using CoreXT in this form. It is used by engineers during
daily development, by build machines in the lab, by distributed build in the cloud, and everywhere we want it to be.
Other than compiling source code and building product binaries, it also carries out various other tasks, including but
not limited to static code analysis, policy check, VHD generation, NuGet package creation, making app package suitable
for deployment, publishing symbols, etc.
CBT and Retail MSBuild
Again, CoreXT is considered to be modern day relic. People use it because they have to. In particular, it is highly
desireable to have seamless integration with Visual Studio and ability to consume latest technology from dotnet core.
Before MSBuild becomes more capable, Common Build Toolset (CBT) was developed
as a GitHub project to fulfill this requirement. This is a lightweight framework to provide consistent “git clone +
msbuild” experience to codebase using it. One of additional advantage is it is open source, for the internal projects
that need to sync to github periodically, no more duplicate build systems (one for internal build, one for public) is
Using CBT is extremely simple from dev perspective. No internal machinary whatsoever. Just clone the repo and open it
using Visual Studio. Adding new project is also straightforward, no more need to perform brain surgery to csproj files
like CoreXT. The downside is obvious, you must install essential build tools such as VS. Reproducibility isn’t strictly
guaranteed as far as I can. After all, the VS developer command prompt is used. For most Azure teams, this may not be a
concern since things move so fast, I haven’t met anyone who complains they cannot reproduce the build one year ago for
serving old version of their service.
CBT is somewhat short-lived. For some people, by the time they come to know the migration path from CoreXT to CBT, it is
already deprecated. The latest shiny framework on the street is Retail MSBuild. :-) It works similarly as CBT but even
more lightweight. With this framework, engineering teams are able to use Visual Studio and retail Microsoft toolset in
their most natural way. In CoreXT, people have to spend a lot of time for any new technology because the framework
intentionally works differently. Personally I’ve spent many hours to make dotnet core working in my team, some other
components might be worse. With retail MSBuild, everything just works with plain simple SDK style project files with
PackageReference. Precious resource can be spent on real work, we are not rewarded for reinventing the wheel (and
possibly a worse one) anyway.
Other than the most popular ones aforementioned, some teams write their own framework for meeting their unique
requirement. For instance, several years ago a team needed a high-performance build integrating with VSTS build
defitions with minimal overhead, so a thin wrapper was built on top of collection of project files and batch scripts. In
RingMaster I had to write my own version of build framework because internal
proprietary build system could not be released because of approval process, project would not build without one similar
to CoreXT, and no other alternative was available (CBT did not exist at the time). At the end, the projects were
migrated to SDK-style to make this work easier.
In the future, I look forward to retail MSBuild being adopted more widely and internal build systems going away
eventually. I love open source down to my heart. :-)