Blackboard Perspectives and thoughts from Zhenhua Yao

Containerized .NET 5.0 App Running on Raspberry Pi 3B

The release of .NET 5.0 is an exciting news for us. No more argument whether to migrate to .NET core or upgrade to newer version of .NET framework, .NET 5 or 6 (LTS version) will be the north star for control plane services with better performance and faster speed of innovations. Additionally it is relieving to know multiple platforms (not just different editions of Windows) can be unified with one set of source code.

Today I found a little cute Raspberry Pi 3B lying unused. It was a toy for my son but he got a real computer already. I wondered whether it was able to run .NET 5 apps, so I decided to give a try. The process turns out to be quite straightforward, although I don’t think it’s useful to do so. Anyway here is what I’ve done.

Upgrade Raspberry Pi OS

Although no endpoint is opened, it is still a good practice to keep the OS up to date:

sudo apt update
sudo apt upgrade
sudo apt full-upgrade
sudo autoremove

Setup Docker

My work doesn’t actually use Docker, but I was curious whether it runs in such a resource-constrained environment. Firstly, run the script from the official website:

curl -sSL https://get.docker.com | sh

In order to not prefix almost all commands with sudo, I added the default user account to the docker user group:

sudo usermod -aG docker pi

Then ran a smoke testing:

docker version
docker info
docker hello-world

It was encouraging to see everything just works.

Install .NET SDK 5.0.1

Initially I thought package management might take care of this. But I had to do it manually like the following:

wget https://download.visualstudio.microsoft.com/download/pr/567a64a8-810b-4c3f-85e3-bc9f9e06311b/02664afe4f3992a4d558ed066d906745/dotnet-sdk-5.0.101-linux-arm.tar.gz
sudo mkdir /var/dotnet
sudo tar zxvf dotnet-sdk-5.0.101-linux-arm.tar.gz -C /var/dotnet

Then I created a sample console app to confirm it indeed worked. Lastly, I changed $HOME/.bashrc for required change of environment variables.

Visual Studio Code

VI is preinstalled on Raspberry Pi, just like every other Linux distributions. However, VS Code is so popular that I must give a try. After download the Debian package from VS Code download site, I installed it with the following:

sudo apt install ./code_1.52.1-1608136275_armhf.deb

Now the “development” environment is ready. Understandably nothing is as responsive as my desktop, but it isn’t slow to the point of unbearable. In fact, writing a simple code was just fine.

Since .NET already provides a sample docker image, why not give a try:

pi@raspberrypi:~ $ docker run --rm mcr.microsoft.com/dotnet/samples
Unable to find image 'mcr.microsoft.com/dotnet/samples:latest' locally
latest: Pulling from dotnet/samples
c06905228d4f: Pull complete 
6938b34386db: Pull complete 
46700bb56218: Pull complete 
7cb1c911c6f7: Pull complete 
a42bcb20c9b3: Pull complete 
08b374690670: Pull complete 
Digest: sha256:9e90c17b3bdccd6a089b92d36dd4164a201b64a5bf2ba8f58c45faa68bc538d6
Status: Downloaded newer image for mcr.microsoft.com/dotnet/samples:latest

      Hello from .NET!
      __________________
                        \
                        \
                            ....
                            ....'
                            ....
                          ..........
                      .............'..'..
                  ................'..'.....
                .......'..........'..'..'....
                ........'..........'..'..'.....
              .'....'..'..........'..'.......'.
              .'..................'...   ......
              .  ......'.........         .....
              .                           ......
              ..    .            ..        ......
            ....       .                 .......
            ......  .......          ............
              ................  ......................
              ........................'................
            ......................'..'......    .......
          .........................'..'.....       .......
      ........    ..'.............'..'....      ..........
    ..'..'...      ...............'.......      ..........
    ...'......     ...... ..........  ......         .......
  ...........   .......              ........        ......
  .......        '...'.'.              '.'.'.'         ....
  .......       .....'..               ..'.....
    ..       ..........               ..'........
            ............               ..............
          .............               '..............
          ...........'..              .'.'............
        ...............              .'.'.............
        .............'..               ..'..'...........
        ...............                 .'..............
        .........                        ..............
          .....
  
Environment:
.NET 5.0.1-servicing.20575.16
Linux 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019

The following is a screenshot of VS Code:

VS Code Screenshot

Create a Docker Image

It was not tricky to create a Docker image using the official .NET 5.0 image. The following command shows the pulled image:

docker pull mcr.microsoft.com/dotnet/runtime:5.0

After copying the published directory to the image, it ran smoothly. However, I found the image was quite large in size, the above was 153 MB. After some trial and error, I found a way to make it smaller.

Firstly, change the csproj file to enable Self-Contained-Deployment with trimming, and also turn off globalization since I almost never need to deal with it in the control plane:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net5.0</TargetFramework>
    <InvariantGlobalization>true</InvariantGlobalization>
    <PublishTrimmed>true</PublishTrimmed>
    <TrimMode>link</TrimMode>
    <TrimmerRemoveSymbols>true</TrimmerRemoveSymbols>
  </PropertyGroup>

</Project>

Then a SCD package was published to out directory:

dotnet publish -c release -r ubuntu.18.04-arm --self-contained -o out 

Note that it is specifically targeted to Ubuntu 18.04 LTS. The package size seems to be reasonable given the runtime is included:

pi@raspberrypi:~/tmp $ du -h out
16M	out

A docker file is written to build image on top of Ubuntu 18.04 minimal image:

FROM ubuntu:18.04
RUN mkdir /app
WORKDIR /app
COPY out /app
ENTRYPOINT ["./hello"]

Build the image:

docker build --pull -t dotnetapp.ubuntu -f Dockerfile.ubuntu .

Give a try to the image and compare with the execution outside of container:

pi@raspberrypi:~/tmp $ docker run --rm dotnetapp.ubuntu
Hello World!
Duration to schedule 100000 async tasks: 00:00:00.2684313

pi@raspberrypi:~/tmp $ out/tmp
Hello World!
Duration to schedule 100000 async tasks: 00:00:00.2351646

Besides Ubuntu, 18.04 I tried other images as well. Here is what I found:

  • Debian 10 Slim image works similarly as Ubuntu 18.04, the size is about 3 MB larger.
  • Default Alpine image doesn’t have glibc, which is required by the bootstrapper. The packaging works but the image doesn’t run even the runtime identifier is set to Alpine specifically.
  • Google image gcr.io/distroless/dotnet works, but the base image is 134 MB already since it ships the entire runtime.
  • The base image gcr.io/distroless/base has glibc, the base image is only 13 MB (Ubuntu is 45.8 MB). However, I didn’t figure out how to fix image build problem. It seems missing /bin/sh is problematic.
  • The base image of busybox with glibc is only 2.68 MB. Seems promising, but it doesn’t have required libs arm-linux-gnueabihf (both at /lib and /usr/lib). I guess it can be resolved by copying some files but in real work this would be unmaintainable.

By the way, other than new apps many things haven’t changed much on Linux, for instance font rendering is still miserable and requires heavy modification. Practically WSL seems to be more productive from development perspective.

Layman's Quick Guide on Crashdump Debugging

In distributed computing, we rely on traces and metrics to understand the runtime behavior of programs. However, in some cases we still need assistance from debuggers for live-site issues. For instance, if the service crashes all of sudden and no trace offers any clue, we need to load crashdump into debugger. Or some exception is raised but traces are insufficient to understand the nature of the problem, we may need to capture a full state of the process.

In old days, at least starting in Windows 3.1, there was a Dr. Watson to collect the error information following a process crash, mainly the crash dump file. Every time I saw it, something bad happened. Nowadays it has been under the new name of Windows Error Reporting, or WER. Inside the platform, there is still a “watson” service to collect all the crashdumps created by the platform code, process it, assign to the right owner, and send alerts as configured. Some times during live-site investigation, we can also request a dump file collection using “Node Diagnostiics”, then the file will be taken over by Watson (assuming your hand isn’t fast enough to move the file somewhere else).

Like it or not, to look at the dump file you have to use windbg. You can choose cdb or windbgx but they are not really different. If you are too busy to learn how windbg works, particularly managed code debugging using SOS, then you may use this quick guide to save some time.

Debugger extensions

Download sosex from Steve’s TechSpot and save the DLL in the extension directory.

Download mex from Microsoft download and save the DLL in the extension directory.

To find the extension directory, find the directory at where windbg.exe is located using Task Manager, then go to winext directory.

Basic commands

Exit windbg: enter qd or simply Alt-F4.

Display process environment block

!peb

Wou will see where the execution image is, all the environment variables which contains the machine name, processor ID, count, etc.

CPU usage

To check which threads have consumed how much CPU time:

!runaway

To check CPU utilization, thread pool worker thread and completion port thread usage:

!threadpool

List of threads: check if how many threads there are, any threads are terminated or hitting some exception, etc.

!threads

If you click the blue underlined link you can switch to that thread, then use the following to see the native stack trace:

k

or see the managed stack trace

!clrstack

To check the object on the stack run the following:

!dso

To check the local variables of a specific frame (use the frame number in “k” output):

!mdv [FrameNumber]

Object count: to get the statistics of objects in the managed heap.

!dumpheap -stat

If you want to get the live objects (the objects that cannot be garbage collected), add -live parameter. If you want to get the dead object, add -dead parameter.

Find object by type name: firstly find the list of types with statistics by the type name (either full name of partial):

!dumpheap -stat -type MyClassName

Then click the method table link, which is essentially:

!dumpheap /d -mt [MethodTableAddress]

You can click the address link to dump the object, or

!do [ObjectAddress]

A better way to browse the object properties is to use sosex:

!sosex.mdt [ObjectAddress]

To know why it’s live, or the GC root:

!gcroot [ObjectAddress]

or use sosex

!sosex.mroot [ObjectAddress]

Symbols

Check the current symbol path, you use use menu or

.sympath

Add a directory where PDB files (symbols) are located, use menu or

.sympath+ \\mynetworkshare\directory\symbols

Find all the class names and properties with a particular string (use your own wildcard string):

!sosex.mx *NetworkManager

List of all modules loaded:

lm

To get the details about a module, click the link in above output or:

lmDvm MyNameSpace_MyModule

Here you can see the file version, product version string, timestamp, etc. For the files from many repos, you can see the branch name and commit hash. If you are interested in the module info:

!lmi MyNameSpace_MyModule

To show disassembled IL code, firstly switch to a managed frame, then run mu:

!sosex.mframe 70
!sosex.mu

Advanced

Find unique stack traces: this will go through the stack trace of all threads, group them by identical ones, and show you which stack has shown up how many times:

!mex.us

Often times you can see lock contentions or slow transaction isuse, etc.

Find all exceptions:

!mex.dae

Dump all async task objects:

!mex.tasks

If you have to debug memory related issue, refer to my previous post.

Further reading

Many debugging topics are not covered, for instance finalization, deadlock, locking, etc. If quick guidance is insufficient, please spend some time starting from Getting Started With Windows Debugging or the book Advanced .NET Debugging.

Internal RPC at Cloud Scale

In cloud computing, a prevailing design pattern is multiple loosely coupled microservice working in synergy to build the app, and RPC is used for inter-service communication. The platform itself is no exception. If you are interested in how we (mainly the services I worked on) use RPC, keep reading.

External and Internal Interface

Some services are exposed to public internet using published API contract, for instance xRP (resource providers). Usually the API is defined in a consistent and platform-neutral manner, such as REST with JSON payload. Typically the underlying framework is some form of ASP.NET. In this note customer facing services are not discussed.

For internal services that are not exposed to external customers, we have a lot of freedom to choose what works the best for the context from the technical perspective. In theory, one can choose any protocol one may feel appropriate. In practice, because of conformity and familiarity, most of time the design choice is converged to a few options as discussed in the note.

Authentication

Before starting further discussion, it is helpful to understand a little bit on service to service authentication, which always scopes down the number of options facing us. In the past, when we choose the communication protocol we look at if two services are within the same trust boundary or not. If a unsecure protocol is used for talking to a service outside of your trust boundary, the design will be shot down before anyone has a chance to use it in either internal review or compliance review with the security team. The trust boundary of services can be the fabric tenant boundary at deployment unit level, or within the same Service Fabric cluster. The most common case is within the trust boundary use unencrypted protocol, outside of trust boundary secure protocol must be used.

The most common authentication is based on RBAC. No one has persisted privileged access to the service, engineers request JIT access before conducting privileged operations, source service has to request security token in order to talk to destination service. Foundational services typically use claims-based identity associated with the X.509 certificate provisioned with the service. For people who are familiar with HTTP authentication, the authentication is orthogonal and usually separated from the data contract for the service communication. This means we need some way to carry the OOB payload for the authentication headers.

Some services choose to not use RBAC due to various reasons, for instance it must be able to survive when all other services are down, or resolve the circular dependency in the buildout stage. In this case, certificate-based authentication is used with stringent validation. Because certficate exchange occurs at the transport level, it is simpler to understand and more flexible to implement, although I personally don’t like it because of the security.

WCF

WCF, or Windows Communication Foundation is a framework for implementing Service-Oriented Architecture on .NET platform. Based on SOAP WCF supports interoperability with standard web services built on non-Windows platform as well. It is extremely flexible, powerful, and customizable. And the adoption barrier is low for developers working on .NET platform. Naturally, it has been the default option for internal RPC. As of today, many services are still using it.

The common pattern is that unencrypted communication uses NetTcp binding, if cert-based authentication is required HTTP binding is used, if RBAC is needed federation HTTP binding is used.

For years WCF has been supporting the cloud well without being criticized. However, it is not without downside, particularly people feel it offers too much flexibility and complexity that we often use it incorrectly. The fact is most people follow the existing code patterns and do not learn it in a deep level prior to using the technology. After enough mistakes are made, the blame is moving from people to the technology itself, we need to make things easy to use otherwise it won’t be sustainable. The following are common problems at this point.

Timeout and retries

When using WCF, it is important to configure timeout values correctly. Unfortunately, not everyone know it, and the price is live-site incident. Considering the following scenario:

  1. Client sends a request to serve. Now it waits for response back. Receive timeout is one minute.
  2. The operation is time consuming. It is completed at the server side at 1.5 minutes.
  3. No response is received at the client side after 1 minute, so the client side considers the request has failed.
  4. Now the state at client and server sides is inconsistent.

The issue must be considered in the implementation. Often times, the solution is to handle the failures at the transport layer with retries. Different kinds of back-off logic and give-up threshold may be used, but usually retry logic is required to deal with intermittent failures, for instance catch the exception, if communication exception then tear down the channel and establish a new one. In the testing or simulation environment this works well. In real world, when a customer sends a request to the front-end, several hops is needed to reach the backend which is responsible for the processing, and each hop has its own retry logic. Sometimes uniform backoff is used at certain hop to ensure the responsiveness as a local optimization. When unexpected downtime occurs, cascading effect is caused, the failure is propagated to the upper layer, multi-layer retry is triggered, then we see avalanche of requests. Now a small availability issue becomes a performance problem and it lasts much longer than necessary.

The problem is well known and has been dealt with. However, it never goes away completely.

Message size

For every WCF binding we must configure the message size and various parameters correctly. The default values don’t work in all cases. For transferring large data, streaming can be used, however in reality often times only buffered mode is an option. As the workload increases continuously, the quota is exceeded occasionally. This has caused live-site incidents several times. Some libraries (e.g. WCF utility in SF) simply increase those parameters to the maximum, and that caused different set of problems.

Load balancer friendly

In many cases, server to service communication goes through virtualized IP which is handled by load balancer. Unsurprising, not many people understand the complication of LB in the middle and how to turn WCF parameters to work around it. Consequently, MessageSecurityException happens after service goes online, and it becomes difficult to tune the parameters without making breaking change.

Threading

This is more coding issue than WCF framework problem – service contracts are often defined as sync API, and this is what people feel more comfortable to use. When the server receives short burst of requests and the processing gets stuck, the number of I/O completion port threads increases sharply, often times the server can no longer receive more requests. To be fair, this is configuration problem of service throttling, but uninformed engineers mistakenly treat it as WCF issue.

Support on .NET core

There is no supported way to host WCF service in a .NET core program, and the replacement is ASP.NET core gRPC. Forward-looking projects move away from WCF rightfully.

Performance (perceived)

The general impression is WCF is slow and the scalability is underwhelming. In some case it is true. For instance when using WS federation HTTP, SOAP XML serialization performance isn’t satisfying, payload on the wire is relatively large comparing with JSON or protobuf, now adding over 10 kB of authentication header (correct, it is that large) to every request you won’t expect a great performance out of that. On the other hand, NetTcp can be very fast when authentication isn’t a factor – it is slower than gRPC but much faster than what control plane services demand. Much of the XML serialization can be tuned to be fast. Unfortunately, few people know how to do it and leave most parameters as factory default.

Easy mistakes in data contract

With too much power, it is easy to get hurt. I have seen people use various options or flags in unintended way and are surprised later. The latest one is IsReference on data contract and IsRequired on data members misconfiguration. Human error it is, people wish they didn’t have to deal with this.

RPC inside transaction

Making WCF calls gives inaccurate impression that the statement is no different from calling a regular method in another object (maybe for novices), so it is casually used everywhere including inside of IMOS transactions. It works most of time until connection issue arises, then we see mystery performance issue. Over time, people are experienced to steer away from anti-patterns like this.

As we can see, some of the problems are caused by WCF but many are incorrect use pattern. However, the complexity is undisputable, the perception is imprinted in people’s mind. We have to move forward.

By the way, I must point out that WCF use does not correlate with low availability or poor performance directly. For instance, the SLA of a foundational control plane service is hovering around four to five 9’s most of time but it is still using WCF as both server and client (i.e. communicating with other WCF services).

REST using ASP.NET

It is no doubt that ASP.NET is superior in many aspects. The performance, customizability, and supportibility is unparalleled. Many services moved to this framework before the current recommendation becomes mainstream. However, it does have more boilerplate than WCF, not as convenient in some aspects.

Message exchange

Some projects use custom solution for highly specialized scenarios. For instance, exchange bond messages over TCP or HTTP connection, or even customize the serialization. This is hardly “RPC” and painful to maintain. Over time this approach is being deprecated.

Protobuf over gRPC

As many .NET developers can see, gRPC has more or less become the “north star” as far as RPC concerned. Once green light is given, the prototyping and migration has started. Initially it was Google gRPC, later ASP.NET core gRPC becomes more popular because of integration with ASP.NET, customizability, and security to some extent. The journey isn’t entirely smooth, for instance people coming from WCF background has encountered several issues such as:

  • Inheritance support in protobuf.
  • Reference object serialization, cycling in large object graph.
  • Managed type support, such as Guid, etc.
  • Use certificate object from certificate store instead of PEM files.
  • Tune of parameters to increase max larger header size to handle oversized authentication header (solved already).

Usually people find a solution after some hard work, and sometimes a workaround or adopting new design paradigm. In a few cases, the team back off to ASP.NET instead. Overall trend of using gRPC is going up across the board. Personally I think this will be beneficial for building more resilient and highly available services with better performance.

Persistence in Stateful Applications

In cloud computing, we build highly available applications on commodity hardware. The software SLA is typically higher thn the underlying hardware for an order or more. This is achieved by distributed application based on state machine replication. If strong consistency is required, state persistence based on Paxos algorithm is often used. Depending on the requirements on layering, latency, availability, failure model, and other factors, there are several solutions available.

Cosmos DB or Azure SQL Database

Most apps build on top of core Azure platform can take dependency of Cosmos DB or Azure SQL Database. Both are easier to use and integrate with existing apps. This is often the most viable path with the least resistence, particularly Cosmos DB with excellent availability, scalability, and performance.

If you are looking for lowest latency possible, the state is better to be persisted locally and cache inside the process. In this case, remote persistence such as Cosmos DB may not be desirable. For services within the platform below Cosmos DB, this approach may not be viable.

RSL

Although not many people have noticed it, Replicated State Library is one of the greatest contributions to OSS from Microsoft. It is a verified and well tested Paxos implmentation, which has been in production for many years. RSL has been the core layer to power the Azure core control plane since the beginning. The version released on GitHub is the one used in the product as of now. Personally I am not aware of other implementation with greater scale, performance, and reliability (in term of bugs) on Windows platform. If you have to store 100 GBs of data with strong consistency in a single ring, RSL is well capable of doing the job.

Note that it is for Windows platforms only, both native and managed code is supported. I guess it is possible to port it to Linux, however no one has looked into it and no plan to do so.

In-Memory Object Store (IMOS) is a proprietary managed code on top of RSL to provide transaction semantics, strong-typed object, object collections, relationships, and code-generation from UML class diagrams. Although the performance and scale are sacrificed somewhat, it is widely used because of convenience and productivity.

Service Fabric Reliable Collections

RSL and IMOS are often used by “monolithic” distributed applications before Service Fabric is widely adopted. SF is a great platform to build scalable and reliable microservices, in particular stateful services. Hosting RSL on SF isn’t impossible but it is far from straightforward. At least, the primary election in RSL is totally independent of SF, you’d better ensure both are consistent via some trick. In addition, SF may move the replicas around any time, and this must be coordinated with RSL dynamic replica set reconfiguration. Therefore, the most common approach is to use SF reliable collections in the stateful application as recommended. Over time, this approach will be the mainstream in the foundational layer.

Ring Master

If you need distributed synchorinization and are not satisfied with ZooKeeper because of its scale, or you want native SF integration, then you should consider adopting Ring Master which is released to open source. Essentially Ring Master provides a superset of ZooKeeper semantics. This is the core component supporting the goal state delivery in several mission-critical foundational services in the platform. The persistence layer can be replaced, the released source code supports SF reliable collections for production use and in-memory for testing. If you want absolute best performance and scale, considering persist to RSL.

If you have any question or comments, please leave a message in the discussion. Thanks!

Build Systems Through My Eyes

Before joining Microsoft, I worked on Linux almost all the time for years. Similar as most other projects, we used shell scripts to automate the build process, and GNU automake / autoconf were the main toolset. Occasionally CMake was used to handle some components where necessary. In Windows team, I witnessed how to build enormouse amount of code consistently and reliably using sophicated in-house software. In this note, a few build systems that I used in the past are discussed to share some learnings.

Why do we need a “build system”?

A simple hello world program or school project doesn’t need a build system. Load it to your favorite IDE and run it. If it works, congrats. If not, it is your responsibility to fix it. Obiviously this logic won’t fly for any projects shared by multiple people. Windows SDK and Visual Studio don’t really tell us how to deal with large number of projects in an automated and reliable manner. NMake is the counterpart of Makefile and able to do the job to some extend. However, honestly I haven’t seen anyone using it directly because of the complexity level at large scale. We need a layer on top of SDK and VS toolset to automate the entire build process for both developers and lab builds, and the process must be reliable and repeatable. For Windows, reproducibility is critical. Imagine you have to fix a customer reported issue on a version released long time back, it would be unthinkable if you could not produce the same set of binaries as the build machines did previously in order to debug. By the way, all build systems are command line based since no one will glare at their monitor for hours, no fancy UI is needed.

Razzle and Source Depot

Razzle is the first production-quality build system I used. Essentially it is a collection of command line tools and environment variables to run build.exe for regular dev build and timebuild for lab builld. At the start of day, a command prompt is opened, razzle.cmd is invoked, which performs some validation to the host environment, sets up environment variables and presents a command prompt for conducting subsequent work for the day.

In Razzle, everything is checked into the source repository. Here “everything” is literally everything including source code, compilers, SDK, all external dependencies, libraries, and all binaries needed for the build process. Outside of build servers, no one checks out everything on their dev machine which could be near or at TB. Working enlistment is a partial checkout at tens of GB level. Because of the outrageous requirement on the scale, an in-house source repository called Source Depot (rumor said it was based off Perforce with needed improvement, not sure the accuracy though) is used, and a federation of SD servers is used to support the Windows code base. On top of sd.exe, there is a batch script called sdx.cmd to coordinate the common operations across multiple SD servers. For instance, instead of using “sd sync”, we used to run “sdx sync” to pull down the latest checkine. Some years later, in order to modernize the dev environment git replaced SD, which I have no hands-on experience.

Razzle deeply influenced other build systems down the line. Even now, people used to type “build” or even “bcz” even if the latter is not really meaningful in the contemporary build systems. One of the great advantages of Razzle is its reproducibility and total independence. Because everything is stored in SD, if you want to reproduce an old build, you just check out the version at the required changeset, type “build” and eventually you will get the precise build required by the work, other than timestamp, etc. In practicality, with a clean installed OS, run the enlistment script on a network share which in turn calls sd to download the source code (equivalent to “git clone”), then you have the fully working enlistment, nothing else is needed (assuming you are fine with editing C++ code with notepad).

Instead of Makefile or MSBuild project files, dirs files are used for directory traversal, sources files are used to build the individual project. An imaginary sources file is like the following (illustration purpose only):

TARGETNAME  = Hello
TARGETTYPE  = DYNLINK
UMTYPE      = console

C_DEFINES   = $(C_DEFINES) -DWIN32 -DMYPROJECT

LINKER_FLAGS = $(LINKER_FLAGS) -MAP

INCLUDES    = $(INCLUDES);\
    $(PROJROOT)\shared;

SOURCES     = \
    hello.cpp \

TARGETLIBS  = $(TARGETLIBS) \
    $(SDK_LIB_PATH)\kernel32.lib \
    $(PROJROOT)\world\win32\$(O)\world.lib \

Invocation of underlying ntbuild will carry out several build passes to run various tasks, such as preprocessing, midl, compile, linking, etc. There are also postbuild tasks to handle code signing, instrumentation, code analysis, localization, etc. Publish/consume mechanism is used to handle the dependencies among projects, so it is possible to enlist a small subset of projects and build without missing dependencies.

Coming from Linux world, I didn’t find it too troublesome using another set of command line tools, other than missing cygwin and VIM. However, for people who loved Visual Studio and GUI tools, this seemed to be a unproductive environment. Additionally, you cannot easily use Razzle for projects outside Windows.

CoreXT

After moving out of Windows, I came to know CoreXT in an enterprise software project. Initially as a Razzle clone, it is believed to be a community project maintained by passionate build engineers inside Microsoft (by the way I have never been a build engineer). It is widely used in Office, SQL, Azure, and many organizations even today. Six years ago, Azure projects were based on CoreXT and followed similar approach as Windows on Razzle: everything stored in SD, dirs/sources on top of ntbuild, timebuild to produce nightly build, etc. The main difference was each service had its own enlistment project, just like a miniature of Windows code base. Inter-service dependencies were handled by copying files around. For instance, if project B had to use some libraries generated by project A, project A would export those files, and project B would import them by adding them to SD. For projects based on managed code (most are), msbuild instead of NTBuild was used for convenience.

At the time, the dev experience on CoreXT was not too bad. It inherited all the goodness of Razzle. But it was still a bit heavyweight. Even you only had tens of MB in source code, the build environment and external dependencies would still be north of ten GBs in size. Young engineers considered it as dinosour environment, which was hard to argue if comparing with open source toolset. The supportibility of Visual Studio IDE was via csproj files (used by both build and IDE) and sln files (used by IDE only).

Five years ago, people started to modernize the dev environment. The first thing was move from SD to git. Without LFS, it is impractical to store much data in git. At least, 1 GB was considered as acceptable upbound at the time. So we had to forget about the practice of checking in everything and started to reduce the repo size dramatically. But Windows SDK alone was already well over 1 GB, how to handle the storage issue without sacrifising reproducibility? The solution was to leverage NuGet. Essentially, besides corext bootstrapper (very small) and source code everything was wrapped into NuGet packages. This solution has been lasted until today.

Most projects have its own git repository. Under the root directory, init.cmd is the replacement of Razzle.cmd, it invokes corext bootstrapper to setup the enlistment environment. Similarly as Razzle, it is still a command prompt with environment variables and command aliases. corext.config under .corext is similar as nuget.config, which contains the list of NuGet feeds (on-premises network shares in the past, ADO nowadays) and list of packages. All packages are downloaded and extracted into CoreXT cache directory. MSBuild project files are modified to use the toolset in the cache directory, such as:

<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="15.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
  <Import Project="$(EnvironmentConfig)" />
  ...
  <Import Project="$(ExtendedTargetsPath)\Microsoft.CSharp.targets" />
</Project>

Here the trick is EnvironmentConfig is a environment variable pointing to a MSBuild props file in CoreXT cache, this file bootstraps everything after that. With that, when build alias is invoked, MSBuild program is called, then compilers and build tools in CoreXT cache are used, instead of the one installed on the host machine.

In theory, the entire process relies on nothing but the files in CoreXT cache. One does not need to install Visual Studio or any developer tools on their computers. In practice, occasionally some packages reference files outside of the cache and assume certain software to be installed. However, that is rather an exception than a norm.

For developers, we use Visual Studio or VS Code to browse code, write code, build and debug. A tool is provided to generate solution file from a set of MSBuild project files (csproj and dirs.proj). Then the solution is loaded in IDE.

Dependencies among projects are handled by NuGet packages. During official build, we can choose whether or not to publish packages into feeds on ADO. Other projects simply add the <package .../> in corext.config file should they want to consume any packages.

So far most projects and most people in my org are still using CoreXT in this form. It is used by engineers during daily development, by build machines in the lab, by distributed build in the cloud, and everywhere we want it to be. Other than compiling source code and building product binaries, it also carries out various other tasks, including but not limited to static code analysis, policy check, VHD generation, NuGet package creation, making app package suitable for deployment, publishing symbols, etc.

CBT and Retail MSBuild

Again, CoreXT is considered to be modern day relic. People use it because they have to. In particular, it is highly desireable to have seamless integration with Visual Studio and ability to consume latest technology from dotnet core. Before MSBuild becomes more capable, Common Build Toolset (CBT) was developed as a GitHub project to fulfill this requirement. This is a lightweight framework to provide consistent “git clone + msbuild” experience to codebase using it. One of additional advantage is it is open source, for the internal projects that need to sync to github periodically, no more duplicate build systems (one for internal build, one for public) is needed.

Using CBT is extremely simple from dev perspective. No internal machinary whatsoever. Just clone the repo and open it using Visual Studio. Adding new project is also straightforward, no more need to perform brain surgery to csproj files like CoreXT. The downside is obvious, you must install essential build tools such as VS. Reproducibility isn’t strictly guaranteed as far as I can. After all, the VS developer command prompt is used. For most Azure teams, this may not be a concern since things move so fast, I haven’t met anyone who complains they cannot reproduce the build one year ago for serving old version of their service.

CBT is somewhat short-lived. For some people, by the time they come to know the migration path from CoreXT to CBT, it is already deprecated. The latest shiny framework on the street is Retail MSBuild. :-) It works similarly as CBT but even more lightweight. With this framework, engineering teams are able to use Visual Studio and retail Microsoft toolset in their most natural way. In CoreXT, people have to spend a lot of time for any new technology because the framework intentionally works differently. Personally I’ve spent many hours to make dotnet core working in my team, some other components might be worse. With retail MSBuild, everything just works with plain simple SDK style project files with PackageReference. Precious resource can be spent on real work, we are not rewarded for reinventing the wheel (and possibly a worse one) anyway.

Snowflakes

Other than the most popular ones aforementioned, some teams write their own framework for meeting their unique requirement. For instance, several years ago a team needed a high-performance build integrating with VSTS build defitions with minimal overhead, so a thin wrapper was built on top of collection of project files and batch scripts. In RingMaster I had to write my own version of build framework because internal proprietary build system could not be released because of approval process, project would not build without one similar to CoreXT, and no other alternative was available (CBT did not exist at the time). At the end, the projects were migrated to SDK-style to make this work easier.

In the future, I look forward to retail MSBuild being adopted more widely and internal build systems going away eventually. I love open source down to my heart. :-)