In performance analysis, the first step is often the time span measurement of various operations in the applications.
It is important to know the correct timestamp and understand which step takes how much time accurately. Without right
data, one may make incorrect decision and spend resource inappropriately. This just happened in my team when a DHCP
slow response issue was investigated.
On computers the time is measured by some variant of system clock, which can be a hardware device that maintains a
simple count of ticks elapsed since a known starting date, a.k.a. the epoach, or the relative time measurement device by
performance counters in CPU. In applications sometimes we want to know the calendar time, or the wall clock time so
we can correlate what happens inside the program with what happens in real world (e.g. customer places an order),
sometimes we want to know the time span, or how fast / slow an operation is. We often use timestamp retrieved from
the system time for the former, and calculate the difference of two timestamps to get the latter. Conceptually it
works, but the relative error of the measurement matters.
On the PC motherboard there is a real-time clock chip, some people call it CMOS clock since the clock is a part of CMOS.
It keeps the date and time, as well as a tiny CMOS RAM which contains the BIOS settings in old days. Even when PC is
powered off, it is still running on a built-in battery. Via I/O ports 0x70 and 0x71, one can read or update the current
date / time. Because the cheap oscillator often not works at designed frequency, the clock may drift over time. The OS
compensates this by periodically synchronizing with the time service using NTP, e.g. time.nist.gov is considered as one
of most authortiative time source.
In the OS the initial date / time clock is retrieved from RTC, then it is updated periodically, typically 50 - 2000
times per second. The duration is called clock interval. The clock interval can be adjusted by applications, for
instance multimedia applications and Chrome browser often set the clock interval to 1 ms. Smaller interval has negative
impact on the battery usage, it is avoided whenever possible. On servers this is normally kept as default.
One can query the clock using sysinternal app clockres.exe or Windows built-in program powercfg. On my desktop the
clock interval is 15.625 milliseconds (ms) (64 Hz frequency), and the minimum supported interval is 0.5 ms. 15 ms
resolution is sufficient for most real-world events, e.g. when the next TV show starts, but it is insufficient for time
measurements on computers in many cases, in particular the events lasting tens of milliseconds or less. For instance,
if an action starts at some point in clock interval 1 and stops at another point in clock interval, using the system
clock you will see a time span of 31 ms, but actually it can be anywhere from 15.625 to 46.875 ms. On the other hand,
if an action starts and stops within the same clock interval, the system clock will tell you the duration is 0 but it
can be as long as 15 ms. My coworker once said “the request is received at xxx, at the same time it gets processed
…”, sorry within a single thread two different things do not happen at the same time.
In .NET, system clock is retrieved using DateTime.UtcNow (see MSDN
It is implemented by calling Win32 API GetSystemTimeAsFileTime
function. Note that although the unit
is in tick (100 ns), the resolution is really clock interval.
In old days (Windows XP, Linux many years ago) people used to read processor timestamp counter (rdtsc) to acquire high
resolution timestamps. It is tricky to get it right in virtual environment, multiple-core system, and special
hardwares. Nowadays on Windows the solution is Win32 API QueryPerformanceCounter
function. On modern hardware (i.e.
almost all PCs nowadays), the resolution is less than 1 microseconds (us). On my “cost effective” home PC, the
resolution is about 300 ns, or 3 ticks. For more information on QPC read MSDN article
In .NET, QPC is implemented by System.Diagnostics.Stopwatch (see reference source
For any time span measurement this should be considered as the default choice.
Another thing to remember is that Stopwatch or QPC is not synchronized to UTC or any wall clock time. This means that
if the computer adjusts the clock after synchronizing to the time server, this will not be affected – no forward or
backward jump. In fact I saw a stress test failure caused by clock forward adjustment, when the timeout was evaluated
the time sync happened so the calculated time span was several minutes greater than the actual value. This kind of bug
is hard to notice and investigate, but trivial to fix. Avoid wall clock time if possible.
In term of overhead, Stopwatch is more expensive than DateTime.UtcNow. However both take very little CPU time. On my
home PC, Stopwatch takes about 6 ns vs 3 ns for DateTime.UtcNow. Normally it is much shorter than the duration being
The last question is that, if we do need the absolute time correlation on multiple computers, is there anything better
than System.DateTime.UtcNow? The answer is yes, setup all computers to the same time source, then use
It is supported in Win8 / Server 2012 or later. In .NET one needs to use P/Invoke to use it.
In the weekend I saw a few posts on StackOverflow asking how the network traffic is
throttled in Azure, how much bandwidth a VM can use, etc. There are some answers based on measurements or MSDN docs. I
hope this post may give a more complete picture from engineering perspective.
When we talk about bandwidth throttling, we refer to the network bandwidth cap at the VM / vNIC level for the outbound
traffic (or the transmit path). Inbound traffic from public internet goes through Software Load Balancer but no
throttling is applied at the VM / host. All network traffic going out of a VM, including both billed and unbilled
traffic, are throttled so the bandwidth is limited to a certain number. For instance, a “Small” VM is capped at 500
Mbps. Different sizes of VM have different caps, some of them are very high.
Then the question is, if a VM has more than one interface, is the cap shared by all interfaces or divided equally among
them? If the value of the bandwidth cap is updated, will VM be rebooted? The answer is it depends. Some time ago, the
network bandwidth is managed by tenant management component in FC. Technically the agent running on the host sets a
bandwidth cap on VM switch using Hyper-V WMI when creating VM. If there are multiple interfaces, the cap is divided by
the number of interfaces. If we want to change the bandwidth of individual VM or all VMs with the same size, fabric
policy file in the cluster has to be updated and VMs have to be created to apply the new values. Recently we changed
the design to let network management component in FC to handle this work. Network programming agent program
communicates with a filter driver (VFP) on the host to create a QoS queue and then associates all interfaces with the
queue. So all interfaces share the same cap. For instance, if a small VM has two NICs, if the first NIC is idling the
second NIC can use up to 500 Mbps. Basically now the cap should apply to the entire VM. Some cluster may not have this
feature enabled temporarily, but this should be rare.
Another question is, since you call it “cap”, does it mean my VM will get that amount of bandwidth in the best case, and
it may get less bandwidth if neighbors are noisy? The answer is noisy neighbors do not affect the bandwidth throttling.
The allocation algorithm in FC knows how much resource exits on each host, including total network bandwidth, and the
container allocation is designed to allow each individual container uses its full capacity. If you absolutely believes
the bandwidth is a lot less than advertised (note Linux VM needs Hyper-V IC being deployed), you may open a support
ticket, ultimately the engineering team will figure it out. From our side, we can see the values in SDN controller as
well as the value set on QoS queue.
In term of latency and throughput, performance measurement shows no statistically significant difference between QoS
queue based throttling and VM switch based throttling. Both are equally high performance.
The new design allows the seamless/fast update of bandwidth in a cluster – the entire process takes less than a half
minute and no visible impact to the running VMs. It also leaves room for further enhancement should upper layer
supports, for instance adjustable bandwidth for same container size based upon customer requirement. Hope all customers
are satisfied with networking in Azure. :-)
Recently the host OS in Azure is being upgraded to a new version. As a part of this, a DLL of networking agent called
WVCDLL has to be upgraded too. My team member tried and got a werid error. Attaching debugger to the program, the IP
stopped at a logging entry, which did not make much sense. Then I looked closely at the stack trace:
From the stack trace the problem occurred at 00007ffe 27321171, the call instruction. The symbol name indicates that
the code in WVCDLL is attempting to call a function defined in OsLogger via import table. Therefore the error is caused
by either the dependency does not exist or it is a wrong version.
In native code, a dependency DLL can be set to delay loaded for two reasons:
* The main program may not call any function in the dependency at all.
* Even if it does, the function may be called late in the execution.
Delay-loading can improve the program load speed for obvious reason. The downside is we may not see the dependency load
failure when the program is loaded, as previous failure shows.
In this case, a closer look turned out that the problem was caused by wrong version was used during the build. This is
a typical scenario of diamond dependencies:
* Program depends on DLL A and B.
* DLL A and B are developed and distributed independently.
* Both A and B depend on DLL C.
Unlike managed code, native code does not check the version of DLL being loaded. Suppose DLL A version 1 uses DLL C
version 1.1, DLL B version 2 uses DLL C version 1.2 which has a different import table, then we link A and B into the
same program, although the build process is successful, at runtime it will fail to resolve function entries.
Most teams in Microsoft have compoentized the code base and migrated the source repository from centralized SD to
distributed Git – I may write another post to talk about how the engineering system looks like. Since the release
cadence of different components are completely different, this problem may not be rare to see.
Lastly, in case anyone does not know one can check the version of loaded module in windbg using lmvm:
0:019> lmvm wvcdll
Browse full module list
start end module name
00007ffe`27310000 00007ffe`273ad000 wvcdll (deferred)
Image path: xxxxxxxxxxxxxxxxxxxxxxxxxxxx\wvcdll.dll
Image name: wvcdll.dll
Browse all global symbols functions data
Timestamp: Tue Mar 8 11:36:39 2016 (56DF29C7)
File version: 188.8.131.52
Product version: 184.108.40.206
PrivateBuild: (by corecet on CY1AZRBLD15)
FileDescription: xxxxxxxxxxxxxxxxxxxxxxxx r_mar_2016 (6f6baea) Microsoft Azure®
Based on the version and other info, one can load matching PDB for ease of debugging.
Our projects are mostly C# and C++ code, and MSBuild is the engine to drive the entire build process. If the code base
is large, the log file can be very large. For instance, our build log is over 600 MB and counts for over 5 million
lines of text. Although the logging is designed to be human readable, it is a challenge to read and extract useful
information in such a huge file. I would like to share a few tips and hope this info makes your work easier.
Firstly we need a decent editor that is capable to handling large files. I use VIM all the time, it works well in this
scenario. Some people like Sublime Text, which I have no experience, but it seems to be equally powerful. Notepad is a
bad choice, it can handle large file but the performance is so bad and I always wonder why it still exists in Windows.
Notepad++ and Visual Studio Code refuse to load large files, so avoid them. Finding a nice editor is what every
developer should do.
Normally MSBuild builds multiple projects in paralle. Different loggers can be used, we only use the default console
logger, which is multiple-processor aware. In the log file, each chunk of log, which can be one or multiple lines, is
prefixed with the build node number, for instance:
Build started 4/28/2016 4:05:25 PM.
1>Project "X:\bt\713442\repo\src\dirs.proj" on node 1 (default targets).
1>Building with tools version "12.0".
1>Project "X:\bt\713442\repo\src\dirs.proj" (1) is building "X:\bt\713442\repo\src\deploy\dirs.proj" (2) on node 1 (VerifyAlteredTargetsUsed;Build target(s)).
2>Building with tools version "12.0".
2>Project "X:\bt\713442\repo\src\deploy\dirs.proj" (2) is building "X:\bt\713442\repo\src\deploy\WABO\dirs.proj" (6) on node 1 (VerifyAlteredTargetsUsed;Build target(s)).
6>Building with tools version "12.0".
Suppose you are inspecting a particular project, try to determine the node, then follow all the lines with the right
node as prefix.
Start and finish
A project is started by loading the specific version of tool:
6>Building with tools version "12.0".
When the build is finished, the last line looks like:
6>Done Building Project "X:\bt\713442\repo\src\deploy\WABO\dirs.proj" (VerifyAlteredTargetsUsed;Build target(s)).
Each target also has the similar pattern. If you want to search the start and end, you may use the following regex
pattern (applies to VIM):
project .* on node\|done building project
Sometimes the project output could have missing files or wrong version of dependencies. Typically one can check
CopyFileToOutputDirectory to see how the files are copied, for instance:
Copying file from "X:\bt\713442\repo\src\Tools\Hugin\obj\amd64\Hugin.exe" to "X:\bt\713442\repo\out\retail-amd64\Hugin\Hugin.exe".
The following regex can be used:
copying file from .*obj.* to
Once you are familiar with the structure of build log and know a little bit into MSBuild targets, it is quite
straightforward to search the log and investigate the issue. If you have any question drop a comment and I will try to
Given an interface, it is straightforward to generate implementations of all members (or missing ones of abstract class)
using either Visual Studio or ReSharper. Sometimes, code generation in IDE may not be enough for handling boilerplate
code. Recently I have to deal with a WCF service proxy, where retry and logic mechanism needs to be put in every
method. Basically the scenario is the following:
There is a WCF service running on multiple-replica setup, and the interface is changing constantly.
The client side wants to call the service and has to deal with unreliable channel with reconnect. However, the
client code simply takes an interface object and uses it just like all other libraries, i.e. without any special
logic of retry / reconnect / timeout.
It is preferable to not make any change to the client code. Finding all references of the interface and fixing the
use pattern is not an option, and it may be huge amount of work anyway.
My solution is providing a service proxy, which implements the service interface and forwards all calls to the actual
service endpoint with proper logic. As one can imagine, boilerplate code has to be put in every method of the proxy
class, something like:
Thanks to reflection, it is fairly easy to inspect the interface type to discover the full details, and then generate
the boilaterplate code automatically. This can be performed in both C# and PowerShell. However, to integrate this
process with the build, it is simpler to use PowerShell. The basic process is the following:
Load the interface assembly using System.Reflection.Assembly.LoadFrom
Load the interface type using Assembly.GetType
Discover all methods using Type.GetMethods
For each method, generate the code based on ReturnType and the list of parameters using GetParameters
For the parameter type, we need to check if it is input, output, or ref parameter and handle the code generation
accordingly. Another thing to consider is handling the generic types, which Name or ToString() will generate a
The last question is where and how to generate the code in the right location, i.e. $(IntermediateOutputPath). Since
the code generation is needed before C# compiler is invoked and it requires the interface assembly being accessible (as
well as its dependencies), the best approach is to define a target AfterResolveReference. This target is empty by
default, it is invoked after all project references are resolved and handled properly and before Compile target is
started. One can also use BeforeCompile target, which is invoked a moment after “AfterResolveReference” and right
before CoreCompile. To get more details on this, read Microsoft.Common.targets in .NET framework directory on the
With this change in csproj file, the only human-written code is a partial class with CallWithRetry method which is
referenced by all interface methods. No matter how interface changes, the maitainence cost is limited to the partial