Blackboard Perspectives and thoughts from Zhenhua Yao

How to Retrive Internet Cookies Programmatically

Using the cookies stored by the website in the script is a nice trick to use the existing authentication to access the web service, etc. There are several ways to retrieve the cookies from IE / Edge, the most convenient way is to directly read the files on the local disk. Basically we can use the Shell.Application COM object to locate the cookies folder, then parse all text files for the needed information. In each file, there are several records delimited by a line of single character *, in each record the first line is the name, second line the value, third line the host name of website that sets the cookie. Here is a simple PowerShell program to retrieve and print all cookies:

Set-StrictMode -version latest

$shellApp = New-Object -ComObject Shell.Application
$cookieFolder = $shellApp.NameSpace(0x21)
if ($cookieFolder.Title -ne "INetCookies") {
    throw "Failed to find INetCookies folder"
}

$allCookies = $cookieFolder.Items() | ? { $_.Type -eq "Text Document" } | % { $_.Path }
foreach ($cookie in $allCookies) {
    Write-Output "Cookie $cookie"
    $items = (Get-Content -Raw $cookie) -Split "\*`n"
    foreach ($item in $items) {
        if ([string]::IsNullOrEmpty($item.Trim())) {
            continue
        }
        $c = $item -Split "\s+"
        Write-Output "  Host $($c[2])"
        Write-Output "  $($c[0]) = $($c[1])"
    }
}

Note that files in %LOCALAPPDATA%\Microsoft\Windows\INetCookies\Low do not show up in $cookieFolder.Items() list. An alternative approach is to browse the file system directly, e.g. gci -File -Force -r ((New-Object -ComObject Shell.Application).Namespace(0x21).Self.Path)

Primary Tracker

I believe in KISS principle. Albert Einstein said:

Make everything as simple as possible, but not simpler.

This is one of guiding prinicples that I follow in every designs and implementations. Needless to say, this principle is particularly important in cloud computing. Simplicity makes it easier to reason about the system behaviors and drive code defects down to zero. For critical components, decent performance and reliability are two attributes to let you sleep well in the night. Primary Tracker in networking control plane is a good example to explain this topic.

Basic layering of fabric controller

Fabric Conrtoller (FC) is the operational center of Azure platform. FC gets customer orders from the Red Dog Front End (RDFE) and/or modern replacement Azure Resource Manager (ARM) and then performs all the heavy-lifting work such as hardware management, resource inventory management, provisioning and commanding tenants / virtual machines (VMs), monitoring, etc. It is a “distributed stateful application distributed across data center nodes and fault domains”.

Three most important roles of FC are data center manager (DCM), tenant manager (TM), and network manager (NM). They manage three key aspects of the platform, i.e. data center hardware, compute, networking. In production, FC roles instances are running with 5 update domains (UDs).

FC layering

The number of UDs is different in test clusters. Among all UDs, one of them is elected to the primary controller, all others are considered as backup. The election of primary is based on Paxos algorithm. If primary role instance fails, all the remaining backup replicas will vote a new primary which will resume the operation. As long as there are 3 or more replicas, a quorum can be made and FC will operate normally.

In the above diagram, different nodes communicate with each other and form a ring via the bottom layer RSL. On top of it is a layer of cluster framework, libraries, utilities, collectively we call it CFX. Via CFX and RSL a storage cluster management service is provided where In-Memory Object Store (IMOS) is served. Each FC role defines several data models living in IMOS which is used to persis the state of the role.

Note that eventual consistency model is not used in FC as far as one role is concerned. In fact, strong consistency model is used to add the safety guarantee (read Data Consistency Primer for more information on consistency models). Whether this model is best for FC is debatable, I may explain more in a separate post later.

Primary tracker

Clients from outside of a cluster communicate with FC via Virtual IP address (VIP), and the software load balancer (SLB) routes the request to the right node at where primary replica is located. In the event of primary fail-over, SLB ensures the traffic to the VIP always (or eventually) reaches the new primary. For performance consideration, communication among FC roles does not go through VIP but Dynamic IP address (DIP) directly. Note that primary of one role is often different from the primary of another role, although sometimes they can be the same. Then the question is, where is the primary? The wrong answer of this question has the same effect of service unavailability.

This is why we have Primary Tracker. Basically primary tracker keeps track of IP address of primary replica and maintains a WCF channel factory so ensure the request to the role can be made reliably. The job is as simple as finding a primary, and re-finding the primary if the old one fails over.

Storage cluster management service provides an interface that, once connecting to any replica, it can tell where the primary is as long as the replica serving the request is not disconnected from the ring. Obviously this is a basic operation of any leader election algorithm, nothing mysterious. So primary tracker sounds trivial.

In Azure environment there are a few more factors to consider. Primary tracker object can be shared by multiple threads when many requests are processed concurrently. WCF client channel cannot be shared among multiple threads reliably, re-creating channel factory is too expensive. Having too many concurrent requests may be a concern to the healthy of the target service. So it is necessary to maintain a shared channel factory and perform request throttling (again, this is debatable).

Still this does not sound complicated. In fact, with proper compoentization and decoupling, many problems can be modeled in a simple way. Therefore, we had a almost-working implementation, and it has been in operation for a while.

Use cases

From the perspective of networking control plane, two important use cases of the primary tracker are:

  • Serving tenant command and control requests from TM to NM.
  • Serving VM DHCP requests from DCM to NM.

Load of both cases depends on how busy a cluster is, for instance if customers are starting many new deployments or stopping existing ones.

Problems

Although the old primary tracker worked, it often gave us some headache. Sometimes customers complained that starting VMs took a long time or even got stuck, and we root caused the issue to unresponsiveness of DHCP requests. Occasionally a whole cluster was unhealthy because DHCP stopped, and no new deployment could start because the start container failed repeatedly and pushed physical blades to Human Investigate (HI) state. Eventually the problem happened more often to the frequency of more than once per week, DRI on rotation got nervous since they did not know when the phone would ring them up after going to bed.

Then we improved monitoring and alerting in this area to collect more data, and more importantly got notified as soon as failure occured. This gave us right assessment of the severity but did not solve the problem itself. With careful inspection of the log traces, we found that failover of primary replica would cause the primary track losing contact to any primary for indefinite amount of time, anywhere from minutes to hours.

Analysis

During one of Sev-2 incident investigation, a live dump of the host processs of the primary tracker was taken. The state of object as well as all threads were analyzed, and the conclusion was astonishingly simple – there was a prolonged race condition triggered by channel factory disposal upon the primary failover, then all the threads accessing the shared object just started an endless fight with each other. I will not repeat the tedius process of the analysis here, basically it is backtracking from the snapshot of 21 threads to the failure point with the help of log traces, nothing really exciting.

Once having the conclusion, the evidence in the source code became obivious. The irony part is that the first line of the comment said:

This class is not designed to be thread safe.

But in reality the primary use case is in a multi-thread environment. And the red flag is that the shared state is mutable by multiple thread without proper synchronization.

Fix

Strictly speaking the bugfix is a rewrite of the class with existing behavior preserved. As one can imagine it is not a complicated component, the core design is using reader-writer lock, specifically ReaderWriterLockSlim class (see the reference source here). In addition, a concept of generation is introduced to the shared channel factory in order to prevent the problem of different threads finding new primary multiple times after failover.

Stress test

The best way to check the reliability is to run a stress test with as much load as possible. Since the new implementation is backward compatible with the old one, it is straightforward to conduct the comparative study. The simulated stress environment has many threads sending requests continuously, and the artificial primary failover occurs much more often than any production cluster, furthermore the communication channel is injected with random faults and delay. It is a harsh environment for this component.

It turns out the old implementation breaks down within 8 minutes. The exact failure pattern is observed as the ones happening in production clusters. On the contrary, the new implementation has not failed so far.

Performance measurement

Although the component is perf sensitive, it has no regular perf testing. A one-time perf measurement conducted in the past shows that the maximum load it is able to handle is around 150 to 200 request/sec in a test cluster. This number is more than twice of the peak traffic in a production cluster under normal operational condition, according to live instrumentation data. Is it good enough? Different people have different opinions. My principle is to design for the worst scenario and ensure the extreme case is covered.

As a part of bugfix work, a new perf test program is added to measure both the throughput and latency of the system. The result shows that the new component is able to process about ten times of load, and the per-request overhead is less than one millisecond. After tuning a few parameters (which is a bit different than production setup), the throughput is increased further by about 30-40%.

Rollout

Despite the fear of severe incident caused by the change in critical component, with the proof of functional / perf / stress test data, the newly designed primary tracker has been rolled out to all production clusters. Finally the repeated incidents caused by primary tracking failure no longer wake up DRIs during the night. From customers perspective, this means less number of VM starting failure and shorter VM bootup time.

How Do We Deal With Flaky Tests

Today I read a blog article from Google Testing Blog “Flaky Tests at Google and How We Mitigate Them” and would like to share my thoughts.

Flaky tests not unheard of for a large software project, particularly if test cases are owned by developers with variety level of experience. People hate flaky tests as much as failed tests, rerun takes more resource, false alarms waste precious dev resource, and often times people tend to ignore them or disable them entirely. Personally I do not agree with the approach in the blog, it is simply not a quality-driven culture.

My opinion is always that, Heisenberg uncertain principle plays no role in software development, any “indeterministic” behavior can be traced back to a quality issue, and the number of flaky tests should be driven down to zero.

In the past observation many flakiness is caused by test code issues. There is no test code for the test code, and people may not have the same level of quality awareness as the product code. Besides unsafe threading, race conditions, lack of synchronizations, etc., there are common anti-patterns causing flakiness (only what I can think of at the moment):

  • Checking driven by timeout instead of event: for instance, click a button on UI, wait for 5 seconds, then click the next button.
  • Unaware the async event: for instance, load a web page and wait for finish by checking if a certain tag is found, then proceed to submit the form. But the page actually has a iframe which has to be completed loading.
  • Incorrect assumption of the runtime condition. There are too many such exmaples. In one case, a P1 test case owned by my team fails over the primary replica of the controller, then wait for its coming back by checking the primary IP address reported by the storage cluster management (SCM). Unfortunately the checking is incorrect, because only the layer above SCM is able to tell if the new primary is up reliably.

Besides test code bugs, real product issues may also cause flakiness of the test execution. This is particularly dangerous in cloud environment since massive scale magnifies the probability of hitting the real issue in production. We sometimes say that, if some bad thing has a small chance to happen, then it will happen after deployment.

Most of time, driving down flaky tests requires right mindset and right prioritization from leadership team. As long as flaky tests and failed tests are treated as rigorously as product bugs and live-site incidents in terms of priority and resource assignment, nothing cannot be fixed. As the tests become more reliable and more proven patterns are adopted, improved CI experience will benefit everyone from IC to leadership. One should not underestimate the ROI of driving for better quality.

Create Process Dump File

In cloud environment we mostly rely on log traces to understand what is happening inside a program and investigate abnormal behaivors if things do not work as expected. However, sometimes it is needed to get a deep insight of the process internals via debugging. When live debugging is not possible, we have to capture a user-mode process dump and conduct post-mortem analysis. For instance, memory leak or unusual memory usage increase is such a case where objects on heap need to examined closely.

The most common way to create a dump file is to use Windows Task Manager. One can open Task Manager, click “Processes” tab for Windows 7, or “Details” tab for Windows 8/10, then right-click the name of the process and then click “Create Dump File”. Once the dump file is created, it will be saved at %TEMP% directory, which is usually \Users\UserName\AppData\Local\Temp directory on the system drive. The size of the dump file is roughly the number of virtual bytes of the process.

The downside of this method is that the location of the dump file cannot be specified. In addition, one cannot choose whether minidump (only thread and handle information) or full dump (all process memory) to create.

In some cases, this could be a severe issue. In Azure data center, free space on the system drive is extremely limited (often times less than 15 GB). Personally I have seen (via post-mortem analysis) that dump file creation causes disk space exhaustion and makes the OS unusable when responding to a live-site incident, which makes the situation worse by having a second incident.

A better way to create dump file is to use ProcDump or AdPlus (a part of WinDBG). An example of creating a full dump is:

procdump -ma MyProcess c:\temp\myprocess.dmp

ProcDump is written by Mark Russinovich, a Microsoft Technical Fellow. It is very small in size. One can visit technet page to download. If a GUI is perferred, I strongly recommend a Task Manager replacement, Process Explorer by the same author.

How Does VM Boot Up in Azure

One of the frequent complaints in Azure support is that customers cannot RDP / SSH into their VMs. In many cases there are very few options for the users other than retry or filing a support ticket. After all, many components may go wrong, trace logs are internal, and no one can put hands on the physical machine and check what is happening. I hope a brief explanation can give you some insight and make your life slightly easier. :-)

WARNING: some technical details are omitted, accuracy is not guaranteed. Maybe more importantly, some design parts may sound suboptimal or even controversial. In public space many companies like to brag how great their culture is and how much they care engineering quality. The reality is, in a work-intensive environment, engineers often just “make stuff work and ship it”, fortunately people learn and grow over time so products are improved gradually.

When you deploy a cloud service or start an IaaS VM via web portal or automation, a Deployment is created. Inside fabric controller, the deployment is called Tenant. Service package (cspkg) and service configuration file (cscfg) are translated to service description file (svd), tenant description file (trd), and images. In each cluster, compute manager and network managers are the two important roles to manage the resource inventory and the tenant life cycle. At regional level, some networking resources such as VIP and virtual networks (VNET) are managed by regional controller. Each tenant is consisted of mutiple roles, each role has one or more role instances, each role instance live in a VM which is called a container. VMs are hosted by Hyper-V server, which are managed by the fabric controller running in the cluster. Fabric controller is also a tenant like all other customer tenants (imagine clusters managed by clusters).

Create / update tenant: svd/trd as well as regional resource is received by the compute manager. This component will allocate the resource and find physical nodes which have sufficient capacity to host the required containers. After compute resource allocation is completed, it will pass information to network manager (NM). NM will then allocate networking related resource and then send relevant information to the agent program running on the nodes where containers will be running. After NM sends back confirmation that the network allocation is done, the tenant is successfully updated to the required service description. The communication between compute and network managers is implemented by single-direction WCF with net.tcp binding. Many APIs are synchronous with certain configurable timeout.

Start tenant: Upon receiving this command from the frontend, the compute manager will drive the start container workflow by sending goal state to compute agent running on the node. At the same time, it will also notify NM that certain role instances have to be connected (or in service). NM will send the goal state to its agent for performing the network programming. Note this work will happen to multiple nodes in parallel.

Once compute agent receives the command to start container, it will download whatever needed from Azure storage (e.g. VHD file for IaaS, service package for PaaS), then create a VM via Hyper-V WMI. By default, the VM NIC port is blocked, no network traffic can go through. NM agent program periodically gets the list of all the VM NICs on the host, checks if the goal state is achieved, and program the interface if needed. Once the programming is completed, the port is unblocked then the VM can talk to outside. Also note that compute and network agents are two different programs and are driven by different manager components, and they work somewhat independently.

Assuming everything is working properly, the VM is created successfully, the port is unblocked, the guest OS boots up, the DHCP client will send DHCP discover in order to get its IP configuration and other information. DHCP packets will be intercepted and encapsulated if needed, then forwarded to WDS in the cluster. WDS receives all DHCP traffic in the cluster, from both physical hosts and virtual machines. A plugin in WDS will then forward the request to cluster hardware management component, which will ask NM if the request is for the physical hosts or the tenants. If the former, this components will handle by itself, if the latter it will forward the request to NM.

After NM receives the DHCP, it will find out the corresponding tenant based on the MAC address then return IP address and other relevant information. After that the response goes back via the same path all the way to the VM. Now the VM has an IP configuration and continute to boot.

Start role instance: This is mainly for PaaS tenants but also relevant for IaaS. Guest agent in the VM is configured to start automatically. In the DHCP response, there is a special DHCP custom option 245 named wire server address. It is the IP address of a web server running on the physical host to which the guest agent (or anyone running inside VM) can talk. GA retrieves this address from DHCP response and does the following:

  • ask wire server which version of the protocol is supported.
  • periodically ask for the machine goal state, i.e. should the VM continue to run or shutdown.
  • retrives the role configuration, which contains IP and other useful information about the runtime environment.
  • keep the heartbeat with the wireserver, report the health state and current status.
  • upon request start the role instance by loading the program on app VHD mounted in the OS.

Wire server is a low-previlege web app designed to be the middle man between fabric and containers. If there is any state request, for instance VM to be shut down, the request will be forwarded by compute agent to the wire server, and the guest agent is supposed to poll it. The state of the role instance, e.g. starting / running / busy / unresponsive, is reported by the GA to the wire server and then compute agent. Wire server and GA work together to keep VM in a healthy and responsive state.

For PaaS containers, thanks to Hyper-V integration service (primarily heartbeat service) and GA, compute agent knows if the guest OS receives proper DHCP offer, when the container is ready, if the container picks up the desired machine goal state, if the role instance starts properly, and so on. If unexpected thing happens, the compute stack will decide to retry, or move the container to other hosts (assuming the problem is caused by malfunctioning physical host), or give up to wait for engineering team investigation.

Once wire server / compute agent is notified of role instance state, the information will be propagated to the upper layer and finally reflected in the web portal. Assuming RDP / SSH endpoint is configuration properly, the incoming request will be routed by the software load balancer, then it is up to the OS how to handle it. Many other factors determine if RDP / SSH works or not, including but not limited to firewall rules, service state, etc.

For IaaS VMs, the OS VHD may be uploaded by customers thus GA may not be installed, and the Hyper-V integration service may not exist. In this case, compute manager has no way to know what is running inside guest OS, so it will start the VM using Hyper-V WMI. As long as the VM is healthy from hypervisor perspective, even if the guest OS is stuck in the boot process somewhere, compute agent will not do anything since it has no knowledge about it.

As you can see, many components participate in the VM / container boot process. Most of time things work amazingly well. However, sometimes fabric controller or other parts of the platform may have glitches, sometimes even the Windows OS may hit issues, consequently causing access issues for customers. Personally I have seen a variety of incidents due to platform quality issues, such as:

  • General platform unavailability due to cluster outage. In this case many number of VMs are offline.
  • Storage incidents slows down OS VHD retrieval, and oddly enough breaks Base Filter Engine service during the boot.
  • Physical networking issue causes packet drop.
  • Networking agent does not program the interface properly so DHCP discover is blocked, so no IP address is available.
  • Missing notification from compute manager or regional network controller so network manager does send correct goal state to network agent.
  • After container in VNET tenant moving to different host, the DHCP response was incorrectly formed so the guest OS DHCP client fails to take the new lease.

Many more issues are caused by guest OS, such as:

  • RDP endpoint misconfigured.
  • DHCP disabled.
  • Incorrect firewall configuration.
  • Mistake in sysprep so Windows image stops working after moving to Azure Hyper-V server.
  • Corrupted VHD so OS does not even boot into login screen.

There are also many Windows OS issues that engineers in Azure have to escalate to Windows team. The number of options to investigate at customer side is limited. Before contacting customer service, check a few things:

Often times a support ticket has to be filed, so our engineers can collect detailed information and figure out if it is caused by platform issue or not. If you suspect platform issue is the culprit, for instance deployment stuck in starting or provisioning state for a long time, contact customer support. We will check all the way from frontend down to the Hyper-V server, including both control path and data path, to mitigate and resolve the incident. Security and privacy are treated very seriously in Azure. Without consent from customers, no one is able to check the internal state of guest VM and/or attached VHD. In some cases, cooperation from customers speeds up investigation significantly because it allows engineers to read Windows events, logs, and perform kernel mode debugging to understand what exactly prevents RDP from working.

This is a brief and high-level description of how VM boots up. Later I may discuss a few components in more details to give you more insight. Thanks for reading!