Async/Await and Closures

The best way to understand what async/await, anonymous delegates, and closures do is to look at the IL the C# compiler generates and measure the impact they have on performance in microbenchmarks.

Let’s start by defining an async method:

private static async Task<string> GetValue0(string param1)
{
    await Task.Delay(0);
    return "Thanks";
}

Using ildasm we can see what this very simple method turns into. First, let’s start with the async method’s IL:

.method private hidebysig static class [mscorlib]System.Threading.Tasks.Task`1<string> 
        GetValue0(string param1) cil managed
{
  .custom instance void [mscorlib]System.Diagnostics.DebuggerStepThroughAttribute::.ctor() = ( 01 00 00 00 ) 
  .custom instance void [mscorlib]System.Runtime.CompilerServices.AsyncStateMachineAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 1B 4D 65 61 73 75 72 65 49 74 2B 3C 47 65   // ...MeasureIt+<Ge
                                                                                                                                     74 56 61 6C 75 65 30 3E 64 5F 5F 31 30 32 00 00 ) // tValue0>d__102..
  // Code size       50 (0x32)
  .maxstack  2
  .locals init ([0] valuetype MeasureIt/'<GetValue0>d__102' V_0,
           [1] valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> V_1)
  IL_0000:  ldloca.s   V_0
  IL_0002:  call       valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<!0> valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::Create()
  IL_0007:  stfld      valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
  IL_000c:  ldloca.s   V_0
  IL_000e:  ldc.i4.m1
  IL_000f:  stfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
  IL_0014:  ldloca.s   V_0
  IL_0016:  ldfld      valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
  IL_001b:  stloc.1
  IL_001c:  ldloca.s   V_1
  IL_001e:  ldloca.s   V_0
  IL_0020:  call       instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::Start<valuetype MeasureIt/'<GetValue0>d__102'>(!!0&)
  IL_0025:  ldloca.s   V_0
  IL_0027:  ldflda     valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
  IL_002c:  call       instance class [mscorlib]System.Threading.Tasks.Task`1<!0> valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::get_Task()
  IL_0031:  ret
} // end of method MeasureIt::GetValue0

There are several references to a class called d__102. This class is the state machine that is generated to handle the await:

The interesting method in the class is the MoveNext method:

.method private hidebysig newslot virtual final 
        instance void  MoveNext() cil managed
{
  .override [mscorlib]System.Runtime.CompilerServices.IAsyncStateMachine::MoveNext
  // Code size       166 (0xa6)
  .maxstack  3
  .locals init ([0] bool '<>t__doFinallyBodies',
           [1] string '<>t__result',
           [2] class [mscorlib]System.Exception '<>t__ex',
           [3] int32 CS$0$0000,
           [4] valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter CS$0$0001,
           [5] valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter CS$0$0002)
  .try
  {
    IL_0000:  ldc.i4.1
    IL_0001:  stloc.0
    IL_0002:  ldarg.0
    IL_0003:  ldfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
    IL_0008:  stloc.3
    IL_0009:  ldloc.3
    IL_000a:  ldc.i4.0
    IL_000b:  beq.s      IL_0044
    IL_000d:  ldc.i4.0
    IL_000e:  call       class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Threading.Tasks.Task::Delay(int32)
    IL_0013:  callvirt   instance valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter [mscorlib]System.Threading.Tasks.Task::GetAwaiter()
    IL_0018:  stloc.s    CS$0$0001
    IL_001a:  ldloca.s   CS$0$0001
    IL_001c:  call       instance bool [mscorlib]System.Runtime.CompilerServices.TaskAwaiter::get_IsCompleted()
    IL_0021:  brtrue.s   IL_0063
    IL_0023:  ldarg.0
    IL_0024:  ldc.i4.0
    IL_0025:  stfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
    IL_002a:  ldarg.0
    IL_002b:  ldloc.s    CS$0$0001
    IL_002d:  stfld      valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103'
    IL_0032:  ldarg.0
    IL_0033:  ldflda     valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
    IL_0038:  ldloca.s   CS$0$0001
    IL_003a:  ldarg.0
    IL_003b:  call       instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::AwaitUnsafeOnCompleted<valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter,valuetype MeasureIt/'<GetValue0>d__102'>(!!0&,
                                                                                                                                                                                                                                                             !!1&)
    IL_0040:  ldc.i4.0
    IL_0041:  stloc.0
    IL_0042:  leave.s    IL_00a5
    IL_0044:  ldarg.0
    IL_0045:  ldfld      valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103'
    IL_004a:  stloc.s    CS$0$0001
    IL_004c:  ldarg.0
    IL_004d:  ldloca.s   CS$0$0002
    IL_004f:  initobj    [mscorlib]System.Runtime.CompilerServices.TaskAwaiter
    IL_0055:  ldloc.s    CS$0$0002
    IL_0057:  stfld      valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103'
    IL_005c:  ldarg.0
    IL_005d:  ldc.i4.m1
    IL_005e:  stfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
    IL_0063:  ldloca.s   CS$0$0001
    IL_0065:  call       instance void [mscorlib]System.Runtime.CompilerServices.TaskAwaiter::GetResult()
    IL_006a:  ldloca.s   CS$0$0001
    IL_006c:  initobj    [mscorlib]System.Runtime.CompilerServices.TaskAwaiter
    IL_0072:  ldstr      "Thanks"
    IL_0077:  stloc.1
    IL_0078:  leave.s    IL_0091
  }  // end .try
  catch [mscorlib]System.Exception 
  {
    IL_007a:  stloc.2
    IL_007b:  ldarg.0
    IL_007c:  ldc.i4.s   -2
    IL_007e:  stfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
    IL_0083:  ldarg.0
    IL_0084:  ldflda     valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
    IL_0089:  ldloc.2
    IL_008a:  call       instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::SetException(class [mscorlib]System.Exception)
    IL_008f:  leave.s    IL_00a5
  }  // end handler
  IL_0091:  ldarg.0
  IL_0092:  ldc.i4.s   -2
  IL_0094:  stfld      int32 MeasureIt/'<GetValue0>d__102'::'<>1__state'
  IL_0099:  ldarg.0
  IL_009a:  ldflda     valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder'
  IL_009f:  ldloc.1
  IL_00a0:  call       instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::SetResult(!0)
  IL_00a5:  ret
} // end of method '<GetValue0>d__102'::MoveNext

This is where the call to Delay is actually made. Also highlighted are the pieces handling exceptions and the resulting output of the method. The async/await state machine adds to the overhead of tasks. This is why it is recommended to use tasks for non-trivial operations like I/O where the overhead cost is insignificant when compared with the cost of blocking a thread.

A good microbenchmarking tool is Vance Morrison’s MeasureIt available at http://measureitdotnet.codeplex.com

Let’s look at some different ways of calling the GetValue async method by making a benchmark in MeasureIt:

timer1000.Measure("async delay 0 plain", delegate
{
    string ret;
    Task<string> task = GetValue0("hello");
    task.Wait();
    ret = task.Result;
});
timer1000.Measure("async delay 0 anonymous delegate", delegate
{
    Task.Run(async () => { string ret = await GetValue0("hello"); }).Wait();
});
timer1000.Measure("async delay 0 with closure", delegate
{
    string ret;
    Task.Run(async () => { ret = await GetValue0("hello"); }).Wait();
});

The first measurement creates a task, waits for it to finish, and assigns the result to a local variable. It’s IL looks like this:

.method private hidebysig static void  '<MeasureAsyncDelay0>b__ac'() cil managed
{
  .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
  // Code size       25 (0x19)
  .maxstack  1
  .locals init ([0] class [mscorlib]System.Threading.Tasks.Task`1<string> task)
  IL_0000:  ldstr      "hello"
  IL_0005:  call       class [mscorlib]System.Threading.Tasks.Task`1<string> MeasureIt::GetValue0(string)
  IL_000a:  stloc.0
  IL_000b:  ldloc.0
  IL_000c:  callvirt   instance void [mscorlib]System.Threading.Tasks.Task::Wait()
  IL_0011:  ldloc.0
  IL_0012:  callvirt   instance !0 class [mscorlib]System.Threading.Tasks.Task`1<string>::get_Result()
  IL_0017:  pop
  IL_0018:  ret
} // end of method MeasureIt::'<MeasureAsyncDelay0>b__ac'

We can see the call to GetValue, the Wait on the task, and the retrieval of the result.

The second measurement creates an anonymous delegate to run the task. The IL looks like this:

.method private hidebysig static void  '<MeasureAsyncDelay0>b__ad'() cil managed
{
  .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
  // Code size       40 (0x28)
  .maxstack  8
  IL_0000:  ldsfld     class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd'
  IL_0005:  brtrue.s   IL_0018
  IL_0007:  ldnull
  IL_0008:  ldftn      class [mscorlib]System.Threading.Tasks.Task MeasureIt::'<MeasureAsyncDelay0>b__ae'()
  IL_000e:  newobj     instance void class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task>::.ctor(object,
                                                                                                                       native int)
  IL_0013:  stsfld     class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd'
  IL_0018:  ldsfld     class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd'
  IL_001d:  call       class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Threading.Tasks.Task::Run(class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task>)
  IL_0022:  callvirt   instance void [mscorlib]System.Threading.Tasks.Task::Wait()
  IL_0027:  ret
} // end of method MeasureIt::'<MeasureAsyncDelay0>b__ad'

This refers to another method called b__ae:

.method private hidebysig static class [mscorlib]System.Threading.Tasks.Task 
        '<MeasureAsyncDelay0>b__ae'() cil managed
{
  .custom instance void [mscorlib]System.Diagnostics.DebuggerStepThroughAttribute::.ctor() = ( 01 00 00 00 ) 
  .custom instance void [mscorlib]System.Runtime.CompilerServices.AsyncStateMachineAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 2A 4D 65 61 73 75 72 65 49 74 2B 3C 3C 4D   // ..*MeasureIt+<<M
                                                                                                                                     65 61 73 75 72 65 41 73 79 6E 63 44 65 6C 61 79   // easureAsyncDelay
                                                                                                                                     30 3E 62 5F 5F 61 65 3E 64 5F 5F 63 32 00 00 )    // 0>b__ae>d__c2..
  .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) 
  // Code size       50 (0x32)
  .maxstack  2
  .locals init ([0] valuetype MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2' V_0,
           [1] valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder V_1)
  IL_0000:  ldloca.s   V_0
  IL_0002:  call       valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Create()
  IL_0007:  stfld      valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder'
  IL_000c:  ldloca.s   V_0
  IL_000e:  ldc.i4.m1
  IL_000f:  stfld      int32 MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>1__state'
  IL_0014:  ldloca.s   V_0
  IL_0016:  ldfld      valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder'
  IL_001b:  stloc.1
  IL_001c:  ldloca.s   V_1
  IL_001e:  ldloca.s   V_0
  IL_0020:  call       instance void [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Start<valuetype MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'>(!!0&)
  IL_0025:  ldloca.s   V_0
  IL_0027:  ldflda     valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder'
  IL_002c:  call       instance class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::get_Task()
  IL_0031:  ret
} // end of method MeasureIt::'<MeasureAsyncDelay0>b__ae'

The layout of this method should look very familiar when compared with the GetValue0 method from earlier. A new class called d__c2 was generated to handle the state machine for the async anonymous delegate.

Finally, I’ll remind you that the third measurement looks like this:

timer1000.Measure("async delay 0 with closure", delegate
{
    string ret;
    Task.Run(async () => { ret = await GetValue0("hello"); }).Wait();
});

This uses both an async anonymous delegate and a closure on the “ret” variable. The IL generated is similar except the state machine class is now nested inside another class that facilitates the closure:

You can see the variable ret in the closure class. This means that closures will introduce more memory usage. A new class object for every invocation of the async anonymous delegate and copies of all the objects included in the closure as fields in the closure class object. Since async methods are generally long running, there is a potential for some of these objects to be promoted to Gen1 or Gen2 during garbage collection.

Let’s look at the results of these benchmarks. First, here are the specs of the machine running the tests:

Attribute Value
Number of Processors 1
Processor Name Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz
Processor Mhz 2001
Memory MBytes 8010
L1 Cache KBytes 64
L2 Cache KBytes 256
Operating System Microsoft Windows 8.1 Enterprise
Operating System Version 6.3.9600
Stopwatch resolution (nsec) 410.530
CompileType JIT
CodeSharing AppDomainSpecific
CodeOptimization Optimized

And the results:

NameMedianMeanStdDevMinMaxSamples
AsyncDelay0: async delay 0 plain [count=1000]49.79350.0521.06649.37853.21210
AsyncDelay0: async delay 0 anonymous delegate [count=1000]2492.7462595.311343.5192276.2183239.79310
AsyncDelay0: async delay 0 with closure [count=1000]2346.6062503.969355.0982250.4663477.61710

The median cost for adding another state machine via the async anonymous delegate is significant in this case. Adding a closure does not make much difference in CPU usage, but it can have an impact on memory and GC.

Another example is to wait on multiple simultaneous tasks:

timer1000.Measure("async delay 0 scenario 1", delegate
{
    var foo = new Foo();
    Task<string>[] tasks = new Task<string>[] {
        GetValue0("1"),
        GetValue0("2"),
        GetValue0("3"),
        GetValue0("4"),
        GetValue0("5"),
    };
    Task.WaitAll(tasks);
    foo.Property1 = tasks[0].Result;
    foo.Property2 = tasks[1].Result;
    foo.Property3 = tasks[2].Result;
    foo.Property4 = tasks[3].Result;
    foo.Property5 = tasks[4].Result;
});
timer1000.Measure("async delay 0 scenario 2", delegate
{
    var foo = new Foo();
    Task.WaitAll(
        Task.Run(async () => foo.Property1 = await GetValue0("1")),
        Task.Run(async () => foo.Property2 = await GetValue0("2")),
        Task.Run(async () => foo.Property3 = await GetValue0("3")),
        Task.Run(async () => foo.Property4 = await GetValue0("4")),
        Task.Run(async () => foo.Property5 = await GetValue0("5"))
        );
});

The benchmark results:

NameMedianMeanStdDevMinMaxSamples
AsyncDelay0: async delay 0 scenario 1 [count=1000]322.383347.21850.595319.067465.44010
AsyncDelay0: async delay 0 scenario 2 [count=1000]3745.6223799.228119.6313679.2744105.07810

The gap between the two methods has narrowed but is still off by an order of magnitude.

But this isn’t the entire story. The GetValue0 async method Delays for 0ms. A typical I/O operation is long running. So let’s change the delay to exactly 1 ms for every call to GetValue and see the results:

NameMedianMeanStdDevMinMaxSamples
AsyncDelay1: async delay 1 plain8107306.0008101311.00020085.6708061917.0008133161.00010
AsyncDelay1: async delay 1 anonymous delegate8086347.0008090461.00032531.8308032332.0008160777.00010
AsyncDelay1: async delay 1 with closure8090829.0008024352.000215411.1007384197.0008145078.00010
AsyncDelay1: async delay 1 scenario 18098472.0008093254.000120500.1007818342.0008348860.00010
AsyncDelay1: async delay 1 scenario 28096244.0008093088.00075052.6207916373.0008230570.00010

A delay of 1ms per call to GetValue has completely wiped out the differences between the approaches.

The morals of this story are:

comments powered by Disqus