Async/Await and Closures
The best way to understand what async/await, anonymous delegates, and closures do is to look at the IL the C# compiler generates and measure the impact they have on performance in microbenchmarks.
Let’s start by defining an async method:
private static async Task<string> GetValue0(string param1) { await Task.Delay(0); return "Thanks"; }
Using ildasm we can see what this very simple method turns into. First, let’s start with the async method’s IL:
.method private hidebysig static class [mscorlib]System.Threading.Tasks.Task`1<string> GetValue0(string param1) cil managed { .custom instance void [mscorlib]System.Diagnostics.DebuggerStepThroughAttribute::.ctor() = ( 01 00 00 00 ) .custom instance void [mscorlib]System.Runtime.CompilerServices.AsyncStateMachineAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 1B 4D 65 61 73 75 72 65 49 74 2B 3C 47 65 // ...MeasureIt+<Ge 74 56 61 6C 75 65 30 3E 64 5F 5F 31 30 32 00 00 ) // tValue0>d__102.. // Code size 50 (0x32) .maxstack 2 .locals init ([0] valuetype MeasureIt/'<GetValue0>d__102' V_0, [1] valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> V_1) IL_0000: ldloca.s V_0 IL_0002: call valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<!0> valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::Create() IL_0007: stfld valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_000c: ldloca.s V_0 IL_000e: ldc.i4.m1 IL_000f: stfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_0014: ldloca.s V_0 IL_0016: ldfld valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_001b: stloc.1 IL_001c: ldloca.s V_1 IL_001e: ldloca.s V_0 IL_0020: call instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::Start<valuetype MeasureIt/'<GetValue0>d__102'>(!!0&) IL_0025: ldloca.s V_0 IL_0027: ldflda valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_002c: call instance class [mscorlib]System.Threading.Tasks.Task`1<!0> valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::get_Task() IL_0031: ret } // end of method MeasureIt::GetValue0
There are several references to a class called d__102. This class is the state machine that is generated to handle the await:
The interesting method in the class is the MoveNext method:
.method private hidebysig newslot virtual final instance void MoveNext() cil managed { .override [mscorlib]System.Runtime.CompilerServices.IAsyncStateMachine::MoveNext // Code size 166 (0xa6) .maxstack 3 .locals init ([0] bool '<>t__doFinallyBodies', [1] string '<>t__result', [2] class [mscorlib]System.Exception '<>t__ex', [3] int32 CS$0$0000, [4] valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter CS$0$0001, [5] valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter CS$0$0002) .try { IL_0000: ldc.i4.1 IL_0001: stloc.0 IL_0002: ldarg.0 IL_0003: ldfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_0008: stloc.3 IL_0009: ldloc.3 IL_000a: ldc.i4.0 IL_000b: beq.s IL_0044 IL_000d: ldc.i4.0 IL_000e: call class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Threading.Tasks.Task::Delay(int32) IL_0013: callvirt instance valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter [mscorlib]System.Threading.Tasks.Task::GetAwaiter() IL_0018: stloc.s CS$0$0001 IL_001a: ldloca.s CS$0$0001 IL_001c: call instance bool [mscorlib]System.Runtime.CompilerServices.TaskAwaiter::get_IsCompleted() IL_0021: brtrue.s IL_0063 IL_0023: ldarg.0 IL_0024: ldc.i4.0 IL_0025: stfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_002a: ldarg.0 IL_002b: ldloc.s CS$0$0001 IL_002d: stfld valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103' IL_0032: ldarg.0 IL_0033: ldflda valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_0038: ldloca.s CS$0$0001 IL_003a: ldarg.0 IL_003b: call instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::AwaitUnsafeOnCompleted<valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter,valuetype MeasureIt/'<GetValue0>d__102'>(!!0&, !!1&) IL_0040: ldc.i4.0 IL_0041: stloc.0 IL_0042: leave.s IL_00a5 IL_0044: ldarg.0 IL_0045: ldfld valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103' IL_004a: stloc.s CS$0$0001 IL_004c: ldarg.0 IL_004d: ldloca.s CS$0$0002 IL_004f: initobj [mscorlib]System.Runtime.CompilerServices.TaskAwaiter IL_0055: ldloc.s CS$0$0002 IL_0057: stfld valuetype [mscorlib]System.Runtime.CompilerServices.TaskAwaiter MeasureIt/'<GetValue0>d__102'::'<>u__$awaiter103' IL_005c: ldarg.0 IL_005d: ldc.i4.m1 IL_005e: stfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_0063: ldloca.s CS$0$0001 IL_0065: call instance void [mscorlib]System.Runtime.CompilerServices.TaskAwaiter::GetResult() IL_006a: ldloca.s CS$0$0001 IL_006c: initobj [mscorlib]System.Runtime.CompilerServices.TaskAwaiter IL_0072: ldstr "Thanks" IL_0077: stloc.1 IL_0078: leave.s IL_0091 } // end .try catch [mscorlib]System.Exception { IL_007a: stloc.2 IL_007b: ldarg.0 IL_007c: ldc.i4.s -2 IL_007e: stfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_0083: ldarg.0 IL_0084: ldflda valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_0089: ldloc.2 IL_008a: call instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::SetException(class [mscorlib]System.Exception) IL_008f: leave.s IL_00a5 } // end handler IL_0091: ldarg.0 IL_0092: ldc.i4.s -2 IL_0094: stfld int32 MeasureIt/'<GetValue0>d__102'::'<>1__state' IL_0099: ldarg.0 IL_009a: ldflda valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string> MeasureIt/'<GetValue0>d__102'::'<>t__builder' IL_009f: ldloc.1 IL_00a0: call instance void valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1<string>::SetResult(!0) IL_00a5: ret } // end of method '<GetValue0>d__102'::MoveNext
This is where the call to Delay is actually made. Also highlighted are the pieces handling exceptions and the resulting output of the method. The async/await state machine adds to the overhead of tasks. This is why it is recommended to use tasks for non-trivial operations like I/O where the overhead cost is insignificant when compared with the cost of blocking a thread.
A good microbenchmarking tool is Vance Morrison’s MeasureIt available at http://measureitdotnet.codeplex.com
Let’s look at some different ways of calling the GetValue async method by making a benchmark in MeasureIt:
timer1000.Measure("async delay 0 plain", delegate { string ret; Task<string> task = GetValue0("hello"); task.Wait(); ret = task.Result; }); timer1000.Measure("async delay 0 anonymous delegate", delegate { Task.Run(async () => { string ret = await GetValue0("hello"); }).Wait(); }); timer1000.Measure("async delay 0 with closure", delegate { string ret; Task.Run(async () => { ret = await GetValue0("hello"); }).Wait(); });
The first measurement creates a task, waits for it to finish, and assigns the result to a local variable. It’s IL looks like this:
.method private hidebysig static void '<MeasureAsyncDelay0>b__ac'() cil managed { .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) // Code size 25 (0x19) .maxstack 1 .locals init ([0] class [mscorlib]System.Threading.Tasks.Task`1<string> task) IL_0000: ldstr "hello" IL_0005: call class [mscorlib]System.Threading.Tasks.Task`1<string> MeasureIt::GetValue0(string) IL_000a: stloc.0 IL_000b: ldloc.0 IL_000c: callvirt instance void [mscorlib]System.Threading.Tasks.Task::Wait() IL_0011: ldloc.0 IL_0012: callvirt instance !0 class [mscorlib]System.Threading.Tasks.Task`1<string>::get_Result() IL_0017: pop IL_0018: ret } // end of method MeasureIt::'<MeasureAsyncDelay0>b__ac'
We can see the call to GetValue, the Wait on the task, and the retrieval of the result.
The second measurement creates an anonymous delegate to run the task. The IL looks like this:
.method private hidebysig static void '<MeasureAsyncDelay0>b__ad'() cil managed { .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) // Code size 40 (0x28) .maxstack 8 IL_0000: ldsfld class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd' IL_0005: brtrue.s IL_0018 IL_0007: ldnull IL_0008: ldftn class [mscorlib]System.Threading.Tasks.Task MeasureIt::'<MeasureAsyncDelay0>b__ae'() IL_000e: newobj instance void class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task>::.ctor(object, native int) IL_0013: stsfld class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd' IL_0018: ldsfld class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task> MeasureIt::'CS$<>9__CachedAnonymousMethodDelegatebd' IL_001d: call class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Threading.Tasks.Task::Run(class [mscorlib]System.Func`1<class [mscorlib]System.Threading.Tasks.Task>) IL_0022: callvirt instance void [mscorlib]System.Threading.Tasks.Task::Wait() IL_0027: ret } // end of method MeasureIt::'<MeasureAsyncDelay0>b__ad'
This refers to another method called b__ae:
.method private hidebysig static class [mscorlib]System.Threading.Tasks.Task '<MeasureAsyncDelay0>b__ae'() cil managed { .custom instance void [mscorlib]System.Diagnostics.DebuggerStepThroughAttribute::.ctor() = ( 01 00 00 00 ) .custom instance void [mscorlib]System.Runtime.CompilerServices.AsyncStateMachineAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 2A 4D 65 61 73 75 72 65 49 74 2B 3C 3C 4D // ..*MeasureIt+<<M 65 61 73 75 72 65 41 73 79 6E 63 44 65 6C 61 79 // easureAsyncDelay 30 3E 62 5F 5F 61 65 3E 64 5F 5F 63 32 00 00 ) // 0>b__ae>d__c2.. .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) // Code size 50 (0x32) .maxstack 2 .locals init ([0] valuetype MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2' V_0, [1] valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder V_1) IL_0000: ldloca.s V_0 IL_0002: call valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Create() IL_0007: stfld valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder' IL_000c: ldloca.s V_0 IL_000e: ldc.i4.m1 IL_000f: stfld int32 MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>1__state' IL_0014: ldloca.s V_0 IL_0016: ldfld valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder' IL_001b: stloc.1 IL_001c: ldloca.s V_1 IL_001e: ldloca.s V_0 IL_0020: call instance void [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::Start<valuetype MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'>(!!0&) IL_0025: ldloca.s V_0 IL_0027: ldflda valuetype [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder MeasureIt/'<<MeasureAsyncDelay0>b__ae>d__c2'::'<>t__builder' IL_002c: call instance class [mscorlib]System.Threading.Tasks.Task [mscorlib]System.Runtime.CompilerServices.AsyncTaskMethodBuilder::get_Task() IL_0031: ret } // end of method MeasureIt::'<MeasureAsyncDelay0>b__ae'
The layout of this method should look very familiar when compared with the GetValue0 method from earlier. A new class called d__c2 was generated to handle the state machine for the async anonymous delegate.
Finally, I’ll remind you that the third measurement looks like this:
timer1000.Measure("async delay 0 with closure", delegate { string ret; Task.Run(async () => { ret = await GetValue0("hello"); }).Wait(); });
This uses both an async anonymous delegate and a closure on the “ret” variable. The IL generated is similar except the state machine class is now nested inside another class that facilitates the closure:
You can see the variable ret in the closure class. This means that closures will introduce more memory usage. A new class object for every invocation of the async anonymous delegate and copies of all the objects included in the closure as fields in the closure class object. Since async methods are generally long running, there is a potential for some of these objects to be promoted to Gen1 or Gen2 during garbage collection.
Let’s look at the results of these benchmarks. First, here are the specs of the machine running the tests:
Attribute | Value |
---|---|
Number of Processors | 1 |
Processor Name | Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz |
Processor Mhz | 2001 |
Memory MBytes | 8010 |
L1 Cache KBytes | 64 |
L2 Cache KBytes | 256 |
Operating System | Microsoft Windows 8.1 Enterprise |
Operating System Version | 6.3.9600 |
Stopwatch resolution (nsec) | 410.530 |
CompileType | JIT |
CodeSharing | AppDomainSpecific |
CodeOptimization | Optimized |
And the results:
Name | Median | Mean | StdDev | Min | Max | Samples |
---|---|---|---|---|---|---|
AsyncDelay0: async delay 0 plain [count=1000] | 49.793 | 50.052 | 1.066 | 49.378 | 53.212 | 10 |
AsyncDelay0: async delay 0 anonymous delegate [count=1000] | 2492.746 | 2595.311 | 343.519 | 2276.218 | 3239.793 | 10 |
AsyncDelay0: async delay 0 with closure [count=1000] | 2346.606 | 2503.969 | 355.098 | 2250.466 | 3477.617 | 10 |
The median cost for adding another state machine via the async anonymous delegate is significant in this case. Adding a closure does not make much difference in CPU usage, but it can have an impact on memory and GC.
Another example is to wait on multiple simultaneous tasks:
timer1000.Measure("async delay 0 scenario 1", delegate { var foo = new Foo(); Task<string>[] tasks = new Task<string>[] { GetValue0("1"), GetValue0("2"), GetValue0("3"), GetValue0("4"), GetValue0("5"), }; Task.WaitAll(tasks); foo.Property1 = tasks[0].Result; foo.Property2 = tasks[1].Result; foo.Property3 = tasks[2].Result; foo.Property4 = tasks[3].Result; foo.Property5 = tasks[4].Result; }); timer1000.Measure("async delay 0 scenario 2", delegate { var foo = new Foo(); Task.WaitAll( Task.Run(async () => foo.Property1 = await GetValue0("1")), Task.Run(async () => foo.Property2 = await GetValue0("2")), Task.Run(async () => foo.Property3 = await GetValue0("3")), Task.Run(async () => foo.Property4 = await GetValue0("4")), Task.Run(async () => foo.Property5 = await GetValue0("5")) ); });
The benchmark results:
Name | Median | Mean | StdDev | Min | Max | Samples |
---|---|---|---|---|---|---|
AsyncDelay0: async delay 0 scenario 1 [count=1000] | 322.383 | 347.218 | 50.595 | 319.067 | 465.440 | 10 |
AsyncDelay0: async delay 0 scenario 2 [count=1000] | 3745.622 | 3799.228 | 119.631 | 3679.274 | 4105.078 | 10 |
The gap between the two methods has narrowed but is still off by an order of magnitude.
But this isn’t the entire story. The GetValue0 async method Delays for 0ms. A typical I/O operation is long running. So let’s change the delay to exactly 1 ms for every call to GetValue and see the results:
Name | Median | Mean | StdDev | Min | Max | Samples |
---|---|---|---|---|---|---|
AsyncDelay1: async delay 1 plain | 8107306.000 | 8101311.000 | 20085.670 | 8061917.000 | 8133161.000 | 10 |
AsyncDelay1: async delay 1 anonymous delegate | 8086347.000 | 8090461.000 | 32531.830 | 8032332.000 | 8160777.000 | 10 |
AsyncDelay1: async delay 1 with closure | 8090829.000 | 8024352.000 | 215411.100 | 7384197.000 | 8145078.000 | 10 |
AsyncDelay1: async delay 1 scenario 1 | 8098472.000 | 8093254.000 | 120500.100 | 7818342.000 | 8348860.000 | 10 |
AsyncDelay1: async delay 1 scenario 2 | 8096244.000 | 8093088.000 | 75052.620 | 7916373.000 | 8230570.000 | 10 |
A delay of 1ms per call to GetValue has completely wiped out the differences between the approaches.
The morals of this story are:
- Use tasks where they make sense, especially for I/O since that is slow enough to be measured in milliseconds
- Unless you’re writing a framework like SignalR, you won’t see much improvement in performance by optimizing away closures or unnecessary async state machines
- Focus instead on other areas:
- Avoid blocking threads
- Parallelize I/O operations as much as possible
- Only use locks when necessary
- Write async code that’s easy to understand and maintain