BlockLeftTop, PRELOAD BlockLeftBottom, PRELOAD BlockLeftStretch, PRELOAD BlockTop, PRELOAD BlockBottom, PRELOAD BlockStretch, PRELOAD BlockRightTop, PRELOAD BlockRightBottom, PRELOAD BlockRightStretch, PRELOAD
DeltaEngine

Interesting Stats of the Gaming Business: Bigpoint and Flight Control (iPhone)

by Karsten Wysk 30. April 2009 02:16

We havent updated for quite a while - time to give some inputs over the Browser Games Market:

 

Link: ds-Videointerview: Heiko Hubertz von Bigpoint

and there is also a great article about Sales in the AppStore of the game "Flight Control": http://www.techcrunch.com/2009/04/29/flight-control-sales-stats-offer-fascinating-look-at-inner-workings-of-the-app-store/

FlightControl-SalesNumbers

For vs Foreach Performance

by Benjamin Nitschke 22. April 2009 03:31
Today I had a discussion with a few other programmers (Hi ViperDK and Timo ^^) about foreach and for loop performance. Everyone had his own ideas and experiences, but we talked mostly about XNA on the Xbox 360 (using the compact framework, that does not like foreach, just use for all the time in tight render loops for your XNA games, foreach will create too much IEnumerator garbage that will not be collected) or other special cases (huge arrays or lists).

Anyway, just guessing around is obviously not as good as actually writing a performance test application and see whats going on (see end of this article for download link). Again, we need a disclaimer: You will probably never write loops with 1 billion iterations, these numbers are similar for less iterations, but you should always try to write clean code and ONLY optimize when it is required after finding performance issues! Lets go directly to the results for 1 billion for or foreach iterations in release mode (will loop through a 10k array or list for 100k times, see below for code):



As you can see there are some major differences between using foreach with an array or using a generic list (boxing/unboxing is not the issue here, you would not be able to archive this performance with an ArrayList). Another thing you can notice right away is that using simple arrays is always faster than using dynamic lists, even converting them from a list to an array via .ToArray() and then executing the foreach is much faster.

It also makes a HUGE difference executing this in debug mode (I execute my code almost 99% in debug mode), everything is 4 times slower for this special performance test. Except for Foreach+List, which is in comparison not that bad in debug mode. What is really bad is For+List, which almost takes 16 seconds to execute 1 billion times (again using my Core 2 Duo 6600 CPU with 4 GB Ram).



Thanks to some helper methods the code for this performance test is pretty short and easy (check end of this article for the source code download link):
 Stopwatch timer = new Stopwatch();
 // Run through all tests
 for (int test = 0; test < 8; test++)
 {
     // Find out which test we wanna make
     bool rememberLength = (test / 4) % 2 == 1;
     bool useForeach = (test / 2) % 2 == 1;
     bool useList = test % 2 == 1;

     // Start timer for test 
     timer.Start();

     // Execute
     int result = RunForOrForeachLoops(rememberLength, useForeach, useList);

     // Stop timer
     timer.Stop();

     // Skip this test if we got no result!
     if (result == 0)
         continue;

     // Report results!
     resultText = ...
     Console.WriteLine(resultText + ": " + timer.ElapsedMilliseconds + "ms");

     // Reset timer for next run
     timer.Reset();
 } // for
The interesting code obviously happens in RunForOrForeachLoops. It basically executes the 1 billion iterations in several different ways (8 ways, as you can guess by the for main test for loop). All of those produce the same result, half of the tests use int[] intArray, the other half uses List<int> intList. And again half of the tests uses just for loops, the other half uses foreach for the iterations. Lets examine the useForeach=true and rememberLength=false cases (btw: Iterations is 100000 and the intList and intArrays have a size of 10000 elements):

if (useForeach)
{
    if (useList)
    {
        for (int iter = 0; iter < Iterations; iter++)
        {
            result = 0;
            foreach (int val in intList)
                result += val;
        } // for
    } // if
    else
    {
        for (int iter = 0; iter < Iterations; iter++)
        {
            result = 0;
            foreach (int val in intArray)
                result += val;
        } // for
    } // else
} // if
else // useForeach == false
...

To figure out the differences between those few lines of code in RunForOrForeachLoops, that almost look the same, we have go to IL code once again. First lets check out what the fastest code block Foreach+Array is being compiled to in IL:
result = 0;
foreach (int val in intArray)
    result += val;
This becomes in IL:
 L_0007: ldsfld int32[] TestSimpleForeachApp.Program::intArray
 L_000c: stloc.2 
 L_000d: ldc.i4.0 
 L_000e: stloc.3 
 L_000f: br.s L_001d
 L_0011: ldloc.2 
 L_0012: ldloc.3 
 L_0013: ldelem.i4 
 L_0014: stloc.1 
 L_0015: ldloc.0 
 L_0016: ldloc.1 
 L_0017: add 
 L_0018: stloc.0 
 L_0019: ldloc.3 
 L_001a: ldc.i4.1 
 L_001b: add 
 L_001c: stloc.3 
 L_001d: ldloc.3 
 L_001e: ldloc.2 
 L_001f: ldlen 
 L_0020: conv.i4 
 L_0021: blt.s L_0011
In L_0021 we jump up to L_0011 as long as we are not done with the loop yet. Inside all the result += val happens, but we do not see any slow code, there are no calls to any methods, there is no unboxing and this all can probably be executed quickly after being complied to assembler by the JIT (just-in-time) compiler.

On the other hand, if we look at pretty much the same code, just with intList instead of intArray, the whole story changes:
result = 0;
foreach (int val in intList)
    result += val;
becomes much more complex IL code with lots of function calls, the creation of an Enumerator and there is even a dispose at the end (hello GC):
  L_0007: ldsfld class [mscorlib]System.Collections.Generic.List`1<int32> TestSimpleForeachApp.Program::intList
  L_000c: callvirt instance valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator<!0>
[mscorlib]System.Collections.Generic.List`1<int32>::GetEnumerator() L_0011: stloc.2 L_0012: br.s L_0020 L_0014: ldloca.s CS$5$0000 L_0016: call instance !0 [mscorlib]System.Collections.Generic.List`1/Enumerator<int32>::get_Current() L_001b: stloc.1 L_001c: ldloc.0 L_001d: ldloc.1 L_001e: add L_001f: stloc.0 L_0020: ldloca.s CS$5$0000 L_0022: call instance bool [mscorlib]System.Collections.Generic.List`1/Enumerator<int32>::MoveNext() L_0027: brtrue.s L_0014 L_0029: leave.s L_0039 L_002b: ldloca.s CS$5$0000 L_002d: constrained [mscorlib]System.Collections.Generic.List`1/Enumerator<int32> L_0033: callvirt instance void [mscorlib]System.IDisposable::Dispose() L_0038: endfinally
What even looks worst too me is the code that is executed in MoveNext (get_Current just returns this.current, no extra instructions here). Since the IL code is very long and confusing (31 IL instructions), here is the C# code of MoveNext:
public bool MoveNext()
{
  List<T> list = this.list;
  if ((this.version == list._version) && (this.index < list._size))
  {
    this.current = list._items[this.index];
    this.index++;
    return true;
  }
  return this.MoveNextRare();
}

And MoveNextRare could also be called at the end of MoveNext if something changes for the list (not the case in our example). But we have to keep in mind that this is just IL code, not actual code that will be executed on the CPU. The JIT compiler can still optimize things out, which we won't see here. But something must be going on since Foreach+List is definiately slower than using all the other approaches to get through all the values in intList. I recommend Foreach+Array converted from List because the IL for Foreach+Array is really fast and does not use any extra methods or Enumerators, especially if you know that the array won't change much and if you do not have to call .ToArray every time when you do a foreach.

And here is the sample project for all this performance testing fun:

DLR Performance Comparisons

by Benjamin Nitschke 20. April 2009 05:28
The last few days I was working a little bit with the DLR and got to a point today where I was wondering about the performance of a very simple while loop in the DLR, specifically in the ToyScript DLR sample project and of course IronPython. I am totally aware that the following comparison DOES NOT really provide accurate information about the actual performance of the languages and techniques used. Please keep this in mind when reading this article, it only covers a very certain aspect of all used languages. More specially how they perform when adding numbers in a while loop a lot of times! Some of it reflects that compiled languages are faster than dynamic languages, but you should not compare the absolute values, Python or Lua is not really 100 times slower, in many cases you can reach almost similar performance as long as you implement or call performance-critical code in a clever way (e.g. calling C++ code, using frameworks, etc.).

I was only interested about very ruff comparisons to figure out which ways are fast and which languages or techniques are kinda slow (click image on the right for a quick overview). Some code written in Assembler or direct IL code execution was really fast (duh, no wonder), but all execution times of C++, C#, Assembler or IL are way faster than all other dynamic languages or DLR approaches. The important thing for me was to figure out why dynamic languages like IronPython or ToyScript on top of the DLR were like 100 times slower for this specific code. At the end of writing this article I was able to provide pretty good performance with the DLR, which is just 2-3 times slower than the fastest execution time. That's pretty good and could probably be even improved more.

Basically I used the following code sample, which adds 1.0 to counter 10 million times while decreasing the loops variable until it reaches 0. The following code block is not valid syntax in any of the presented languages, but it is very similar to Python. In languages that use compile time types, counter would be a declared as a double (64bit floating point number), most dynamic languages use double as the run-time type for floating point numbers also.
loops = 10000000
counter = 0.0
while loops > 0
loops = loops - 1
counter = counter + 1.0

print counter

Languages and techniques used and tested (all with detailed code examples and explanations):
Before boring you with all the different implementation details, here is the fancy Code Generation Performance comparison. This article became very long; I just wanted to post a few sentences about the DLR and this is what happened. The following graphs were created with Excel with some help from my good old friend ViperDK. The second graph (the first one was already shown above) shows just the execution time of the while-loop code block in each of the used language or technique.


And this graph also shows the other times like compile & build time where it makes sense or script or DLR start-up times to see how much this could affect the overall time from coding to seeing the result on the screen. This is important for my approach to my own language since I want a extremely low time for the coding to see result loop. Start-up time can be ignored in those cases, but it is still important since it is still kinda slow with the DLR (but will hopefully get better in the future).



Here are the values used to generate those graphs. Tested on my almost 3 years old 2.4 Ghz 6600 CPU; this is probably way faster on my i7@3.6 Ghz at work:
Language or Technique   Code lines   Compile&Build   Script Startup   Parsing&AST   Execution  
C# 3.0 10 ~500ms - - ~37ms
Native C++ 8 1500ms - - ~31ms
Assembler 15+ ~1100ms - - ~15ms
IronPython 2.6 6 - ~300ms DLR 160ms ~9000ms
(~1250ms for files)
ToyScript 8 - ~430ms DLR 103ms ~8400ms
Python 2.6 6 - ~50ms ~100ms ~3500ms
Python 3.1 6 - ~50ms ~100ms ~4500ms
Lua 5.1.4 7 - ~25ms ~150ms ~1400ms
Script.NET 8 - ~100ms - ~39752ms
IL Emit 30+ - ~100ms ~1ms ~16ms
Irony + DLR (dummy) 30 + 6 - 265ms Irony 87ms ~138ms (dummy)
DLR AST with objects 68+ - 433ms DLR 4ms ~459ms
DLR AST with doubles 60+ - 434ms DLR 5ms ~98ms

Implementing and testing the following code was not too bad for me since I was learning how to generate AST statements in the DLR, generating System.Reflection.Emit IL code, doing stuff with the ToyScript DLR sample, etc. anyway. I was of course interested in how to get the best performance for my code generation, but it is useful to understand why each approach is slower or faster. Sorry if the comparison presents your project or technology you might use in a bad light, but for example Script.NET is slow in comparison (this is explained below in more detail), but still fast enough for probably 99% of its uses. It still took much longer than I expected, writing this article did take some time too, but most of the time was spend writing and testing code in each language and technique and then gathering the execution time results. But performance evaluation is always fun :) Again: This is not a representative performance comparison of the languages and techniques involved, it only applies to my very short and stupid test code and should only give you a general idea.

C# 3.0

First of all the very simple implementation of the while loop:

public
static double DoLoop(int loops) {     double counter = 0.0;     while (loops > 0)     {         loops = loops - 1;         counter = counter + 1.0;     }     return counter; } // DoLoop(loops)

And now some code to measure the execution time and to print out the result to see if the loop was executed as many times as we expected:

    long startTime = WindowsHelper.GetPerformanceCounter();
    
    // Do simple performance test in C#
    const int NumberOfLoops = 10000000;
    double result = DoLoop(NumberOfLoops);
    // Note: We did printing in other tests too (does not matter anyway)!
    Console.WriteLine("DoLoop result=" + result);
    
    long endTime = WindowsHelper.GetPerformanceCounter();
    Console.WriteLine("Executing the while loop " + NumberOfLoops +
        " times took: " + WindowsHelper.ConvertToTime(endTime - startTime));

Note that the WindowsHelper class is just providing a little bit more accurate information than just using a StopWatch (which is used in most other samples here), which is fine too. With System.Diagnostics.StopWatch this code looks like the following (actually results in the same results on Windows 7 and Vista):

    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Do simple performance test in C#
    const int NumberOfLoops = 10000000;
    double result = DoLoop(NumberOfLoops);
    //Note: We did printing in other tests too (does not matter much anyway)!
    Console.WriteLine("DoLoop result=" + result);

    timer.Stop();
    Console.WriteLine("Executing the while loop " + NumberOfLoops +
        " times took: " + timer.ElapsedMilliseconds + "ms");

The following output is generated:
DoLoop result=10000000
Executing the while loop 10000000 times took: 37ms

Native C++

Here is the whole program used for testing. This has nothing to do with the DLR or .NET anymore, I just wanted to know how my little while loop would perform in native C++ and Assembler. Sometimes when using for loops the C++ compiler will optimize the code away and just replace it with something static that results in the same value as if the for loop would have been executed. Please also note that it did not make any difference to use a 64bit platform and doubles or just 32bit floats, but since most script languages and especially .NET compile to x64 too, we should not use 64bit numbers when compiling a 32bit application here in C++. Switching to x64 is also not too hard, you just need the correct 64bit libraries and set the linker target platform to X64.

#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();

    // C++ version
    int loops = 10000000;
    float counter = 0.0f;
    //64bit code: double counter = 0.0;
    while (loops > 0)
    {
        loops = loops - 1;
        counter = counter + 1.0f;
    }
    
    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    // Wait for user input (not required for Debug.StartWithoutDebugging)
    //char input[128];
    //scanf(input);

    return 0;
}

And after compiling (usually a lot faster than 1s when just incrementally compiling) and running this code, it generates this output:
counter=10000000.000000
This took 31ms
The result is not very accurate, I either get 31ms or 47ms, but not values in between, the accuracy of GetTickCount pretty much sucks and I should use similar code as the GetPerformanceCounter() stuff above from C#, but I do not really care about more accurate values here, C++ is more than fast enough and we do not need to discuss its runtime performance!

Assembler

Based on the C++ solution I was able to write some Assembler instructions. It has been a very long time since I used assembler code at all, probably in 2002 when optimizing some path finding algorithms for ArenaWars, which were called from C++ code, which was called from C# code. Before that in the C++ days I used it now and then, but never for more than some performance critical inner loops. Please also note that sometimes you make non-optimal assembler code (as I probably did in this example) and just having your C++ compiler generate assembler instructions can be better, so always profile your code and only optimize to Assembler if you have no other option left and if you are too lazy to rethink the problem :D

#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();
    
    int loops = 10000000;
    float counter = 0;
    float counterInc = 1.0f;
    __asm
    {
startLoop:
        // counter += 1.0f
        fld counter
        fadd counterInc
        fstp counter

        // loops--
        dec loops

        // loops > 0? 
        cmp loops, 0
        // Then goto startLoop (else we are done)
        jg startLoop
    }

    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    return 0;
}

I don't have much to say about this ugly assembler code, as you can see it is a lot more lines of code and this was the first idea not even using registers as you should in assembler. After running this code and getting 31-47ms too (like the C++ version), I added some minor optimization by using registers more explicitly and this is the resulting code, it ran about 2-3 faster than the above code:

#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();

    int loops = 10000000;
    float counter = 0;
    float counterInc = 1.0f;
    __asm
    {
        // Copy loop counter to eax register (32bit)
        mov eax, loops
        // Load both float values
        fld counterInc
        fld counter
        
startLoop:
        // counter += 1.0f
        fadd st(0), st(1)
        //st(0) has still the counter value! no need to store: fst st(0)

        // loops--
        dec eax

        // loops > 0? 
        cmp eax, 0
        // Then goto startLoop (else we are done)
        jg startLoop

        // We are done, copy counter value back!
        fstp counter
    }

    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    return 0;
}

This program does produce the following output:
counter=10000000.000000
This took 15ms
This is pretty good and I did not want to optimize it further, especially not when the timer is so inaccurate. Running the code 10 times resulted in 140ms. Again, fast enough, no reason to argue the superior runtime performance. But this comes at a high price since the code is in no way easy to understand, its hard to write and read.

IronPython 2.6

Speaking of code, I would say the implementation in Python is the shortest and very beautiful since it is so easy to read and write:

loops = 10000000
counter = 0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
print counter

I just used my ScriptManager from my .NET libraries, which can handle IronPython (and other DLR languages) as well as Lua and wrote this short unit test to see how IronPython performs:

    public void TestPythonDoLoop()
    {
        // Create script manager without any output redirection or special directories
        ScriptManager manager = new ScriptManager(null, null);
        
        Stopwatch timer = new Stopwatch();
        timer.Start();

        // Simply execute line by line as we would in a console window
        manager.ExecuteCode("loops = 10000000", ScriptType.Python);
        manager.ExecuteCode("counter = 0", ScriptType.Python);
        manager.ExecuteCode(@"while loops > 0:
loops = loops - 1
counter = counter + 1", ScriptType.Python);
        manager.ExecuteCode("print counter", ScriptType.Python);

        timer.Stop();
        Console.WriteLine(
            "Executing the while loop 10000000 times in python took: " +
            timer.ElapsedMilliseconds + "ms");
    } // TestPythonDoLoop()

And this results in the following output:
Executing the while loop 10000000 times in python took: 9151ms

Update 2009-04-21: As Simon Davy correctly states in the comments of this post, ipy.exe (IronPython) is actually a lot faster when executing a file with the code above (e.g. use import time). This results in an execution time of just 1250ms, so a lot faster! I have tested it too with ipy, but I entered the code line by line (same way as I called my ScriptManager code in the unit test above) and it still takes 9 seconds. But Simon is right, once I execute everything at once (with ipy test.py) it only takes 1.25s (can also be done with my ScriptManager, but I have not thought of that). So it is probably just an issue with REPL (entering statements one by one .. less optimizations that way). While this certainly does make IronPython a much faster dynamic language than all the other ones, it still is slower than the compiled languages or the DLR code at the end of this article.

Thanks for noticing!This is certainly not good. The same thing happend after I tried using the IronPython 2.6 console or IronPython 2.0, it always took about 9 seconds to complete the big while loop on my PC. I wanted to figure out why this is so slow compared to C#, C++ or Assembler runtime, but then again we have to remember that Python is a dynamic language. The most important reason for the long runtime here is probably just all the boxing and unboxing of values, if you can say that for dynamic languages. What I mean is figuring out which type we got, and then performing the add/sub operation, which can be slow if you do it 10 million times. Please note again that this is probably not affecting any real Python program and performance for more complex programs will be way better. More on the type-performance and this issue in general at the end of this article!

ToyScript

To learn more about the DLR I played around with the DLR ToyScript sample quite a bit the last days. Since a lot of its code is based on the same ideas as IronPython and IronRuby, you can't expect much different performance results. This is the ToyScript code for the while loop:

loops = 10000000
counter = 0
while (loops > 0)
{
    loops = loops - 1
    counter = counter + 1
}
print counter

Similar to the ScriptManager above I wrote some helper classes for the ToyScript language allowing me just to execute code, viewing the AST (abstract syntax tree) and write a lot of unit tests. The following code is such a unit test for the ToyScript class to check the performance for these 10 million while loop iterations:

    public void TestBigLoop()
    {
        const int NumberOfLoops = 10000000;

        Stopwatch timer = new Stopwatch();
        timer.Start();

        // Initialize the language first (takes its own time, around 400ms)
        ToyLanguage language = new ToyLanguage();
        // Execute something else (takes maybe 100ms)
        Assert.Equal(3.0, language.Execute("print 1+2"));

        timer.Stop();
        Console.WriteLine("Startup time: " + timer.ElapsedMilliseconds + "ms");
        timer.Reset();
        timer.Start();

        Assert.Equal((double)NumberOfLoops,
            language.Execute(
@"loops = " + NumberOfLoops + @"
counter = 0
while (loops > 0)
{
loops = loops - 1
counter = counter + 1
}
print counter
"));

        timer.Stop();
        // Show total execution time
        Console.WriteLine("Executing the while loop " + NumberOfLoops +
            " times took: " + timer.ElapsedMilliseconds+"ms");
    } // TestBigLoop()

This code runs for about the same time as IronPython, maybe a little faster since ToyScript has a lot less features than IronPython, but in general it uses the same approach. This is the output from the unit test:
Startup time: 650ms
Executing the while loop 10000000 times took: 8914ms
Investigating the Lambda Expression AST for the code block shows that quite a lot of statements were created in order to make this ToyScript code run on the DLR:

 // Expression: Expression`1
 
 .lambda (<toyblock>$2 Microsoft.Func`1[System.Object] #1)
 
 // LambdaExpression: <toyblock>$2(1)
 .lambda System.Object <toyblock>$2 ()(
 ) {
   .label 0x03d11d12:
.comma {
.block {
.extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
(Object)10000000)
,
.extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
(Object)0)
,
.loop break:.label 0x03fe2d21 {
.block {
.if (
.site (Boolean) CallSiteBinder(ConvertTo to System.Boolean) (
.extension CodeContextExpression ( )
,
.site (Object) CallSiteBinder(DoOperation GreaterThan) (
.extension CodeContextExpression ( )
,
.extension GlobalVariableExpression ( )
,
0
)
)
) {
/*empty*/
} .else {
.block {
/*empty*/,
.break .label 0x03fe2d21
}
}
,
.block {
.extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
.site (Object) CallSiteBinder(DoOperation Subtract) (
.extension CodeContextExpression ( )
,
.extension GlobalVariableExpression ( )
,
1
))
,
.extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
'ToyHelpers.Add'(
(Object).extension GlobalVariableExpression ( )
,
(Object)1
))
,
/*empty*/
},
/*empty*/,
/*empty*/
}
},
'ToyHelpers.Print'(
.extension CodeContextExpression ( )
,
.extension GlobalVariableExpression ( )
),
/*empty*/
},
.null
}
}

The IL code was even more confusing so I only tried to figure out the simple "print 1+2" expression IL code that can be produced by calling ScriptCode.SaveToAssembly(..), more DLR IL code analysis will happen at the end of this article:

  .method public specialname static object <toyblock>$1$1(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope $scope,
 class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext $language) cil managed { .maxstack 3 .locals init ( [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context) L_0000: ldarg.0 L_0001: ldarg.1 L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class
[Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext) L_0007: stloc.0 L_0008: ldloc.0 L_0009: ldc.r8 1 L_0012: box float64 L_0017: ldc.r8 2 L_0020: box float64 L_0025: call object ToyScript.ToyHelpers::Add(object, object) L_002a: call void ToyScript.ToyHelpers::Print(class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext, object) L_002f: ldnull L_0030: ret }

There is a lot of extra stuff in there, but you can clearly see the 1+2 in L_0009 to L_00020 with all its boxing fun. Then those two are added with help of a method Add that is defined in the ToyHelpers class (unboxing the doubles again, adding and boxing them to an object again, which is returned). Then the result is passed along to the ToyHelpers.Print method, which just calls Console.WriteLine in my case (usually it is used by the ConsoleHost class for the normal REPL console interaction).

I am not a DLR expert and there is probably a lot of other things going on, but to me all this boxing, unboxing and extra code just to add some numbers is clearly too much for the big loop. At the end of this article I will try to write some DLR code that does less boxing/unboxing.

Python 2.6

After testing the DLR it was time to check out the native Python implementations to see if this is releated to the dynamic nature of dynamic languages like Python or Lua. Well it obviously is related, but more specifically I wanted to see if certain implementaions are maybe more clever or optimized.

The Python code is the same as above for IronPython:

loops = 10000000
counter = 0.0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
print counter

Executing this takes about 3.5 seconds, we could cheat a little and use the following code to execute this loop a little faster:

counter = 0.0 for loops in range(0, 10000000):     counter = counter + 1.0 print counter

This runs about 2 times faster (less than 2 seconds) because Python is clever and optimizes for loops. The same actually applies to IronPython, a for loop is much faster here too (5.5s instead of 9.1s). But this defeats the whole point of this exercise, I wanted to know what is slow about the DLR and implementations like IronPython or ToyScript and why. It is nice that Python executes for loops quicker, but the whole counter = counter + 1 thing is still very slow compared to C#, C++, etc.

Python 3.1

Executing Python 3.1 is pretty much the same as Python 2.6. It is slighly slower, but that can also be because I use a alpha build (3.1 alpha 2 from 2009-04-04). Again, executing the for-loop version with range was twice as fast (2.1s instead of 4.3s). In general is is quite known that Python is not one of the fastest languages, there are many dynamic languages that can be faster (for example Ruby or Lua), but there are also much slower ones like PHP. If you are more interested about comparing IronPython with Python read the performance comparisons at the IronPython project at codeplex: http://ironpython.codeplex.com/Wiki/View.aspx?title=IronPython%20Performance

Lua 5.1.4

The lua code is pretty similar to Python, but it executes much faster:

loops = 10000000
counter = 0.0
while (loops > 0) do
    loops = loops-1
    counter = counter+1
end
print(counter)

This executes in less than 1.5 seconds and returns the expected 10000000 (not 10000000.0 as Python however since Lua treats everything as a 64bit double anyway). The for-loop trick works here too and reduces the time in half (0.7s instead of 1.4s):

counter = 0.0
for i=1,10000000 do
    counter = counter+1.0
end
print(counter)

The reason why Lua is much faster than all the other dynamic languages I have tested is probably because Lua has much simpler data types and is optimized a lot for math, for loops and stuff like that. I use Lua more than Python these days and the performance is quite good I have to say (using it for AI, some physics and some game logic). As you can see here in many benchmarks Lua is several times faster than Python (but Python is still a lot more powerful with the huge amount of libraries available for it).

Script.NET

Script.NET is a great project that uses the Irony compiler toolkit to parse and evaluate simple C# code. It does not generate any IL or immediate code, it just evaluates the AST directly. This is a very nice approach and easy to understand because all the code is C#, but there is a slight problem with it: This approach is it VERY slow. It was optimized many times already and runs much faster than a year ago, but I would not recommend it for anything other than some scripting in C# (guess what, that's the whole name of the project). If you do not need high performance Script.NET is just fine for executing a couple lines of code, even for games as long as you do not do any physics or rendering code with it ^^

loops = 10000000;
counter = 0;
while (loops > 0)
{
    loops = loops-1;
    counter = counter+1;
}
Console.WriteLine(counter);

The output appears after almost 40 seconds and displays correctly:
10000000
At least I now know why generating some IL code is very important, with or without the DLR. Next I will investigate IL code generation with Reflection.Emit, but I can still remember the pain from the last time I tried to do something with it.

IL Emit

So let's get right to it, the following code will generate the required IL for us:

    // Create a dynamic method called DoLoop where we put all our IL code
    DynamicMethod DoLoop = new DynamicMethod("DoLoop",
        typeof(double), null, typeof(string).Module);

    // Start a timer once again to measure creation time and execution time!
    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Get an ILGenerator and emit a body for our dynamic method
    ILGenerator il = DoLoop.GetILGenerator();
    
    // Create counter (double)
    LocalBuilder counter = il.DeclareLocal(typeof(double));
    il.Emit(OpCodes.Ldc_R8, 0.0);
    il.Emit(OpCodes.Stloc, counter);
    // Assign loops variable to 10000000 (10 million) for the loop!
    LocalBuilder loops = il.DeclareLocal(typeof(int));
    il.Emit(OpCodes.Ldc_I4, 10000000);
    il.Emit(OpCodes.Stloc, loops);
    
    // Start loop label
    Label startLabel = il.DefineLabel();
    il.MarkLabel(startLabel);
    
    // Do this 10000000 times:    
    // Increase counter by 1.0
    il.Emit(OpCodes.Ldloc, counter);
    il.Emit(OpCodes.Ldc_R8, 1.0);
    il.Emit(OpCodes.Add);
    il.Emit(OpCodes.Stloc, counter);

    // Decrease loops by 1
    il.Emit(OpCodes.Ldloc, loops);
    il.Emit(OpCodes.Ldc_I4, 1);
    il.Emit(OpCodes.Sub);
    il.Emit(OpCodes.Stloc, loops);

    // Check if we still have to loop (as long as loops > 0)
    il.Emit(OpCodes.Ldloc, loops);
    il.Emit(OpCodes.Ldc_I4, 0);
    il.Emit(OpCodes.Bgt, startLabel);

    // End loop and return counter!
    il.Emit(OpCodes.Ldloc, counter);
    il.Emit(OpCodes.Ret);

    // Create a delegate that represents the dynamic method. This
    // action completes the method, and any further attempts to
    // change the method will cause an exception.
    DoLoopDelegate loop = (DoLoopDelegate)
        DoLoop.CreateDelegate(typeof(DoLoopDelegate));

    timer.Stop();
    Console.WriteLine("DoLoop IL creation time: " + timer.ElapsedMilliseconds + "ms");
    timer.Reset();

This creates the dynamic DoLoop method, which we can call with loop() from here on, which will also return the result (counter) once we call it. This all takes almost no time at all:
DoLoop IL creation time: 0ms
Next, we need to invoke this dynamic method, which is rather easy:

    timer.Start();
    Console.WriteLine("Invoking DoLoop() returned: " + loop());
    timer.Stop();
    Console.WriteLine("DoLoop IL execution time: " + timer.ElapsedMilliseconds + "ms");

And this only takes about 16ms and also returns the correct value:
Invoking DoLoop() returned: 10000000
DoLoop IL execution time: 16ms
So IL code is rather fast (faster than everything else here except the fine-tuned assembler, which is similar to this), but don't ask in how many problems I ran into just writing these few lines of code. Let's just say a lot of times I used wrong opcodes, did not know how to implement stuff and finally when stuff compiled I got syntax errors when executing, not really telling me what I did wrong, so I had to comment out code until it worked again and then slowly added code until I could figure out what had to be changed. Let's just say not a pleasant thing to write a compiler this way.

Irony + DLR (dummy)

As mentioned before Script.NET actually uses the Irony compiler toolkit for parsing and building the AST (abstract syntax tree). Irony itself is rather fast (can parse 15000 lines of code or more per second), only the execution part of Script.NET wasn't so good because no code is generated, everything is evaluated on the fly.

While I have started using Irony just a week ago after playing around with ANTLR first (see earlier blog post here), but I still like it very much and I hope Roman Ivantsov (the guy behind Irony) will soon release a long promised update with some cool new features I can sink my teeth into. As you will see in my sample below I have added some own functionality to Irony already to make it easier to test code and I will probably extend it even more when I write more of my language grammar.

If you don't know how Irony works and you are interested in building compiler grammars directly in C# you should check out the project and the Lang.NET talks Roman did this and last year about it, plus the great Irony article on CodeProject of course. Basically you define some non-terminals (variables, numbers) and terminals (logical blocks of your code) in the constructor of a Irony.Grammar derived class and then define some rules for it (in BNF = Backus-Naur Form). Then you set the root node and you are ready to parse some code. Please note that this grammar is very simple and will only handle this sample code and very similar code blocks. I'm still just playing around with grammars, there is no concrete implementation of my language yet, just a lot of ideas .. Writing some grammars is a good training, I have written one in ANTLR and 4 other more complex ones with Irony yet.

    public TestDoLoopGrammar()
    {
        // 1. Terminals
        var variable = new IdentifierTerminal("variable");
        var number = new NumberLiteral("number");

        // 2. Non-terminals
        var program = new NonTerminal("program");
        //var commandList = new NonTerminal("commandList");
        var command = new NonTerminal("command");
        var assignmentOperator = new NonTerminal("assignmentOperator");
        var assignment = new NonTerminal("assignment");
        var whileLoop = new NonTerminal("whileLoop");
        var compareOperator = new NonTerminal("compareOperator");
        var condition = new NonTerminal("condition");
        var binaryOperator = new NonTerminal("binaryOperator");
        var operation = new NonTerminal("operation");

        // 3. BNF rules
        // Please note that this program grammar is rather limited and not really
        // useful, it is only used for testing and learning Irony ..
        // We can have many commands in case we want to go crazy
        program.Rule = MakePlusRule(program, null, command);
        assignmentOperator.Rule = Symbol("=");
        assignment.Rule = variable + assignmentOperator + number;
        whileLoop.Rule = Symbol("while") + condition + ":" +
            program + "end";
        // We only allow some simple compare conditions
        compareOperator.Rule = Symbol("<") | "==" | ">";
        condition.Rule = variable + compareOperator + number;
        // We can either use + or -
        binaryOperator.Rule = Symbol("+") | "-";
        operation.Rule =
            variable + assignmentOperator + variable + binaryOperator + number;
        // Commands can be assignments, while loops or operations
        command.Rule = assignment | whileLoop | operation;

        // 4. Operators precedence and punctuation
        RegisterPunctuation(":", "end");

        // 5. Global settings
        this.CaseSensitive = false;
        this.ThrowGrammarExceptionsOnError = true;

        // 6. Set grammar root
        this.Root = program;
    } // TestDoLoopGrammar()

Note that ThrowGrammarExceptionsOnError is not a property of Irony, I added it to make sure all Grammar errors immediatly cause an exception to be thrown, which makes it much easier for unit tests and figuring out what when wrong. Not really sure why Irony does not do that, I can understand not throwing exceptions for syntax errors when parsing some source code (because that might be slower than just reporting errors in a list form and more helpful because its easier to continue after the error). But grammar errors mean usually that you cannot use the grammar for anything anyway. Some things also just produce a warning, but I throw Exceptions then too because I don't want hidden fixes of my grammar, which would make it hard to fix real problems later.

Anyway, with the TestDoLoopGrammar class we can now test and parse some code. Irony usually expects you to use its Grammar Explorer program to do this job, which is a pretty nice application, but it is not unit test and we all know I'm all about unit testing (don't we?). I still use the Irony Grammar Explorer from time to time and I recommend it very much if you want to test out grammars and visualize the AST a little better than just throwing some text into a console window. The following simple unit test will parse our while-loop program source code and produce a nice AST for us, which it kindly prints out:

    public void GenerateDoLoopAST()
    {
        AstNode program = IronyHelpers.ParseWithErrorHandling(
            new TestDoLoopGrammar(),
@"loops = 10000000
counter = 0.0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
end");

        IronyHelpers.PrintAst(program);
    }

This will quickly print out the AST in the following way, which might seem a little complex and verbose at first, but you have to remember that we do not have to generate code for every single node. Often several nodes together will just be one single instruction. Other nodes like program are just used to hold code blocks together and allow several child nodes.

 program
   assignment
     loops [variable]
     assignmentOperator
       = [Symbol]
     10000000 [number]
   assignment
     counter [variable]
     assignmentOperator
       = [Symbol]
     0 [number]
   whileLoop
     while [variable]
     condition
       loops [variable]
       compareOperator
         > [Symbol]
       0 [number]
     program
       operation
         loops [variable]
         assignmentOperator
           = [Symbol]
         loops [variable]
         binaryOperator
           - [Symbol]
         1 [number]
       operation
         counter [variable]
         assignmentOperator
           = [Symbol]
         counter [variable]
         binaryOperator
           + [Symbol]
         1 [number]
     end [variable]

Now this tree has to be translated to some code. We could use Reflection.Emit for this, but we just saw how much work it was for this very simple sample. Thats the reason I'm using the DLR instead, which also gives my language some cool dynamic capabilities for free (plus the other benefits like interacting with other DLR languages). But since the DLR is not a piece of cake and I'm not done exploring it yet (just started last two week with it, gimme more time), I will not present a full project with this language here yet (totally out of the scope of this article). I used a modified version of ToyScript and some other sample projects (following next) to generate some DLR code here, but it is still hacked up very much. I will probably write another long article when I'm done with that and have some cool results to show.

DLR AST with objects

Okay, so let's dig deeper into the DLR. While I got some parts of the above Irony AST working in the DLR with help of the ToyScript sample, the whole DLR framework, the ToyScript sample and especially IronPython is still very hard to understand and its also kinda annoying that recent code drops do not work at all with ToyScript. They compile, but running ToyScript will just produce funny errors like "node not reducible", which is not very helpful if you just want to learn about the DLR. A lot of stuff in the DLR still constantly changes (just see the huge amount of code drops at dlr.codeplex.com). I totally like that the DLR team releases all this wonderful code and does it so often, but without any working documentation (almost everything you find on the internet is based on DLR versions from a year ago, when stuff worked in a totally different way) and not even the included samples working all the time is kinda painful. So what does a programmer do in such a case? Yes, he writes tests and figures it out on his own. If there were not this many classes that all do almost nothing in functionality (what the hell, in my projects each class has many hundred lines of code, but in the DLR it seems like 50% of all classes are just 10-30 lines long) it would a much easier job. Debugging is also much harder because stack-traces become 2 miles long after just calling one of two actual functions, that do not just call other functions. Okay, enough ranting, let's just hope these things get better in the final 1.0 release since Microsoft is not allowing any contributions from the community to the DLR and I certainly do not want to write my own. It is pretty amazing what the DLR can do and how well it performs, it just has to become easier to use. Someone had to say it :)

So the bare minimum you need for the DLR to compile some AST for you is a LanguageContext and a DefaultBinder, define one class of each and derive it accordingly. Then define a constructor for your language context, set the binder and while you are at it add the default System (mscorlib) assembly to the script domain manager. Note this code uses DLR change set 21259 (from Apr 5 2009), but I could also use the more recent 22459 version (from today), but I reverted to make the ToyScript sample work again.

    public YourLanguageContext(ScriptDomainManager manager,
        IDictionary<string, object> options)
        : base(manager)
    {
        Binder = new YourBinder(manager);
        manager.LoadAssembly(typeof(string).Assembly);
    } // YourLanguageContext(manager, options)

Next you need to provide an implementation of ParseSourceCode to make the compiler happy and to actually allow your program to parse some source code in your language:

    protected override ScriptCode CompileSourceCode(SourceUnit sourceUnit,
        CompilerOptions options, ErrorSink errorSink)
    {
        // Create lambda expression
        LambdaBuilder program = AstUtils.Lambda(typeof(object), "YourLanguage");
        program.Body = Expression.Block(GenerateStatements(program, sourceUnit.GetCode()));
        LambdaExpression ast = program.MakeLambda();
    
        Expression<DlrMainCallTarget> globalAst =
            new GlobalLookupRewriter().RewriteLambda(ast);
    
        return new LegacyScriptCode(globalAst, sourceUnit);
} // ParseSourceCode(context)

The interesting thing here is how we build an DLR AST from statements (Expression.Block) with help of the LambdaBuilder. The statements do not exist yet, that's what the GenerateStatements method will be used for. We pass the source code to this method and let its do its magic (this is a reduced version, the actual unit test uses a more complex version with some extra code paths and statements to make sure we can initialize the DLR first and to generate code for the next technique too):

    private List<Expression> GenerateStatements(LambdaBuilder program, string code)
    {
        List<Expression> statements = new List<Expression>();

        // Try simple while loop for performance testing!
        // counter = 0.0
        Expression counter = program.Variable(typeof(object), "counter");
        statements.Add(
            Expression.Assign(
                counter,
                Expression.Convert(
                    Expression.Constant(0.0), typeof(object))
            )
        );
        Expression loops = program.Variable(typeof(object), "loops");
        statements.Add(
            Expression.Assign(
                loops,
                Expression.Convert(
                    Expression.Constant((int)10000000), typeof(object))
            )
        );
        
        // while (loops > 0) { counter = counter + 1; loops = loops - 1; }
        statements.Add(
            AstUtils.While(
                Expression.GreaterThan(
                    Expression.Convert(loops, typeof(int)),
                    Expression.Constant(0)
                ),
                Expression.Block(
                    Expression.Assign(counter,
                        Expression.Convert(
                            Expression.Add(
                                Expression.Convert(counter, typeof(double)),
                                Expression.Constant(1.0)
                            ), typeof(object))),
                    Expression.Assign(loops,
                        Expression.Convert(
                            Expression.Subtract(
                                Expression.Convert(loops, typeof(int)),
                                AstUtils.Constant(1)
                            ), typeof(object)))
                ),
                AstUtils.Empty()
            )
        );
        
        // Print out results of counter to the console!
        MethodInfo consoleWriteLine = typeof(Console).GetMethod("WriteLine",
            new Type[] { typeof(object) });
        statements.Add(
            Expression.Call(
                consoleWriteLine,
                counter
            )
        );

        return statements;
    } // GenerateStatements(program, code)

This is obviously not dynamic code, it just adds a bunch of expression statements to a list and returns it. Please also note that I tried to use object for counter and loops to demonstrate this in a similar way ToyScript or IronPython will generate code (thats how I figured out some of these expressions). But since .NET or the DLR does not know how to add two objects for example and does not even allow to assign a number to an object, we have to do a lot of casting, which is in essense just boxing and unboxing.

This code is then executed by a unit test, which does basically create a ScriptRuntime for us, makes sure it uses this language implementation and then calls engine.CreateScriptSourceFromString with the code to be executed. Please note that GenerateStatements completely ignores the source code, it will always generate the same statements right now.

    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Create runtime
    ScriptRuntimeSetup setup = new ScriptRuntimeSetup();
    Type provider = typeof(YourNamespace.YourLanguageContext);
    setup.LanguageSetups.Add(
        new LanguageSetup(provider.AssemblyQualifiedName, provider.Name));
    ScriptRuntime runtime = new ScriptRuntime(setup);

    // Load Engine
    ScriptEngine engine =
        runtime.GetEngineByTypeName(provider.AssemblyQualifiedName);
    timer.Stop();
    Console.WriteLine("Create DLR runtime: " + timer.ElapsedMilliseconds + "ms");
    timer.Reset();
    timer.Start();

    // Execute command
    ScriptSource code = engine.CreateScriptSourceFromString("Objects");
    code.Execute();

    timer.Stop();
    Console.WriteLine("Execute code: " + timer.ElapsedMilliseconds + "ms");

This test will spend about 430ms in the DLR runtime creation (which is kinda slow and will hopefully be faster in the future) and that is the reason it is checked separately. Then it will spend a few ms for the statement creation (4ms as mentioned in the table above) and finally it will execute the code in about 459ms, which is a lot faster than the other DLR languages. But then again, we cheated quite a bit by providing the exact statements we need. The only thing we can optimize is to get rid of all those boxing and unboxing actions. While it is nice to be under half a second for these 10 million while loop iterations with one add to counter and one subtract from loops statement each, it is still a far cry from the performance we have seen in the compiled languages. Let's check out the IL code generated from the DLR AST to see whats going on here:
{
  .maxstack 5
  .locals init (
    [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context,
    [1] object obj2,
    [2] object obj3)
  L_0000: ldarg.0 
  L_0001: ldarg.1 
  L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext) L_0007: stloc.0 L_0008: ldc.r8 0 L_0011: box float64 L_0016: stloc.1 L_0017: ldc.i4 0x989680 L_001c: box int32 L_0021: stloc.2 L_0022: ldloc.2 L_0023: unbox.any int32 L_0028: ldc.i4.0 L_0029: cgt L_002b: brfalse L_0035 L_0030: br L_003a L_0035: br L_0063 L_003a: ldloc.1 L_003b: unbox.any float64 L_0040: ldc.r8 1 L_0049: add L_004a: box float64 L_004f: stloc.1 L_0050: ldloc.2 L_0051: unbox.any int32 L_0056: ldc.i4.1 L_0057: sub L_0058: box int32 L_005d: stloc.2 L_005e: br L_0022 L_0063: ldloc.1 L_0064: call void [mscorlib]System.Console::WriteLine(object) L_0069: ldnull L_006a: ret }

It might be a little hard to figure out which line does what, but once you found the constants like 0x989680 (which is just 10000000) and the add and sub operations, you can get an idea how the statements were translated. The code between L_0035 and L_0063 is the inner loop and the performance critical code. There are 2 unboxing and 2 boxing operations in that code and probably some other unnecessary assignments.

DLR AST with doubles

While playing around with the code even more and getting to a version like the one that will be shown in this section, I tried to understand why the DLR did certain things in the way that it does. A lot of dynamic behavior is only possible when you use very generic types, but I still think for my language implementation it should be possible to use a much more strict rule set and only allow dynamic code where necessary or wanted, not in thinks like loops or math, where we deal with numbers anyway. Lua is a great example of that, numbers are just numbers. They are always doubles, so there is no need to box or unbox them all the time, which is probably one of the reasons it can handle math better than other dynamic languages.

I also found this useful explanation of what IL the DLR emits for dynamic code, it could explain why stuff is slower than IL code, but still a lot faster than evaluating everything ever time. The DLR does a great job of caching and optimizing dynamic code, but as you can see the IL generated is quite long and verbose too, which is ok for dynamic code I guess. I'm just interested in ways to optimize this further to at least provide proof that the DLR is in fact just as good as a compiler for more static languages.

After the last section it is quite obvious what we have to do do generate statements that can be executed quicker: Get rid of all that boxing and unboxing, use specific data types and make sure all the assignments use the correct format too (else convert constants or other variables). This should not be too hard to do in an actual language once we figure out that we are only dealing with numbers (similar to Lua we could only allow one kind of number, like doubles). This is the updated code for GenerateStatements:

    private List<Expression> GenerateStatements(LambdaBuilder program, string code)
    {
        List<Expression> statements = new List<Expression>();

        // Try simple while loop for performance testing!
        // counter = 0.0
        Expression counter = program.Variable(typeof(double), "counter");
        statements.Add(
            Expression.Assign(
                counter,
                Expression.Constant(0.0)
            )
        );
        Expression loops = program.Variable(typeof(double), "loops");
        statements.Add(
            Expression.Assign(
                loops,
                Expression.Constant(10000000.0)
            )
        );

        // while (loops > 0) { counter = counter + 1; loops = loops - 1; }
        statements.Add(
            AstUtils.While(
                Expression.GreaterThan(
                    loops,
                    Expression.Constant(0.0)
                ),
                Expression.Block(
                    Expression.Assign(counter,
                        Expression.Add(counter,
                            Expression.Constant(1.0))),
                    Expression.Assign(loops,
                        Expression.Subtract(loops,
                            AstUtils.Constant(1.0)))
                ),
                AstUtils.Empty()
            )
        );

        // Print out results of counter to the console!
        MethodInfo consoleWriteLine = typeof(Console).GetMethod("WriteLine",
            new Type[] { typeof(double) });
        statements.Add(
            Expression.Call(
                consoleWriteLine,
                counter
            )
        );

        return statements;
    } // GenerateStatements(program, code)

BTW: To analyze the IL code I use the ScriptCode.SaveToAssembly method and use Reflector to show the IL code. As you can see the code has become shorter and we can't see any signs of boxing and unboxing anymore. The code is not even optimal because we use doubles for everything (remember that the C#, C++ and Assembler samples all used a simple integer for the while loop), but as the execution time with less than 100ms shows us, the performance is quite good with this approach.
{
  .maxstack 5
  .locals init (
    [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context,
    [1] float64 num,
    [2] float64 num2)
  L_0000: ldarg.0 
  L_0001: ldarg.1 
  L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class
[Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext) L_0007: stloc.0 L_0008: ldc.r8 0 L_0011: stloc.1 L_0012: ldc.r8 10000000 L_001b: stloc.2 L_001c: ldloc.2 L_001d: ldc.r8 0 L_0026: cgt L_0028: brfalse L_0032 L_002d: br L_0037 L_0032: br L_0054 L_0037: ldloc.1 L_0038: ldc.r8 1 L_0041: add L_0042: stloc.1 L_0043: ldloc.2 L_0044: ldc.r8 1 L_004d: sub L_004e: stloc.2 L_004f: br L_001c L_0054: ldloc.1 L_0055: call void [mscorlib]System.Console::WriteLine(float64) L_005a: ldnull L_005b: ret }

This is a more complex version of the above unit test from last section, the 10000000 is printed by the DoLoop DLR code:
runtime creation: 336ms
get engine: 16ms
create script: 4ms
Init code and first script startup time: 232ms
10000000
execute again (DoLoop code): 98ms
and shutdown: 0ms
Okay, I would say this blog post is long enough now and I wrote all day on it and the code samples presented here. Could even be the longest blog post on my blog (html is 150kb). Who knows .. it is probably a good idea to go to bed now.

Good fight, good night, till next time :)

Lang.NET Symposium 2009 Talks online now!

by Benjamin Nitschke 16. April 2009 19:43
I've been waiting for the latest Irony release the last few days, but luck was not on my side because the good old Roman Ivantsov, who is fearlessly leading the Irony project, was too busy and not able to drop a build yet. But he had a talk at the Lang.NET Symposium 2009 that was going on the last 3 days.

Since I'm starting to write my own DLR language the last few days. I already posted about the DLR and Antlr here, but I switched using Irony together with DLR now, in case you are interested in more links, check out DlrCalculator from the Script.NET guy and the Irony articles on CodeProject.

I will write more about this project when it matures a little bit more, right now I just have a couple of sample projects, lots of unit tests to try stuff out and a very long TODO list and document about what I want to implement. The project itself could go on for a very long time (>1 year). I already attempted to write a language service few years back and got pretty far with the VSSDK, but writing the language itself, the parser, AST generator, etc. was too much work back then. Thankfully with the new cool tools like Irony and the DLR these tasks have become much easier.

Anyway, time to watch some talks from the Lang.NET Symposium 2009, the first batch of videos was just released today, more talks will hopefully follow soon. Thank you very much Microsoft making these available :)


I will start watching the Keynote by Jason Zander - Microsoft,
  • then watch C# 4.0 Dynamic by Mads Torgersen - Microsoft,
  • Intro to a Visual Studio Language Service by Ted Neward,
  • Irony by Roman Ivantsov and
  • IronPython Scripting in MS Dynamics by Roman Ivantsov,
  • The Visual Studio 2010 Editor by Jack Tilford,
  • The Second Life Cloud by Jim Purbrick - Linden Lab and
  • then check out the rest of the talks :)


Update: I just heard in the Keynote by Jason Zander that a VS2010 beta release is imminent, very exciting stuff. Finally I can use a beta VS again, VS2008 is getting boring :D

Intel X25-M Firmware Update fixes write speed

by Benjamin Nitschke 15. April 2009 16:56
Intel just released a Drive Firmware Update Tool for its Intel X18-M and X25-M Solid State Drives (SSDs). You can read more about it here (info) and here (pdf) or just download the firmware update.

I have been using my Intel SSD for less than a month now and I did not really notice any major slowdowns, but thanks to this firmware I'm confident the drive will stay fast and very usable for our big projects. At home I'm pretty happy with my old crappy OCZ Core SSD (crappy because the J-Micron controller just sucks and freezes the whole PC after 50+ open files) because Windows 7 handles it quite nicely and my projects are usually small enough. The only annoyance is with ASP.NET web projects because the IIS is quite slow and because of other issues like the slow HttpWebRequest.

Maybe in 2 Months when the new i7 950 CPUs (replacement of i7 940) come out I will think about some new PC stuff for my almost 3 year old Dual-Core PC at home, but for now I'm pretty happy. The only real improvement besides the CPU I could make is probably more ram (which is pretty cheap anyway these days) and a much faster SSD like the Intel X25-M, which I use at work, but thats a little expensive .. maybe Intel will think about reducing the prices of their SSDs after many competitors (especially OCZ) are catching up and getting better and better.

HttpWebRequest is very slow and taking forever on Windows 7

by Benjamin Nitschke 7. April 2009 17:23
Update 2009-08-06: This issue was fixed in Windows 7 RC and Windows 7 RTM :)

I was going nuts yesterday night after I wrote a couple of unit tests that involved the C# HttpWebRequest class. For some strange reason it always takes 20-30 seconds to create a HttpWebRequest with an url and calling the GetResponse method on Windows 7. I was thinking maybe the server I'm contacting is kinda slow, but it was really fast in the browser. So maybe my SSD drive is messed up? Ok, tried on a normal HDD, same thing. Okay, then maybe TestDriven.net is doing something strange, lets write a simple console application. Wtf, still the same issue! Then I tried it on a different Windows 7 PC, same issue.

I gave up (was very late anyway) and went to work to test it again in the morning on my work PC (also Windows 7). Still the same freaking issue, it takes forever to get anything connected with HttpWebRequest and HttpWebResponse. Getting the actual data is like 30 times faster (and should not matter anyway), still slower than I remembered, but the issue was just in the following two lines (first line took 2-5s, second line 20-25s):

    HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create(url);
    HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse();

Maybe this isn't the right comic, but its always good to make fun of not-perfectly-working operating systems:




Back to the Problem: I rewrote my console test application again (from last night) and compiled it on my work PC:

using System;
using System.Net;

namespace TestHttpRequest
{
    class Program
    {
        static void Main(string[] args)
        {
            //WTF is going on here? Takes 30s on Windows 7, <100ms on Windows Vista
            string url = "http://www.google.com";

            Console.WriteLine("Connecting: " + url);
            HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse();

            Console.WriteLine("Connected ..");
            if (httpResponse.StatusCode == HttpStatusCode.OK)
            {
                // Dummy
                Console.WriteLine("Getting data ..");
                httpResponse.GetResponseStream();
            } // if
            else
                Console.WriteLine("Failed to get any data ..");

            Console.WriteLine("Done ..");
        }
    }
}


Guess what, still the same issue, takes around 30 seconds to do this very simple job in Windows 7. Could not really figure out the real cause of this (I'm not really that interested, I just wanted to write a couple of unit tests to make some things easier), but it seems like the Conhost.exe file (Console Window Host) of Windows 7 is doing a lot while this is going on (I even have different Windows 7 versions installed, it is the same issue on all of them). Could not see any networking going on (maybe 0.01 kb every 5 seconds) either. If I completely disable networking the GetResponse immediately crashes (it does not take 2-3s for the WebRequest.Create to do its work, and the exception occurs instantly), but that is not really helpful because I wanted to test the response ..

Instead of guessing what this could be about I just wanted to make sure this is a problem with my Windows 7, so I gave my test console application to my colleague that is running Windows Vista and the thing ran in about 20ms, there is nothing wrong with the code, it runs perfectly fine on a non-Windows 7 machine!

So I tried the same stuff in different ways, like using the WebClient class, same result, still 30 seconds for the request.

            // Same stuff with webclient
            WebClient webClient = new WebClient();
            Console.WriteLine("Downloading: " + url);
            byte[] data = webClient.DownloadData(url);
            Console.WriteLine("Done ..");
            
Next I implemented the same stuff with sockets (more specifically TcpClient, which is based on Sockets, but a little easier to use), which is kinda ugly and more work, but at least I finally got the response time I was looking for (<100ms). Microsoft should maybe look into this Windows 7 issue .. for my unit tests I can now use my own HttpWebResponse class and all problems are gone :)

            // Another try, this time with TcpClient ..
TcpClient client = new TcpClient("www.google.com", 80);

// Get the stream from the tcp client (allows us to read and write)
NetworkStream stream = client.GetStream();

// Write the http request ourselfs (Note: Needs 2 empty lines at the end)
byte[] requestBytes = Encoding.ASCII.GetBytes(
@"GET / HTTP/1.1
Host: www.google.com
Connection: Close

"
);
// Send this ound
stream.Write(requestBytes, 0, requestBytes.Length);
stream.Flush();

// We can't use DataAvailable here because we don't know when the data
// will arrive. So just call ReadByte until it returns -1 and then
// we know we are done (Note: This will block our application,
// if you really care about performance write it asyncronly).
Console.WriteLine("Getting data ..");
string result = "";
int aByte = 0;
do
{
aByte = stream.ReadByte();
if (aByte >= 0)
result += (char)aByte;
} while (aByte >= 0);
Console.WriteLine("Result=" + result);
client.Close();

English should be the most important programmer language

by Benjamin Nitschke 3. April 2009 13:32
Pretty good blog post from Jeff Atwood at CodingHorror.com:
http://www.codinghorror.com/blog/archives/001248.html

Lots of interesting comments too and a nice comic:

Best April 1st Joke

by Benjamin Nitschke 1. April 2009 16:05
is YouTube, everything is inverted (just click on any video), this is so hilarious!

Opera's Face Gestures feature is also pretty funny:

Most other things were pretty obvious, but still a lot of work to put up, like Blizzards Star-Craft 2 Terra-Tron or A new character for Diablo 3 (the Shush spell is pretty good) or the WoW DanceBattle-System or WoW Pimp My Mount. World of Quake-Live was also nicely made, but overall this year the jokes were all too obvious (or maybe I'm getting old ^^).
Disclaimer: The opinions expressed in this blog are own personal opinions and do not represent the companies view.
© 2000-2011 exDream GmbH & MobileBits GmbH. All rights reserved. Legal/Impressum

Poll

Which platform should Soulcraft be released on next?











Show Results Poll Archive

Recent Games

Soulcraft

Fireburst

Jobs @ exDream

Calendar

<<  April 2014  >>
MoTuWeThFrSaSu
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011