Jacob Carpenter’s Weblog

October 20, 2011

Hello Roslyn

Filed under: csharp, Roslyn — Jacob @ 6:55 am
using System;
using Roslyn.Compilers.CSharp;

namespace HelloRoslyn
{
  class Program
  {
    static void Main()
    {
      string program = Syntax.CompilationUnit(
        usings: Syntax.List(Syntax.UsingDirective(name: Syntax.ParseName("System"))),
        members: Syntax.List<MemberDeclarationSyntax>(
          Syntax.NamespaceDeclaration(
            name: Syntax.ParseName("HelloRoslyn"),
            members: Syntax.List<MemberDeclarationSyntax>(
              Syntax.ClassDeclaration(
                identifier: Syntax.Identifier("Program"),
                members: Syntax.List<MemberDeclarationSyntax>(
                  Syntax.MethodDeclaration(
                    returnType: Syntax.PredefinedType(Syntax.Token(SyntaxKind.VoidKeyword)),
                    modifiers: Syntax.TokenList(Syntax.Token(SyntaxKind.StaticKeyword)),
                    identifier: Syntax.ParseToken("Main"),
                    parameterList: Syntax.ParameterList(),
                    bodyOpt: Syntax.Block(
                      statements: Syntax.List<StatementSyntax>(
                        Syntax.ExpressionStatement(
                          Syntax.InvocationExpression(
                            Syntax.MemberAccessExpression(
                              kind: SyntaxKind.MemberAccessExpression,
                              expression: Syntax.IdentifierName("Console"),
                              name: Syntax.IdentifierName("WriteLine"),
                              operatorToken: Syntax.Token(SyntaxKind.DotToken)),
                            Syntax.ArgumentList(
                              arguments: Syntax.SeparatedList(
                                Syntax.Argument(
                                  expression: Syntax.LiteralExpression(
                                    kind: SyntaxKind.StringLiteralExpression,
                                    token: Syntax.Literal("\"Hello world\"", "Hello world")
                                  )
                                )
                              )
                            )
                          )
                        )
                      )
                    )
                  )
                )
              )
            )
          )
        )).Format().GetFullText();

      Console.WriteLine(program);
    }
  }
}
Advertisements

January 7, 2010

Reading large xml files

Filed under: csharp, extension methods — Jacob @ 12:16 am

I’m a huge fan of System.Xml.Linq or “LINQ to XML”. However, some documents really are just too large to efficiently process with an in-memory representation like XDocument. For such documents, we need to consume the xml with a streaming XmlReader instead.

As much as I love System.Xml.Linq, that’s how much I hate XmlReader. I don’t know why it is, but every time I have to use an XmlReader, I have to go back to the documentation. And working with an XmlReader rarely feels fun.

At work (by the way, we’re hiring all kinds of developers), we’ve written some really nice code to make reading xml easier. But I’m not at work, and I wanted to process a large set of xml data—namely, the Project Gutenberg catalog in RDF/XML format. So I came up with a simple, efficient solution that I want to share.

The Project Gutenberg catalog data looks something like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/"
         xmlns:dcterms="http://purl.org/dc/terms/"
         xmlns:cc="http://web.resource.org/cc/"
         xmlns:pgterms="http://www.gutenberg.org/rdfterms/">

    <cc:Work rdf:about="">
        <cc:license rdf:resource="http://creativecommons.org/licenses/GPL/2.0/" />
    </cc:Work>

    <cc:License rdf:about="http://creativecommons.org/licenses/GPL/2.0/">
        <!-- cc:license children omitted -->
    </cc:License>

    <rdf:Description rdf:about="">
        <dc:created>
            <dcterms:W3CDTF>
                <rdf:value>2010-01-05</rdf:value>
            </dcterms:W3CDTF>
        </dc:created>
    </rdf:Description>

    <pgterms:etext rdf:ID="etext14624">
        <dc:publisher>&pg;</dc:publisher>
        <dc:title rdf:parseType="Literal">Santa Claus's Partner</dc:title>
        <dc:creator rdf:parseType="Literal">Page, Thomas Nelson, 1853-1922</dc:creator>
        <pgterms:friendlytitle rdf:parseType="Literal">Santa Claus's Partner by Thomas Nelson Page</pgterms:friendlytitle>
        <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
        <dc:subject><dcterms:LCSH><rdf:value>Christmas stories</rdf:value></dcterms:LCSH></dc:subject>
        <dc:subject><dcterms:LCC><rdf:value>PZ</rdf:value></dcterms:LCC></dc:subject>
        <dc:created><dcterms:W3CDTF><rdf:value>2005-01-06</rdf:value></dcterms:W3CDTF></dc:created>
        <dc:rights rdf:resource="&lic;" />
    </pgterms:etext>

    <!-- etc. -->

</rdf:RDF>

Let’s first look at the wrong way to read this data:

static void Main()
{
    XNamespace nsGutenbergTerms = "http://www.gutenberg.org/rdfterms/";
    XNamespace nsRdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";

    XDocument doc = XDocument.Load("catalog.rdf");
    foreach (XElement etext in doc.Root.Elements(nsGutenbergTerms + "etext"))
    {
        string id = (string) etext.Attribute(nsRdf + "ID");
        string title = (string) etext.Element(nsGutenbergTerms + "friendlytitle");

        Console.WriteLine("{0}: {1}", id, title);
    }
}

A couple of problems:

  1. speed—the program sits around for 5 seconds or so before outputting anything, while it loads the 128MB xml file into memory.
  2. memory usage—loading the 128MB file pushes the memory usage from 10,328K to 731,832K (as reported in task manager). I don’t want to read too much into that value, but we can certainly agree that loading the whole file into memory at once isn’t optimal.

This is the worst of both worlds: the program is slower than it needs to be, and it uses more memory than it should.

… but did I mention that I love LINQ to XML? Processing each etext element as an XElement instance is really convenient.

Ideally, we would want to combine the efficiency of reading the large xml file with an XmlReader with the convenience of handling each etext element as an XElement instance.

Cue Patrick Stewart saying, “Make it so”:

static void Main()
{
    XNamespace nsGutenbergTerms = "http://www.gutenberg.org/rdfterms/";
    XNamespace nsRdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";

    using (XmlReader reader = XmlReader.Create("catalog.rdf",
        new XmlReaderSettings { ProhibitDtd = false }))
    {
        // move the reader to the start of the content and read the root element's start tag
        //   that is, the reader is positioned at the first child of the root element
        reader.MoveToContent();
        reader.ReadStartElement("RDF", nsRdf.NamespaceName);

        foreach (XElement etext in reader.ReadElements(nsGutenbergTerms + "etext"))
        {
            string id = (string) etext.Attribute(nsRdf + "ID");
            string title = (string) etext.Element(nsGutenbergTerms + "friendlytitle");

            Console.WriteLine("{0}: {1}", id, title);
        }
    }
}

Apart from noticing the similarity between this and the previous code block, the most interesting part of this code is the ReadElements extension method.

Before calling ReadElements, the code positions the reader on the first child of the root element. Then, ReadElements is called with an XName referring to the etext element. All of the etext elements are returned as a sequence.

This is exactly what I want: the program starts processing etext elements nearly instantly, and the memory utilization is barely noticeable.

Let’s look at the implementation of ReadElements:

/// <summary>
/// Returns a sequence of <see cref="XElement">XElements</see> corresponding to the currently
/// positioned element and all following sibling elements which match the specified name.
/// </summary>
/// <param name="reader">The xml reader positioned at the desired hierarchy level.</param>
/// <param name="elementName">An <see cref="XName"/> representing the name of the desired element.</param>
/// <returns>A sequence of <see cref="XElement">XElements</see>.</returns>
/// <remarks>At the end of the sequence, the reader will be positioned on the end tag of the parent element.</remarks>
public static IEnumerable<XElement> ReadElements(this XmlReader reader, XName elementName)
{
    if (reader.Name == elementName.LocalName && reader.NamespaceURI == elementName.NamespaceName)
        yield return (XElement) XElement.ReadFrom(reader);

    while (reader.ReadToNextSibling(elementName.LocalName, elementName.NamespaceName))
        yield return (XElement) XElement.ReadFrom(reader);
}

The documentation comments should be pretty self-explanatory, but it’s probably important to call attention to the side effects: ReadElements expects an intentionally positioned xml reader. Once ReadElements is done returning XElements, the reader will be positioned at the end element of the initially positioned element’s parent.

I should also point out it would be trivial to add an overload of ReadElements that didn’t take an XName and simply returned a sequence of the initially positioned element and all of its following siblings. But I don’t need that method yet, so I didn’t write it.

ReadElements will certainly allow me to process this large xml file more efficiently and easily than exclusively using either an XDocument or an XmlReader. Hopefully this method will be helpful to some of you, too.

October 6, 2008

C# compiler eccentricity of the day: throwing lambda

Filed under: csharp — Jacob @ 4:21 pm

Here at work (gratuitous link; oh yeah, and we’re hiring), we have a Verify helper class. Verify lets you succinctly validate (or verify, if you will) various method invariants. For instance, non-nullness:

Verify.IsNotNull(name);

The problem with these helper methods is that when an invariant is violated, an exception is thrown far from the actual affected bit of code. Always moving one frame up the call stack to see the real code that’s failing quickly gets annoying.

I started thinking about how to mitigate this problem, and realized that with Visual Studio’s excellent debugger support for delegates, the call site could include an Action<Exception> that just threw (throw-ed?):

Verify.IsNotNull(name, ex => throw ex);

Except that doesn’t compile.

Wrap the lambda body in curly braces (don’t forget the extra semi-colon) and everything works as expected (including the debugger breaking within the lambda!):

Verify.IsNotNull(name, ex => { throw ex; });

But unfortunately that adds so much syntax that the “solution” is more annoying than the problem I was initially trying to solve.

Has anyone run into anything like this before? Does it make any sense why that statement wouldn’t be valid as a lambda expression?

July 21, 2008

C# reminder of the day

Filed under: csharp — Jacob @ 4:24 pm

Static data is not shared among constructed generic types.

That is, the final line of output from the following program:

using System;

class Program
{
    static void Main()
    {
        NonGeneric.PrintCount(); // "Called 1 time."
        NonGeneric.PrintCount(); // "Called 2 times."

        Generic<int>.PrintCount(); // "Called 1 time."
        Generic<string>.PrintCount(); // ?
    }

    public static void DoPrintCount(int count)
    {
        Console.WriteLine("Called {0} time{1}.",
            count, count > 1 ? "s" : "");
    }
}

class NonGeneric
{
    public static void PrintCount() { Program.DoPrintCount(++count); }
    static int count;
}

class Generic<T>
{
    public static void PrintCount() { Program.DoPrintCount(++count); }
    static int count;
}

Is “Called 1 time.”

July 16, 2008

Strange Framework design decision of the day…

Filed under: csharp — Jacob @ 3:19 pm

Today I encountered the strangest .NET Framework design decision I’ve seen in recent times:

HashSet<T>’s GetEnumerator method returns a public struct HashSet<T>.Enumerator.

Let’s count how many Framework Design Guidelines this violates:

1. Avoid publicly exposed nested types.

  • violation: duh.

Do not define a structure [instead of a class] unless the type has all of the following characteristics [including]:

2. It is immutable.

  • violation: calling MoveNext mutates the enumerator object.

3. It will not have to be boxed frequently.

  • violation: passing a HashSet<T> as a parameter to a method that accepts IEnumerable<T> (Linq, anyone?) will hide the class’ GetEnumerator method. Therefore, any calls to GetEnumerator call the interface method which requires boxing the HashSet<T>.Enumerator to return an IEnumerator<T>.

4. [Any others you see? Leave a comment.]

 

I really want to hear the arguments in favor of the shipping design.

April 23, 2008

C# abuse of the day: SwitchOnType

Filed under: csharp, extension methods — Jacob @ 5:30 pm

Today I encountered a situation where I wanted to switch based on a type. Maybe I stayed up a little too late reading Foundations of F#, last night.

While this is certainly no pattern matching, it didn’t seem like terrible C#:

DefinitionBase definitionBase = /*...*/;

var targetProperty = definitionBase.SwitchOnType(
        (ColumnDefinition col) => ColumnDefinition.WidthProperty,
        (RowDefinition row) => RowDefinition.HeightProperty);

Note that the lambdas require type decoration (you really don’t want to explicitly declare the generic parameters on this method).

Here’s the implementation (taking two Func projections—feel free to overload to your heart’s content):

public static TResult SwitchOnType<T, T1, T2, TResult>(this T source,
    Func<T1, TResult> act1, Func<T2, TResult> act2)
{
    if (source is T1)
        return act1((T1) source);

    if (source is T2)
        return act2((T2) source);

    throw new InvalidOperationException("No matching delegate found");
}

As you can see from the implementation, the method returns the result of the first delegate for which source can be converted into a parameter.

For a default case, add a final delegate that takes object.

April 16, 2008

PC#1: A solution

Filed under: challenge, csharp, extension methods, LINQ — Jacob @ 12:21 pm

So, when I initially posed the programming challenge #1 I stated:

… since I intended to output HTML, ASP.NET seemed a logical choice. But I was amazed at the amount of code required for such a seemingly simple task (not to mention how ugly code containing <% and %> is!).

Well, it turns out, using plain old C# with a little LINQ to XML functional construction made my solution a lot nicer.

Prerequisites

I created a few DateTimeExtensions to enhance readability, though I could have easily inlined the implementation of each of those methods without any LOC impact.

public static class DateTimeExtensions
{
    public static DateTime ToFirstDayOfMonth(this DateTime dt)
    {
        return new DateTime(dt.Year, dt.Month, 1);
    }
    public static DateTime ToLastDayOfMonth(this DateTime dt)
    {
        return new DateTime(dt.Year, dt.Month, DateTime.DaysInMonth(dt.Year, dt.Month));
    }
    public static DateTime ToFirstDayOfWeek(this DateTime dt)
    {
        return dt.AddDays(-((int) dt.DayOfWeek));
    }
    public static DateTime ToLastDayOfWeek(this DateTime dt)
    {
        return dt.AddDays(6 - ((int) dt.DayOfWeek));
    }
}

I also relied on the Slice extension method I’ve previously blogged about.

Solution

static void Main(string[] args)
{
    DateTime today = DateTime.Today;
    DateTime firstDayOfMonth = today.ToFirstDayOfMonth();
    DateTime startCalendar = firstDayOfMonth.ToFirstDayOfWeek();
    DateTime lastDayOfMonth = today.ToLastDayOfMonth();
    DateTime endCalendar = lastDayOfMonth.ToLastDayOfWeek();

    var calendarPrefix =
        from day in Enumerable.Range(startCalendar.Day, (firstDayOfMonth - startCalendar).Days)
        select new XElement("td", new XAttribute("class", "prevMonth"), day);
    var calendarMonth =
        from day in Enumerable.Range(1, lastDayOfMonth.Day)
        select new XElement("td", day == today.Day ? new XAttribute("class", "today") : null, day);
    var calendarSuffix =
        from day in Enumerable.Range(1, (endCalendar - lastDayOfMonth).Days)
        select new XElement("td", new XAttribute("class", "nextMonth"), day);

    var calendar = calendarPrefix.Concat(calendarMonth).Concat(calendarSuffix);

    var table = new XElement("table",
        new XElement("thead",
            new XElement("tr",
                from offset in Enumerable.Range(0, 7)
                select new XElement("th", startCalendar.AddDays(offset).ToString("ddd")))),
        new XElement("tbody",
            from week in calendar.Slice(7)
            select new XElement("tr", week)));

    Console.WriteLine(table);
}

I’d love to see more ways to solve this. If you’ve got a simpler or more beautiful implementation in your favorite programming langauge/web application framework, let me know in the comments of the original post.

April 4, 2008

Euler 14

Filed under: csharp, Euler, extension methods, LINQ, Ruby — Jacob @ 12:41 pm

When I read Dustin Campbell’s latest post, I couldn’t help but feel a bit like Steve Carrell in this clip from the Office. While his solution is an admirably close port of the original F# solution, it makes me feel a little bit yucky.

Of course, it’s completely hypocritical of me to say so, since I’ve abused C# to make it exhibit F#-like behavior in the past.

But Project Euler invites elegantly simple solutions (like the original F#). Different languages have different idioms, and a literal port typically doesn’t exhibit the same beauty as the original.

If I was solving project Euler 14 in C# (with “elegance and brevity in mind”), my code would look more like:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Euler14
{
    class Program
    {
        static void Main(string[] args)
        {
            var iterativeSequences = from start in 1.To(1000000L)
                select new
                {
                    Start = start,
                    Length = SequenceUtility.Generate(start,
                        n => n % 2 == 0 ? n / 2 : 3 * n + 1,
                        n => n == 1).Count()
                };

            Stopwatch sw = Stopwatch.StartNew();

            var longestSequence = iterativeSequences.Aggregate(
                (longest, current) => current.Length > longest.Length ? current : longest
            );

            sw.Stop();

            Console.WriteLine("Longest sequence starts with {0:#,#} (found in {1:#,#.000} seconds)",
                longestSequence.Start, (float) sw.ElapsedTicks / (float) Stopwatch.Frequency);
        }
    }

    public static class SequenceUtility
    {
        // Can also overload To by changing the end value's type;
        // example: "int excludedEnd" returns "IEnumerable<int>"
        public static IEnumerable<long> To(this int start, long excludedEnd)
        {
            for (long i = start; i < excludedEnd; i++)
                yield return i;
        }

        public static IEnumerable<T> Generate<T>(T first, Func<T, T> getNext, Func<T, bool> isLast)
        {
            T value = first;
            yield return value;

            while (!isLast(value))
            {
                value = getNext(value);
                yield return value;
            }
        }
    }
}

Which runs in an acceptable ~5 seconds on my machine.

[Okay. You caught me: I stole the idea for that To extension method from Ruby’s upto. I’m a huge hypocrite and take back everything I said before.

Do invest time learning the idoms of other programming languages, and try applying them to your native language. You may discover something beautiful, after all.]

March 26, 2008

LINQ to prime numbers

Filed under: csharp, Euler, LINQ — Jacob @ 4:41 pm

Having last Friday off, and finding myself in want of something to do, I decided to learn F#. Once I installed F#, though, I learned that desire and motivation are different things.

So I started killing time by solving Project Euler problems. In LINQPad.

(Which you really should download, if you haven’t already.)

People more eloquent than me can explain how embracing constraints helps creativity flourish. I’m not going to try.

Instead, I’ll share a prime number generator inspired by Euler problem 10 and implemented with LINQ:

var odds =
	from n in Enumerable.Range(0, int.MaxValue)
	select 3 + (long) n * 2;

var primes = (new[] { 2L }).Concat(
	from p in odds
	where ! odds.TakeWhile(odd => odd * odd <= p).Any(odd => p % odd == 0)
	select p);

This certainly isn’t the most efficient prime number generator in the world. But the full query to solve the problem (left as a exercise to the reader) runs in a perfectly acceptable less than six seconds on my machine. And it uses no intermediate storage for the primes!

Now that you’ve downloaded LINQPad—you have downloaded LINQPad, haven’t you?—you can start solving Project Euler problems in a blissfully constrained environment, too!

I’ve got Problem 1 down to 54 characters. 🙂

March 13, 2008

Dictionary To Anonymous Type

Filed under: csharp, extension methods, LINQ — Jacob @ 5:34 pm

There’s some buzz about how cool it is to initialize a Dictionary from an anonymous type instance. Roy Osherove recently wrote about it, though he attributes the technique to the ASP.NET MVC framework. Alex Henderson (whose blog I just subscribed to) also came up with an inspiring use of Lambda expressions to initialize Dictionaries (don’t miss the related posts at the bottom).

But I haven’t seen anyone do the reverse: initialize an anonymous type instance from a Dictionary.

Until now.

Prerequisites

public static class DictionaryUtility
{
    public static TValue GetValueOrDefault<TKey, TValue>(this IDictionary<TKey, TValue> dict, TKey key)
    {
        TValue result;
        dict.TryGetValue(key, out result);
        return result;
    }
}

Show me the code!

public static class AnonymousTypeUtility
{
    public static T ToAnonymousType<T, TValue>(this IDictionary<string, TValue> dict, T anonymousPrototype)
    {
        // get the sole constructor
        var ctor = anonymousPrototype.GetType().GetConstructors().Single();

        // conveniently named constructor parameters make this all possible...
        var args = from p in ctor.GetParameters()
            let val = dict.GetValueOrDefault(p.Name)
            select val != null && p.ParameterType.IsAssignableFrom(val.GetType()) ? (object) val : null;

        return (T) ctor.Invoke(args.ToArray());
    }
}

Notice anonymousPrototype. This is a technique called casting by example, coined by Mads Torgerson (of the C# team) in the comments of this post.

Since you can’t ever explicitly refer to the type of an anonymous type, you have to provide an example instance. Using an undocumented feature of the default keyword, we can strongly type the properties of our prototype object without a bunch of null casting.

Here’s some sample code to get you going:

class Program
{
    static void Main(string[] args)
    {
        var dict = new Dictionary<string, object> {
            { "Name", "Jacob" },
            { "Age", 26 },
            { "FavoriteColors", new[] { ConsoleColor.Blue, ConsoleColor.Green } },
        };

        var person = dict.ToAnonymousType(
            new
            {
                Name = default(string),
                Age = default(int),
                FavoriteColors = default(IEnumerable<ConsoleColor>),
                Birthday = default(DateTime?),
            });

        Console.WriteLine(person);
        foreach (var color in person.FavoriteColors)
            Console.WriteLine(color);
    }
}

And thanks to anonymous types overriding ToString(), our program reasonably outputs:

{ Name = Jacob, Age = 26, FavoriteColors = System.ConsoleColor[], Birthday =  }
Blue
Green

Notice that the types don’t even need to exactly match! The dictionary’s “FavoriteColors” value is a ConosleColor[]. But the anonymous type has an IEnumerable<ConsoleColor> property.

Enjoy!

Older Posts »

Blog at WordPress.com.