Iām a huge fan of System.Xml.Linq
or āLINQ to XMLā. However, some documents really are just too large to efficiently process with an in-memory representation like XDocument
. For such documents, we need to consume the xml with a streaming XmlReader
instead.
As much as I love System.Xml.Linq
, thatās how much I hate XmlReader
. I donāt know why it is, but every time I have to use an XmlReader
, I have to go back to the documentation. And working with an XmlReader
rarely feels fun.
At work (by the way, weāre hiring all kinds of developers), weāve written some really nice code to make reading xml easier. But Iām not at work, and I wanted to process a large set of xml dataānamely, the Project Gutenberg catalog in RDF/XML format. So I came up with a simple, efficient solution that I want to share.
The Project Gutenberg catalog data looks something like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:cc="http://web.resource.org/cc/"
xmlns:pgterms="http://www.gutenberg.org/rdfterms/">
<cc:Work rdf:about="">
<cc:license rdf:resource="http://creativecommons.org/licenses/GPL/2.0/" />
</cc:Work>
<cc:License rdf:about="http://creativecommons.org/licenses/GPL/2.0/">
<!-- cc:license children omitted -->
</cc:License>
<rdf:Description rdf:about="">
<dc:created>
<dcterms:W3CDTF>
<rdf:value>2010-01-05</rdf:value>
</dcterms:W3CDTF>
</dc:created>
</rdf:Description>
<pgterms:etext rdf:ID="etext14624">
<dc:publisher>&pg;</dc:publisher>
<dc:title rdf:parseType="Literal">Santa Claus's Partner</dc:title>
<dc:creator rdf:parseType="Literal">Page, Thomas Nelson, 1853-1922</dc:creator>
<pgterms:friendlytitle rdf:parseType="Literal">Santa Claus's Partner by Thomas Nelson Page</pgterms:friendlytitle>
<dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
<dc:subject><dcterms:LCSH><rdf:value>Christmas stories</rdf:value></dcterms:LCSH></dc:subject>
<dc:subject><dcterms:LCC><rdf:value>PZ</rdf:value></dcterms:LCC></dc:subject>
<dc:created><dcterms:W3CDTF><rdf:value>2005-01-06</rdf:value></dcterms:W3CDTF></dc:created>
<dc:rights rdf:resource="&lic;" />
</pgterms:etext>
<!-- etc. -->
</rdf:RDF>
Letās first look at the wrong way to read this data:
static void Main()
{
XNamespace nsGutenbergTerms = "http://www.gutenberg.org/rdfterms/";
XNamespace nsRdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
XDocument doc = XDocument.Load("catalog.rdf");
foreach (XElement etext in doc.Root.Elements(nsGutenbergTerms + "etext"))
{
string id = (string) etext.Attribute(nsRdf + "ID");
string title = (string) etext.Element(nsGutenbergTerms + "friendlytitle");
Console.WriteLine("{0}: {1}", id, title);
}
}
A couple of problems:
- speedāthe program sits around for 5 seconds or so before outputting anything, while it loads the 128MB xml file into memory.
- memory usageāloading the 128MB file pushes the memory usage from 10,328K to 731,832K (as reported in task manager). I donāt want to read too much into that value, but we can certainly agree that loading the whole file into memory at once isnāt optimal.
This is the worst of both worlds: the program is slower than it needs to be, and it uses more memory than it should.
ā¦ but did I mention that I love LINQ to XML? Processing each etext
element as an XElement
instance is really convenient.
Ideally, we would want to combine the efficiency of reading the large xml file with an XmlReader
with the convenience of handling each etext
element as an XElement
instance.
Cue Patrick Stewart saying, āMake it soā:
static void Main()
{
XNamespace nsGutenbergTerms = "http://www.gutenberg.org/rdfterms/";
XNamespace nsRdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
using (XmlReader reader = XmlReader.Create("catalog.rdf",
new XmlReaderSettings { ProhibitDtd = false }))
{
// move the reader to the start of the content and read the root element's start tag
// that is, the reader is positioned at the first child of the root element
reader.MoveToContent();
reader.ReadStartElement("RDF", nsRdf.NamespaceName);
foreach (XElement etext in reader.ReadElements(nsGutenbergTerms + "etext"))
{
string id = (string) etext.Attribute(nsRdf + "ID");
string title = (string) etext.Element(nsGutenbergTerms + "friendlytitle");
Console.WriteLine("{0}: {1}", id, title);
}
}
}
Apart from noticing the similarity between this and the previous code block, the most interesting part of this code is the ReadElements
extension method.
Before calling ReadElements
, the code positions the reader on the first child of the root element. Then, ReadElements
is called with an XName
referring to the etext
element. All of the etext
elements are returned as a sequence.
This is exactly what I want: the program starts processing etext
elements nearly instantly, and the memory utilization is barely noticeable.
Letās look at the implementation of ReadElements
:
/// <summary>
/// Returns a sequence of <see cref="XElement">XElements</see> corresponding to the currently
/// positioned element and all following sibling elements which match the specified name.
/// </summary>
/// <param name="reader">The xml reader positioned at the desired hierarchy level.</param>
/// <param name="elementName">An <see cref="XName"/> representing the name of the desired element.</param>
/// <returns>A sequence of <see cref="XElement">XElements</see>.</returns>
/// <remarks>At the end of the sequence, the reader will be positioned on the end tag of the parent element.</remarks>
public static IEnumerable<XElement> ReadElements(this XmlReader reader, XName elementName)
{
if (reader.Name == elementName.LocalName && reader.NamespaceURI == elementName.NamespaceName)
yield return (XElement) XElement.ReadFrom(reader);
while (reader.ReadToNextSibling(elementName.LocalName, elementName.NamespaceName))
yield return (XElement) XElement.ReadFrom(reader);
}
The documentation comments should be pretty self-explanatory, but itās probably important to call attention to the side effects: ReadElements
expects an intentionally positioned xml reader. Once ReadElements
is done returning XElements
, the reader will be positioned at the end element of the initially positioned elementās parent.
I should also point out it would be trivial to add an overload of ReadElements
that didnāt take an XName
and simply returned a sequence of the initially positioned element and all of its following siblings. But I donāt need that method yet, so I didnāt write it.
ReadElements
will certainly allow me to process this large xml file more efficiently and easily than exclusively using either an XDocument
or an XmlReader
. Hopefully this method will be helpful to some of you, too.