This document presents a new, graph-based, way for expressing and executing Microsoft ADO.Net Entity Framework queries. Using an extensive performance comparison, we show that Graph-based Querying (GBQ) easily outperforms traditional querying with LINQ in terms of expressiveness and performance.

Inheritance has probably the largest (negative) impact on performance in
EF. In case of non-trivial inheritance hierarchies (using TPT), the
performance of EF queries drops significantly. This is well-known and
Microsoft is working on it (see the June 2011 CTP release of EF). TPT
leads to very complex SQL queries that take a lot of resources to
construct and to execute. We've seen examples where EF was not even able
to construct a query at all.

Another issue is that specifying queries spanning graphs of entities is
difficult in LINQ. Consider for example the following structure:

A query spanning a graph of associated entities consists of the minimum
set of paths from a given root ('E00' in this example) that defines that
graph. This is a rather verbose and error prone way to define what
should be included in a query: it includes redundant information and
paths components are untyped.

Things get worse if a query includes inheritance and associations,
because LINQ is not expressive enough. For instance, the following
figure shows an entity model with inheritance.

Inheritance is depicted in this
figure as dashed edges, associations as solid edges.

Given such a graph, it is no longer possible to fetch the data using a
single query. The reason is that we cannot specify a path from 'E00' to
'B00'. That is, the query 'ObjectContext.OSet.Include("E00.A00.B00")'
will raise an exception because there is no association between 'A00'
and 'B00' (only from 'A10' to 'B00'). The query
'ObjectContext.OSet.Include("E00.A00.A10.B00")' also doesn't work
because the inheritance relation between 'A00' and 'A10' cannot be
expressed in LINQ query.

These limitations in combination with complex data models easily lead
to badly-performing queries that are hard to understand and maintain.
In this document we briefly present an alternative query mechanism that
is designed for querying complex structures. We present an in-depth
performance comparison that shows that this mechanism quickly pays-off
in terms of performance and complexity.

Graph-Based Querying

Graph-based querying (GBQ) is a new approach for querying complex
structures from a relational database. If your data model uses TPT
inheritance and/or if you are using multiple SQL queries to fetch a
graph of related data (such as in Figure 2), GBQ might be an option for
you. Figure 3 shows an example of the tables, associations, and
inheritance relations that are involved in a typical query that GBQ is
designed for.

Figure 3. Obfuscated graph representation of a complex query. GBQ is designed for queries as complex as this.

Instead of writing a complex SQL query, or worse, a sequence of SQL
queries in order to get all the data from your database, with GBQ you
define the shape of the graph of data you are interested in (i.e., the
entity types and their relations), and then you call the 'Load()' method
on this shape. For example:

This example shows how you can define a strongly-typed entity graph
shape that covers the Depth3Assoc1 data model. You can now fetch all
data described in this shape for all owner entities as follows:

shape.Load<O>();

If you would rather load the data only for a single owner object, then, given a primary key, you could issue:

var owner = new O { Id = <my owner id> };
shape.Load(owner);

Edges in a shape definition may start anywhere in an inheritance
hierarchy. This way you can easily express associations from some sub
type to another entity. For example, to express that 'A10' has an
association with 'B00' (see Figure 2), we define the following edge:

.Edge<A10, B00>(x => x.B00Set)

As you can see, entity graph shapes are concise and declarative,
making it easy to define (and maintain) your queries. Moreover, shapes
are objects that you can use anywhere in your program (checkout http://riaservicescontrib.codeplex.com/wikipage?title=EntityGraph
to learn more about what you can do with entity graphs). Last but not
least, GBQ uses the information in the shape definition to synthesize
very efficient SQL queries. Consequently, fetching data from an SQL
database according to a shape definition is efficient. This document
demonstrates how efficient it is by showing that it easily outperforms
traditional LINQ queries in EF.

GBQ is an extension to EF. For querying it completely bypasses EF by
generating T-SQL and materializing result sets itself. The resulting
entities are attached to an EF object context, so that they are further
managed by EF. This means that GBQ does not replace Entity Framework. On
the contrary, it is designed to work together with EF and is to be used
only where EF queries do not satisfy. GBQ uses the Meta model of an EF
model to analyze types, store mappings and so on for generating T-SQL
and materializing the resulting data.

The Test

To benchmark the performance of GBQ we measure its performance for
different data models of increasing complexity and for different
population sizes. We compare the results with corresponding EF queries.

The Data Models

GBQ is designed to improve performance for TPT inheritance models with
complex association structures. To test the GBQ performance for such
models, we created 5 EF models of increasing complexity. Each data model
defines an inheritance hierarchy rooted at the entity type 'E00'. The
inheritance hierarchy is structured as a binary tree (except for the
data model 'Depth5Wide', which has more than 2^(depth-1) nodes at its
leaves). The models are of increasing depth, ranging from a depth of 3
(consisting of 4 concrete and 3 abstract base types) to 6 (consisting 32
concrete and 31 abstract base types). The models have an owning entity type 'O' that defines an association with 'E00'. The simplest model (Depth3) is depicted below:

Figure 4. Depth3 model.

To also benchmark the performance for data models with associations, we created two flavors for each model: a model with and a model without
associations. For two models with associations, we even created
variants with 1, 2, and 3 associations. So in total we have 14 different
data models. The following figure shows the simplest inheritance model
with 3 associations (Depth3Assoc3).

Figure 5. Depth3Assoc3 model.

Associations are depicted in this figure as solid edges, while dashed
edges denote inheritance. These associations are pretty simple. In
practice, such simple association structures are not very realistic. So,
the performance gain of GBQ that we will show will be even greater in
more realistic scenarios.

There are two important aspects of the associations:

They form an inheritance hierarchy by themselves, rooted at the entity type 'A00'.

Sub types of 'A00' have associations themselves.

This is important because data structures like this cannot be covered
in a single EF query using the Include mechanism. That is, you can
define a query like ObjectContext.E00Set.Include("A00Set"), but you
cannot express to include 'B00Set' for the elements of type 'A10'.

Queries

We will compare GBQ queries with two forms EF queries.

EFSingleQuery

The first form is a single query and has the following form:

ObjectContext.E00Set.Where(x => x.OId == selectedEntityId)

Tis query obtains the elements of type 'E00' for a given owner with
key 'selectedEntityId'. We populate the database in such a way that
there is exactly one E00 object for each owner object. The result of
this query is a single element of a type derived from 'E00'. EF takes
care of joining the corresponding tables (in TPT each type has its own
data base table), and instantiating a proper subtype of 'E00'. The
queries for the models with associations have the form:

This will not only fetch and materialize an instance of a subtype of
'E00', but also the associated entities of type 'A00'. As, we already
indicated, this include mechanism is not expressive enough to indicate
that also the 'B00' entities associated with 'A10' entities should be
fetched. Consequently, this query is incomplete for the data models with
associations.

EFMultiQuery

The second form of EF query is a multi-query.

One of the problems with the single EF query is that it performs not
very well for complex inheritance hierarchies. This is because EF
generates a very complex SQL query spanning all tables of the TPT
hierarchy. An alternative approach is to separate this single query in
multiple queries, one for each concrete type. Each query has the
following form:

Up till EF 4.0 this was a much faster query than the single query,
although performance dropped for larger inheritance hierarchies. Since
the June 2011 CTP release of EF, the single query is much faster
(although the performance drops again for larger inheritance
hierarchies). For the models with associations we append '.Include("A00Set")' to each query. To also fetch the 'B00' entities, we execute a join between 'E00', 'A10' and 'B00' entities:

Observe the complexity of this query for this simple data model. Also
observe that expressing such queries really becomes a nightmare if more
associated data should be fetched (e.g., for the data models
Depth3Assoc2 and Depth3Assoc3). We've experimented with alternative
queries but these perform worse and are even more verbose. According to
the posts here and here, there doesn't seem to be a good solution for this. I welcome alternative queries that form an improvement over the ones above.

GraphBasedQuery

The GBQ query consists of defining the shape of the entity graph that
should be fetched from the database, and then calling the 'Load()'
method on this shape. For the model without associations, this looks
like:

Observe how we express that all 'B00's should be fetched for entities of type 'A10'.

Populations

We test the queries for 10 different populations with different
numbers of 'E00' entities, ranging from 100 to 1,000. For each concrete
subtype of 'E00' we create an equal number of instances. For each 'E00'
entity we create 75 instances of 'A00' (25 instances of 'A10' and 25
instances of 'A11', and 25 instances of 'A12'). For each instance of
'A10' we create 5 instances of 'B00'. Likewise, we create 5 instances of
'C00' and 5 instances of 'D00'. For each combination of a data model
and a population, we generate a separate data base. So, our tests span
140 different data bases. The number of objects in the database range
from 400 in Depth3 model (see Figure 4) to more than 540,000 in
Depth6Assoc3 model (see Figure 6). The number of tables range from 8 to
71.

Figure 6. Depth6Assoc3 model.

Test Execution

For each of the 14 data models we have the three different queries.
These queries are executed for each population. This means we have 42
queries that we execute over 10 populations. This gives 420 different
query results.

We run a separate benchmark for each of the three query types. This
gives 14 different queries per run over 10 different populations. Each
combination of a query for a particular population forms a benchmark
test. This gives 140 different benchmark tests per run. The outcome of a
benchmark test forms a sample. Each run consists executing 14,000
benchmark tests, resulting in an equal number of samples. On average,
each benchmark test is executed 100 times.

Each benchmark test is randomly selected from the set of 140. Each
test consists of querying data for an 'owner' object. To minimize
effects of result caching, we randomly select an owner object for each
benchmark test run.

For each benchmark test we measure the time to complete. After
completion, we record the average execution time for each benchmark
test. From these average numbers we create the performance graphs that
we will discuss later on.

EF requires quite some resources for the first execution of a query.
To minimize the impact of this startup time in the benchmark results, we
ignore the timings of each first run of a benchmark test. GBQ doesn't
have any specific initialization. To prevent auto loading of data, we
set the 'LazyLoadingEnabled' flag to 'false'.

Both EF and GBQ have a query plan caching mechanism which improves
performance by caching compiled queries. If this mechanism is switched
off, there is an immediate performance degrade of EF compared to GBQ. We
therefore leave query plan caching switched on to make the performance
comparison more challenging for GBQ.

We use EF 4.1 June 2011 CTP release for benchmarking. This release
includes a significant performance improvement of TPT queries.

We run the benchmarks on a Dell Latitude E6520 Essential laptop,
which has an Intel Core I7-2720QM (2.2 GHz) CPU and 8 Gb memory. On
this computer we installed the 64 bits version Windows 7, Visual Studio
2010 SP1, and the June 2011 CTP of Entity framework. For the
benchmarking we are not so much interested in the absolute time that is
needed for the test runs to complete, but for the relative time
difference between the different queries, for the different population
and data models. In the end what we want to study is for what complexity
of data model and/or population size GBQ starts to pay off.

Test Results

Data Models with Inheritance Only

In this section we present the results of querying the
inheritance-only data models. The figures show that GBQ easily
outperforms the other two forms of querying. Since the EF 4.1 June 2011
CTP release, the EF single query is much faster than the EF multi query
for all tested data models. This used to be the other way around in
previous EF versions. The performance of both query types significantly
degrades when the inheritance hierarchy grows. Remarkably, the
performance of the EFSingleQuery queries significantly drops for larger
inheritance hierarchies. So much, that it becomes comparable with the
performance of the EFMultiQuery queries. GBQ performance is much less
dependent on this form of data model complexity.

Depth 3 Model

Depth4 Model

Depth 5 Model

Depth5Wide Model

Depth6 Model

Data Models with Inheritance and Associations

As indicated, it is not possible in EF to express in a single query
that all association should be fetched. Therefore, the EFSingleQuery
queries below return incomplete results. These queries are included in
the figures below just for illustration. Their timings cannot be used
for a valid comparison with the other two query types.

Below is the EFMultiQuery for the Depth3Assoc1 model. Observe the
join query that is needed to fetch the 'B00' entities associated with
'A10' entities. If more associations should be fetched as well, the
complexity of the query increases significantly. Also observe that we
have separate queries for each concrete sub type, 'E20', 'E21', 'E22',
and 'E23'. Each is a form of code duplication. When the number of
concrete types increases, the number of queries increases equally.

In this query you just define the shape of a graph of associated
entities by enumerating its edges. Observe that edges can be defined on
sub types (e.g., from 'A10' to 'B00'). Furthermore, in contrast to the
EF query, the GBQ query remains the same for the different data models
used in this section. The corresponding EF queries need to be adapted to
add additional queries for sub types and to add new joins for
additional associations.

Depth3Assoc1 Model

Depth3Assoc2 Model

Depth3Assoc3 Model

Depth4Assoc1 Model

Depth5Assoc1 Model

Depth5WideAssoc1 Model

Depth6Assoc1 Model

Depth6Assoc2 Model

Depth6Assoc3 Model

Summary

GBQ is an alternative approach for database querying that bypasses
EF's querying mechanism. Since the resulting entities are attached to an
EF context, GBQ can be used together with EF. GBQ is typically used
when EF query performance is insufficient and/or because EF queries
become too complex.

The performance figures in this report clearly show that GBQ queries
quickly win in terms of performance. As stated before, the performance
of the EFSingleQuery queries in the association models cannot be
compared with the EFMultiQuery and EFGraphQuery queries because they
yield incomplete results.

The data models and corresponding queries are still relatively simple.
GBQ becomes particular interesting for more complex data models where
the queries include many associations (e.g., see Figure 3). In these
situations query performance of EF really becomes problematic as is the
case with expressing and maintaining such queries.

The resulting entities of QBG queries are attached to an EF object
context. Since we cannot access any internal APIs of the object context,
we are forced to use the 'AttachTo' method. It is expected that a
slight performance improvement could be gained if GBQ could access the
object context at a lower level.

This report compared the performance of GBQ queries with EF queries. I
welcome any suggestion for improving the EF queries that I used.

[1]
The data models in this document use the following naming scheme:
Depth<x> denotes the depth of the inheritance hierarchy of 'E00'.
Assoc<y> denotes that the data model has an association from 'E00'
to 'A00'. The number 'y' denotes the number of associations (e.g., 1
denotes a single association from 'A10' to 'B00', while 2 indicates a
second association, from 'A11' to 'C00'.

This sounds like prefetch paths in LLBLGen Pro or the graph shape / spans features in other o/r mappers: 1 query per entity node in the graph (which is 1 edge), and sets are merged in-memory over pk-fk hashsets.

For the edges in the graph I generate SQL joins. There is no in-memory merge and no hash sets involved. Also, no sub queries or IN queries. I'm not so much an SQL expert so the T-SQL that I generate is pretty straight-forward (at least that's what I think).

I welcome any discussion on this topic, because I'm open for improvements and because I'm not claiming that what I'm doing is unique. Although I do believe that the combination of expressiveness and performance is at least new to EF. That is the main reason for posting this article.

Yes, it is pretty new. We're using it for a product under development to express and execute queries like the one depicted in Figure 3. Our team is very happy with GBQ because of the improved performance and maintainability.

I've tested it pretty extensively but I can't give any guarantees. I have some ideas for further improvement/enhancement (this will probably not lead to any performance improvements), but I'm also very interested in feedback from users. Of course, I also welcome community contributions.