Retrofitting Position IDs to Addepar Non-Intrusively

Every day, Addepar’s data pipeline consumes portfolio data from hundreds of different custodians and imports it to millions of nodes and edges in our Financial Graph. To ensure that the data can be trusted and used to correctly calculate performance, it’s crucial that we not only import it in a timely fashion, but also run verification checks with what already exists on our platform. The data is complex, and data integrity check failures can be caused by a lot of different issues — for example, missing data, incorrect raw data from custodian, incorrect transaction mapping, problematic validation logic, and incorrect security mapping.

As our platform grows, we’re continually improving and extending our methods. The following is an example of how we addressed one type of verification problem — node matching — by non-intrusively introducing a core change to a critical pipeline.

Node Matching

The Financial Graph is used to represent the ownership structure of portfolio data, where a node represents an entity (such as a brokerage account, an asset owner, a legal entity, a stock, or a bond), andan edge (or a position in financial terms) represents the ownership relationship between two nodes. The from node is often an account, and the to node refers to the portfolio’s holding — the actual security.

When importing data, we need to do a few things:

Identify the account node.

Identify the security node via node matching. We search for the security node in our system using the security attribute information the custodian provided. The search criteria could be the name of the security or security identifiers. Each security type may have different preferred strong and weak identifiers that drive the matching. Our matching logic may evolve over time as we learn more about the data. This is similar to the entity resolution problem, where we have to identify the same node across different data sources.

Find the edge. Once both the account node and security node are identified, the edge between them is where daily transactions and position snapshots are stored.

Ideally, the security nodes are unique and shared between the account nodes. After all, the edge is what references the client’s specific holding.

Financial Graph representation of a portfolio.

The Node Jumping Problem

Addepar’s platform imports daily account portfolio data from hundreds of different data sources. Each source may represent each security slightly differently. When there isn’t enough information in the feed to identify the corresponding node for the security, the system treats it as a new security and automatically creates a new node. Clients may also edit the security attributes directly, which can also cause the system to create a new node. Over time, multiple nodes may represent the same security.

Accounts referencing duplicate security nodes.

Having duplicate nodes isn’t great, but it’s not the end of world. It has no impact on the accurate representation of the client’s portfolio. What’s important here is the edge, where the position snapshots and transactions are stored. We need to be able to import data to the right edge all the time. The real problem with duplicate nodes is that it increases the probability of the system matching the account’s holding to the wrong security node, hence breaking the time series. This in turn causes unit verification failures and performance calculation errors.

A Node Jump for Account I on Day 4

Above is an example showing that Account I’s Facebook holding has time series data for three days. However, on Day 4 after the import, it only has one day of position data because the wrong security node was identified, which created a new edge. This causes the time series to drop off in what we refer to internally as the Node Jumping problem.

Finding the Right Edge

Instead of searching the from and to nodes first and identifying the edge, what if the system could just match with the edge directly? The benefit here is that matching becomes very straightforward, and node matching is bypassed for the most part, minimizing the chance of node jumping. The challenge is that we’re introducing a new primary key that bridges two large internal systems. It’s a fundamental change to the data flow of Addepar’s daily portfolio data imports. The adoption risk is very high. We took this challenge into account when deciding how to design and deliver this change.

At a high level, the new workflow leveraging the new Position ID, goes like this:

Every position from each feed has a unique Position ID. Ideally, this ID is the same over time.

The system uses this ID to find the corresponding edge.

If found, that’s where the data is imported.

If not found, the system falls back to the existing workflow of node matching and updates the edge with a reference to this Position ID, so future imports can leverage it.

Position ID

The Position ID is the unique identifier for each holding in a portfolio. Some portfolio data partners have this concept for us to leverage. Others require examining patterns in the feed data to come up with the best derived key. While we may know what combinations make up the best Position ID, the assumption still needs to be validated for the entire time series. The goal is to find the most stable combination in the time series. Position ID loses its meaning when it mutates too often because the process will just fall back to node matching. The combination that mutates the least over time is considered the strongest ID.

Temporal Design

Ideally, Position ID is a constant value like a true primary key. In reality, Position ID mutates over time as security attribute changes. This could happen for a variety of reasons, the primary driver being data coming from the custodians or corporate actions. To accommodate this requirement, Position ID persistence utilizes a temporal model where each ID has a start and end date to track its validity. This is essential for historical imports and replaying imports: as IDs change over time, the ability to hit the right edge each time is crucial. Each Position ID should only be assigned to one edge only. There should never be a case where a Position ID points to more than one edge. To ensure data integrity, any conflicting ID assignment is invalidated with every insert.

The temporal model also provides a complete history of the Position ID (PID) assignment of an edge. This is particularly useful when debugging any mapping abnormalities.

Time series of a Position ID.

Position ID Release

There is a big difference between building a system from scratch and strategically evolving a running system. With the former, you have fewer constraints on the design, no concerns about backward compatibility, and, most importantly, no possibility of impacting a working solution. Often times, we have to deal with the latter, making core changes to a critical pipeline, whether that’s migrating a database, adopting a new messaging infrastructure, or refactoring a large interface. When we do, we need to constantly think about the risk with each release and how we can leave the production system better than we found it.

For the Position ID release, the known and unknown risks were very high because the change could directly impact the integrity of the client’s portfolio. Before we could roll out this feature, we first needed to understand the worst case scenarios:

Unexpected exceptions in the new flow causing imports to fail.

Weak Position IDs causing node jumping to appear again. If a Position ID changes frequently, matching can’t depend on the Position ID reliably, so it falls back to the existing node matching. The worst case scenario here is node jumping, which we know how to deal with already.

Buggy Position ID matching, which could cause massive data integrity check failures and post data clean up effort. This is a high-risk unknown. We would want a way to revert the change quickly and keep the impact to a minimum.

Other Unknowns. We need a way to revert quickly in case of an unknown failure that’s blocking data import.

Thorough unit test coverage and feature flagging goes a long way toward ensuring the quality of a release. Another important aspect here is test coverage. How we do ensure the test data set is meaningful and covers all the edge cases? The following captures the steps we took for this feature rollout:

Instead of a binary feature flag, we used multifaceted feature flags to give us fine-grained controls on when and how widely we release and to allow us to rollback with minimal impact.

We upgraded our staging environment to mirror production. In addition, we could replay production traffic in staging. This allowed us to test in full scale and have production quality test coverage.

We rolled out persisting Position IDs without using them for matching. This gave us a critical view on the quality of the Position ID chosen and allowed us to learn from this data. We’ve learned what Position ID combination worked well and what needed adjustment.

We rolled out Position ID matching by client. We started out with smaller clients or clients with less volatile instrument attributes. This greatly reduced the risk of the unknowns.

We gradually increased the roll out size with each release, and learned from any bugs that surfaced. This allowed us to keep the impact to a minimum.

We worked closely with our Data Operations team throughout the process. They work with this data day in and day out. Their knowledge of the data and feedback throughout this process drove the quality of the migration.

Releasing this feature was a lengthy process and it took a few iterations but with these careful measures, the release went out without any issues and we have been using Position IDs successfully to improve our data quality.

What’s Next

Our data footprint is growing exponentially. Continually investing in, retrofitting, and upgrading our platform is a constant theme at Addepar. We balance these challenges with careful planning around known and unknown risks, always look to learn from our mistakes, and collaborate across teams of passionate engineers who enjoy solving tough problems. There will be more blogs in this space around large scale data migration, import and performance challenges. Stay tuned!