Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

If you have spent any time in the trenches of enterprise development, then it’s practically guaranteed you have been exposed to the porter-like duties that come with carrying data to and from your database’s door. In addition, if you’ve spent some time in this endeavor, it’s more than likely that you’ve parsed and loaded a lion’s share of files into a schema’s tables. From plain flat files to structured XML files to the more esoteric ones (like ISO 2709), developers and administrators have been shuffling these files and ingesting their data for decades.

There are both advocates and naysayers on the time-honored practice of ingesting data files. Critics point out that data files are not real-time sources of information, and depending on the chosen format, it may require a certain amount of coordination and finesse in order for them to be handled properly. Advocates, on the other hand, would make the argument that data files have been used for decades, and as a result, the accrued cornucopia of libraries and commands for handling them can empower even the untrained novice.

Related Sponsored Content

Related Sponsor

These proven tools can make the parsing and loading of data files as easy as the flick of a wrist, and for loading large volumes of data, they are also usually the fastest route. These same advocates would also point out that though files are not real-time, they also serve as recorded snapshots of data that can be archived for later inspection by hired auditors who ensure compliance with laws. For example, under U.S. law, the retention requirements of the Sarbannes-Oxley Act dictate that certain data must be archived for at least 5 years1. If you currently own systems that handle files according to these specifications, the flow of your data probably resembles the following:

(Click on the image to enlarge it)

However, despite the various arguments made for the usage of data files, the world’s hunger for real-time information only grows stronger with each passing day, and this hunger is becoming increasingly met with the practice of retrieving data via an API. Legacy programs and systems will be loading data from files for many years to come, but eventually, those systems will likely be updated by stakeholders who will also have a desire for up-to-date information. Of course, new systems and new programs will be needed in order to make this switch a successful transition. In addition, since there will no longer be any files to compress and archive into a designated folder, we must now create our own retention solution.

In general, though, we would likely want a workflow that was akin to our file-ingestion solution:

(Click on the image to enlarge it)

There are a number of ways to solve this problem, including ad hoc solutions that could be implemented quickly. However, if you have read my previous article2, you’ll know that I’m fond of solving these problems in a more systematic way. In fact, we can use metadata-driven design here in order to create a robust architecture that can assume and execute these expected responsibilities. So, what exactly is metadata-driven design? For the sake of brevity, it can be summarized as an approach to software design and implementation where metadata can constitute and integrate both phases of development. In other words, it’s a way in which developers can employ Agile iteration over the entire software lifecycle[3]. By using metadata derived from domain-driven design, you can then proceed to the next step of metadata-driven design and create an impressively flexible architecture.

As Mike Amundsen recommended that builders of APIs should be thorough in their designs4, we should keep the same mentality when we construct systems that consume them. So, in order to make any problem of complexity more approachable, it’s a good practice to disassemble such a conundrum into its constituent parts. (Since it’s especially helpful to address each segment of a problem with its own curtailed solution, the modular nature of metadata-driven design is especially appropriate for such situations.) For example, since API data retrieval typically limits the maximum amount of data pulled back in one request, we will need to design our engine with the expectation that it will enumerate through a large data set via repeated calls to the API. However, before we deal with such minor idiosyncrasies, we need to anticipate the higher level issues, namely the different styles that are possible with API data retrieval. Some vendors will provide a less versatile mechanism to access their data, providing little to no parameters in assisting with the query; the user will need to pull and handle the full set of contents each time. Some users might wish to simply retrieve all available records from a vendor repeatedly, but many users (like myself) desire the option to pull a subset of them, specifically the records which have changed since the last retrieval (i.e., the delta records). In a more straightforward implementation, they might require just an argument in the URL query string in order to retrieve these delta records, but in more abstract cases, the auditing information and the actual data are obtained through two separate APIs. Consequently, the auditing API would be called first, providing a manifest that details which records and fields have changed since the last date; this manifest would then be used to target and retrieve the delta records through the data API. We must also take this type of scenario into account when we design the architecture.

Even though we could be coordinating the simultaneous use of both a data API and its corresponding auditing API, our preliminary concern is how we will methodically approach calling an API in general. In both cases, we will need to take several factors into account: generating the appropriate URL, parsing the format (JSON, XML, etc.), extracting specific properties’ values, filtering unwanted records, etc. In order to create an appropriate set of metadata for this architecture, the final result will need to contain a comprehensive range of values that can address each of these required steps, whether calling the auditing API or the data API. In order to illustrate this point, here is a side-by-side comparison of a delta manifest and the corresponding delta record:

(Click on the image to enlarge it)

Even though it may not always be the case, it is common for records to be encased within a collection body (i.e., “<products>”) and for each record to exist within a discrete body (i.e., “<product>”). It’s also common for an API to help us address the issue of making repeated calls for a large data set (i.e., pagination) by providing values that link to the next batch of data (i.e., “<hasMore>” and “<next>”). Such patterns can help us build generic functionality within our engine. In order to enumerate through these records, we can use an underlying mechanism available within your given environment; for the remainder of this article, we will focus on .NET as our chosen platform. In that case, the .NET XPath library and the W3C XPath notation5 provides us with the functionality needed to traverse through these records.

Before we provide a candidate for the ingestion engine’s metadata and since we are discussing the raw payloads, we should take this opportunity to address our legal concerns. In order to satisfy the auditors of your systems, we should persist these raw payloads in order to validate the integrity of our current data. We will then create a table that describes our specific goal and another table that records the API payloads given in response to our queries:

(Click on the image to enlarge it)

In addition to satisfying our need for a retention policy, we also now have a potential sanity check as the first step in our pipeline of data processing. If there is any question about our engine’s performance in subsequent steps, we can always check these payload ‘snapshots’ in order to ensure that we are correctly parsing and loading the values contained within them.

Now that we have a proper baseline, we can finally start to design the metadata that will serve as the configuration for retrieving data through our designated APIs. In order to correctly execute the hypothetical process mentioned above, the first step in our process will be to obtain the change manifest through the auditing API. Then, with the change manifest in hand, we can pull the relevant delta records mentioned, either singly or in batches:

This schema prototype and example rows serve as the first iteration of metadata-driven design for our engine. It attempts to capture the abstract concepts that will drive our software. (As the father of domain-driven design said, “Success comes in an emerging set of abstract concepts that makes sense of all the detail.”6) In this example, the row of Type ‘A’ denotes the API configuration needed to retrieve the change manifest, while the other row’s values constitute the configuration for our actual data retrieval. Previously, we had mentioned the need for an ability to pull the contents of a large data set through repeated calls. (Even though the XML payloads from beforehand refer to a set with only one delta record, most cases will probably refer to much larger record sets.) Now, we have the metadata that can be used to construct our initial calls to the URLs, and we have the ‘Anchor’ columns that can help us to form those repeated calls for enumerating through a sizeable collection:

Along with providing the building blocks for URL generation, we now have the ability to parse and filter the payloads from the API calls. As a side note, I would recommend implementing an interface (i.e., IEnumerable) that seamlessly enumerates through the payloads’ data and makes any subsequent API calls as part of the enumeration. Since the .NET platform serves as our underlying environment in this example, the .NET XPath library comes into play at this point. By using XPath notation with the data from certain metadata columns (i.e., “Target_Child_Tag” and “Target_Child_Key_Tag”), we are able to easily traverse the records inside the payload and extract bodies of data that we intend to process. With only a few lines, the .NET LINQ functionality gives us the ability to both obtain data and ignore bodies of the payload that are not of interest to us:

This generic functionality allows us to populate a container with records from a given payload. In the case of our auditing API, this LINQ execution would populate a list with discrete portions of our delta manifest. In the case of our data API, the list would contain bodies of actual product records. We already have a purpose for the container of our change manifest: it serves as an itinerary as to which product records should be retrieved through the data API. However, in the case of this list of the data records, further direction is required from us. What do we do with these data? How should we build the mechanism for persisting it? With more metadata-driven design, of course!

This penultimate piece to our architectural puzzle must serve to direct these delta records towards their final destination. Even though we could choose to persist these delta records to the filesystem (along with the change manifest), it should come as no surprise that most enterprise systems use databases as their primary repositories. If you simply needed to display these data, you could use a NoSQL database like MongoDB to store the delta records (or even the raw payloads as a whole), but due to common needs like querying and processing, most enterprise data systems will choose to distribute these attributes into various tables within a relational database. In that case, we will need to create a set of metadata that will help us to redirect our delta records into an appropriate staging area:

If the engine’s software is implemented correctly, we can also create an intelligent auditing subsystem that knows how to detect and record changes when data is supplied to this staging area. Best of all, we can easily include more rows in order to capture and persist additional data points from the records, all with minimum or no alteration to the code.

Finally, we should address the archival of these payload records. Depending on your database configuration and the number of expected changes, it could be reasonable to simply allow those records to remain in the payload table(s) for the time required by law and proprietary policy. However, it’s very possible that the data will exceed your database’s space limitations and/or will perform poorly when queried by interested parties. In that case, there are a couple of options to consider. One option would be to coordinate with a DBA and create an archive policy for certain sections of your payload table. In that case, though, the solution would be outside the domain of your engine and outside the realm of your control. For those of us who don’t like the sound of that, there is another alternative. Over the last few years, various providers have started to offer data archival services at a reasonable price (like Microsoft Azure Backup). Though some of them require the installation of client software, this option can provide you with a number of services aside from simply archiving the data (like data recovery, for example). Of course, the API ingestion engine will need to work with your archive solution; with some of these services, the engine might need to move the payload rows into a filesystem volume that will be automatically replicated to a remote site. However, with only a few rows of metadata and a few lines of code, you could utilize such a service in order to construct a flexible archive step in your engine’s workflow.

Before we’re too comfortable, though, you might ask about the lack of any protection mechanisms built into such a design. For example, we are currently assuming that our system will behave in a certain fashion, namely that the API server will provide change manifests that point to a few delta records. However, what if our current expectations suddenly become incompatible with the actual number of changes in the future? What if there are suddenly a deluge of impending changes that could overwhelm our system? Or what if the API server simply stops functioning properly and simply begins to loop infinitely, sending us the same change manifest repeatedly? In this last case, we would also potentially cause harm to the API server by inadvertently spawning a DoS attack against its owner, and the situation could be exacerbated even more if the actual cause for this inadvertent attack was found to be within your own engine. That could be embarrassing (to say the least), and we should strive to avoid such situations, for both technical and political reasons. What can we do in order to avoid such scenarios? Again, we can create a robust metadata-driven solution to help address such potential issues:

Of course, actions like ‘Email’ and ‘Throttle’ will need to be implemented within your engine, but they should be fairly straightforward. Since we know the domain model, we have the business knowledge to establish parameter values that can quantify our problematic scenarios and that can trigger the appropriate responses to deal with them. More importantly, we can also easily alter these values in order to accommodate an evolving set of requirements, as your given environment and expectations will probably change with the passage of time.

By employing this model, we can obtain an architecture that serves as a comfortable transition for established IT departments. By having recorded snapshots of the data retrieved through an API, we can provide legal and psychological satisfaction to management. Also, as always, we can employ Agile here in order to further refine this ingestion engine. If you intend to utilize data APIs (with no corresponding call to retrieve a change manifest), then you can easily remove the change manifest portion of the design and its respective code. If you want to add policies that would dictate the automatic removal of old records, you could iterate over the design in order to include another set of metadata (and its accompanying code); this new functionality could remove records based on values like the source identifier and the maximum allowable time for saved records. In any case, this configuration can provide a flexible blueprint for the future of data ingestion.

About the Author

Aaron Kendall is a software engineer in New York City, with nearly 20 years of experience in the design and implementation of enterprise data systems. After beginning as a developer of device drivers and professional software, he became passionate about software design and architecture. He has created innovative business solutions using a variety of platforms and languages, as well as numerous freelance software projects that range from open source packages to game design and mobile apps. If you would like to read more about his work, you are encouraged to visit LinkedIn and his blog.