5 tips for building a scalable technology stack

For more insights into creating a culture and business model that makes the most of the emerging field of data science, check out our O'Reilly Learning Path video training "Creating a Data Science Culture."

In a constantly evolving world of technology and data, keeping abreast of the latest technologies means building an architecture and culture around keeping your data pipelines up-to-date; in short, this requires future-proofing your business. At The Data Incubator, our team has worked with hundreds of companies, including Capital One, eBay, The New York Times, AIG, and Palantir, who have either hired our Ph.D. data science fellows or enrolled their employees in our big data corporate training programs. One consistent theme in the feedback we hear from our partners involves the challenges of future-proofing their technology stack.

Future-proofing means having a flexible, adaptable, and scalable technology stack that can leverage data science effectively. Here are five tips to keep in mind when building out your functionality:

It’s important to realize that what you’re able to do with individual pieces of technology doesn’t necessarily reflect your overall capability. Most production-level workflows require a large degree of interconnectivity between data sources, ingestion and storage, cleaning and processing, computation, and delivery—and these connections need to exist between different technologies, and different people. What you’re able to do with data is often more limited by the efficiency of these connections, than with the technologies themselves.

When it comes time to “scale up,” it’s important not to get sucked into the hype without fully recognizing the implications across the entire data pipeline. Adding functionality in one stage can place new demands and stresses on others.

For example, it’s great that your collection team has come up with a thousand new features to incorporate, but have you asked your data engineers how that will affect the performance of backend queries? It’s easy to switch to a stateful text vectorizer that saves word frequencies, but how will that change the memory requirements for everything downstream? Or maybe your team is eager to upgrade to Spark Streaming to run their clustering algorithms in real time, but your frontend will lose responsiveness if you try to display the results as fast as they come in. These are the types of considerations to take into account before you scale up.

2. Ask yourself what you actually need, before you build or buy it

While it’s easy to get googly-eyed over fancy new technology and immediately start throwing projects at it to see how it performs, there’s a smarter path. First, think carefully about the types of projects you expect to have, and then consider the technologies that are capable of implementing them. Don't discount older technology—there is usually an established knowledge base surrounding it, and your team may already be familiar with many of the important elements related to it, such as error messages.

This isn't said to discourage investing in your data pipeline. Even if you’re not planning on huge changes in the future, the gradual accumulation of data over time will eventually require that you build out more capacity. The point is to be aware of the growing pains associated with moving to cutting-edge technology.

You should also expect that the process of adopting new technology will not be completely seamless with old processes. Even if the performance metrics you care about are demonstrably better, there’s still a disruptive effect. Developers have to learn new routines, and often have to figure out solutions themselves rather than rely on institutional knowledge.

For instance, adding a neural network component to a predictive model may improve its accuracy, but modellers will then have to master new unfamiliar optimization parameters. There may also be further reaching ramifications; perhaps it’s not as easy to determine or explain the predictive power of input features.

There will inevitably be tangible productivity losses associated with doing things in a new way. If you’ve done your research ahead of time, though, and can point to a positive cost-benefit analysis in the long run, these types of short-term losses will be easier to address and justify to stakeholders.

It’s common today for all of your technology systems to be interconnected; while this allows many conveniences, it also creates the potential for strange and unforeseen interactions to occur while upgrading your systems. It’s crucial to have systems in place that test your integrations, especially when dealing with code that touches core business functionality, or sensitive Personally Identifiable Information (PII).

Not only does this testing ensure that the underlying systems are working properly, it also creates a reference of documented test cases that future system upgrades must meet. Writing tests is a process that documents the developer's thought process and intended use cases, and highlights important considerations. This kind of built-in knowledge transfer is hugely important down the road, when other developers need to build on earlier work.

4. Establish a feedback loop to monitor and deal with issues

When investing in a new technology, it’s critical to measure how closely it meets expectations, and how well it actually achieves its goals. This process supports team members who are directly involved in defining new workflows and improves higher-level decision-making, particularly when it’s time to identify areas to improve. A thorough feedback and monitoring system can pay dividends in the future in terms of understanding the true ROI of your current investment and what you should be looking for in your next investment.

Speaking of ROI, keep in mind not to over-invest upfront. Remember the 80-20 rule: in many situations, 80% of the effects are produced by 20% of the causes. Achieving the core of what you want is relatively easily attainable, but building it to perfection requires ever-increasing effort. It usually makes sense to do the most you can at the least cost and then decide if it’s worth investing more. Closely monitoring your efforts is a great way to prevent over- and under-investing.

5. Train, hire, and build a data-literate workforce

Ultimately, the best way to future-proof your company is to make sure employees feel valued and heard. An ideal work environment starts with open communication across all roles and departments, with effective communication just as influential on productivity as server hardware.

Take time to teach team-members new skills, both formally and informally. Informal training is a great way to get everyone on the same page, achieve buy-in from those involved, and understand related nuances. It also allows everyone to contribute to the discussion, and feel that updates and changes within the company are not just a one-way information dump. If data science is a part of your company’s future, it’s important to make sure everyone can participate in that future.

Michael Li is the founder and CEO of The Data Incubator, a company focused on training and hiring for data scientists. A data scientist himself, Michael has worked at Google, Foursquare, and Andreessen Horowitz. He is a regular contributor to Harvard Business Review, The Wall Street Journal, FastCompany, and TechCrunch. He has a Ph.D. from Princeton and a Master’s degree from Cambridge.

Ariel M'ndange-Pfupfu is a data scientist in residence at The Data Incubator, a big data education and placement company that runs customized, vendor-neutral, corporate training, and a selective eight-week fellowship for Ph.D.s transitioning into industry. He has worked on a variety of data science, software engineering, and curriculum development roles, and is also a current Bleeker Fellow. He earned his Master’s degree at Stanford and his Ph.D. from Northwestern.