August 11th, 2024

Why Your Data Stack Won't Last – and How to Build Data Infrastructure That Will

The article highlights challenges in data infrastructure, emphasizing poor design, technical debt, and key person dependency. It advocates for thorough documentation, cross-training, and stakeholder engagement to ensure sustainable systems.

Read original articleLink Icon
Why Your Data Stack Won't Last – and How to Build Data Infrastructure That Will

The article discusses the challenges of maintaining effective data infrastructure and offers strategies for building sustainable systems. Many data projects fail due to unclear design decisions, technical debt, and a lack of alignment with business goals. Consultants often encounter abandoned or poorly designed systems that require complete replacement rather than simple fixes. Key issues include resume-driven development, where choices are influenced by vendor recommendations or personal career advancement rather than the specific needs of the organization. Additionally, reliance on a small number of key personnel can lead to dependency issues, making it difficult for future teams to manage or understand the infrastructure. To avoid these pitfalls, the article emphasizes the importance of thorough documentation, cross-training team members, and maintaining open communication with business stakeholders throughout the development process. By focusing on business outcomes and ensuring that infrastructure is built with input from relevant parties, organizations can create data systems that are robust, maintainable, and aligned with their strategic goals.

- Many data infrastructure projects fail due to poor design and lack of planning.

- Resume-driven development can lead to unsuitable tool choices and technical debt.

- Key person dependency can jeopardize the sustainability of data systems.

- Effective documentation and cross-training are essential for long-term maintenance.

- Engaging business stakeholders throughout the development process is crucial for success.

Link Icon 19 comments
By @Terr_ - 2 months
I've come to believe the opposite, promoting it as "Design for Deletion."

I used to think I could make a wonderful work of art which everyone will appreciate for the ages, crafted so that every contingency is planned for, every need met... But nobody predicts future needs that well. Someday whatever I make is going to be That Stupid Thing to somebody, and they're going to be justified demolishing the whole mess, no matter how proud I may feel about it now.

So instead, put effort into making it easy to remove. This often ends up reducing coupling, but--crucially--it's not the same as some enthusiastic young developer trying to decouple all the things through a meta-configurable framework. Sometimes a tight coupling is better when it's easier to reason about.

The question isn't whether You Ain't Gonna Need It, the question is whether when you do need it will so much have changed that other design-aspects won't be valid anymore. It also means a level of trust towards (or helpless acceptance of) a future steward of your code.

By @weego - 2 months
How do you ensure the data infrastructure you’re building doesn’t get replaced as soon as you leave in the future?

If this is a core conceit of the thinking then my answer is who cares?

Why do you want to try and influence a situation you're not even involed in?

Taking it back to the best lesson I was ever given in software engineering "don't code for every future".

Do what you're asked to and not get caught up in projecting your own biases into trying to make a "solid base" for the future when you can't know the concerns of said future.

By @mritchie712 - 2 months
(disclaimer: I'm a founder in this space)

> the project is either so incomplete or so lacking in a central design that the best thing to do is replace the old system

I put a lot of blame here on "the modern data stack". Hundreds[0] of point-solution data tools and very few of them achieve "business outcomes" on their own. You need to stitch together 5 of them to get dashboards, 7 of them to get real time analytics, etc.

We're going to see more products that achieve an outcome end-to-end. A lot of companies just want a few dashboards that give a 360 degree view of their data. They want all their data in one spot, an easy way to access it and don't want to spend fortune on it. That's what we're focused on at Definite[1].

We're built on the best open source data projects (e.g. DuckDB, Iceberge, Cube, etc.). If you decide to self host, you can use the same components, but it's generally cheaper to use us than manage all this stuff yourself.

0 - https://mattturck.com/landscape/mad2024.pdf

1 - https://www.definite.app/

2 - https://youtu.be/7FAJLc3k2Fo

By @cletus - 2 months
All technical problems are organizational problems. Put another way: any technical problem is a symptom of an organizational problem.

Even at Google, which has some truly amazing homegrown technical infrastructure, you see what I called Promotion Driven Development ("PDD"). I didn't see this when it came to core technical infrastructure (eg storage, networking) but at a higher level I saw many examples of something being replaced solely (ultimately) because someobody wanted to get promoted and you don't get promoted for maintaining the thing. You get promoted for replacing the thing.

The most egregious example was someone getting promoted to Principal Engineer (T8) for being the TL of something that was meant to replace existing core infrastructure before it had even shipped. In the end it didn't ship. The original thing is still there. But wait, "we learned a lot".

So this happens because the organization rewards the new thing.

So why is your data infrastructure being replaced? Probably because of an organizational failure and it'll have almost nothing to do with technical aspects of that infrastructure. This is true at least 90% of the time (IME).

Data infrastructure is particularly bad for this because in any sufficiently large organization you will completely underestimate the impact of changing data dependencies for metrics, dashboards, monitoring, ML training and so on. Those things can be hard to find and map out and generally you only find them when they break. Sometimes they can break for years before anyone notices even when the thing is used by live production systems.

By @kerkeslager - 2 months
I work in a different domain (full stack development), but I think the principle here applies broadly.

I tend to favor tools that have been around for a long time. A lot of the sites I have built have deployment scripts written in bash with dependencies on apt packages and repos, git to pull application code, and rsync to copy over files. It would probably be okay to update to zsh at this point. ;) I'm constantly shocked by the complexity of deployment infrastructures when I get into new projects: I've spent plenty of time working with Docker and Kubernetes and I have yet to see a case where these simplified things. As a rule, I don't throw out existing infrastructure, but if I'm doing greenfield development I never introduce containers--they simply don't do anything that can't be done more explicitly in a few lines of Bash.

One of the sites I still maintain has been running for 15 years. I ported it from Fedora (not my choice) to Debian about 8 years ago, and the only thing that changed in the deployment scripts was the package manager. I switched to DigitalOcean 5 years ago and the deployment script didn't change, period. The deployment script is 81 lines of bash. git blame shows 64 of those lines are from the original commit of the file. The changes are primarily to add new packages and to change the firewall to UFW.

And critically: I didn't write this script. This was written by a guy before me who just happened to have a similar deploy philosophy to me. That's a much better maintainability story than having to hire someone with your specific deploy tools on their resume.

By @moltar - 2 months
An easy to maintain stack from my experience that almost anyone can do:

- S3 for storage

- Glue catalog to describe / define source data shapes

- Athena to query the above

- dbt for business data modelling (has Athena and glue adapter)

The only difficult part I always struggle with is getting partitioning right.

By @alexpotato - 2 months
Here is the Big Investment Bank version of "resume driven development":

- "Legacy" trading system in place (old but battle tested)

- New head of technology for the business comes in

- They get the bright idea to retire the old system and roll out a new system

- They promise ridiculous deadlines

- Roll out a half baked new system and retire AT MOST 60% of the old system

- They bounce to the next role at another firm b/c they have "retired legacy trading system / rolled out new system" on their resume

Meanwhile, the new system ALWAYS has some giant outage or near miss due to the rushed deadlines.

By @mkl95 - 2 months
> The problem I find is that many data teams get thrown into having to design data infrastructure with little experience actually setting up one in the past. Don’t get me wrong, we all start with no experience. But it can be difficult to assess what all the nuances of different tooling and designs can be.

I've been there. Companies can be cheap about training and I was given none before building some sophisticated data stuff that surprisingly worked, but probably could have been simpler.

I got a much better job soon after, and hopefully my replacement got some training.

By @fifilura - 2 months
Parquet+iceberg stored on s3 as a base. That is solid enough.

After that comes various kinds of caches, maybe postgres for frontend? Or something streaming?

But once everything is stored as files you gain the freedom to experiment or refactor.

By @4WIW - about 2 months
I did designs that overstayed their intended lifespan by 15 years. I did designs that were cancelled even before being fully implemented. Most however had their predictable lifespan of 3-6 years.

It seems to me that the key is to make useful product based on sound technical decisions; entropy (a thing you can't control) will handle the rest.

By @cgio - 2 months
Missing a most important part in my opinion, store data using open standards in an accessible platform to enable rather than anticipate evolution. Options would be e.g. Postgres, Parquet, Avro, Json, even csv. Storage is the foundational, absolutely infrastructural data infrastructure. No one cares if data pipelines infrastructure changes, but if it cannot be done, just because your data is hosed into a vendor locked-in platform, then that is the infrastructure failure you did not want in your conscience.
By @pphysch - 2 months
> Our really smart engineer working over time built amazing custom infrastructure

> They quit and no one knows how it works

Either the infrastructure wasn't "amazing" in the first place or clueless management is looking for a scapegoat.

"Amazing" is an interesting word choice because a non-technical manager will be amazed by any blob of code. "Amazing" doesn't mean a straightforward, robust solution to a difficult problem.

By @kkfx - 2 months
Hem... Sorry but... It seems more propaganda for proprietary cloud solutions than a personal statement and actually the conclusion "do not do it yourself" tend to be regularly denied by the facts...

Choosing third party, well known, FLOSS infra/open formats it's a thing, not developing their own infra with such tools in house is another.

By @yobbo - 2 months
Yes, but this article seems to be talking to a business that has no competence in "data" outside of a handful random engineers that have no stake in the business. The advice given amounts to "avoid bad things".

Resume-driven engineering etc is the result of engineers with no stake in the business future. Any solution must involve incentives against "bad things" and favour "good things".

By @pmx - 2 months
> In one example I came in and found a data vault project that wasn’t being used. It actually had pretty good documentation. However, the team had taken so long and hadn’t fully completed the project which led to their dismissal.

I feel like this is a major reason things don't get documentation. We don't get judged on it, nobody cares how good the docs are, but they DO care that we're shipping features.

By @mannyv - 2 months
Design so you can replace the design when needed.
By @fsndz - 2 months
The problem is consultants selling bullshit as expertise are more prevalent than honest consultants. And for these bullshit consultant, selling the most unnecessarily complex solution with all the trendy keywords and making beautiful slides is all that counts. And what is funny is that customers believe all those lies.
By @xiaodai - 2 months
whoever thought spark based on Scala is a good idea should get shot.