August 16th, 2024

Good programmers worry about data structures and their relationships

Good programmers prioritize data structures over code, as they enhance maintainability and reliability. Starting with data design simplifies complexity, aligning with Unix philosophy and aiding senior engineers in system documentation.

Read original articleLink Icon
Good programmers worry about data structures and their relationships

Good programmers prioritize data structures and their relationships over mere code, as emphasized by Linus Torvalds, the creator of Git and Linux. He argues that effective data structures lead to simpler, more maintainable code and enhance software reliability. By focusing on the data model during software design, developers can avoid complications later on. Torvalds illustrates this with an example where restructuring data simplified a complex function significantly, demonstrating that well-designed data structures can reduce code complexity and improve performance. He also references the Unix programming philosophy, which advocates for embedding knowledge into data to simplify program logic. The article suggests that programmers should start with data design, ensuring a clear understanding of data flow and component interactions before delving into code specifics. This approach is particularly relevant for senior engineers in tech companies, who are often required to create high-level design documents for complex systems. Overall, the emphasis is on the importance of data structures in software engineering, advocating for a shift in focus from code to data.

- Good programmers focus on data structures rather than just code.

- Well-designed data structures lead to easier maintenance and improved software reliability.

- Starting with data design can simplify code complexity and enhance performance.

- The Unix programming philosophy supports the idea of embedding knowledge into data.

- Senior engineers are expected to create high-level design documents for complex systems.

Link Icon 50 comments
By @et-al - 5 months
Looks like that substack just copied a bunch of quotes from this Stack Exchange post:

https://softwareengineering.stackexchange.com/questions/1631...

By @dswilkerson - 5 months
"Show me your flowcharts [code], and conceal your tables [schema], and I shall continue to be mystified; show me your tables [schema] and I won't usually need your flowcharts [code]: they'll be obvious." -- Fred Brooks, "The Mythical Man Month", ch 9.
By @alphazard - 5 months
Data structures are not the same thing as types. Data structures are bit patterns and references to other bit patterns (pointers or relationships). Types (as they are used in programming languages) place some constraints on those bit patterns, but can also encode many other language features.

Creating an elaborate type hierarchy with unnecessary abstractions is not what is meant by "worrying about data structures", and that tendency is one of the most common failure modes for otherwise smart engineers.

By @krooj - 5 months
Linus always has a great way of summarizing what others might be thinking (nebulously). What's being said in the article is really mirrored in the lost art of DDD, and when I say "lost" I mean that most developers I encounter these days are far more concerned with algorithms and shuttling JSON around than figuring out the domain they're working within and modelling entities and interactions. In modern, AWS-based, designs, this looks like a bunch of poorly reasoned GSIs in DDB, anemic objects, and script-like "service" layers that end up being hack upon hack. Maybe there was an implicit acknowledgement that the domain's context would be well defined enough within the boundaries of a service? A poor assumption, if you ask me.

I don't know where our industry lost design rigor, but it happened; was it in the schools, the interviewing pipeline, lowering of the bar, or all of the above?

By @AndrewKemendo - 5 months
It’s so interesting because I started doing professional engineering AFTER doing day to day data and statistical analysis in statistical systems like matlab, R and early Python.

So my view of engineering has always been based on managing two things: functional state and data workflows

After doing software engineering professionally for a decade now I can tell you that:

1. Most “scientific” engineers back to Minsky, Shannon etc… describe the world of computing in terms of state management, data transformation and computing overhead management. All of the big figures and pioneers in software cared A LOT about data and state basically that’s all computing was at the beginning and was expected to be the pattern moving forward

2. There’s absolutely no consistency in what are the foundationally important assumptions in engineering system design that are always true such that everyone does them - and the ones that do are fads at best

3. Business timelines dictate engineering priorities and structures much more than robustness, antifragility, state management etc… in the vast majority of production software

4. Professional organizations like guilds, unions, etc… are almost universally rejected by software engineers. Nobody actually takes IEEE seriously because there’s no downside if you don’t. This ensures there’s no enforcement or self-regulation in engineering practices the same way there are in eg Civil and biomedical engineering. Even then those are barely utilized.

Overall the state of software development is totally divorced from its exceptionally high minded and philosophical roots, and is effectively led by corporations that are priorizing systems that make money for people with money.

So what is “good” has very little to do with what is incentivized

By @cpeterso - 5 months
“Show me your flowcharts [code] and conceal your tables [data structures], and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”

-- Fred Brooks

By @ants_everywhere - 5 months
This is essentially the point of view of functional programming and category theory.

You have some data object whose structure provides constraints on how it can be transformed. And then the program logic is all about the structure-preserving transformations.

The transformations become simpler and easier to reason about, and you're basically left with a graph where the transformations are edges and the structures are nodes. And that's generally easier to reason about than an arbitrary imperative program.

By @swyx - 5 months
A conclusion I reached a while ago: all the work we do in code is far more likely to be shorter lived than a single good decision that we have in data.

https://www.swyx.io/data-outlasts-code-but

By @fungiblecog - 5 months
This principle also applies at the business level. I’m constantly dealing with business analysts who obsess with processes (code) but don’t first take time to understand the entities and their relationships (data). The result is that when it comes time to build something they cannot communicate with the developers about what the data model should look like. The processes get implemented and the data model is put together on-the-fly rather than being carefully designed.
By @breadchris - 5 months
I think of code vs types analogous to the function vs form argument in design. If a website needs to be shipped ASAP, I should prioritize types less. If hundreds of engineers rely on some code, I care about types.

Language also influences how important types are, regardless of function. Haskell is strict, LISP is less so. Python, being closer to LISP in syntax, but surfacing powerful C (closer to Haskell) primitives has proven valuing function over form can be empowering.

Premature modeling of a domain in verbose types (ex. struct vs any) can slow down rapid iteration in comprehending what is valuable from data or how users may actually use code. Someone might need not just one, but infinite cat pictures in their file upload, but the code _and the types_ treat this as a single value. Another example is using JSONB columns in their RDS initially and normalizing fields into columns when needed. A more flexible type system saves time in early iteration cycles.

By @constantcrying - 5 months
Correct, nothing improves code quality and performance more than having the right data structures.

This is also something which I learned far too late, my programming education focused very much on algorithmic thinking. That is important, but only helpful if you have already chosen the right data structures. Many times I have had the situation that the code I was writing was confusing and only a small part of it had to do with solving the actual problem. If this ever happens to you, you should rethink your data structures and consider whether they were chosen correctly.

Also, when reading code for the first time you should be looking at the data structures before anything else.

By @austin-cheney - 5 months
There are some grave problems with this article. I agree with the basic premise 100% but the article over simplifies the idea to focus on data. It isn't just about data, data structures, or even relationships. It is about organization in general, and most people cannot perform at that level.

To be clear: Good programmers worry about the organization and cleanliness of their code. They worry that their code is reduced to the smallest of forms, consistent in expression, and exceptional in measure.

The limitation here is personality and not intelligence and there is a lot of data on this.

The personality metric of concern is conscientiousness, which is how a person perceives the world outside themselves. This one thing is responsible for self-discipline, concepts of organization, initiative, half of empathy, and much more. People at the extreme high end of this lean more towards things like authoritarianism, obligation, duty, healthy living, and social alignment. These people find joy in putting things into order and discerning relational structures.

People on the low end tend to be free spirits, are more likely to experiment with drug use, can't clean their rooms or pick up trash even if you put a gun to their heads. Concepts of work effort and self-reliance are almost entirely unimaginable. These people cannot organize anything and they require absurd rewards to accomplish the smallest tasks, and even still the output of their efforts is fleeting and temporary. They simply cannot see abstract relational concepts and cannot be compelled so.

Strangely, low scoring people struggle to discern value from a thing as they cannot perceive separations of vanity from functionality. Yet, they have no problem selling things in full awareness that if they cannot perceive value then neither can most other people. High scoring people don't do this and thus tend to make less effective merchandisers.

High scoring people tend to perceive low scoring people as slobs, sloths, and an anchor on social progress. Low scoring people tend to perceive high scoring people as perfectionists, prudes, and unnecessarily distracted on trivialities far outside their imagination.

The common assumption is that people who are brilliant at abstract organization and industriousness must be more intelligent. This makes sense because these people tend to be more successful in all aspects of life other than careers in entertainment. That assumption is completely wrong, though. Conscientiousness is negatively correlated to intelligence at -0.27, according to various studies.

By @kyledrake - 5 months
When I make a web application, the first step in that process for me is designing the relational database model with a pencil, eraser and piece of paper. It makes the code a lot less messy when you have all the data sorted out before you get into it. I also find that it really helps me to understand what I'm building and how I need to build it. And it's a hell of a lot easier to change code than how data is being stored, so it's something I really try to get right and properly normalized the first time.

I don't even attempt to do types at this point. It's really just about how the structure is going to look.

By @hipadev23 - 5 months
Argued this for a long time yet so many devs insist on MongoDB and other similar schemaless data stores
By @ashton314 - 5 months
This is the principle behind “How to Design Programms” [1]: Bild your data structures, then the form of your functions on those structures should correspond more or less exactly to it.

[1]: https://htdp.org/2023-8-14/Book/index.html

By @pmarreck - 5 months
> When I read this quote, I actually was able to recognize countless examples in the past of this. I once worked on a project where we spent quite a while optimizing complex algorithms, only to realize that by restructuring our data, we could eliminate entire classes of problems. We replaced a 500-line function with a 50-line function and a well-designed data structure. Not only was the new code faster, but it was also much easier to understand and maintain. (Of course, then the problem also shifted “down the stack” to where the majority of toil was in restructuring existing data.)

This is really a preference, then. I encountered almost this exact sort of problem in my last project. I wanted a simpler database design and more complex querying/code, they wanted a significantly more complex database design that was harder to understand (for everyone but the guy who spent all of one weekend designing it) but simpler querying/code (that was also more plentiful as a result). The question really is, where do you prefer your complexity to go? Do you want to lean on the database, or your code?

Simple example, you have a portfolio of stock that constantly changes in composition and value over time. Do you: 1) only store the current model of the portfolio in a "portfolios" table and the current prices of stocks in a "stock_prices" table and use a separate history table for both (with stored procedure triggers to automatically copy all changes to it) to store all previous versions that can then be queried separately if needed, OR 2) store each change in both quantity and price across multiple tables, no separation of what is "current" vs. what is "historical" other than the relationships that are (properly, hypothetically) set up via an "intent_versions" table at the top level, requiring a bunch of joins to actually determine the state of the portfolio both now and at any point in the past?

I opted for the former because I have no fear of complex queries, the center of thought-mass of the team leaned towards the latter. WWYD?

By @taeric - 5 months
A major caveat for folks in this line of thinking, though, is to avoid falling into the "one true schema" trap. Data can and will be duplicated in your system. A large part of the "consistency" battle people should be having is how long before a lot of that duplication is as expected. Not making sure it is never inconsistent.

That is, it is easy to see many junior efforts stall out during schema design thinking that you can solve all issues with a fancy method of storing the data. It isn't the schema that is important about your data, so much, but where different updates to it are known first and what they will need to go with it.

By @aryehof - 5 months
Such advice is dangerous in the assumption that there is only one type of problem in programming. One type of domain - applications related to the computer and data sciences and computing infrastructure.

While GIT might be particularly about data structures at it’s core, might I suggest you don't try to model into code your next complex payroll, insurance quotation, supply-chain or billing system as a composable set of lists, stacks, queues and trees, modified by code that grows over time to increasingly looking like a big ball of mud.

By @packetlost - 5 months
There are two types of applications: one where you know your data model from the beginning and one where you don't. Static types work exceptionally well when you're modeling something you understand pretty well; especially to the point where it is not expected to change significantly. On the other hand, a lot of programs find their data model while being made. This is fine too, what is expected from a program can change, sometimes a lot. I've built both types of applications in both types static and dynamically typed languages.

What your team knows matters more than either of these.

By @spratzt - 5 months
For those interested in escaping the Hacker Prison where ‘Weeks of coding can save hours of thinking’ I strongly recommend William Kent’s book “Data and Reality’.
By @mcny - 5 months
Speaking of data structures, I am curious what you guys think of the Entity Attribute Value model.

I worked on an e-commerce "platform" that used EAV and I always struggled to write queries to find anything I needed.

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

By @giovannibonetti - 5 months
That's why modern languages that encourage type-driven development like Rust and Gleam are a godsend. Just look at the things of things you can encode in the type system. It can prevent issues like mixing up numbers in different units:

https://blog.hayleigh.dev/phantom-types-in-gleam

By @quantum_state - 5 months
Would like to be clarified to: information + relationship among it components, including how they get transformed by computation.
By @slowhadoken - 5 months
RDBMS bored me to tears when I first studied it but it’s an invaluable way to view at data and structures.
By @makeramen - 5 months
So basically [engineering] design is more important than implementation details.

I would say the "engineering" part of the design is also optional, as product design is also another lever of higher influence than code optimization.

By @OutOfHere - 5 months
I'd argue that when your entire approach is experimental, there is no need to fret over structures. If you are convinced that your approach can work, that's the time to design well.
By @antipaul - 5 months
Can a "class", eg in Python or Java, be considered an example of the "data structure" Linus and others are talking about here?

Or are they only talking about tables in databases and such?

By @Avshalom - 5 months
"you should actively seek ways to shift complexity from code to data."

but also somehow we're supposed to write everything to read write flat text...

Thanks UNIX!

By @divbzero - 5 months
Linus Torvalds’ git is the perfect case in point: wonderful data structure wrapped with adequate tooling.
By @ramesh31 - 5 months
The best code is no code. Which is why the best programming language (Lisp) expresses code as data.
By @skywhopper - 5 months
This is how I think about most of my coding, so it must be true.
By @jdeaton - 5 months
I dont understand what it means to “move complexity into data”
By @kipple - 5 months
> The actionable tip here is to start with the data. Try to reduce code complexity through stricter types on your interfaces or databases. Spend extra time thinking through the data structures ahead of time.

This is why I love TS over JS. At first it feels like more work up front, more hurdles to jump through. But over time it changed how I approached code: define the data (& their types) first, then write the logic. Type Driven Development!

Coming into TS from JS, it might feel like an unnecessary burden. But years into the codebase, it's so nice to have clear structures being passed around, instead of mystery objects mutated with random props through long processing chains.

Once the mindset changes, to seeing data definition as a new first step, the pains of getting-started friction are replaced by the joys of easy future additions and refactors.