October 8th, 2024

Don't let dicts spoil your code

Roman Imankulov critiques Python dictionaries for causing technical debt and complicating maintenance. He advocates for domain models, dataclasses, and Pydantic to enhance clarity and structure in evolving codebases.

Read original article

Roman Imankulov discusses the pitfalls of using dictionaries (dicts) in Python programming, particularly in the context of evolving codebases. He argues that dicts can lead to opaque and mutable data structures that complicate maintenance and extendability. As applications grow, reliance on dicts often signals technical debt, making it difficult to track changes and manage data integrity. Imankulov suggests treating dicts as a "wire format" for data deserialization, advocating for the use of domain models to encapsulate data and behavior. He highlights the benefits of using Python's dataclasses and Pydantic for creating structured models, which provide clearer semantics and reduce boilerplate code. For legacy codebases, he recommends annotating dicts with TypedDict to improve type safety. Additionally, he advises using mappings for key-value stores to enforce immutability and clarity in data handling. Ultimately, Imankulov emphasizes the importance of controlling dict usage to prevent them from undermining application architecture.

- Dicts can lead to technical debt and complicate code maintenance.

- Using domain models instead of dicts enhances clarity and structure.

- Python's dataclasses and Pydantic offer better alternatives for data modeling.

- TypedDict can improve type safety in legacy codebases.

- Annotating dicts as mappings can enforce immutability and clarity.

Python Modern Practices

Python development best practices involve using tools like mise or pyenv for multiple versions, latest Python version, pipx for app running. Project tips include src layout, pyproject.toml, virtual environments, Black, flake8, pytest, wheel, type hinting, f-strings, datetime, enum, Named Tuples, data classes, breakpoint(), logging, TOML config for efficiency and maintainability.

The Rise of the Analytics Pretendgineer

The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.

Approximating sum types in Python with Pydantic

Pydantic enables robust data models in Python, supporting sum types and discriminated unions for clear, type-safe definitions. It enhances maintainability and reliability by preventing invalid states in applications.

Lesser known parts of Python standard library – Trickster Dev

The article highlights lesser-known Python standard library features, including advanced data structures in `collections`, precise arithmetic in `decimal` and `fractions`, and tools for resource management, debugging, and packaging.

Know your Python container types

The article discusses Python's container types: lists, tuples, named tuples, sets, dictionaries, and dataclasses, highlighting their uses, differences, and recommendations for appropriate applications in programming.

31 comments

By @cardanome - 7 months

This is absolute key advice.

Another way to look at it is the functional core, imperative shell pattern.

Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.

If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.

Seriously so much suffering can be avoided by just following this pattern.

By @jimmytucson - 7 months

Here’s an out-there take, but one I’ve held loosely for a long time and haven’t shed yet: dicts are not appropriate for what people mostly use them for, which is named access to member attributes.

dict is an implementation of a hash table. Hash table are designed for o(1) lookup of items. As such, they are arrays which are much bigger than the number of items they store, to allow hashing items into integers and sidestep collisions. They’re meant to act like an index that contains many records, not a single record.

A single record is more like a tuple, except you want named access instead of, title = movie[0], release_year = movie[1], etc. And Python had that, in NamedTuple, but it was kinda magical and no one used it (shoutout Raymond Hettinger).

Granted, this rant is pretty much the meme with the guy explaining something to a brick wall, in that dicts are so firmly entrenched as the "record" type of choice in Python (but not so in other languages: struct, case class, etc. and JSON doesn’t just deserialize to a weak type but I digress).

By @Jean-Papoulos - 7 months

>"unstructured data is problematic"

>"solution : use dataclasses"

Damn, it's almost like using an untyped language for large projects is not a great idea.

By @bigstrat2003 - 7 months

For better or for worse, Python doesn't do typing well. I don't disagree that I prefer well defined types, but if that is your desire then I think Python is perhaps not the correct choice of language.

By @ungamedplayer - 7 months

Can someone educate me in why dicts are uncool for explained reasons, but clojure (which seems to be highly recommended on hn) seems to suffer the same issues when dealing with a map as a parameter (ring request etc).

I know how to deal with missing values or variability in maps, and so do a lot of people.. what am I missing here?

By @Waterluvian - 7 months

I think one really nice thing about Python is duck typing. Your interfaces are rarely asking for a dict as much as they’re asking for a dict-like. It’s pretty great how often you can worry about this kind of problem at the appropriate time (now, later, never) without much pain.

There’s useful ideas in this post but I’d be careful not to throw the baby out with the bath water. Dicts are right there. There’s dict literals and dict comprehensions. Reach for more specific dict-likes when it really matters.

By @cle - 7 months

Dicts can be a problem, but this particular example isn't that great, like in this diagram from the article:

  External API <--dict--> Ser/De <--model--> Business Logic

Life's all great until "External API" adds a field that your model doesn't know about, it gets dropped when you deserialize it, and then when you send it back (or around somewhere else) it's missing a field.

There's config for this in Pydantic, but it's not the default, and isn't for most ser/de frameworks (TypeScript is a notable exception here).

Closed enums have a similar tradeoff.

By @Garlef - 7 months

I don't think dicts themselves are the problem.

In typescript using plain JS objects is very straightforward. Of course you have to validate the schema at your system boundaries. But you'll have to do this either way.

So: If this works very well in TS it can't be dicts themselves but must be the way they integrate into- and are handled in python.

This leads me to the conclusion that arguments presented in the article might be the wrong ones.

(But I still think, the conclusion the article arrives at is okay. But I don't think there's a strong case being made in the article about wether to prefer data classes or typed dicts.)

By @hcarvalhoalves - 7 months

Debatable. Here's a counter-point:

https://www.youtube.com/watch?v=aSEQfqNYNAc

But ok, it's less bad in Python since objects are dicts anyway and you don't need getters.

By @fhdsgbbcaA - 7 months

Seems like the issue is less using dicts than not treating external APIs as input that needs to be sanitized.

By @cschneid - 7 months

I generally support this. When dealing with API endpoints especially I like to wrap them in a class that ends up being. I also like having nested data structures as their own class sometimes too. Depends on complexity & need of course.

    class GetThingResult
      def initialize(json)
        @json = json
      end
    
      # single thing
      def thing_id
        @json.dig('wrapper', 'metadata', 'id')
      end
    
      # multiple things
      def history
        @json['history'].map { |h| ThingHistory.new(h) }
      end
      ... two dozen more things
    end

By @Attummm - 7 months

Python has made its rise as an antithesis to Java thinking. Classes used to be seen by some in the community as an anti-pattern. [0] The coding style used to focus on "Pythonic-ness," which meant using Python's expressiveness to write code in such a way that type information could be inferred without explicitly stating the type.

Most developers will carry their previous language paradigms into their new ones. But if types, DDD (Domain-Driven Design), and classes are what you're looking for, then Python isn't the best fit. Python doesn't have compiler features that work well with those paradigms, such as dead code removal/tree shaking. However, starting out with dictionaries and then moving over to dataclasses is a great strategy.[1] As a small note, it's kind of ironic that the statically typed language Go took inferred typing with their := operator, while there is now a movement in Python to write foo: str = "bar".

[0] https://youtu.be/o9pEzgHorH0?si=pv0QQyM-iBrHuXUN

[1] https://docs.python.org/3/library/dataclasses.html

By @CraigJPerry - 7 months

This has merit in some cases but let me try to make a counterpoint.

You lose the algebra of dict’s - and it’s a rich algebra to lose since in python it’s not just all the basic obvious stuff but it’s also powerful things like dict comprehensions and ordering guarantees (3.7+ only).

You tightly couple to a definition - in the simple GitHubRepository example this is unlikely to be problematic. In the real world, coupling like this[1] to objects trying to capture domain data with dynamic structures is regularly the stuff of nightmares.

The over-arching problem with the approach given is that it puts code above data. You take what could be a schema, inert data about inert data, and instead use code. But it might also be an interesting case to consider as a slippery slope - if you can put code concerns above data concerns then maybe soon you will see cases where code concerns rank higher than the users of your software?

[1] - by coupling like this I mean the “parse don’t validate” school of thought which says as soon as you get a blob of data from an external source, be it a file, a database or in this case a remote service, you immediately tie yourself to a rocket ship whose journey can see you explosively grow the number of types to accurately capture the information needed for every use case of the data. You could move this parsing operation to be local to the use case of the data (much better) rather than have it here at the entry point of the data to the system but often times (although not always) we can arrive at a simpler solution if we are clever enough to express it in a style that can easily be understood by a newbie to programming. That often means relying on the common algebra of core types rather than introducing your own types.

By @cranium - 7 months

Python dataclasses are a good start for internal use. They are just a bit of a pain to serialize/deserialize natively. When it comes to that, I prefer to use Pydantic objects and have all the goodies, at the cost of some complexity.

By @xenoxcs - 7 months

I'm a big fan of using Protobuf for the third-party API validation task. After some slightly finniky initial schema definition (helped by things like json-to-proto.github.io), I can be sure the data I'm consuming from an external API is strongly typed, and the functions included in Protobuf which convert JSON to a Proto message instance blows up by default if there's an unexpected field in the API data it's consuming.

I use it to parse and validate incoming webhook data in my Python AWS Lambda functions, then re-use the protobuf types when I later ship the webhook data to our Flutter-based frontend. Adding extensions to the protobuf fields gives me a nice, structured way to add flags and metadata to different fields in the webhook message. For example, I can add table & column names to the protobuf message fields, and have them automatically be populated from the DB with some simple helper functions. Avoids me needing to write many lines of code that look like:

MyProtoClass.field1 = DB.table.column1.val

MyProtoClass.field2 = DB.table.column2.val

By @wruza - 7 months

Knew it was python before the first line of code. Python lacks ceremony-free data syntax, that’s why people use dicts. Dataclasses have to be named, initialized and imported, which is tedious. Much easier to just foo({name, age}) and let typings match, but python doesn’t have that. Lack of “POPO” is a design mistake.

By @pmarreck - 7 months

Less important in Elixir (where they are "maps") due to the immutable nature of them as well as the Struct type which is a structured map.

By @Barrin92 - 7 months

It's a bit of an odd article because the second part kind of shows why dicts aren't a problem. You basically just need to apply the most old school of OO doctrines: "recipients of messages are responsible for how they interpret them", and that's exactly what the author advocates when he talks about treating dict data akin to data over the wire, which is correct.

If you're programming correctly and take encapsulation seriously, then whatever shape incoming data in a dict has isn't something you should take an issue with, you just need to make sure if what you care about is in it (or not) and handle that within your own context appropriately.

Rich Hickey once gave a talk about something like this talking about maps in Clojure and I think he made the analogy of the DHL truck stopping at your door. You don't care what every package in the truck is, you just care if your package is in there. If some other data changes, which data always does, that's not your concern, you should be decoupled from it. It's just equivalent to how we program networked applications. There are no global semantics or guarantees on the state of data, there can't be because the world isn't in sync or static, there is no global state. There's actually another Hickey-ism along the lines of "program on the inside the same way you program on the outside". Dicts are cool, just make sure that you're always responsible for what you do with one.

By @scotty79 - 7 months

> Ignore fields coming from the API if you don’t need them. Keep only those that you use.

This is great if you know what you need from the start. If you only find out what you need after passing your data through multiple layers and modules of your system then you need to backtrack through all your code to the place of creation.

If you have immutable data structures then you have to backtrack through multiple places where your data is used from previous structures to create new ones to pass your additional data through all that.

So if your data travels through let's say 3 immutable types to reach the place you are working on then even if you know exactly where the new field that you need originates, you need to alter 3 types and 3 places where data is read from one type and crammed into another.

If you have a dict that you fill with all you got from the api there's zero work involved with getting the new piece of information that you thought you didn't need but you actually do. It's just there.

By @karmakurtisaani - 7 months

I've cleaned up code where input parameters came in a dict form. Absolute shit show.

- The only way to figure out which parameters are even possible was to search through the code for the uses of the dict.

- Default values were decided on the spot all over the place (input.getOrDefault(..)).

- Parameter names had to be typed out each time, so better be careful with correct spelling.

- Having a concise overview how the input is handled (sanitized) was practically impossible.

0/10 design decision, would not recommend.

By @pansa2 - 7 months

> convert [dicts] immediately to data structures providing semantics [...] You can simplify your work by employing a library that makes “better classes” for you

Python seems to have many different kinds of "better classes" - the article mentions `dataclass` and `TypedDict`, and AFAIK there are also two different kinds of named tuple (`collections.namedtuple` and `Typing.NamedTuple`).

What are the advantages of these "better classes" over traditional classes? How would you choose which of the four (or more?) kinds to use?

By @QuadrupleA - 7 months

Hard disagree on most of this. The immutability dogma for one (changing data is "the worst felony you can commit to your data"). Computing IS manipulation and transformation of data. The contortions people go through to try and sidestep that seem delusional.

Plus all this 1995-era OOP and domain-driven-design crap, "business logic" and data layers and all this other architectural rigidity and usually-needless complexity, layers of boilerplate (and then tools to automate the generation of that), etc.

If your function takes a dict, and is called from many different places, document the dict format in the function comment. Or yes, create a dataclass if it saves more trouble than its additional boilerplate and code and maintenance causes. But take it case by case and aim for simplicity. Most of the time I call out to an API in python, I process its JSON/dict response right after the call, using maybe 10% of the data returned. That's so much cleaner and simpler than writing a whole Data Object Layer, to be used by my API Interface Layer, to talk to my Business Logic layer, etc.

By @newaccountman2 - 7 months

I work with people who are ambivalent about this and believe using random dicts in a variety of places is a valid way to write Python code.

For these kinds of people, no amount of rational evidence or argument is going to convince them this is bad. They practically make an identity out of eschewing anything that seems too orderly or too designed.

(Luckily, at work, most of us on our team like `Pydantic` and also (some of us more than others) type-checking, so these people are dragged along)

By @est - 7 months

dicts are OK, because at least they do have a `key` and it does mean something.

un-annotated tuples and too many func params are cancer.

By @secretsatan - 7 months

I largely moved away from dictionaries when switching to swift. Usually we only use them now when going from legacy code. For the example in the article with JSON, Swift has the Codable protocol, which cleaned all my client code for our back end from the old NSJSONSerialization code.

By @greatgib - 7 months

I don't agree with the predicate, but I have to admit that the rest of the article is well written to list the different ways to give types to dicts when it is needed.

By @jimberlage - 7 months

If you want an example of a language where the exact opposite advice is taken at all times (with all the pitfalls described in this blog post), give Clojure a whirl.

By @thebeardisred - 7 months

FYI, posted in 2020, updated in 2021.

By @leoh - 7 months

Big structs as params in rust have similar issues

By @klyrs - 7 months

Lists and sets suffer the same drawbacks. If the advice is to not use any of the batteries included if the language, why are we using Python?

If you want an immutable mapping, why not use an enum?

By @gotoeleven - 7 months

Personally I find it is often helpful to keep Dicts in a BigBag ie:

BigBag<Dict>

Don't let dicts spoil your code