Don't let dicts spoil your code
Roman Imankulov critiques Python dictionaries for causing technical debt and complicating maintenance. He advocates for domain models, dataclasses, and Pydantic to enhance clarity and structure in evolving codebases.
Read original articleRoman Imankulov discusses the pitfalls of using dictionaries (dicts) in Python programming, particularly in the context of evolving codebases. He argues that dicts can lead to opaque and mutable data structures that complicate maintenance and extendability. As applications grow, reliance on dicts often signals technical debt, making it difficult to track changes and manage data integrity. Imankulov suggests treating dicts as a "wire format" for data deserialization, advocating for the use of domain models to encapsulate data and behavior. He highlights the benefits of using Python's dataclasses and Pydantic for creating structured models, which provide clearer semantics and reduce boilerplate code. For legacy codebases, he recommends annotating dicts with TypedDict to improve type safety. Additionally, he advises using mappings for key-value stores to enforce immutability and clarity in data handling. Ultimately, Imankulov emphasizes the importance of controlling dict usage to prevent them from undermining application architecture.
- Dicts can lead to technical debt and complicate code maintenance.
- Using domain models instead of dicts enhances clarity and structure.
- Python's dataclasses and Pydantic offer better alternatives for data modeling.
- TypedDict can improve type safety in legacy codebases.
- Annotating dicts as mappings can enforce immutability and clarity.
Related
Python Modern Practices
Python development best practices involve using tools like mise or pyenv for multiple versions, latest Python version, pipx for app running. Project tips include src layout, pyproject.toml, virtual environments, Black, flake8, pytest, wheel, type hinting, f-strings, datetime, enum, Named Tuples, data classes, breakpoint(), logging, TOML config for efficiency and maintainability.
The Rise of the Analytics Pretendgineer
The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.
Approximating sum types in Python with Pydantic
Pydantic enables robust data models in Python, supporting sum types and discriminated unions for clear, type-safe definitions. It enhances maintainability and reliability by preventing invalid states in applications.
Lesser known parts of Python standard library – Trickster Dev
The article highlights lesser-known Python standard library features, including advanced data structures in `collections`, precise arithmetic in `decimal` and `fractions`, and tools for resource management, debugging, and packaging.
Know your Python container types
The article discusses Python's container types: lists, tuples, named tuples, sets, dictionaries, and dataclasses, highlighting their uses, differences, and recommendations for appropriate applications in programming.
Another way to look at it is the functional core, imperative shell pattern.
Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.
If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.
Seriously so much suffering can be avoided by just following this pattern.
dict is an implementation of a hash table. Hash table are designed for o(1) lookup of items. As such, they are arrays which are much bigger than the number of items they store, to allow hashing items into integers and sidestep collisions. They’re meant to act like an index that contains many records, not a single record.
A single record is more like a tuple, except you want named access instead of, title = movie[0], release_year = movie[1], etc. And Python had that, in NamedTuple, but it was kinda magical and no one used it (shoutout Raymond Hettinger).
Granted, this rant is pretty much the meme with the guy explaining something to a brick wall, in that dicts are so firmly entrenched as the "record" type of choice in Python (but not so in other languages: struct, case class, etc. and JSON doesn’t just deserialize to a weak type but I digress).
>"solution : use dataclasses"
Damn, it's almost like using an untyped language for large projects is not a great idea.
I know how to deal with missing values or variability in maps, and so do a lot of people.. what am I missing here?
There’s useful ideas in this post but I’d be careful not to throw the baby out with the bath water. Dicts are right there. There’s dict literals and dict comprehensions. Reach for more specific dict-likes when it really matters.
External API <--dict--> Ser/De <--model--> Business Logic
Life's all great until "External API" adds a field that your model doesn't know about, it gets dropped when you deserialize it, and then when you send it back (or around somewhere else) it's missing a field.There's config for this in Pydantic, but it's not the default, and isn't for most ser/de frameworks (TypeScript is a notable exception here).
Closed enums have a similar tradeoff.
In typescript using plain JS objects is very straightforward. Of course you have to validate the schema at your system boundaries. But you'll have to do this either way.
So: If this works very well in TS it can't be dicts themselves but must be the way they integrate into- and are handled in python.
This leads me to the conclusion that arguments presented in the article might be the wrong ones.
(But I still think, the conclusion the article arrives at is okay. But I don't think there's a strong case being made in the article about wether to prefer data classes or typed dicts.)
https://www.youtube.com/watch?v=aSEQfqNYNAc
But ok, it's less bad in Python since objects are dicts anyway and you don't need getters.
class GetThingResult
def initialize(json)
@json = json
end
# single thing
def thing_id
@json.dig('wrapper', 'metadata', 'id')
end
# multiple things
def history
@json['history'].map { |h| ThingHistory.new(h) }
end
... two dozen more things
end
Most developers will carry their previous language paradigms into their new ones. But if types, DDD (Domain-Driven Design), and classes are what you're looking for, then Python isn't the best fit. Python doesn't have compiler features that work well with those paradigms, such as dead code removal/tree shaking. However, starting out with dictionaries and then moving over to dataclasses is a great strategy.[1] As a small note, it's kind of ironic that the statically typed language Go took inferred typing with their := operator, while there is now a movement in Python to write foo: str = "bar".
You lose the algebra of dict’s - and it’s a rich algebra to lose since in python it’s not just all the basic obvious stuff but it’s also powerful things like dict comprehensions and ordering guarantees (3.7+ only).
You tightly couple to a definition - in the simple GitHubRepository example this is unlikely to be problematic. In the real world, coupling like this[1] to objects trying to capture domain data with dynamic structures is regularly the stuff of nightmares.
The over-arching problem with the approach given is that it puts code above data. You take what could be a schema, inert data about inert data, and instead use code. But it might also be an interesting case to consider as a slippery slope - if you can put code concerns above data concerns then maybe soon you will see cases where code concerns rank higher than the users of your software?
[1] - by coupling like this I mean the “parse don’t validate” school of thought which says as soon as you get a blob of data from an external source, be it a file, a database or in this case a remote service, you immediately tie yourself to a rocket ship whose journey can see you explosively grow the number of types to accurately capture the information needed for every use case of the data. You could move this parsing operation to be local to the use case of the data (much better) rather than have it here at the entry point of the data to the system but often times (although not always) we can arrive at a simpler solution if we are clever enough to express it in a style that can easily be understood by a newbie to programming. That often means relying on the common algebra of core types rather than introducing your own types.
I use it to parse and validate incoming webhook data in my Python AWS Lambda functions, then re-use the protobuf types when I later ship the webhook data to our Flutter-based frontend. Adding extensions to the protobuf fields gives me a nice, structured way to add flags and metadata to different fields in the webhook message. For example, I can add table & column names to the protobuf message fields, and have them automatically be populated from the DB with some simple helper functions. Avoids me needing to write many lines of code that look like:
MyProtoClass.field1 = DB.table.column1.val
MyProtoClass.field2 = DB.table.column2.val
If you're programming correctly and take encapsulation seriously, then whatever shape incoming data in a dict has isn't something you should take an issue with, you just need to make sure if what you care about is in it (or not) and handle that within your own context appropriately.
Rich Hickey once gave a talk about something like this talking about maps in Clojure and I think he made the analogy of the DHL truck stopping at your door. You don't care what every package in the truck is, you just care if your package is in there. If some other data changes, which data always does, that's not your concern, you should be decoupled from it. It's just equivalent to how we program networked applications. There are no global semantics or guarantees on the state of data, there can't be because the world isn't in sync or static, there is no global state. There's actually another Hickey-ism along the lines of "program on the inside the same way you program on the outside". Dicts are cool, just make sure that you're always responsible for what you do with one.
This is great if you know what you need from the start. If you only find out what you need after passing your data through multiple layers and modules of your system then you need to backtrack through all your code to the place of creation.
If you have immutable data structures then you have to backtrack through multiple places where your data is used from previous structures to create new ones to pass your additional data through all that.
So if your data travels through let's say 3 immutable types to reach the place you are working on then even if you know exactly where the new field that you need originates, you need to alter 3 types and 3 places where data is read from one type and crammed into another.
If you have a dict that you fill with all you got from the api there's zero work involved with getting the new piece of information that you thought you didn't need but you actually do. It's just there.
- The only way to figure out which parameters are even possible was to search through the code for the uses of the dict.
- Default values were decided on the spot all over the place (input.getOrDefault(..)).
- Parameter names had to be typed out each time, so better be careful with correct spelling.
- Having a concise overview how the input is handled (sanitized) was practically impossible.
0/10 design decision, would not recommend.
Python seems to have many different kinds of "better classes" - the article mentions `dataclass` and `TypedDict`, and AFAIK there are also two different kinds of named tuple (`collections.namedtuple` and `Typing.NamedTuple`).
What are the advantages of these "better classes" over traditional classes? How would you choose which of the four (or more?) kinds to use?
Plus all this 1995-era OOP and domain-driven-design crap, "business logic" and data layers and all this other architectural rigidity and usually-needless complexity, layers of boilerplate (and then tools to automate the generation of that), etc.
If your function takes a dict, and is called from many different places, document the dict format in the function comment. Or yes, create a dataclass if it saves more trouble than its additional boilerplate and code and maintenance causes. But take it case by case and aim for simplicity. Most of the time I call out to an API in python, I process its JSON/dict response right after the call, using maybe 10% of the data returned. That's so much cleaner and simpler than writing a whole Data Object Layer, to be used by my API Interface Layer, to talk to my Business Logic layer, etc.
For these kinds of people, no amount of rational evidence or argument is going to convince them this is bad. They practically make an identity out of eschewing anything that seems too orderly or too designed.
(Luckily, at work, most of us on our team like `Pydantic` and also (some of us more than others) type-checking, so these people are dragged along)
un-annotated tuples and too many func params are cancer.
If you want an immutable mapping, why not use an enum?
BigBag<Dict>
Related
Python Modern Practices
Python development best practices involve using tools like mise or pyenv for multiple versions, latest Python version, pipx for app running. Project tips include src layout, pyproject.toml, virtual environments, Black, flake8, pytest, wheel, type hinting, f-strings, datetime, enum, Named Tuples, data classes, breakpoint(), logging, TOML config for efficiency and maintainability.
The Rise of the Analytics Pretendgineer
The article examines dbt's role in data modeling, highlighting its accessibility for analytics professionals while noting the need for a structured framework to prevent disorganized and fragile projects.
Approximating sum types in Python with Pydantic
Pydantic enables robust data models in Python, supporting sum types and discriminated unions for clear, type-safe definitions. It enhances maintainability and reliability by preventing invalid states in applications.
Lesser known parts of Python standard library – Trickster Dev
The article highlights lesser-known Python standard library features, including advanced data structures in `collections`, precise arithmetic in `decimal` and `fractions`, and tools for resource management, debugging, and packaging.
Know your Python container types
The article discusses Python's container types: lists, tuples, named tuples, sets, dictionaries, and dataclasses, highlighting their uses, differences, and recommendations for appropriate applications in programming.