> group related concepts together
> The hardest part of this process is deciding what “related concepts” mean.
The article talks about "readability", but arguably the unnamed hard problem it is dancing around is how to structure an application or system by decomposing it into modules.
I'd argue the baseline reasonable approach to structuring applications or systems is the one given in Parnas' 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules":
> We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others.
Parnas' criterion embeds the understanding that code and systems are not static but need to evolve over time as requirements change or decisions are made, and that different decompositions can be inferior or superior to accommodating that change.
"Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.
People forget that readability isn't a function of specific program - there is no one optimal readability. On the contrary, it's a function of the program and the goals of the reader. So after fixing and DRYing all the generally bad/inefficient decision, what is readable code becomes solely the issue of why you're reading it - trying to debug or add an entirely new feature will have opposite readability criteria to extending some high-level feature.
Even in the best case, readability just becomes a Pareto frontier[0], given by expressive limits of the dominant programming paradigm - same single plaintext source code for all. There's only so much complexity, so much cross-cutting concerns, we can cram into the same piece of plaintext, until something gives, until the same code is beatiful to you one week, and incomprehensible the next week, with the only thing that changed is the type of work you're doing on it.
So, beyond evolving over time, I'd also consider the orthogonal aspect of different decompositions being good for different purposes, and that you can't have it all and work on the same, single, high-level plaintext code.
EDIT: And I believe the solution to this, the step forward beyond the Pareto frontier, is what 'valty described here: https://news.ycombinator.com/item?id=39426895 - not coding directly in the same plaintext, but treating the single-source-of-truth code as a database, which you query and update through views/lenses that best fit whatever work you're doing at the moment.
If it's not too much trouble, could you create a minimal demonstration of a simple piece of code, structured for various goals - easy to extend, easy to debug etc.? I can't defend my code form the best-practice-people with a Pareto Front wikipedia article.
> Even in the best case, readability just becomes a Pareto frontier[0], given by expressive limits of the dominant programming paradigm
> People forget that readability isn't a function of specific program - there is no one optimal readability. On the contrary, it's a function of the program and the goals of the reader.
The original Wiki at c2 has a great example [0] comparing the expressive capabilities of functional vs. object-oriented programming, and their suitability towards certain goals - namely, either extending the set of operations the system supports, or expanding the number of data (sub-)types the system models.
In spite of Turing equivalence, some paradigms (even if only in terms of readability, dev ergonomics etc.) are better suited for expressing certain classes of problems than others, which may introduce unnecessary friction trying to solve problems that are mismatched with the paradigm's approach to structuring and decomposing problems (e.g. into compositions of functions, or compositions of objects).
Think "lots of small functions" vs. "few fat functions" - one of those infamous code style holy wars. Look at the arguments people make for and against either.
There is no single answer there, because which style is better depends on what you're doing. For example, "lots of small functions" makes things easier to understand when you're working horizontally, trying to e.g. understand a module at a certain level of abstraction. However, in the same code, if you're trying to understand a single piece of functionality, e.g. to debug it, it's much easier when you have a vertical view - ideally a single fat function with all helper calls inlined. In first case, all the little functions form high-level languages that aid your understanding; in the latter, they're just noise that kills your working memory with all the jumping-to-definition around the codebase.
Our current programming paradigm forces you to make this style choice ahead of time. You can't have it both ways. This is the Pareto frontier - as you make your codebase easy for one type of work, it becomes hard to do other kinds of work in it. And this is a stupid state to be in, because you will be doing both horizontal and vertical tasks, and many others that benefit in yet different kinds of slicing through code, and you will be switching gears every few days or weeks.
Another concrete example: exceptions vs. algebraic return types (Expect/Result/Maybe/etc.). People love the latter for code locality, and mostly ignore the ridiculous amount of noise this method adds to all code, and/or invent ever more advanced math to paper over it. Exceptions were much better in this regard, if worse in others, but again, I posit that having to make that choice is dumb in the first place. Personally, I'm fine with Result return types. It's just that, 90% of the time, I don't give a damn about them, because I'm working on the success case/golden path, and they're just pure visual noise. It's something I'd like to just not see. But then, remaining 10% of the time, I'd like to make everything other than Result types and control flow disappear, because when working with error handling, success case becomes visual noise.
Inventing new species of monads or syntax keywords to cram this, and async, and other cross-cutting concerns, isn't the solution. The solution is to stop working on raw plaintext source code, and instead work on representations tailored to specific needs, while treating the underyling source code like we treat assembly today.
I wish there was another Gang of Four with a book on codebase design decisions, some of which you had outlined. I need the alternatives laid out clearly and, more importantly, given catchy names that I can refer to when real, or self-appointed code reviewers show up.
How do I explain what my prefered "level of abstraction" is and why it is superior to all the others? How can I be convincing with a fuzzy and subjective term "visual noise"? Etc.
> How do I explain what my prefered "level of abstraction" is and why it is superior to all the others?
That's my point: you shouldn't, because there isn't. Once you hit the Pareto front, there's no superior choice. There are just choices, each of which is better in different situations, and once taken, very expensive to back out of. The problem is being forced to make the choice in advance.
> How can I be convincing with a fuzzy and subjective term "visual noise"?
That's another part of the problem. Different choices may be better for different people. There's no one-size-fits-all here. Which is why, again, it's dumb that we have to make this choice once and for everyone (part of what I mean by "working on single-source-of-truth code").
The way I see it, to stop running in circles on the Pareto front, to move past to more powerful ways of dealing with complexity, we need to make things like "lots of small functions vs. few fat ones" or "exceptions vs. expected" to be subjective, personal, local preferences, not meaningfully different than your editor's color scheme, or syntax highlighting scheme; inlining needs to be as easy as code folding, etc.
Your arguments still seem to lead me to the entirely opposite conclusion - that different choices would be better for different organizations and different projects, and it would make sense to have things like "lots of small functions vs. few fat ones" or "exceptions vs. expected" to be well-known tradeoffs where the choices would be made according in an objective manner according to the type of the project, and override personal preference because for this thing we'll explicitly optimize for this type of reader because we believe that this codebase has XYZ properties and will be maintained at ABC level by DEF kind of people.
My argument is that those are fundamentally not project-level tradeoffs, those are "whatever your current ticket is about"-level tradeoffs. Your task may be such that you'll want opposite choices to have been made within 5 minute span - e.g. "lots of small functions" to get the gist of what the module is doing, followed by inlining everything into a single fat function along a vertical, after you found what piece of logic you want to debug. Both might also benefit from temporarily pretending error handling is done by unchecked exceptions, to remove visual clutter.
I.e. you're talking strategy and architecture, I'm talking not even tactics, but individual minute-to-minute execution.
IMHO at the ticket-to-ticket level you're not really making these tradeoffs but rather experiencing the consequences of them - it doesn't matter if for minute-to-minute execution right now lots of small functions would be better or worse, you either have them or not; and whatever tradeoffs you make for the code you write during this one ticket, it should take into account the potential future readers.
> it doesn't matter if for minute-to-minute execution right now lots of small functions would be better or worse, you either have them or not; and whatever tradeoffs you make for the code you write during this one ticket, it should take into account the potential future readers.
This is precisely the problem.
The way it should look like is, these trade-offs should be purely local, minute-to-minute editor preferences. Inlining some helpers or removing error types from the code you see, should be no harder than folding a code block or pinning a peeked function definition. Nor should those "changes" have any effect visible to anyone else. You should not take into account any "future readers", because there won't be any - but rather, you should pick a representation you need right this second, and switch to a different one the moment you need that, etc.
Ah ok, I was assuming that having this be a fully transparent & reversible issue of representation isn't really possible - but if it is, then sure, then I'd agree, then it can be treated just like indentation or syntax highlighting color schemes.
But for the major issues like split between "lots of small functions" vs "one big function" even if the IDE could e.g. inline the functions for readability, but that feels a bit risky as it makes it difficult to talk about the code or document a functions' interface if another person is seeing different semantic structure of the code, it's as if they had different names for the same things.
Until future tech allows for reformatting code to a developer's preference (which might not arrive any time soon or might introduce subtle new bugs, who knows) we could use some framework to make transparent decisions around that Pareto front, no?
(Of course there is a best way to do it that will optimize the codebase for a function of productivity, reliability and pleasure. The likelihood and the cost of changes are part of that function. Finding and advocating for it is the task of the most experienced, wisest team member. Less experienced team members may (or may not) understand the wisdom of the better decision in time. But this is just my struggle with postmodernism and relativism and besides the point)
> Another concrete example: exceptions vs. algebraic return types (Expect/Result/Maybe/etc.). People love the latter for code locality, and mostly ignore the ridiculous amount of noise this method adds to all code, and/or invent ever more advanced math to paper over it. Exceptions were much better in this regard, if worse in others, but again, I posit that having to make that choice is dumb in the first place.
There's actually at least one programming language that allows you to choose as needed, and it's Visual Basic of all things.
The "On Error Goto" directive says on error you want to jump to an exception handler. "On Error Resume Next" means just keep forging ahead -- which is a terrible idea done badly, but if check the Err object, you can access the status.
It's not quite as good as algebraic return types, but it was an interesting idea in that you can code each bit of code in whichever way works out best.
I like VB's idiosyncrasies for other reasons, but here my point is different. This is not about making choices at runtime, but about making them in your IDE, for yourself. It's about the lens you view your code through. Hiding or emphasizing error handling should be a visual operation, with no impact on semantics - just like e.g. folding or unfolding blocks of code in your editor.
I'm not sure this is even possible - all the value comes when the underlying application is no longer simple. If you try to make a minimal example it just looks over-engineered, like some sort of "enterprise hello world" joke.
public class HelloWorld {
public static void main(String[] args) {
// Prints "Hello, World" in the terminal window.
System.out.println("Hello, World");
}
}
This is the full code. This is what I care about on success path:
System.out.println("Hello, World");
// Success
Or maybe even just:
println("Hello, World");
Or maybe, because I forgot what is what in Java:
java.lang.System - The System class contains several useful
class fields and methods. It cannot be instantiated.
|
| System.out : java.io.PrintStream - The "standard" output stream.
| |
| | PrintStream::println(String) -> void - Prints a String and then terminate the line.
v v v
System.out.println("Hello, World");
This is what I care when I'm looking at the module view:
HelloWorld (public, class)
main(String[] args) -> void (public, static, entry point)
-- Prints "Hello, World" in the terminal window.
This is what I care when I'm looking at error propagation view:
HelloWorld::main() - entry point
Program end, return code 0 [default]
Or maybe, given sufficient metadata in the standard library:
HelloWorld::main() - entry point
System.out.println("Hello, World")
-> failed if System.out.checkError() == true
Program end, return code 0 [default]
Or I want a dependency view:
HelloWorld
java.lang.System
[java.io.PrintStream]
Etc.
Now imagine being able to open the Hello World code, and instantly switch between those views and many others with a single shortcut. Not tooltips above the code, but views that replace code. That is what I'm talking about.
Oh, and views are editable where it makes sense. Even read-only views would improve the development experience a lot, but the magic is in mutable views, so that you don't ever have to touch the full, original form of the source.
EDIT: look at the first vs. third code-block in this comment. People built whole programming languages on the premise that Hello World should look like the third block instead of the first one. That's an extreme case of making a style choice ahead of time, and also a source of endless, pointless debates.
Interesting, and somewhat doable for IDEs in current languages, but I think the real question is: is it possible for what you can't see to hurt you?
That is, if typing in one of the restricted views, is it possible to create a bug because of something that's currently hidden from you? I feel as soon as that happens to someone, they are going to switch to the "show everything" view and never switch back.
> look at the first vs. third code-block in this comment. People built whole programming languages on the premise that Hello World should look like the third block instead of the first one.
> is it possible to create a bug because of something that's currently hidden from you?
Isn't this always the case?
I don't think there is a "full view" like you describe. Even if you're writing raw assembly, you're still working on top of the abstractions provided by the instruction set. There is no such thing as a full view where you can see every detail of everything that is going to happen.
I think the solution to this problem is tests and doing your due diligence in understanding what you're doing.
Having more control over what abstractions you're seeing should help with the understanding part, which should in turn reduce bugs resulting from lack of understanding.
Edit: related thought: it's about improving the tooling. Which I suppose is another thing for someone to learn and possibly misuse. But I don't think it's correct to say that firefighters shouldn't carry axes because they could accidentally kill someone with them.
You can project your programs into different views and add lots of metadata about the program and use that data in various contexts. Also extends to source control and other use cases. It looked really neat.
With enough work I think it is definitely possible, but your comment made realise it would be pretty large, almost a research project.
One could start from a real case in a real company, document it thorougly, including different stakeholders. Then rewrite the code to fit this organisation at this point in time.
Then, make a two or three fictional changes to the organisation and circumstances, snapshot these (or follow an organisation longitudinally for long enough that this occurs naturally) and for each snapshot, rewrite and redocument the code to fit those circumstance.
From that, one could "dumb down" the whole thing until it stops making sense and see how simple one could make it.
Probably there are people smart and seasoned enough to write an entirely fictional account of all of this, and still have it make sense - that's not me though.
In my experience, most programmers go way too hard with anticipating future changes and end up creating systems with entirely too much abstraction. Most of the time those changes never occur and the result is a codebase which has been obfuscated with excessive abstraction that bogs down anybody trying to maintain it. Future changes end up needing different abstractions than the ones which were preemptively created, and as the programmer is adding new abstractions to cover their present need they also create new premature abstractions, thinking they're saving themselves future trouble. The cycle then continues.
Better to KISS and leave abstraction for the future when it actually becomes necessary. If you start out with code that is only as complex as it needs to be in that moment, then it will generally be much easier to change it in the future.
> "Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.
I'm fairly sure I remember reading somewhere that that piece of advice was originally meant for data/values/configuration, not code, and that applying it to code is itself a mistake that keeps getting repeated.
Regardless of who said it and what they meant, I don’t want more code to write tests for, more pages to read through when stuff breaks, more material for new engineers to learn. You can always start copy-pasting and make a mess later - less true in the reverse.
Like almost anything it can be taken too far or misapplied.
In the quoted example, when I have multiple occurrences of related business logic, I build a vocabulary of reusable sub elements - find the joints and carve, don’t build a giant mutant.
Wasn't familiar with Parnas' criteria, thanks for sharing.
I do something similar in a different way, which I call "IKEA-oriented development". IME, semi-disposable code is very easy to change over time as mental models and product goals evolve:
Thank you for the post, and the link to this second one as well.
Re: "IKEA-oriented development", you make a very good point about the cost of change. I think the semi-disposable code idea overlaps comments from folks elsewhere in this discussion thread talking about the horror of codebases that introduced premature abstractions to cope with expected future changes that then never actually appeared ("YAGNI" is indeed a good rule of thumb).
Your point about "make experimentation effortless" is a good one. The highest productivity environment I worked with that supported rapid experimentation was a small business' monorepo codebase with good test coverage and rapid feedback from CI, where the library code was only used internally by the company's software products (i.e. all the abstractions were implementation details, not part of any external interface). Over time we'd learn that some of our early ideas for abstractions in the internal libraries were flawed, but because these abstractions were internal, and we had confidence in the automated test coverage, it was possible to make quite large scale improvements to abstractions rapidly with confidence as we learned more.
The kind of environment that really bogs down experimentation and impedes change and improvements to abstractions is where an initial idea for an abstraction is resourced with its own development team and turned into a production service, and then another half a dozen internal company services start depending on it. Then it's very easy to end up in situations where everyone becomes aware that the abstraction is flawed, but improving it is less "one developer goes dark for a week or two and emerges with a 50-patch PR that atomically replaces the flawed v1 abstraction with v2 while passing all test suites in all projects that depend upon it" and more "project managers, product owners and enterprise architects compare roadmaps for the next few quarters to figure out how many years it might be until a prototype of the v2 abstraction can be ready for manual testing in the integrated test environment".
Maybe in the worst case there's some initial decomposition of the system that is flawed, then an org chart is spun up defining teams that own components matching the flawed system decomposition, so refactoring to improve the decomposition would also require refactoring the org chart to change people's teams. Then instead of having colleagues indifferent to or supporting a purely technical refactor, people will resist it to avoid change to their roles!
I haven't worked on a project where we've know all our problems up front and most of the time the complexity is added to cater for "flexibility" but that rarely ends up being a useful implementation for what we actually needed. It's great to hide this from other areas but you will need to work on it and it will impact how the software is architected
That's interesting. Knowing when to decompose systems into modules indeed seems to be key. This is a complex problem because, I think, the choice of the optimal model depends on the uncertainty you have about the reality behind the data, about what you know and don't know about the domain you are modeling.
But there might be optimal solutions rooted in information theory and Basyesian probabilities that you can strive to approach while programming. This is about avoiding over-fitting or under-fitting your domain knowledge.
Theoretically speaking, finding the right Bayesian fit optimizes for future evolution of the code and how it generalizes into the unknown, how correct your software will be when faced with things you haven't specifically designed for. More here: https://benoitessiambre.com/abstract.html
If I were to add something to abstract.html blog post, it would be something about Dependency Length Minimization ( https://www.pnas.org/doi/full/10.1073/pnas.1502134112#:~:tex... ) which has important information theoretic ramifications (for example, files with shortened dependencies tend to compress better and LLMs became much better when they solved for managing dependencies with their "attention" mechanism). When an abstraction breaks out a piece of code to enable reuse, the reduction in redundancy should be weighted against the stretching of dependencies to decide whether the abstraction is warranted.
The original article acknowledges this by mentioning "locality".
Your linked blog post "Abstraction and the Bayesian Occam's Razor" is very interesting. I'll play back my understanding to you, to see if I'm approximately following and summarising your thesis.
Context:
When programming we attempt to design an effective abstraction that models some domain. When designing this abstraction, there are trade-offs between reducing the amount of code required, enabling reuse, reducing coupling, flexibility to accommodate future use cases.
Key Problem:
How do we design an abstraction for our domain model?
Claim 1:
Apply the "Minimum Description Length" (MDL) model selection principle: prefer a domain model embedded in the shortest program able to recreate the dataset of domain knowledge.
Applying MDL model selection will result in an abstraction for the domain model that is both smaller -- giving less code to maintain -- and more likely to generalize to future unknown use cases.
Complication:
Applying the MDL model selection principle relies on having access to a dataset of domain knowledge. We can think of this dataset of domain knowledge as a list of (situation, expected behaviour) pairs -- c.f. a labelled supervised learning dataset, or a gigantic list of requirements. Unfortunately, in typical software projects, no such explicit dataset cataloguing the requirements or expected behaviour in each situation exists.
Claim 2:
We can use the automated test suite as a proxy for the dataset of domain knowledge. When designing our abstraction we should prefer a domain model where the combined size of the logic for the domain model and the size of the corresponding test suite* is minimal.
* with the important caveat that "just cutting out the tests, or removing other safeties like strict types doesn't give you a lower MDL, in that case, you're missing the descriptions of important parts of your data or knowledge".
Thanks for engaging with my wild ideas. I do think that's a good description yes. There are corollaries like Dependency Length Minimization I mentioned, which I think helps lower the MDL. Putting related things together allows using smaller symbols to describe things because more can be inferred from the adjacent context.
In my mind this is also related to concepts like scientific notation. Minimum description is not just about minimizing the number of operations, the number of branches, the degree of freedoms of your program. Taking into account variable width also makes a difference. Using a timestamp YYMMDDHHMMSS when you only need YYMMDD creates unintended edge cases the same way using too many digits in scientific notation creates noise and both can result in poor calibration with the uncertainty of the model.
If you ever get the time & mental bandwidth do a second pass on your blog post, it could benefit from a worked example or two where you lay out sample code for the domain model & application logic & sample code for a test suite, along with their description length (measured in some MDL-appropriate sense), then illustrate how various changes to the design that plausibly make it better or worse at accommodating future change influence the description length.
One way to define a test suite that would be very easy for readers to understand could be to rig a test suite of "table-driven tests" for some toy problem.
I'm not sure what a good toy problem would be that would be small enough so it's relatively easy for readers to understand, but large enough to show off MDL being applicable to a more general kind of programming problem and not merely MDL being applied to a statistical modelling exercise (one person's decision tree fit via supervised learning is another person's handwritten function containing a morass of hand written if-else chains -- but we already expect MDL to give reasonable results when applied to decision trees, it's less obvious it gives good results when applied to more general programming problems)
> In my experience, the key to maintaining readability is developing a healthy respect for locality
I think this pursuit of "locality" is what actually causes more complexity. And I think its mainly around our obsession with representing our code as text files in folder hierarchies.
> coarsely structure codebases around CPU timelines and dataflow
This is why I would prefer code to be in a database, instead of files and folders, so that structure doesn't matter, and the tree view UI can be organized based on runtime code paths, and data flow - via value tracing.
> don’t pollute your namespace – use blocks to restrict variables/functions to the smallest possible scope
Everyone likes to be all modular and develop in tiny little pieces that they assemble together. Relying on modularization means that when stuff changes upstream in the call stack, we just hack around these changes adding some conditionals to handle these changes instead of resorting to larger refactors. People like this because things can keep moving instead of everything breaking.
Instead, what we need to do is make it easier to trace all the data dependencies in our programs so that when we make a change to anything, we can instantly see what depends on it and needs updating.
I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.
Instead we end up with all these little mini-databases spread all over our code, when what we should have is one central one from which we can clearly see all the data dependencies.
> group related concepts together
Instead, we should query a database of code as needed...just like we do with our normalized data.
I was thinking about code along the same lines: we are modeling, not writing text.
This just happens to be the best way to express our models in a way a computer can be made to understand it, be formal enough and still be understandable by others.
What current languages are bad about is expressing architecture, and the problem of having one way to structure our models (domain models) vs. the actions/transformations that run on them (flow of execution).
I strongly disagree on the global variable side though...
> I strongly disagree on the global variable side though...
My thinking is that software has been terrible (over-complex) for such a long time, so its time to start questioning our most dogmatic principles, such as "global variables are bad".
Imagine you can instantly see all the dependencies to/from every global variable whenever you select it. This mitigates most of the traditional complaints.
I would argue that adequate tooling that allows for this would dramatically simplify all development. It's the only thing that matters and its so absent from every development platform/language/workflow.
If we could only see what was going on in our programs, we would see the complexity, and we could avoid it.
Another related bit of dogma is _static scoping_. Why does a function have to explicitly state all its arguments? Why aren't we allowed to access variables from anywhere higher up in a call stack?
What you realize is that all of these rules are so you can look at plain text code and (kind of) see what is going on. This is a holdover from low-powered computers without GUIs like most of programming. Even if an argument is explicit, if its passed down via 10 layers, you still have to go look.
Even with excellent introspection and debugging tools, it's hard to prove anything about the state of a mutable global variable (since it's hard to reason about the effects of many interleaved instructions), so it's hard to prove anything about your program that depends on that state (like whether the program is correct enough), and code accessing that global must be more complicated to account for the fact that it doesn't know much about it.
Suppose you have something that's effectively a global anyway (logging configuration), isn't mutable (math constants), or for some reason you can actually only have one of (memory page table). Sure, you probably gain a lot by making the fact that it's global explicit instead of splattering that information across a deep, uncommented call chain.
Could other use cases exist? Sure. Just be aware that there's a cost.
> dynamic scoping not bad
It's not just a matter of visibility (though I agree with something I think you believe -- that more visual tools and more introspectability are extremely helpful in programming). No matter whether you use text or interactive holograms to view the program, dynamic scoping moves things you know at compile-time to things you know only at run-time. It makes your programs slower, it makes them eat more RAM, and it greatly complicates the implementation of any feature with heap-spilled call stacks (like general iterators or coroutines).
Implementation details vary, but dynamic scoping also greatly increases the rate of shadowing bugs -- stomping on loop iterators, mistyping a variable and accidentally using an outer scope instead of an inner scope, deleting a variable in an outer scope without doing a full text search across your whole codebase to ensure a wonky configuration won't accidentally turn that into a missing variable error at runtime 1% of the time, ....
Modern effect systems actually look a lot like a constrained form of dynamic scoping, and some people seem to like those. Dynamic scoping isn't "bad"; it just has costs, and you want to make sure you buy something meaningful.
> Another related bit of dogma is _static scoping_. Why does a function have to explicitly state all its arguments? Why aren't we allowed to access variables from anywhere higher up in a call stack?
E.g. dynamic vs lexical scoping. Dynamic scoping used to be more popular and you can still use it in some languages like elisp. In some situations it's a natural fit for the problem, but I think in most cases lexical scoping is simply easier to use.
With plain text editors for sure. You really need a mandatory re-imagined IDE to make it work.
You need to be able to see exactly where the variable is coming from...which I think would be a good feature anyway.
And for this you really need a live programming environment...which I think would be good too...but they are very rare. Everyone is obsessed with static typing these days, but runtime value tracing is more useful imho.
I think the main problem is that we think of code as text. So the only way to determine if code is related is by parsing all of the text again. I'm not sure if a database representation is really the correct path to take, but I think we need some other way to represent code and give parts of code meaning.
> Tight integration of the environment with the storage format brings some of the nicer features of database normalization to source code. Redundancy is eliminated by giving each definition a unique identity, and storing the name of variables and operators in exactly one place.
They're both old and completely ignored. People occasionally reinvent them when they e.g. store code in DBs, or add scripting languages to their programs, or build new programming langauge because Hello World in Java is too verbose.
Unison plays with these ideas (I tried it, it's taking things in the right direction, though I still can't figure out how to write anything more complex than sorting numbers in the REPL with it; the examples are too Haskelly, IMHO.) Smalltalk language is, I believe, the original - built around the assumption that code is in the database, and coming with a built-in IDE for this. Glamorous Toolkit is trying to push this further, to give programmers better ability to create ad-hoc problem-specific views into their programs.
I've seen a few other articles written about this over the years, but I don't have any link handy.
I've worked them up from questioning the things about programming that seem most rigid and dogmatic over many years. But there is a lot of literature I have found along the way.
Intentional Programming is an interesting read as someone has already mentioned...from the guy who brought us `strHungarianNotation`. Storing code in a database but retaining the joy of the plain text cut/copy/paste experience is the key challenge, as well as all the unix file goodness.
Its quite fun to talk to ChatGPT about these topics and just question everything and delve back in the history of programming.
this is sort of what modern ides (e.g. jetbrains stuff) already do in the bg. when im working on stuff, i almost never navigate via text or the file explorer, i use things like "goto usages or definition" and navigate via what is essentially data tracing. this only works well with statically typed languages ime, though.
the indexing step is basically building this db in the background, it's just kept out of view / hidden unless you're building ide plugins or whatever.
Value tracing is at runtime. JetBrains cannot trace how values flow through your code.
To do this, you need to instrument all your code, and track all the transformations that occur for each value. It's really difficult to do if the language is not designed for it and there are a lot of performance implications.
If your code is written in a functional paradigm it becomes much easier to trace...such as with rxjs.
At this point I'd settle for making the background DB a read/write interface to the codebase. Value tracing is nice, but there are many, many lower-hanging fruits that would improve the coding experience by orders of magniutde, and that can all be handled statically, or without global code instrumentation.
> All the issues with global variables can be solved with better tracing tooling
I would argue this problem is solved in most current languages with strict types.
Stop making all the things strings or abstract base classes.
Easy example I've worked on recently. An IPv4 address is an IPv4address in code. I don't care if it is just represented as an uint32 or string in Memory, in your code it should be an IPv4 address and if a function expects an IPv4 address and you pass it a string that is a compilation error.
The "problem that needs solving" as you put it, is I believe fundamentally not solvable. Not at the human level, not at the computer level, not even at the we-made-a-Dyson-sphere post-human utopia level. Because the speed of light.
No matter how efficient we are overall, what we can access the fastest is fundamentally limited, because what we can keep closest is limited. If you want to access information in less than 3 nanoseconds, a copy of that information must be stored less than a meter away.
More prosaically, the reason our screens offer only a small window, is because our eyes only offer a small window. The problem would not really be solved even with infinite resolution VR googles.
> Code is a database of functions. This approach is like trying to design a database in denormalized form.
I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.
This is an interesting premise, and actually I think we have quite a lot of examples both of successfully applying the idea up to a point and of where it starts to break down in practice.
Modern distributed applications mostly have a back end with some kind of database and front end UIs that depend on that back end via an API. Those databases are often global data stores, accessible from anywhere in the back end implementation. A lot of work is done to design and manage them, probably modelling some real world system that our application is concerned with, and there are varying degrees of abstraction/isolation used to preserve that design intent.
If the data model is simple then this works OK, particularly if you have a SQL database that can enforce some basic constraints and relationships to make illegal states unrepresentable.
What we usually see as the data models and the actions that update them become more complicated is the introduction of some business logic layer. The rest of the system isn’t allowed free access to update the state any more; it’s required to go through some defined interface that provides specific actions that guarantee the state remains valid.
That’s the writing side. On the reading side, aside from security/privacy issues, we generally don’t have the same concerns with allowing free access to the whole database from anywhere. However, often we need some form of derived data that isn’t directly stored in the database itself but instead can be constructed from other state that is. So again we end up with some kind of abstraction/isolation layer between the rest of our system and the database.
In each of these cases, there is probably data that we’re working with that is not the state that ultimately persists within the database. So the question immediately arises, if we only have global data in our programs, where does all of this transient, intermediary data go? If we put it into our database as well then all the usual problems with concurrency and integrity immediately appear, so we are back to needing something that is local to our immediate logic and can’t conflict with any other logic or indeed any other instance of the same logic that happens to be running in some other context at the same time.
We see analogous issues in the front end UI code for those distributed systems. If there is a relatively simple model then maybe the front end can effectively just fetch/cache the state from the back end API. As things get more complicated, maybe you end up with a front end data store analogous to the back end database that becomes the central, authoritative store of your front end state. And again maybe this provides some defined interface for accepting valid updates to the state and/or for accessing derived data. And again the questions arise about where the intermediate data generated by all of that logic should go if we have only our global store to hold state, and the answer is likely to be some form of more local data.
On top of the persistent state and anything acting upon or derived from it, we also have other kinds of information we work with in front end code. Many UIs will have state that is used purely to control the user’s view into the inner world: the sorting criteria and current page of a table, the current position and zoom level over a map, the last item we’re currently showing in an infinitely scrolling list, look and feel settings like whether we’re using a dark mode theme. Some of this data might apply across the whole UI while other aspects might only apply to, say, a specific table, with each table needing its own instance of that “UI state” data. So again, if everything were global, that would mean we’d need to include every possible piece of UI state for the whole application in the global store.
This comment is already far too long so I’ll just quickly note that there are other recurring themes. One is how to synchronise “global” stores in a distributed system where you might have multiple front ends running with their own copies of the state, or perhaps multiple microservices on a back end that have duplication in their databases because everything is supposed to be independent and denormalised. A related issue is how to represent temporary clones of significant parts of the data during user interactions, like building up a transaction with several changes before atomically committing it or rolling it back (think dialog box on the UI side or an internally consistent batch of changes sent to the back end), or supporting an undo facility that needs to reconstruct a previous version of the persistent state one way or another.
I do believe there’s a lot more we have to learn about different types of state and transient data and how we can model those cleanly in our systems. There are certainly common patterns we touch on in a lot of different contexts. And I think both extremes of having too much data trapped too locally and having too much lifted to global storage have their own difficulties and probably there is some sort of structure in between that would be better than what we typically write today. But it’s not an easy problem or we’d all be solving it by now…
The comment was mainly related to in-memory variables within an application process...focusing on scoping/syntax...but the thinking was definitely inspired by the fact that most apps center around an external database without realizing its essentially a global variable.
In application code, when I talk of global vars, I mean that every function has access to all data...as opposed to access being abstracted and modularized into various services which are exposed via being passed through a chain of function args, or some kind of dependency injection system.
But this global variable could actually be an abstraction (a store) allowing data integrity checks on writes.
> there is probably data that we’re working with that is not the state that ultimately persists within the database
If you think about all your data in one big graph, this transient data still has a relation to the final persisted state. There is a data flow of transient values into the persisted data values. And separately, your intermediate data structures might also contain relations to your persisted data structures.
Most dev tools don't track these relationships, and you have this tangled ad-hoc mess where data is dumped from one structure to the next.
> “UI state” data...we’d need to include every possible piece of UI state for the whole application in the global store.
Yep, this is what you should do.
If I am sorting something on a webapp and I refresh the browser, I probably want to have see the same things as they were sorted. This might vary between use case, but adding the functionality should be easy to do if necessary. So therefore it is good practice to allow all local ui state to be persisted by default.
The ui state is being persisted anyway, inside a component or inside the HTML document. Somewhere in the heap this data is stored. And if we think about our one big graph again, this data is related to other things...its just that we lose these relations.
> A related issue is how to represent temporary clones of significant parts of the data
All rendered UI values should be in their own _ui_ models (view model is similar), separate from the source of truth models.
These ui models basically allow all rendered UI to be editable without immediately committing changes to the database. This allows for optimistic UI updates. They get notified of any incoming changes from the source of truth, and can decide what to do with them.
If you want to batch them up, you just create a Batch entity, and add a relation to these ui models. The main thing is to treat the ui models like any other models. Whether they are persisted or not should simply be flipping a flag in your code.
For UI, everything should be in one big graph. Code is data.
I find with modern programming, all of the popular programming languages, frameworks, libraries, databases, platforms, really get in the way of being able to do things simply.
Again, I like a lot of the thinking behind this. I’m personally a big fan of having clear data models and data flows, and I think there is a lot of potential in ideas like reactivity that clearly define sources of truth and relationships in single places so derived state gets updated automatically when its underlying source(s) of truth change. I also believe realising the full potential of some of these ideas will, as you’ve mentioned yourself, need better tools than we currently have. However, we can already get pretty far with these ideas, and when they work well, it certainly does make for a nice, clear, easy-to-follow design compared to having state and relationships scattered all over the code base.
Where I suspect we might disagree is the idea that everything should fit into this model. Sometimes the data models and relationships get more complicated and the one-way flows implied by the basic idea of reactivity are no longer sufficient. Sometimes the shape of the data flow graph changes on the fly, because we need multiple instances of certain parts of the state. Then we need a way to distinguish between those instances and often some way to connect the repeated parts of the graph back to the wider graph so that we can make changes to the duplicated parts but still enforce relationships with other parts of the data model.
I have never so far seen a reactivity-based approach to modelling state that handles more demanding scenarios like these gracefully, nor a convincing theory for doing so without introducing more than just a general global data store and simple reactivity. This is where I believe there is still a lot of potential for finding better ways to model the data and then better tools to implement those models. The trick, as again you said yourself, is to find the essential aspects and make it simple to describe them without the accidental complexity from the tools becoming excessive.
In this context when people say "global variable" it is effectively shorthand for "arbitrary accessible shared mutable state" and the associated risks of mutating something which other code expects to stay the same, and also often unintentionally coupling things without it being clear that they are coupled.
I'd argue that an external database is not necessarily such a thing - it can be and sometimes is, but in many (most?) applications there generally is a somewhat well-defined layer that manages that state mutation and enforces any constraints needed, isolating the "global variable" from the rest of the program, just as a singleton class with constrained setter functions could do for an OOP approach to a shared state that's arguably fixes most issues with "naked" global variables; and standard data flow analysis in specifications can make explicit any coupling between components that happens "through the DB".
To my mind the conclusion is backwards. A file with a high compressed size might be doing something useful; a file with a low compressed size but a high uncompressed size is a file that's full of repetitive junk, and those are the files that should be a target for refactoring.
Exactly. As ways to differentiate between "essential complexity" and "accidental complexity" go, the idea of looking at what compresses well sounds quite good - but it's the accidental complexity that will compress the best, and essential the worst. And latter is not the problem, the former is.
That is the entire basically the entire concept of complexity theory.
The reals are compressed into the computable reals. The real numbers are un-computable 'almost everywhere'
Semi-decidabity is just recursively enumerable that gets you to finite time with unlimited resources.
NP-hard is brute forcible in exponential time, with most likely no approximate polynomial reductions.
P has exact polynomial time reductions...
Most code is made of IF and FOR loops, because that produces primitive recursive functions, which are most of the intuitive computable functions that always halt.
The problem becomes with complex systems where you need to balance coupling with cohesion along with free variables, WHILE and GOTO.
Note that the above compression was lossy.
If you consider that information loss as setting constraints through heuristics, (educated guesses) those constraints may or may not work for a particular use case.
The problem with how we often try to set coding style is that we want simple universal rules.
Unless you have a system that fits in with those ideals all the time that is problematic.
Gödel demonstrated that isn't possible for complex systems. Either our rules will be inconsistent or incomplete.
This is why I think selling conventions as ideals, that need to yield to less preferred cohesion models when appropriate is the real solution.
Unfortunately that requires a lot more thought and vigilance.
Functional cohesion with loose coupling is what we shoot for when it is appropriate, but not as a hard fast rule.
The article is somewhat silly, but there's a kernel of good advice here -
To estimate the "complexity" of a codebase:
1. Remove all comments
2. Replace all spans of whitespace with a single space
3. Concatenate all source together into a single file
4. Compress the resulting text file using gzip -9 (or your favorite compression engine)
The size of the resulting file is a good proxy for overall complexity. It's not heavily affected by naming conventions, and a refactoring that reduces the number is probably good for overall complexity.
It's not a perfect metric as it doesn't include any notion of cyclomatic complexity, but it's a good start and useful to track over time.
I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.
Here are some examples where you would increase the compressed code size while not making the project more complex:
1. Adding unit tests to code that was previously untested. Unit tests add little complexity because they don't introduce new interfaces.
2. Splitting a God class up into multiple independent classes. Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.
> I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.
This sounds a lot like the "your model is wrong because nuance X" argument. I want to remind you that all models are wrong, but some of them are useful anyway. In particular, I have found the size of source code to be a highly useful predictor of complexity. It has helped me predict where bugs are, where changes are made, where developers point out areas of large technical debt, and many other variables associated with complexity.
The test of a model is not whether it accounts for all theoretical nuances, but rather whether it's empirically useful – and critically, has higher return-on-investment than alternative models. What model do you suggest for implementation complexity that you have verified to be better than code size? Genuinely interested!
(Additionally, I have also successfully used the compressed size of input data to predict the resource requirements of processing that data, without actually having to process it first. This is useful because the compressed size can be approximated on-line rather cheaply.)
> The GP’s point isn’t that the model is wrong “because nuance X”, it’s that the model directly contradicts good practice.
"Doesn't matter; had predictive value" is what comes to mind. "Good practise" isn't a defense against empirical evidence.
That said, you're right that I missed the compressed part. I haven't tried compressing the source code before analysis, but I do suspect it would improve accuracy rather than decrease it. That's not a rigorous argument though, and I'm willing to accept that uncompressed code size might be a better model than compressed code size.
> Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.
That is why compression is mentioned. Boilerplate is something that disappears under good enough compression. It's literally why we call it boilerplate and generally dislike it - because once we spot the pattern, we can mentally compress it away, and then are annoyed that we have to do that mental compression whenever reading or modifying that code. Feels like pointless work, which it is.
Sometimes I've scanned code bases of my own for all user definable variable names and just levenshtein distanced them. It's kind of useful, but the hurdle for me at least is that I need to run something in a terminal to get the results. Maybe I'd use it more if it was a plugin in my ide of choice.
Something else you could maybe do is to simplify the code and compare sequences of statements and expressions to each other.
ie the 2 statements "foo = bar; foo += 20" is identical to "zoo = war; zoo += 20"
This is what a minifier does, and those go even further to rename variables.
Another thing that should be pruned away entirely are data files, including all constant strings within the code, since humans should avoid those when focusing on algorithms
At that point you pretty much have a highly compressed version of what you'd find in CLRS or any other algorithmic text.
Many of the issues that come up with applied information theory to practical code are of the form "oh, but it won't be fast if we do it that way". The end of the article links Alan Kay discussing STEPS and how it solves many computing. fundamentals needed for a desktop in miniscule amounts of code. One of the comments to that video, made five years ago, dismisses it as unrealistic ivory tower nonsense that can't run fast enough. (Notwithstanding, the presentation was given on a running system demonstrating the proof of concept)
But there is a similar sentiment to Kay's from the bottom-up viewpoint. The Forth community, who have made livings on implementing this kind of succinct design in commercial settings, tend to point to hardware manufacturers themselves as the primary difficulty. Their business is to sell you more hardware than you need, and that leads them towards doing nothing to help with the software crisis, but rather, to encourage processing and I/O to be complex things to reason about, with complex protocols and mystery-meat drivers. If you have to use USB, Bluetooth, TCP/IP...you're stuck. Nobody wants to deal with those hot potatoes. You can't address it properly by running up the abstraction stack and doing "everything in the browser". That's playing nicely with the standards instead of attacking them. When software companies play along and say "well, it's the standard so we have to use it," their problem gets deeper.
Some room could be conceded to say that some of that complexity is essential, but one of the ways in which we describe progress in science and technology is to find solutions that are lighter and simpler to understand, e.g. instead of astronomical tables describing "Earth at the center of the universe" epicycles, smaller equations describing orbits around the Sun.
Bit of roundabout way to say ‘DRY’. Information isn’t a universal context-free quantity, it depends on the models involved. In this case the target is removing repeated words/code symbols and being concise.
I've also thought of the idea of using some LZ-based compression on source code files to determine which ones have the most redundancy (the ones that have the best ratios) and could be simplified by refactoring, which is not too different from the entropy-based approach described here. It's worth noting that this also identifies languages that trend towards boilerplate and egregious verbosity --- for example, I've noticed that the average C# or Java codebase will compress much better than C, while (much) denser stuff like APL-family languages don't compress as much.
I can half-agree with this, but I would measure it differently.
It's not the words themselves that are surprising, but the range of language features used.
I have a friend who produces particularly readable (TypeScript) code - at least in my opinion. For a long time I couldn't figure out what was so special about it, then it dawned on me that he simply doesn't use the more fancy and recent features of the language - by recent I mean stuff that came out in the last four years or so.
I wouldn't be so radical in my approach, but I believe feature usage should obey a power law, with more advanced parts occurring less frequently. I'm sure there's a balance to be struck between terseness and using as minimal a subset of the language as possible.
Also it doesn't hurt to have whitespace here and there. We use paragraphs in written speech for a reason.
> This is similar to the identification of measurable properties of matter (like the volume, mass, and pressure of a gas) and the relationships between them (analogous to the gas equation). Thus his metrics are actually not just complexity metrics.
I thought I was familiar with halstead but this perspective was new to me. Interesting
The combination of the Elm reference and "coarsely structure codebases around CPU timelines and dataflow" statement reminded me of this great talk by Evan Czaplicki (Elm creator): https://www.youtube.com/watch?v=XpDsk374LDE
I feel the approach could have some merit, but the article unfortunately stops short of demonstrating it. Also the problem of analytically finding similar code sub-trees seems highly non-trivial. As baseline you want to parse the code, but really you'd also want to somehow "normalize" identifiers, as usually it is the common structure rather than specific identifiers that matter.
Cyclomatic complexity is much better for code analysis than pure entropy analyses.
Using linters and automated code formatting deals with entropy on level of words/characters rather well. So it is not useful metric for current state of art development.
I'd argue the baseline reasonable approach to structuring applications or systems is the one given in Parnas' 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules":
http://sunnyday.mit.edu/16.355/parnas-criteria.htmlParnas' criterion embeds the understanding that code and systems are not static but need to evolve over time as requirements change or decisions are made, and that different decompositions can be inferior or superior to accommodating that change.
"Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.