Friday, May 27, 2022

English is Purple and One Dimensional

English is Purple and One Dimensional

and why "Source Code" is a terrible place to put Software

Everyone agrees that a Picture is Worth a Thousand Words, but have you really stopped to think about why? 

An obvious reason is that pictures are 2 dimensional, and can make use of all the colors of the rainbow, while if I were to assign English a color and a shape it would have to be strictly Purple and quite Linear in nature.  Allow me to explain... 

It is purple because while we think of English as a Primary Color, English is in fact composed of two completely separate and distinct things, just like purple is actually a mixture of blue and red. 

What I mean is that English can be disambiguated into two completely separate and distinct parts.  A Blue Abstract Model representing the underlying facts and data about the idea being expressed, along with a Red Algorithm, that can interpret the meaning embedded in the underlying facts and generate the purple English that us human beings want to read, when given the Blue Abstract Model as an input.  We'll dig into exactly how this works below.

Additionally, the shape of an English document is very linear and one dimensional in it's nature.  In other words, when we read a book we read all the words in a big long line of words, starting with the first word, and then the next word, and the next, sequentially until "The End." 

We can read faster or slower.  We can skip over parts and move forward and backwards along the line - but in general, the meaning of language is conferred through it's vocabulary, syntax and grammar, rather than through it's physical location on the page or within a book.

These two facts about English, and about language in general are what make it such a terrible place to put software, technical systems, protocols, and complex ideas in general.

Instead, the English technical specification that human beings need and expect for any given project should actually be a report, written against an abstract model of the idea - which can be shared amongst all languages, including English, French, German, Spanish, C#, Python, SQL, etc.  All of these languages are purple, and one dimensional, and they are not a good format to assign as the "source" for complex ideas.

What's the Alternative?

At this point, you're probably asking... "Well, that's great EJ - but what's the alternative?", and that's a completely reasonable question.

While we might start with a Purple English description of an idea, literally the next step should be to extract a Blue, Data-Based, Abstract Model of the idea from the purple language (as described below,) that includes enough fidelity to unambiguously represent the strict facts about the underlying concepts.

Step 3 is then to validate that we have successfully accomplished this by creating a Red Algorithm (basically a "report") that reassembles the Blue Model back into the original Purple English.  Doing this proves, by it's very existence, that the model has enough internal detail to accurately represent the original idea.

The key is that by pulling apart the Purple into these two separate parts, a different "report", i.e. a different Red Algorithm, can take the same underlying facts from the model - but construct longer, possibly more detailed report about the idea... possibly in French, or German.  It would use the same underlying facts.  The same underlying elements and details.  The same information being communicated - but arranged into a different syntax, grammar - and possibly using a different vocabulary, i.e. possibly a completely different language.

Each of these different reports however hopefully represents exactly the same idea - the same underlying facts - which should be Language IndependentTruth should be language independent.

What is "Truth"

Answering this question may seem academic, but having a shared understanding of what makes something "true" is essential for meaningfully differentiating between a Linguistic Description of a system and a Blue Model or "Digital Twin" of the system as described below.

The best description I've heard for the definition of truth is that...

Something is True if it Comports with "Reality".

The problem with this definition is that of course everyone creates their own version of reality. 

So, while trying to communicate an idea with Bob, Alice will say some words, which hopefully have the same meaning to Bob as they do to her.  And then, based on Bob's understanding of Alice's purple words, he will say words back to her to convey what he think's she meant. 

And if Bob seems to be thinking of the same thing as Alice, then she may agree.  And if not, she might disagree and say more purple words back to Bob, in order to attempt to update his understanding to more accurately match her understanding of the idea that they're trying to communicate about. 

But this is all clearly an exercise futility,  even when both parties are speaking the same language, because "reality" is completely Subjective.  And all of this becomes dramatically more difficult when the parties are speaking different languages, like English and Python or C#, for example.

By contrast, the blue model described above can actually serve as "digital twin" (more below) of the idea - and then all we each have to do is agree that it accurately represents our understanding of the idea in question. 

This opens the door to simply define this "digital twin" as being "Reality" - at which point the truthiness of literally any linguistic statement, can be Objectively Tested by simply checking if it "comports with reality" - where reality is defined as our Digital Twin. 

Creating a "Digital Twin" for the underlying idea

Language's Purple and One Dimensional nature make it an undesirable candidate to be the "Source" encoding of complex ideas.  It is so inefficient at communicating complexity because all it can do is dance around an idea, relying on the parties consuming the idea to share the same understanding of every word and inference of the language being used.

The Blue Model described above, by contrast, is not a linguistic description of the idea.  It is not a language at all, in fact.  Instead, it is digital instantiation of the idea.  A digital example of the idea that really serves as the platonic ideal of the idea that we are ultimately trying to capture with Language, and as a result, provides us with the the opportunity to agree on a shared "reality". 

This "digital twin" can be created in virtually any no-code tool, from databases, to spreadsheets, to no-code services like Airtable, Tray.io, Bubble.io or others.  The only requirement is that it is not "code".  I.e. that it is just the decisions about how a system should behave, and should generally be exportable to json, xml, csv or a similar data-based, non-linguistic format.

This multi-dimensional data structure literally forms a physical picture in space.  Not a description, but a digital instantiation of the idea being discussed..

With this digital twin in hand, all the project stakeholders can agree that it is an accurate representation of the idea.  No words are needed.  They can simply look at the Digital Twin - and it if looks, and acts, and behaves as expected - everyone can give a simple thumbs up or a thumbs down. 

You could literally have 10 people approve the Digital Twin - and they could each literally speak a different language - and never actually communicate with each other directly in any way.  Instead, they all simply look at the model and give a thumbs up or a thumbs down.

Once everyone involved agrees, at an abstract level, that the digital twin accurately represents the system, protocol, or software that we are trying to actually build - everything else gets dramatically easier.

Mail Merge

Think of a simple mail merge, where we want to send the following email to Mary a Graphic Designer, along with Bob and Juanita who are applying to work in Sales.  

Dear Mary,
Thank you for your recent application to work with us as a Graphic Designer.  We will review your application and be in touch shortly.  Sincerely, HR Manager Ellen.
So while we want the same basic content within each of the emails, we actually need 3 different versions of the email, each one including candidate specific details like their name and job.  And, this email is just one of many different things that we need to do with this list of candidates.

We could do this by creating an email to Mary, and then copying and pasting it twice more, replacing Mary with Bob and Graphic Designer with Salesperson - and this is largely how software get's written, even in 2022.

Instead, however, we could also take the lessons learned in the 1980's, and split that 3 page purple content into it's two constituent parts. 

A list of candidates, along with the job they are applying for,   (a blue model)

Applicant  Phone     Job               Address...
Mary       555-1234  Graphic Designer  123 Main st
Bob        555-2345  Salesperson       321 Park ave
Juanita    555-3456  Salesperson       515 Linden ct

And separate copy of the email, as a template. (a red algorithm)

Dear {Applicant},
Thank you for your recent application to work with us as a {Job}.  We will review your application and be in touch shortly.  Sincerely, HR Manager Ellen.
Now, any time we need new emails, we can simply mix the blue, current list of applicants together with the current red template and get a merged list of purple emails to each applicant we actually need to send.

If we also wanted to send them a letter in the mail, however, we could use the same list of candidates, and merge it with a different red Mailing Label template like this.
{Applicant}
{Address},
{City}, {State} {ZipCode}
This is only possible because the "blue" list of candidates is kept separate, in a machine readable format from all of the places that this information needs to appear within various purple documents.
 
This exact same idea can be applied to complex technical ideas - swapping out the simple candidate spreadsheet with a more sophisticated Specification Database, which can unambiguously encode the moving parts of even complex, enterprise level software systems, and swapping out the simple mail merge with a more complex Low-Code tools such as those found at https://explore.ssot.me.

The Blue Model

Let's take a closer look at the Blue Model that we keep talking about. These days I almost always start with a no-code modeling tool of some kind that lets me very quickly sketch out the rules for a given system.  Once those rules have been captured though, a simple export from that system generates a json file which, almost by definition has to include all of key decisions that were made about the project.  

In reality, it's just a database.  So regardless of where it starts - it always just literally ends up as a database, in the most abstract way that this word can be interpreted.  In other words, it could literally be a SQL Server or Postgress Relationship Database Management System, or it could just be a spreadsheet, or a Json or XML file, or even just a simple csv file.

The format of the data is dramatically less important than that it is data, rather than language.  In other words, it is multi-dimensional, structured data - where the 3 dimensional location of the data in space literally confers meaningful information about the underlying idea. 

The content of the database however, is literally constructed using exactly the same words as the Purple English that the Specification Database is intended to capture, with the words re-arranged into a physical, multi-dimensional structure. 

Even just the size and shape of that database start to give us details and insights into the underlying idea being represented.  Every word in the Purple English will either end up as the name of a Table, or Column in a Table, or as a value in a row somewhere in one of those tables.  And deep meaning can be embedded in this structure, and can then be inferred based on where each word ends up in the specification database.

More specifically, the nouns in the English version of the information, for example, will tend to become tables within the database.  So if it's a very small idea, that only deals with one or two "things" - then the database will likely only have 1 or 2 tables in it.  If it is a complex idea, dealing with dozens of subordinate concepts and related ideas, then the specification database will likely have many tables in it - making it very deep (along the Z axis in cartesian space).

The specific "instances" of those things, within the content will tend to become rows within the tables.  So if there are one or two things described, there will only be one or two rows in a given table.  However, if we discussing an idea that has many moving parts, there will potentially be many,  many rows, and the database will get very tall (along the Y axis in cartesian space).

And if we only know one or two things about each "thing", then there will tend to only be one or two columns within each row, whereas if we know lots and lots of details about one or more things, then there will tend to be lots of columns, and the database might get very very wide (along the X axis in cartesian space).

This will define the bounds of a data cube, and within that 3 dimensional space, every word in the equivalent, purple, English document will have specific, geospatial location.  This data-point cloud essentially defines a 3 dimensional "finger print" for the underlying idea.

In this way - we can simply glance at the size/shape of the specification database, and get a good sense of the "scope" of the idea.  As a complete aside, in software development, even just the height-width-depth of a given database tends to be a much more reasonably starting point for cost estimation than the number of pages of text that were written describing the same system.

By contrast, if we look at a 200 page technical specification, we will tend to get virtually no information about the underlying idea, beyond the fact that the line of words used to describe the idea is very, very long. 

The problem is that we don't know if it is a detailed technical specification, or a 200 page novel with just:  "All work and no play, makes jack a dull boy!  All work an no play, makes Jack a dull boy! ..."  

The line of words may keep getting longer, but the structure of the underlying idea remains unchanged. The blue model is a representation of the underlying idea - and adding more words, much of the time, does nothing to change the actual idea being discussed.  The issue is that we don't know which word are used along the line until we read them, one by one, by one, by one.

The Red Algorithm

We compared the red algorithm to a Mail Merge above, but to be effective it obviously needs to be dramatically more complex than most Mail Merge operations would allow for.

Instead, a more accurate representation of the red algorithm is that it is just a Report, written against the specification database, which can reassemble the underlying facts about the idea, into English, for example.

If we describe an idea, in English, and then perform this operation of pulling the purple apart into the red and the blue - most of the specific details about the idea itself end up in the blue model, and the meaning or understanding of those facts come from the red algorithm.

In other words, neither part is complete on it's own. We can take the same underlying facts, and interpret them in completely different ways, or use the same interpretation to process different facts. Just like we can make 10 different meals with using different preparations of the same 3 ingredients, we can also make 10 different meals following the same preparation steps with 10 different sets of ingredients.

The Blue Model should be just the un-arguable facts, specific details about the idea, which infer from their very construction the meaning that is intended by the individual data-points, and the Red Algorithm is essential for interpreting and inferring the meaning that is communicated in the final English document.

So while the Red Algorithm is responsible for interpreting the facts, i.e. interpreting the meaning, the Blue Model is also essential for constraining the red model.

If we think about a Recipe again, the ingredients are the Blue model, the Preparation Instructions are the Red Algorithm, and the prepared meal is the Purple result.

Because recipes are separated out in this way, ingredients can be easily manipulated or swapped out. For example, the final meal size can be increased by simply multiplying all of the quantities on the ingredients by two. Alternatively, meat products can easily be swapped out for Vegan alternatives.

And still, the instructions can be written completely decoupled from these kinds of changes, with easily understood sentences like:
"Assemble the ingredients in a frying pan."
And, whether those ingredients are the original ingredients, or have be manipulated or swapped out - the instructions don't care. The instructions end up being quite decoupled from the very specific ingredient list. This is also true of most of the algorithms that are brought to bear on the Single Source of Truth.

However, if this recipe was just a paragraph of purple text, which described how to make the meal very precisely, with instructions and ingredients all mixed together, we'd end up with exactly the same meal at the end - but it would be much more difficult to work with.

Simple questions like which ingredients include meat would necessarily require reading literally the entire recipe. Because it could easily add in a meat ingredient in the last sentence of the recipe, and there's just no way to know until we read the whole long line of words.

We learned this lesson with cooking hundreds of years ago, and yet in 2022, we will make little to no attempt in the process of creating software to separate out the ingredients from the instructions. We mix them together and call it "Source Code". I'm sorry, but it is simply a travesty!

Benefits

One of the most important benefits of separating out the Digital Twin from the purple English, is that it serves as a Single Source of Truth for Reality.

In other words, with an SSoT, any English statement, or line of code, can be deemed to be Correct or Incorrect - by simply comparing it to the Digital Twin or Single Source of Truth.  Without an SSoT, the Spec will say one thing, the docs will say something else, the sales material will claim a 3rd thing, and yet the "Source Code" is the only thing which says what actually happens.

Reusability

The abstract blue model can be re-used in a report to create the English technical specification that that human beings need in order to understanding the idea, but the same blue model can also drive the a report which creates a C# or Python language representation of the same underlying rules or ideas.  And in this case, the English description and the Python Code will tend to match each other, because neither of them is the "Source."  Instead, they are both derived from the same SSoT, and so each of their respective size and shape is directly mirrored from the same digital twin.

Right vs. Wrong

In a "traditional" development environment, the specification says what is meant to happen, the documentation (if it even exists) says what the system was designed or intended to do, but only the "Source Code" says what the system actually does.  And to read that code, you have to either understand the language, or compile it, run the system and see if you can test if it appears to do what it's meant to.

By contrast, if we have a digital twin which unambiguously encodes the rules of the system, anything which lines up with the digital twin can be said to be "right" and anything that does not comport with the reality defined by the digital twin can be said to be "wrong".  In fact, automated tests can often be generated to simply check if the production system behaves in the same way as the twin, without any actually "understanding" of the underlying.

Without any training in biology or science, a 5 year old can look at a lion at the zoo and say Yes - that looks just like the stuffed lion toy that I'm holding in my hands.

Software should be similarly trivial.

Universal Cross-Platform Consistency

By introducing a single source of truth, we end up with a dramatically more consistent result, even across platforms, because most of the moving parts comport with the same reality - because they are literally defined relative to the SSoT - just like the instructions of a recipe are written relative to the ingredients.

A Concrete Example

The most important part of this process, which is still usually overlooked in software development is the creation of the blue model, the digital twin, the Single Source of Truth.  As mentioned above, the reason this piece is so critical is that it serves as a picture in place of 1000 words. 

So, let's explore some words, and what a picture might be that represents those words.
Mary is wearing red shoes.


At first, all we know is that there is someone named Mary, and that she is wearing red shoes.  Even at this point though, this first picture of Mary is enough that even a 3 year old who can't read could answer a question like: 
  - What color are Mary's shoes?  Red.


Mary also has another pair of shoes.
 



Now we learn that she has another pair of shoes.  At this point, again with just this 2nd picture in hand, a 5 year old who can't read could answer a question like:
  - How many shoes does Mary have? 4.

 

Her other pair is green.



Then we learn that her other shoes are actually green.  With the 3rd picture, once again, even a 3 year old who can't read could answer the question:
  - What color are Mary's other shoes?  Green.

Every time a new piece of information is added, we just have to make sure that it is incorporated into the picture.  And then that same, single picture, if provided as a definition of reality, can serve as the basis for an English, French, German and Spanish description, because it's not a linguistic description of Mary that has to be interpreted.  We can just glance at it and learn everything we need to know in a single frame - because a Picture is worth a thousand words.

All we have to do is agree that the picture accurately reflects "reality", i.e. the underlying idea, and  then everything else gets simpler.

A more poetic example

Name            Is a Rose     Language   Smells as Sweet as a Rose
Rose            Yes           English    Yes
Trandafir       Yes           Romanian   Yes
Garbage Truck   No            English    No
Rosa            Yes           Spanish    Yes
Dxk kuhlāb      Yes           Thai       Yes

With this simple data model as a starting point, we not only have enough information to converse about various aspects of, but also to even objectively decide the truthiness of the famous Shakespeare quote:
A rose by any other name smells as sweet
Deciding whether this statement is true or false without a shared blue model, like the list above however, is an entirely Subjective decision, starting with what exactly it even means.

Let's take a hypothetical scenario, for example, where someone claims that while the Molineux flower is not a "classic" red rose, it is technically in the rose family and actually smells like garbage.  In this hypothetical scenario, we'd have to update our list, because the facts on the ground would have changed.

Name            Is a Rose     Language      Smells as Sweet as a Rose
Rose            Yes           English       Yes
Trandafir       Yes           Romanian      Yes
Molineux        Yes           English       No
Garbage Truck   No            English       No
...

So, at least in this hypothetical example, the Molineux flower is technically a "rose", but does not appear to smell as sweet as a rose, and so, as written at least, Shakespeare's words appear to be wrong.  

But this introduces a problem, because what Shakespeare clearly meant is that "A thing is a thing, regardless of what you call it!" and this is still a True idea, despite the existence of an apparently contradictory fact. 

But both can't be true at the same time, right?   

So, what can be done?

As is so frequently the case, we seem to have two competing, but equally "true" ideas, and usually we're now just left to hash it out between these two competing "truths" using more and more purple words.  The lines just keep getting longer, and the advocates for each position in this discussion would be 100% justified, each in their opposing positions, each based on the same apparent facts on the ground.

This is a problem without a Blue Abstraction that both parties agree to, because the nuances of everything we're talking are difficult to describe with words.  Without a shared model, each person is left to interpret the words, each using their own understanding - and while they might be using the same words (smell, rose, sweet, etc) - each is bringing a completely different meaning with their use of these words.  

But with a blue abstraction, however, something is possible that is dramatically more difficult to negotiate with purple, one dimensional "language".  Specifically in this case, let's update the Blue Model to more accurately reflect what Shakespeare actually meant, even while simultaneously acknowledging the existence of the Molineux flower, which in our hypothetical example smells like garbage.

Name      Is a Rose    Is 'Classic' Red Rose    Smells as Sweet as a Classic Red Rose
Rose      Yes          Yes                      Yes
Molineux  Yes          No                       No
Rosa      Yes          Yes                      Yes
...       Yes          Yes                      Yes

Our list is once again consistent with a now slightly more precise notion that "A 'classic' red rose, by any other name smells as sweet".  This model can highlight nuances such as the fact that while the Molineux is technically a rose, and therefore appears to contradict the original quote, it is not a "Classic Red Rose" - which is what Shakespeare was obviously referring to.  And as soon as we add that attribute to our model, it can now accurately, and unambiguously represent a more complex reality than the original data.

These are the kinds of things which are often hammered out in legal contracts and negotiations, where the primary tools available are a large and always growing collections of purple, spoken and written words.  And at the end of the day, this results in pages and pages of words, each with their own interpretations, grammar and syntax, arranged in big long lines of words which must all be consumed serially, before meaning can be inferred.  Hopefully the scale of the problem begins to become clear.

Just the existence of the list above (by contrast) - all on it's own, literally changes our ability to evaluate, understand, and communicate about an idea from a mostly Subjective one, to an almost entirely Objective one, specifically based on the objective facts agreed to in the model - i.e. the Single Source of Truth. 

With the original list in hand, we can identify a deep problem in the precision of the meaning that we are attempting to capture.  Then, we can update our model to unambiguously incorporate a deeper, more nuanced definition of the meaning that we were trying to capture, even while taking into account entirely new pieces of information which initially seemed to conflict with our understanding.

So we were able to update the model, to unambiguously encode a new notion - with mathematical precision - and without the need for purple, one dimensional linguistic uncertainty and ambiguity.

Conclusion

Ultimately, the process for how to pull the purple apart into it's constituent elements, and then obviously more importantly, how to put the two parts back together to create hopefully not just the original documents, but also matching, derivative works, and that will need to be explored further in another post.  But hopefully this article shines a light on the underlying problem, and helps to expose the massive flaw with the general color and shape of language.

No comments:

Post a Comment