Normalizing Data

Normalizing data is when you make it less redundant and easier and safer to maintain by separating into more than one table.

Here’s a simple example. We’ll start with some “un-normalized” data, which I call dirty data sometimes.

Say you are a teacher and need to make a simple spreadsheet to keep track of all your class attendance for different courses that all meet only once a week, some on different days, some on the same day, only only once per week. Initially you just want something easy to use and mark quickly because you don’t have a lot of time taking attendance every week.

💎 The most important design decision when creating any software is how well does it meet the needs of the person using it, not how easy was it to make or maintain. Developers constantly forget this.

Sample Combined Class Attendance
Class Student W1 W2 W3 W4
Math Jill L
Coding Jill
Coding John
Coding Jane X
Sculpt John
Paint Jill x L
Rowing Jane

That data might not be “normal” but it is practical. With a little color background they can quickly see who attended all the different classes they teach and perhaps even draw some conclusions by analyzing the data just by looking at it. Having this of the data is useful. In fact, in database design that is exactly what this is called, a view.

One day you realize you really need to email students so you add a column.

Notice Anything Wrong?
Class Student Email W1 W2 W3 W4
Math Jill jill@pm.me L
Coding Jill jill@pm.me
Coding John john@pm.me
Coding Jane jane@pm.me X
Sculpt John jon@pm.me
Paint Jill jill@pm.me x L
Rowing Jane jane@pm.me

This table has a problem. John’s second email was entered incorrectly. That’s not so critical in our little example, but imagine if this contain data for 20 students per class. That would be 100 records. It would be much easier to miss.

Fixing this problem is easy in this case. We can see that John is spelled incorrectly and guess that john@pm.me just needs to be corrected. But what about when we add a birthday?

Getting Dangerous
Class Student Email Bday W1 W2 W3 W4
Math Jill jill@pm.me 5/10 L
Coding Jill jill@pm.me 5/10
Coding John john@pm.me 3/29
Coding Jane jane@pm.me 12/3 X
Sculpt John john@pm.me 3/29
Paint Jill jill@pm.me 5/10 x L
Rowing Jane jane@pm.me 1/23

At this point stuff starts to feel wrong but we keep using it because we either don’t know how to separate the data or don’t have time for the work involved to copy everything over to another table.

But the really dangerous part is we now have bad data in the birthday column for Jane and no way to correct it without asking Jane personally. Which is it, 12/3 or 1/23?

Obviously someone put the / in the wrong place, but this is an easy mistake to make when typing in and maintaining that data by hand. That would have been caught if we had validated the data somehow before accepting it.

⚠️ The danger such mistakes pose to software, users, and society is very significant. No one will die with this mistake, but imagine making such errors in databases with millions of records, say at your bank, or the government, or your medical records. The danger is compounded by the extremely high expense of correcting these errors, which can only be done “by hand” negating much of the benefit of having the database in the first place.

Now we decide to shorten the class name from Coding to Code. It’s a small change so you don’t think anything of it. You start up find-and-replace and end up with this.

A Simple Change
Class Student Email Bday W1 W2 W3 W4
Math Jill jill@pm.me 5/10 L
Coding Jill jill@pm.me 5/10
Code John john@pm.me 3/29
Code Jane jane@pm.me 12/3 X
Sculpt John john@pm.me 3/29
Paint Jill jill@pm.me 5/10 x L
Rowing Jane jane@pm.me 1/23

Then we get a transfer student in the middle who is also named John. We have no choice but to add an initial to his name.

Class Student Email Bday W1 W2 W3 W4
Math Jill jill@pm.me 5/10 L
Math JohnP johnp@pm.me 4/3
Coding Jill jill@pm.me 5/10
Code John john@pm.me 3/29
Code Jane jane@pm.me 12/3 X
Sculpt John john@pm.me 3/29
Paint Jill jill@pm.me 5/10 x L
Rowing Jane jane@pm.me 1/23

Did you notice we missed one of the class renames? I know it would not be that hard to miss in that small a list, but imagine it being larger and forgetting to set your find to “wrap around”.

There are actual a lot of small risks that add up to real danger for our simple little data set — especially if someday we would like to make an app to help us out with all of this, like a QR code scanner or something:

You get the point. If you haven’t yet, imagine having a substitute teacher take over for two weeks and needing them to understand all of this. You haven’t even written any of it down for yourself and forget. (You are always typing Coding instead of Code since you made that change.)

After getting a full dose of reality from having to go to relatively great lengths to correct these data entry errors and make these risky updates in several places we are now ready to consider ways to avoid that wasted time in the future, but how?

It’s time to normalize.

When we step back a bit we see that we actually have three tables:

  1. students,
  2. classes, and
  3. attendance.

In order to get these, however, we are going to have to setup some rules for ourselves and get clarification on what some of that data can actually be. Such rules are often called constraints, which is a word that has a bad reputation but is exactly what we need right now. Here are some examples:

Those are actually just the beginning.

You will find that the more time you invest in defining constraints up front, the more time you will save in the long run, you just won’t realize it because you will take it for granted.

Most people learn this lesson the hard way after losing valuable data or time to correct massive data problems. Don’t be them, learn from history and start with good data organization. Your data will grow and evolve naturally. That’s fine. But it will be easier if you get started right.

💎 Organizing your data isn’t difficult, but it does take time, like cleaning your room or lab or main home screen. It’s always the right thing to do, and for all the right reasons. A clean lab means you can find stuff faster and spend less time looking around. The same goes for data.

[As I look up from writing that last paragraph I see all the wires, circuit boards, books, and random notes cluttering my desk. (Wait, is that my guitar capo?) I’m certainly not the strictest adherent to this principle even though I should be. Knowing what’s best — and doing it — are two different things. My production computer labs were immaculate, however. My boss used to joke about my obsession with zip-ties and cable management all the time. If he could actually read my code he would have seen the same.]

Alright, enough ranting about keeping your desk data clean let’s actually clean up this contrived example.

Let’s separate what we have into the three tables and clean up the data a bit.

We’ll use a better birthday format and spend the time to ask students for the year. It just so happens these dates also sort nicely with no fancy sort algorithm help.

Let’s also create an arbitrary unique student ID.

💎 Hexidecimal numbers provide an ID system that is easy to remember and allows up to 16,777,216 unique IDs, which also happen to be valid CSS colors. It also avoids the confusion of wondering if that 1 is an l or 0 is O.

Students
ID Student Email Bday
6d0b8b Jill jill@pm.me 2000-05-10
1fa376 JohnP johnp@pm.me 2000-04-03
223225 John john@pm.me 2000-03-29
954837 Jane jane@pm.me 1999-12-23

Doesn’t Class seem like it is begging for more data? Why not.

Classes
ID Class Day Time Duration
4fb73e Math 1 15:10 50
b29725 Code 1 9:00 90
66409a Sculpt 2 12:00 120
3a433a Paint 2 14:00 120
2aae7a Rowing 2 16:30 120

Obviously we just incurred a whole bunch more constraints:

💎 Stop for a moment and imagine what our original “dirty” table would look like if we added all that Class data. Imagine changing times and such. It would be an absolute nightmare.

Which leaves us finally with our Attendance table.

Attendance
Class Student W1 W2 W3 W4
4fb73e 6d0b8b L
4fb73e 1fa376
b29725 6d0b8b
b29725 223225
b29725 954837 X
66409a 223225
3a433a 6d0b8b x L
2aae7a 954837

💢 Wait a second. I have no idea what any of that means just by looking at it? There is no way I could quickly mark attendance in such a table. I’m don’t have the brain nor the desire to memorize hundreds of unique identifiers. I think this just became more work to maintain, not less!

This is the first pain we feel from normalization. The first casualty of organized data is actually — ironically — clarity and simplicity. Really the only practical way to mark attendance is to create an app that does it for us. That’s right …

🛑 We’re now required to make an app just to mark attendance.

⚠️ Up to now you might have agreed with all the reasons for normalization. This is where most people abandon normalization in favor of other methods of data organization. Normalizing data has taken a serious kick in the teeth over the last decade. The argument is that this added complexity — which we will get even worse — is not sustainable and does not justify the extra time and energy to maintain. This thinking is at the core of the NoSQL movement.

Notice that Classes and Students are different than Attendance. This is because a Class and a Student are entities and Attendance is a relation. Entities and relations are very different.

Entities are self-contained. They are the things in our domain. They have characteristics, properties, and attributes.

📖 These are sometimes called fields but I prefer attributes which has the same meaning in other areas of tech. For example, having brown eyes is an attribute not a property. The term property works better in some cases, however. For example, one property of salt is that is dissolves in water. To me property is best for stuff that could be represented with true or false while attributes are more descriptive. So maybe saying “properties and attributes” is the best way to keep everyone happy.

Relations are how these things relate to one another. A student attends a class.

🤪 Now for something totally mind-melting and the reason people hate working with normalized data without a really good database of some kind — and even then often prefer to take the risks of not normalizing at all.

Consider the following rational questions for a moment:

If we dig deeper we see that we actually discover a fourth table lurking (and more to come). We didn’t see it because we really have not strictly defined what a Class is.

Each mark of attendance is a property not of one specific time in a given week that a student attended, was late, or didn’t attend but we don’t have a name for it. How about Session.

💎 That little thought process we had right there is called domain modeling and is critical to good software design and development.

You have to know what things are called before you can code them into software and systems. The Business Analyst and Systems Analyst and Systems Architecture professions spend a of time doing good domain modeling with the people using the software or system before they ever build it and while they are building it — if they are actually good at their jobs, that is. This is where Agile Development comes in.

So, should we create a new table for Sessions?

This is where full normalization starts to really get complicated and begins being more of a burden than a help. After all, we are just trying to make keeping track of everything easier, not harder.

Here’s the kicker though, we already have no idea how to maintain the new Attendance table with it’s impossible to read unique IDs. So would it be that much worse to maintain to add Sessions?

📖 If we did go to the trouble of making Sessions into its own table we would be creating data in “fifth normal form,” which is some term invented in the 50s for all the different levels of “normalization.” Most data because very impractical at that level and is only done in huge databases with billions of records that would seriously impact performance. For example, consider if I wanted to search for every attendance record specifically for one session in a given week for available class at that time for every student in a university. We could search for all sessions taught for that specific day and time and get what was taught and what students where in attendance. That’s a level of detail we just don’t need.

Just because we don’t make a table for Sessions doesn’t mean we shouldn’t remember that name, though. It will come in handy later when writing code about attendance and we need to name things in our code.

While we are talking about names though, what is a Class really?

I love this example, because it shows how easily miscommunication can happen. Such things can destroy a software project.

Permit me to role play a bit here to make the point. Imagine bumping into your friend on campus.

"Hey what class you headed to?"

"Math"

"Oh cool, which one?"

"The 4pm one."

"No, I meant, like, which Math?"

"Oh, Calculus."

"No way, I'm in Calculus also. Which one?"

"Which Calculus?"

"Yeah, 101, 150?"

"Oh, just 101. My AP test got me credit, but clearly didn't prepare me for *this* so I decided to take the first level again"

"That makes sense."

"So which is your class?"

"You mean that I'm going to now? Or which Calculus course? Or which time is my Calculus class?"

"Um, 101, 150, whatever?"

"Oh, I'm in 300. I did a special STEM boarding school that class really prepared me."

"So which class?"

"I graduated class of 2019. Amazing people. So glad I got to be with that group. We still follow each other a lot."

"No, which, like math class prepared you?"

"Well they had a statistics class that was amazing. Plus I met my girlfriend in it, Doris."

"No way! Is that the Doris Katie's friend? I think I know her. Love that name. Katie's a good friend. Which class was that?"

"Dude, statistics, I already told you."

"No man, which *class*?"

"You mean, like, which *time* we met?"

"Yeah."

"I think it was around 11:00 am. I remember because it was right before lunch. We'd always go eat together after."

"Nice. Maybe we could double sometime after class this week?"

"Um, sure! Which class?"

That is the problem with language without specificity. Our friends in that example all came through it without any hard feelings or lost money or blood, but when you don’t have such specific definitions in your domain your software and systems will practically implode very early on — or at the very least — later when can’t remember what something is.

💎 Jim Coplien explains how totally and completely object-oriented programming has failed because it lost sight of its primary goal, to map the mental models of those using the software into the terms and organization of the software itself.

TODO