Structured Data Objects v.s. Tables and Sheets
How we organize data is perhaps the most important topic for any IT professional. It certainly isn’t just for “data scientists” and database administrators. It is a critical topic that touches all aspects of IT and the organization involved. There is simply no better way to reduce technical debt and increase efficiency and productivity than having a solid data discovery, management, and migration strategy.
There are two major approaches to data organization and management:
- Structured Data Objects
- Relational Tables and Sheets with Records and Fields
These approaches have core differences and provide different solutions to a wide range of problems.
However, the nature and reality of modern IT systems and software development objectively prefer defaulting to structured data objects rather than tables and sheets.
In other words, get off that spreadsheet as soon as possible and use JSON files instead. The core justification for this actually comes from the traditional argument for normalization itself.
When you remove the constraint of rows and columns you free your mind to discover the actual structure and relation of all the things in your domain. Structure, type safety, and validation can be added easily and reliably with the simplest of code without requiring the heavy installation of a full database system.
Besides, Web APIS have replaced traditional SQL views from decades past. These APIs provide the security of enforcing data constraints and validation while joining data in ways that the application need not worry about. This destroys most justification for traditional relational database management systems. In fact, the only remaining reason I can think of is performance and that gap is lessening.
In fact, much of database usage is being replace on a macro scale by machine learning. There will always be a need for querying and crunching large data sets, but machine learning is separating out the crunching — roughly equivalent to indexing of old — and replacing it with easy to run models in TensorFlow that can be applied to any incoming or existing data. This has fundamentally changed how everyone approaches data in general, which massive disrupts the foundational needs that all databases have filled for their users in the past.
For example, the auditing system I helped create at IBM was based on the traditional collect, crunch, query, and report approach. But another approach is now possible. Create sample data for healthy systems and then generate events for each system no in compliance as it is received. Such an architecture requires a more flexible structured data approach from the beginning because the traditional approach requires it to go into a database first, which drags all the latency and complexity with it.
💬 NoSQL options did not exist when I designed and deployed a massive data storage system for IBM that contained the profile data for tens of thousands of systems to ensure audit compliance. We were constantly plagued by the brittle tables that would easily add a column, but almost never migrate to a new one. Requests to keep track of new data required not only a database change and outage, but also rewriting all the SQL to create the database, query it, and more. Then the software data structures had to be altered to match as well. This was massively prohibitive. Had I had Go’s JSON structure integration and modern NoSQL databases it would have been a dream to maintain.