Relying on the database to validate your data
One of my pet peeves is using the database schema to validate data.
Several ways in which this normally happens:
- Specifying a column as "required", e.g.
email VARCHAR(255) NOT NULL
- Adding an index to force column values to be unique (e.g.
CREATE UNIQUE INDEX email_idx ON users(email)
) - Adding an index for foreign key integrity, including cascading deletes, etc.
Yes, I want data integrity too. No, I don't want to rely on the database for that.
I find it surprising that some years ago we decided that we shouldn't write application logic in our database (e.g. stored procedures) because:
- They are not written in the same language as the rest of the project.
- They are not version-controlled (unless you jump through some extra hoops).
- They are not testable in isolation; you need an actual database to run them.
- They are "magic" because they are triggered implicitly.
- The code is vendor-specific.
Well, anyway, it's clear that we don't want them.
Yet, many of these concerns apply to validation at the database-level as well.
Except, with stored procedures we actually delegate some work to the database.
With validation, we usually duplicate the work.
We first validate in the code that a value has been provided and show a form error if it hasn't.
Then we use an assertion in our model object to verify that a value is not null
.
Then we save our object to the database, which again verifies that the value is not null
.
Why do we do this? Maybe because we want symmetry? The model property is not nullable, so the column it's mapped to should also not be nullable. Maybe because we are more confident about the database than about our code? In the code a validation might be skipped somehow, but we'll always have that extra validation once the data ends up in the database.
Non-nullability
I think we don't need the symmetry, nor the safe-guard.
Instead, we should put more trust in our application code and make sure everything is handled there.
No need for the "double bookkeeping" where you try to keep the nullability of your model's properties in sync with the nullability of the database columns.
In my experience often a model property is nullable, but the database column isn't, or vice versa.
This leads to the application blowing up for a nullability discrepancy between the code and the database.
We can reduce this risk by stopping the double bookkeeping.
Instead of defining non-nullability on database columns, let's only define it in the code.
We always have to deal with non-nullability in the code anyway, since we want validation errors instead of SQL errors.
So, let's just remove NOT NULL
everywhere, hooray!
Unique indexes
Another interesting database-level validation technique is ensuring uniqueness, like in the case of the unique email address in the users
table.
Apparently, we don't trust the application here either, and leave the verification of an important rule to the database (we'll talk later about how important it really is).
Given that the database supports a uniqueness check by means of adding an index we jump in and add it to every column we feel should be unique.
We then realize that we don't want SQL errors when a user registers with a known email address.
We want form validation errors instead.
Then we introduce some more double bookkeeping so we can be user-friendly and protect our data integrity.
What if we only validate the uniqueness of the email address in the application code and not in the database?
We could end up with two records in the users
table that have the same email address.
It's unlikely that this will happen though, because the application always runs the uniqueness validation itself, by querying the table for an existing record with the provided email address:
if ($this->userRepository->containsUserWithEmailAddress($emailAddress)) {
// show form error in the response
} else {
// the data is valid
$user = new User($emailAddress);
$this->userRepository->save($user);
}
The only way we could end up with duplicate records is when two registration requests providing the same email address are being processed at the same time.
In that case, the first containsUserWithEmailAddress()
may return false
during both of these requests and the call to save()
would result in two records with the same email address.
Again, it's not very likely, but it could happen.
But what if it happens?
We just have to make sure it has no significant impact.
I think the biggest fear when having duplicate email addresses in the database is that someone might b
Truncated by Planet PHP, read more at the original (another 8982 bytes)