How to tame your growing users table

How did the client get there?

Before diving into the details, let's try to understand how an app might end up in this state. We start with a simple users table. After a few weeks, we need to be able to determine the last sign in time so we add users.last_sign_in_at. Then we need to know the user's name. We add first_name and last_name. Twitter handle? Another column. GitHub profile? Phone number? After a few months the table becomes mind-boggling.

What's wrong with this?

A table that large indicates several problems:

User has multiple unrelated responsibilities. This makes it more difficult to understand, change and test.
Exchanging data between the app and the database requires extra bandwidth.
The app needs more memory to store the bulky model.

The app fetched User on every request for authentication and authorisation purposes but usually used only a handful of columns. Fixing the problem would improve both the design and performance.

Extracting a table

We can solve the problem by extracting seldom-used columns to a new table (or tables). For example, we can extract profile information (first_name, etc.) into profiles with the following steps:

Create profiles with columns duplicating profile-related columns in users.
Add profile_id to users. Set it to NULL for now.
For each row in users, insert a row to profiles that duplicates profile-related columns.
Point profile_id of the corresponding row in users to the row inserted in 3.
Do not make users.profile_id non-NULL. The app isn't aware of its existence yet so it'd break.

We need to replace references to users.first_name with profiles.first_name and so on. If we're extracting just a few columns with a handful of references then I recommend we do this manually. But as soon as we catch ourselves thinking "Oh, no. This is the worst job ever!" we should look for an alternative.

Don't neglect the problem. A part of the code that everyone avoids will deteriorate further and suffer from even greater inattention. The easiest way to break the vicious circle is to start small.

Read on, if you're curios how my client solved the problem.

Fixing the code one line at a time

The most incremental approach is fixing one reference to the old column at a time. Let's focus on moving first_name from users to profiles.

First, create Profile with:

1rails generate model Profile first_name:string

Then add a reference from users to profiles and copy users.first_name to profiles:

1class ExtractUsersFirstNameToProfiles < ActiveRecord::Migration
2  # Redefine the models to break dependency on production code. We need
3  # vanilla models without callbacks, etc. Also, removing a model in the future
4  # might break the migration.
5  class User < ActiveRecord::Base; end
6  class Profile < ActiveRecord::Base; end
7
8  def up
9    add_reference :users, :profile, index: true, unique: true, foreign_key: true
10
11    User.find_each do |user|
12      profile = Profile.create!(first_name: user.first_name)
13      user.update!(profile_id: profile.id)
14    end
15
16    change_column_null :users, :profile_id, false
17  end
18
19  def down
20    remove_reference :users, :profile
21  end
22end

Because it forces each user to have exactly one profile, a reference from users to profiles is preferable to the opposite reference.

With the database structure in place, we can delegate first_name from User to Profile. My client had several requirements:

Accessors should use the associated Profile. They should also log where the deprecated accessor was called from.
Saving User should automatically save Profile in order to avoid breaking code using the deprecated accessors.
User#first_name_changed? and other ActiveModel::Dirty methods should still work.

This means User should look like this:

1class User < ActiveRecord::Base
2  # We need autosave as the client code might be unaware of
3  # Profile#first_name and still reference User#first_name.
4  belongs_to :profile, autosave: true
5
6  def first_name
7    log_backtrace(:first_name)
8    profile.first_name
9  end
10
11  def first_name=(new_first_name)
12    log_backtrace(:first_name)
13
14    # Call super so that User#first_name_changed? and similar still work as
15    # expected.
16    super
17
18    profile.first_name = new_first_name
19  end
20
21  private
22
23  def log_backtrace(name)
24    filtered_backtrace = caller.select do |item|
25      item.start_with?(Rails.root.to_s)
26    end
27    Rails.logger.warn(<<-END)
28A reference to an obsolete attribute #{name} at:
29#{filtered_backtrace.join("\n")}
30END
31  end
32end

After these changes, the app works the same but may be a bit slower because of the extra references to Profile (if performance becomes an issue just use a tool like AppSignal). The code logs all references to the legacy attributes, even ungreppable ones (e.g. user[attr] = ... or user.send("#{attr}=", ...)) so we'll be able to locate all of them even when grep is unhelpful.

With this infrastructure in place, we can commit to fixing one reference to users.first_name on a regular schedule, e.g. every morning (to start the day with a quick win) or around noon (to work on something easier after a focused morning). This commitment is essential because our goal is to lessen mental barriers to fixing the issue. Leaving the code above in place without taking action will impoverish the app even further.

After removing all deprecated references (and confirming with grep and logs) we can finally drop users.first_name:

1class RemoveUsersFirstName < ActiveRecord::Migration
2  def change
3    remove_column :users, :first_name, :string
4  end
5end

We should also get rid of the code added to User as it's no longer necessary.

Limitations

The method might apply to your case but keep in mind some of its limitations:

It doesn't handle bulk queries like User.update_all.
It doesn't handle raw SQL queries.
It may break monkey-patches (remember that dependencies might introduce them too).
User and Profile may be out of sync, if profiles.first_name is updated but users.first_name is not.

You might be able to overcome some of them. For example, you might keep the models in sync with a service object or a callback on Profile. Or if you use PostgreSQL you might consider using a materialized view in the interim.

That's it!

The most important lesson of the article is do not avoid code that smells but tackle it head on instead. If the task is overwhelming then work iteratively on a regular schedule. The article presented a method to consider when extracting a table is difficult. If you can't apply it then look for something else. If you have no idea how then just drop me a line. I'll try to help. Don't let your bits rot.