Why dummy data matters – and how to generate it

Part of creating a successful Staging and Test environment for your application is generating dummy data as part of your deployment process.

This dummy data is used during development, the QA process and even for feature demos. Making it easy for application states to be achieved with minimal “setup” (that tends to be tedious, data-entry style work) increases productivity of the whole team.

In this article, we will explore two strategies for generating dummy data: using a ‘Dummy Data Seeder’ or ‘Scrambling Production Data’.

Dummy Data Seeding

I had once written an article entitled “The Perfect Staging Environment” which explores this strategy in more detail.

The strategy

When to use this strategy

  • When the different states your data can be in is limited.
  • When it’s easy to recreate states by doing a few manual steps.
  • When it’s easy for whoever is doing the QA to update the seeder (ideal for when the developers are taking on the QA responsibility).

When not to use this strategy

  • When there are many different states your data can be in.
  • When it’s complicated and/or tedious to get the application into a particular state.
  • When it’s not uncommon to want to recreate a user’s exact state on a test environment in order to better understand the issue.

Lessons learned

  • When the application was “small”, this strategy was an extremely quick way for developers to get into a productive, working state.
  • The “DummyDataSeeder” file can quickly and easily become one very long and hard to follow class.
  • We tried to improve on this strategy by creating a UI + configuration-driven way of seeding different states. Avoid doing this: the configuration file ended up being more complex than what it replaced, rarely worked and wasn’t used.

Scrambling Production Data

This strategy leaves you with dummy data that is almost identical to your production data, except anonymised. For medium to large scale applications – this is well worth the effort.

The strategy

  • Replicate your production database elsewhere
  • Iterate over all of your records, and apply a reasonable strategy to anonymise data
  • Use this database as your staging/test database
  • Automate this through a command, or as part of your post deployment script.

When to use this strategy

  • When the data you want to mock is quite complex
  • When you want to be able to test the exact state that a production user is in, in a test environment
  • When you expand the team to include a dedicated QA person
  • When you’d like to test using production data, without any of the risks (and in a GDPR-friendly way)

When not to use this strategy

  • When the application is small
  • When the person doing the testing is a developer, and can easily create different states

Lessons learned

  • Automate this process at the click of a button – this is a task that needs to happen often (at least once per release, and at most – every time a feature branch is deployed to a test or staging environment).
  • Depending on your strategies for 3rd party integrations your application might have, you may want to disconnect any production accounts as part of the scrambling process.
  • It’s useful to scramble user emails to match their ID, and to change everyone’s password to something easy to remember.
  • Test this solution in a real-world environment to avoid surprises (if you’d like to setup an export across servers then you’re going to want to see what the performance of that looks like!)

Update: @PovilasKorop let me know about a couple of packages that help implement this strategy:

Thanks Povilas!

End

What do you think about these strategies? Let me know if you’ve got any different ones working for your team + application setup!

I lead the tech team at Mindbeat, follow me on Twitter as I continue to document lessons learned on this journey.