A number of people have asked me what it takes to make it in data analytics. This is a tough question because often the answer is more a function of where you’re going to be working and what will be expected of you in your role. Some companies believe that a Data Analyst should be wholly responsible for an analytics project from soup to nuts (or in this case, from need identification to detailed strategy). Others have some resources doing the requirements gathering, others pulling queries, others using the results of those queries for analysis, and still other people reviewing the analysis to arrive at a strategy.
Regardless of where in the expectation spectrum your situation falls, here are some rules that I’ve mapped out along with my team to lay the groundworks for a good analyst.
Rule #0: Know Thy Data
I mean intimately know your data. This is the foundation for any good analyst. If you don’t know what you’re looking at, you’ll never be able to find what you’re looking for. Maybe more importantly, you have to know the data so you know where to avoid the pitfalls that you might otherwise stumble into.
Rule #1: No Dying
This applies to you, your queries, your servers, your computer, your pet… take your pick. Dying is unhelpful when you’re trying to do analysis and tends to distract others. It is therefore forbidden.
Rule #2: In Case Rule #1 is Violated, Double Tap
Straight out of Zombieland, ensure that anyone who has violated Rule #1 is no longer a danger. They deserve it anyway.
Rule #3: Know Thy Code
Hand-in-hand with Rule #0, this is absolutely crucial to being a good analyst. You have to know what your code is doing to the data in order to appropriately identify any possible SNAFUs along the way. This includes:
- Knowing at least a good chunk of the code that got the data into your hands
- Knowing the code you’re using to query the data
- Knowing the code used for any data transformations
- Knowing the code and algorithms behind any analysis you intend to perform
Violations of this rule are pretty egregious and very difficult to recover from.
Rule #4: We Do Not Talk about Rule #4
While this is indeed a reference to Fight Club, it is also used in place of whatever project it is that will haunt you to the end of your days. This is the project that you thought you’d finished… except the fix didn’t work… and the system blew up (violating Rule #1)… and the data had to be adjusted manually… and… and… and…
Rule #5: Know Thy Data Types
Technically, this one came about as an evolution of a reduced statement: Timestamp != Date. More than any other data type, temporal data types seem to be the most troublesome. I once did an enormous amount of work on a statistical analysis with a date field improperly cast as text. Hundreds of hours of work down the drain because I wasn’t aware that the data was stored in an incorrect data type. The same holds true of trying to aggregate numbers stored as text, or casting integers as dates… the list goes on. Knowing what data type you have and what it should be (and, ideally, how to cast it appropriately from one to the other as needed) is an invaluable skill.
Rule #6: Apples + Oranges = Pear-Shaped
The dangers of comparing apples to oranges are numerous (obviously, since this rule comes with its own saying). Mixing data types, time periods, member groups, or many other types of data or summaries can create results that *should* make no sense… except that people will assume you’ve followed this rule and try to read into these results. This tends to lead to erroneous conclusions and faulty strategies that can cost quite a bit to the client/company.
Rule #7: Tables are Fragile. Do Not Drop!
OK, this isn’t really a hard and fast rule. There are many times when you want to drop a table (or, more precisely, when you should be dealing with temporary tables and not cluttering up the storage drives with tables that then need to be dropped). Most of the time, it is imprudent to drop a table, especially if you are in violation of Rule #0. There have been plenty of times when I’ve seen tables dropped that then required hours or even days worth of rework to recover. If you’re smart about it, you ought to have backups of tables this important (and difficult to re-create) so that you can avoid the additional work to get back, but I’ve seen backups fail as well. Best to follow the rules.