It’s been more than five months since I started working as a data scientist. I remember when I decided to make the change from academia that I spent quite a bit of time asking for advice and wondering exactly what skills I was missing, and also what skills I might already have but not be aware of, in order to “become” a data scientist. This post will function as general advice for people thinking about taking the route I’ve taken, and I will try to make similar posts in the future with updates about further things I’ve learned.
UPDATE #1: I wrote this post after one year.
- Python. There is nothing like the power of Python (and particularly the pandas package) for quickly hacking together something you need. Is there some data transformation you find yourself doing fifteen times a day? Write a Python script. Need to augment your own data with some pulls from APIs? Use Python to both scrape and join the data. Python also has access to all standard machine learning algorithms with its scikit-learn package. I knew a little bit about these things before I started: enough to play around with some personal projects but definitely not a lot. I feel much more fluent in Python and its data libraries now.
- SQL. Like it or not SQL is still the heart of most data infrastructures. To obtain raw data sets for testing, or to provide basic reports and explanations for other people, SQL is the main way to access data. Before I started I knew SQL syntax and understood how relational databases work (this is fairly easy for someone with a mathematical background), but had very little experience writing, and in particular optimising, queries.
- Skepticism. “Data science”, like any science, requires adherence to the scientific method of falsifiability and replicability. Anything you do should be a) able to be disproved and b) able to be repeated by someone else. This means always checking your assumptions and always being willing to change your hypothesis and even fully reverse your line of inquiry based on new evidence you find in the data. Again, from an academic background in the sciences this is something that comes fairly naturally.
Skills I’ve learned
- Hands-on experience with data. There’s no real substitute for actual hands-on experience with a real-world database. The way that big data is structured and handled is not something you can accurately replicate with a home project. This requires knowledge of the full data pipeline, which is completely dependent on the company and their software stack.
- SQL. The more SQL you learn the better you can understand the data you are playing with. SQL is a very intuitive and powerful way to “touch” the data, and learning how to optimise SQL queries forces you to understand the specific structure of your data and your database.
- Communication. In industry, as opposed to academia, most people to whom you are presenting the insights you found in your data are not particularly interested in the method you used, but rather about the bottom line number and the confidence you have in it. Being able to confidently and accurately communicate complicated and insightful conclusions you draw from data is something that is difficult to learn without hands-on experience.