I started work as a data scientist one year ago today. As I mentioned in the first post I wrote on the topic, I get questions from time-to-time about the transition from academia to data science, and I hope these posts can help people get a better idea about the day-to-day life of someone working with data.
Projects from the last year
I’ll start with a short list of projects I’ve done in my job over the last year (not going into too much detail). Some of these are outside the traditional domain of a data scientist (I mentioned this in a previous post) but I still enjoyed them and learnt a lot (and am starting to believe that the title of “data scientist” is a bit misleading anyway; see below).
- building a statistical model for user lifetime value
- writing a basic anomoly detector for website traffic (to detect successful content before its “hit peak”)
- designing client-side event logic for a native monetisation platform
- writing an ETL pipeline for a new data product
- building a regression model to predict popular content
- using random forests to train a model for user engagement behaviour
- designing and helping write a content-based recommendations engine in Spark with Scala
Things I’m pretty good at now
My problem solving skills have adapted well from mathematical problems to real-world problems involving data. I usually feel confident in diagnosing an issue, detecting where in the pipeline things are going wrong, and knowing the steps I or someone else will need to take to fix it.
My Python programming has improved dramatically: I now write much more modular, efficient code, well-commented, documented and clean. I also learnt the basics of programming in Scala and using the Spark framework for writing distributed algorithms (including their SparkML framework for implementing machine learning at scale).
Things I still want to learn
There’s quite a lot I still don’t know about data science. My mathematical background is good but there’s still a lot of gaps in my understanding of statistics (this wasn’t something I ever really studied at university). I’m making my way slowly through the classic textbook on statistical learning and getting a lot out of it. I also would love to learn more about practical implementations of machine learning models: I feel quite confident with the theory but have only had the opportunity to implement a few basic models, and not at huge scale. Over the next year I’m very keen to learn about large-scale deployment and optimisation of machine learning algorithms in production environments.
Finally, data visualization. I know very little about the design theory behind data visualization: this is something that I feel really separates good data scientists from ordinary ones, and gives them the edge in being able to communicate their findings effectively. Over the next year I’m going to be spending quite a bit of time practising data visualization with a variety of interesting software.
What I learnt about data science and data scientists
You can find a hundred posts online about what data science is, and is not, and what a data scientist should know and do professionally. I have had a somewhat unusual initiation into the field as my job specifically is a lot more hands-on and operational than many data scientists. I’ve enjoyed this though, and learnt about the fluid boundaries between parts of the “data” industry: over the last year I’ve done bits of pieces of jobs which might be referred to in different workplaces as data scientist, data engineer, data architect, database administrator, business analyst, software engineer, product manager and other things. It’s an exciting field and there’s lots to learn, and I take these messy boundaries as a good thing (so long as someone else is ultimately deciding who is responsible for what).