I spend a lot of my time at work turning .CSV files generated by SQL queries that I’ve written into meaningful insights. The first step in this process is always to clean up the data by processing it, further aggregating it, combining columns, and the second step is to perform some kind of statistical analysis to get some sort of insight into the user behaviour that is behind the data.
Historically a lot of this work was done in Microsoft Excel. Excel is a powerful and underrated piece of software which people have been using for more than twenty years to get these kinds of things done. If you’ve never seen an Excel power user manipulate a spreadsheet then perhaps you aren’t aware of just what can be done in Excel if you know what you’re doing.
I appreciate Excel and wish I was an expert at data-wrangling in it. However, my preferred method for perfoming all of these tasks is pandas, a scientific computing package for Python which is purpose-built for managing and working with data. The power of pandas is the power of all programming languages, indeed, the power of computers, versus human-executed tasks. Maybe a CSV file can be cleaned up in Excel by deleting and combining columns, writing some basic functions and VLOOKUPs which make new columns, creating pivot tables to see how different variables look statistically and how they influence each other. The problem is, if you require fresh analysis of a different file, you’ll need to do all those mouse clicks again.
Wrangling the data in pandas however, means the process is modular and reproducible. If I have a fresh CSV with the same data, I can load it in and perform the analysis again instantly. Not to mention the fact that once a dataset is in a pandas dataframe, clean and ready to go, it is accessible to all the machine learning tools of scikit-learn and statistical analysis tools in numpy and scipy (or transferrable to R).