Data Analysis Tools at Cogo Labs
When creating the next web businesses at Cogo Labs, Analysts are often the project pioneers. When we have new ideas we want to explore, we need to get a working version up as quickly as possible so that it can be evaluated more effectively.
This means Analysts have significantly more technical responsibility that they might at other big-data companies. Because it’s our responsibility to see our projects come to fruition, we require a much broader set of technologies than your typical data analyst.
Here’s an overview of the tools we currently use when tackling our day-to-day tasks. Keep in mind this is just a snapshot in time - we always seek to use the best tool for the job, so our toolset continually evolves as new technologies for data analysis and data management emerge.
The first questions for any data analysis problem are “Where does the data live?” and “How is it stored?”
The most common way for us to store and organize data is to use some type of relational database management system, or RDBMS. Most often we use MySQL, Postgres, or Amazon Redshift. Occasionally we will just use Amazon’s S3 service if the data is very large and doesn’t need to be as structured and query-able.
How do we decide which one of these to use? For smaller projects, the decision is between MySQL and Postgres. Each one has its own advantages and disadvantages, so the decision is mostly based how familiar a given team is with each variant.
Redshift usually enters the picture once the dataset is quite large (approaching a billion rows), as this is the point when Redshift can prove to outperform Postgres (or MySQL) for many types of queries. Since Redshift data management is more technically involved, our engineering team helps the ETL and cluster management of our Redshift data.
Once our data has been uploaded, nicely formatted, and readied for use, what tools do our Analysts use to do their number crunching?
If they don’t already know it, every Analyst who comes to Cogo learns SQL. SQL is the language needed to extract data from our databases, so it is the most common language written by all Analysts.
Although you can perform some pretty complicated statistics in SQL alone, it’s often more trouble than it’s worth. The much more common approach is to extract the data in the format you want, then pass it to the next tool in the pipeline.
Python, the powerhouse
By far the most common next tool would be Python. This is another default tool all analysts need to learn once they arrive at Cogo, for three reasons:
- Ease of learning;
- Quick turnaround for ideas;
- Ability to scale to most of the problems we need.
Additionally, learning Python sets our Analysts up for interacting with the bigger engineering systems within Cogo, many of which are also written in Python.
Another data analysis tool used less frequently around the building is R. Since it is the other standard tool for data analysis, many people, especially those coming from academia, arrive with a fair amount of experience in R.
R is useful to the data scientists within Cogo because of the huge library of statistical functions available, as well as the more mature plotting that comes from ggplot.
Because Python is a much more general tool and goes far beyond just Pandas and SciPy, we have Analysts learn Python over R. However, we never discourage people from deriving clear, presentable results with whatever tool works best for them.
For the quickest analysis work, many Analysts also use Excel. It's hard to beat the speed of opening your CSV file, making a quick manipulation in a pivot table, and having a presentable graph in seconds. Especially for the Analysts who come to Cogo already very familiar with Excel, there will almost always be a place to use the nice graphical interface for moving around your data.
Data, but bigger
Once in a while, an Analyst might need to research a problem using many months' worth of historical data. If this becomes too much for traditional relational databases to handle in one query, they may use tools designed to handle larger quantities of data at one time. Most commonly, we will write either a MapReduce or, more recently, a Spark job.
While our Engineering team has set up some infrastructure to run Hadoop jobs on our own machines, we have been moving more and more to Amazon’s Elastic MapReduce web service. This pairs nicely with our S3 and Redshift usage, allowing minimal overhead for transferring data from one tool to another.
Some of our Engineers have also recently begun working with Presto for a few of our big databases. This has been especially nice for Analysts who would like to access our largest databases, but don’t want to write an entire MapReduce job for every answer.
In addition to all the open source technology we use, our Engineering team has created two incredible tools for sharing data analyses. The first is Quake, which stands for Query Utility And Knowledge Engine. Quake provides a website where you can create, run, and share SQL executions on any of our dozens of databases.
The other tool, called à la chart, is a service that provides a drop-down menu for creating a visually appealing, easy to understand chart from any SQL execution that ran on Quake. The underlying library to à la chart is C3.js, which offers a variety of ways to visualize data.
Having these tools in our arsenal means that when we discuss and present analyses, we aren’t just glossing over our methodologies. We are sharing the SQL queries, graphs, and code to all who want to poke at, play with, or extend the analysis work. It transforms what was used to be a black box of results into a fully collaborative environment, where Analysts work together to solve complex big-data problems.