At Custora (YC W11), analyzing consumer behavior and presenting the insights thr...

At Custora (YC W11), analyzing consumer behavior and presenting the insights through a web-based interface.

We have an in-house system for modeling DAGs of statistics, perhaps like Airflow (I've only read about Airflow, never used it, so can't comment more there). Computed values at different nodes in the DAG can be cached by their arguments. So for example, a fitted model would be a very logical node to cache, as you'll have many other potential statistical requests that would depend on the model.

Refitting the model would entail either clearing that node; if you changed the inputs to the fit, a separate stat would be cached and other dependent stats would know the difference and point to the right stat depending on the arguments you passed.

A lot of our models are Bayesian in nature, so generating predictions is typically a two-part process: training parameters, which is slow, can happen infrequently and which need not critically include all of the latest data, and applying the posterior update, which is faster and which needs to be redone on every data update. (We import data in batch.) Retraining is somewhat ad-hoc right now, although we've got an active project on the docket to systematize this and produce streamlined before-and-after comparisons.

Computation work is dispatched to EC2 instances by a Redis-based job manager (qless, developed by Moz, formerly SEOmoz). We do the stats work in R, in-memory. For things like order and transactional data this is feasible even for relatively large retailers, but we're gradually looking at involving Spark more so that we have the capacity for larger analyses. (We already do use Spark for some non-ML tasks, like customer data import.)