Jake Timothy

thinking in models

Modeling Humans

The people with the most insight into people’s decisions and motivations tend not to be the same people who design models and systems for data.  My wife, for example, has a prophetic intuition about people.  She “just knows” what someone is up to and why, what’s going on in different relationships.  She would never let on how much she sees, even if asked directly, because it would make the person uncomfortable.  She also couldn’t read a statistical graphic to save her life.  Over the decades, much of the artificial intelligence thought space has been shaped around understanding human intelligence, on metaphysical and biological levels.  With the growth of AI and data science in recent years, technology is poised to take immediate advantage of any new insight that can be captured and instrumented into a practical model.  Just as there are classic and quantum mechanical models for physics, there are economic, sociological, and psychological models for human behavior.  Collaboration between data and relational thinkers on human insight is a great opportunity to push data science forward.

Graphical Excellence

Edward Tufte, in his much loved The Visual Display of Quantitative Information, lays out a clean approach to data graphics and visualization.  By thinking carefully about representing the data and analysis with context and the efficient use of “ink”, a graphic has a better chance of telling the real story.

An example that caught my eye a while back is Norse’s live cyber attack map.  Notice the heat map that builds up as time progresses, a visual representation of the aggregating tables beneath.  I find it much more informative than the dramatic exploding circles that emanate from both attackers and targets.

Javascript libraries such as D3.js have enabled the creation of graphics with more design dimensions and options while maintaining clarity and efficiency.  Graphics have been dynamic for some time with the ability to depict three spatial dimensions, change over time, and to be interactive, encompassing multiple views in one, and the design space continues to grow with the advent of augmented and virtual reality technologies. Integrated systems for analysis and visualization, like Tableau, have made and will continue to make the process of generating graphics easy.  However, when designing for publication in a website or app, taking the time to refine the graphics is vital.  Avoiding distracting or distorting elements will continue to be absolutely important for the audience to have the best chance of correctly understanding the information as communication media evolve.


“In the way of this style, it is correct for even the beginner to hold a sword and short sword in either hand and train in the Way.  When you put your life on the line, you want all your weapons to be of use.  Your real intent should not be to die with weapons […]

Big year for Spark

Congrats to all the Apache Spark contributors out there!  I just read Databricks’ Spark 2015 Year In Review.  What a change Spark has gone through this last year.  I started learning Spark at the beginning of the year.  About a year on and what a change to the APIs.  The biggest change from my perspective looks like the Machine Learning Pipelines and the creation of ml over mllib.

Optimizing real-time big data

In my efforts to design and implement a Kalman filter using Apache Spark, I’ve had to explore some of the design limits of basic distributed computing.  At first, without thinking, I attempted to convert all the operations on collections to the equivalent “parallelized” operation.  It quickly became apparent to me that the differential equation solver needed for time series filtering would not translate in this way.  The simple numerical integration of a data set or function is easily distributed.  However, the propagation of an initial state through a series of transformations is not.  Each step could be processed by a separate node, but the steps must all be completed in sequence, taking the result of the previous computation as input.

After running into this wall for the first time, now I can’t stop thinking about designing across the boundary of distributed vs. local computing.  As the velocity and veracity of time series data fluctuates, tension can mount between the potential gains of farming out computation and the need to integrate and filter data for use locally.  Distribution and networking offers an arbitrarily large brain that is often just out of reach given an aggressive local need for real-time processed data and decision making.  Decision makers in a complex data flow must integrate system level information, not just latency, available resources, priorities, etc., but also the characteristics of the work needing to be done.