/ Smart DataPart IStatisticsData Analysis



Thanks to steady technical improvements in computational power, memory, storage, and software, for instance in application of machine learning, we are able to leverage data as never before. Now, in particular when smartphones, common internet-of-things devices, and of cause end-user itself are generating data at an unprecedented speed [1], everybody wants to analyze that „big-data“ hoping to get insights never before known. Often however this huge amount of data is not needed at all to support human based decision making.

There are a lot of historical examples for usage of statistical data analysis in the heath care. Already in 1954 P. Meehl showed [2] that psychoanalysts compared poorly against simply algorithms concerning the future of their patients by using a basic data-driven algorithm. Other significant examples are a cancer detection model by Lewis R. Goldmann and a heart attack detection algorithm by Lee Goldmann in the 1970s. Lee Goldmann’s heart attack algorithm, which was received by the community with skepticism and thus evaluated only in 1996 by Brendan Reilly in the Cook Country Hospitals’ Department of Medicine, showed right results in 95% of potential cases compared to 75-89% for doctors. Cost savings pressured the hospital to take care of many people with constrained resources [3]. The creation of a cancer detection model by Lewis R. Goldmann was rather by chance, since the initial intention of Lew was to show how different the diagnoses of various doctors were. By creating a statistical model of the various cues to detect cancer in order to compare them, he inadvertently created an algorithm that determines cancer better then every doctor by himself [4]. These examples applied data sets which had to be very small, rather puny, compared to today’s standards and yet the results are profound. And the interesting thing is that most applications today still use data-sets whose sizes are far from big.


How to distinguish small from big data-sets? There are unfortunately no general definitions, and the term Big Data has evolved to a marketing buzzword, since size is relative to the times technology: today’s big- is tomorrow’s normal-data. We regard data as big when it’s structure and size cannot be processed without a supercomputer [5]. Examples of that would be: a) the massive amount of data the NSA collects, they had by mid of 2012 „over twenty billion communication events (rom around the world each day“ [6], b) experiments at the Large Hadron Collider which is expected to generate 50 petabyte in 2018 after filtering 99% of data [7], or c) the recent largest simulation of our cosmos that ran on 24 thousand processors and produced over 500 terabytes of simulation data [8].

Eventually it doesn’t matter at all how big your data-set is, as long as you use it to its fullest effect. Peter Thiel wrote:

„Today’s companies have an insatiable appetite for data, mistakenly believing that more data always creates more value. But big data is usually dumb. Computers can find patterns that elude humans, but they don’t know how to compare patterns from different sources or how to interpret complex behaviors. Actionable insights can only come from a human analyst (or the kind of generalized artificial intelligence that exists only in science fiction)“ [9].

The representation of this view through actual data has led to the terms smart-data, which describes that the data needs to be understood first [10], and thick-data, which describes application of „qualitative, ethnographic research methods to uncover people’s emotions, stories, and models of their world.“ [11].

Due to statistical aberrations and human mistakes in interpreting data, the understanding and context of data gets more important the bigger the data-sets are. A well know example is the Simpson’s paradox, where all distinct subgroups show the same trend, but the aggregate shows the opposite trend [12].


Helpful applications, which benefit its users through locally small data-sets, but overall huge data-size, i.e. through the number of users, can be found in every aspect of human live. These sets are usually small enough to even be processed by a commodity device and leveraging knowledge from bigger inter-connected sets.

Prominent examples in health care are picture recognition and statistical analysis, which help an increasing number of people to get the exact help they need. Mindfulness and meditation apps are popular self-help tools for reducing stress and enjoying more of life. Weight control with eating plans and recording of calories. Sleep tracking can result in deeper and more restful nights. Ovulation calendars may naturally fulfill a couple’s wish for a child. Fitness tracker to optimize workouts and increase the athlete’s performance. And the list goes on and on.

However all quoted application have multiple things in common, namely:

  • understanding of the main application subject and necessary data and its size needed for results through existing research
  • data-tracking ability including knowledge which data has to be tracked and ability to track the data cross devices by smartphone or IoT-gear
  • users: audience that provides data and feedback
  • feedback: short enough feedback loops for incrementally improvements

Based on the mentioned aspects the next steps would be incremental refinements of the algorithms and distinguishing the most helpful tactics for different groups of people, for instance by A / B testing.

„Data is the new gold“. This is unfortunately the reason there are basically no concrete examples how big the data-sets of these applications really are. Fortunately most essential research is freely available and applicable. Improving your health by tracking appropriate metrics is not that hard; pen-and-paper or spreadsheets are viable tools, if there is no app that meets your needs.

I myself was in need of something to help me control my weight. But there were no tools that satisfied my demands. After a short stint with spreadsheets I programmed something for myself, which is now in private use for over two years with great success.
In my next post will describe the use of existing research, data-tracking ability and short feedback loops with a practical application: the science of tracking food (calories) and controlling weight. We will see how a tiny data-set had not only big opportunities, but generated massive change. Stay tuned!

  1. Simon Kemp. Digital in 2018: World’s internet users pass the 4 billion mark. Source: https://wearesocial.com/uk/blog/2018/01/global-digital-report-2018, visited on 11.03.2018 ↩︎

  2. Paul Meehl. Clinical versus Statistical Prediction -- Paul Meehl ↩︎

  3. Malcolm Gladwell. Blink. Penguin Books. 2006, page 134 ff ↩︎

  4. Michael Lewis. Undoing Project. Allen Lane. 2017, page 356 ff ↩︎

  5. Top 500 List. Source: https://www.top500.org/, visited on 11.03.2018 ↩︎

  6. No place to hide. Green Greenwald. Metropolitan Books. 2014, page 98 ↩︎

  7. Worldwide LHC Computing Grid. Source: http://wlcg-public.web.cern.ch/about, visited on 11.03.2018 ↩︎

  8. Here’s Your First Look at the Most Detailed Simulation of the Cosmos Ever Made. Christianna Reedy. Source: https://futurism.com/illustris-cosmos-simulation/, visited on 11.03.2018 ↩︎

  9. Peter Thiel. Zero to One. Crown Publishing Group. 2014, page 149ff ↩︎

  10. Why Big Data Has to Become Smart Data!. Dr. Wolfgang Heuring. Source: https://www.siemens.com/innovation/en/home/pictures-of-the-future/digitalization-and-software/from-big-data-to-smart-data-why-big-data-has-to-become-smart-data.html, visited on 11.03.2018 ↩︎

  11. Why Big Data Needs Thick Data. Tricia Wang. Source: https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7, visited on 11.03.2018 ↩︎

  12. Can you trust the trend? Discovering Simpson’s paradoxes in social data. Adrian Colyer. Source: https://blog.acolyer.org/2018/02/21/can-you-trust-the-trend-discovering-simpsons-paradoxes-in-social-data/, visited on 11.03.2018 ↩︎

    Christian Seyda

    Christian Seyda

    Software engineer at incontext.technology working on backend for mass data. He studied computer sciences with focus on data mining and care to implement applications to support healthy living.

    Read More