icm2re logo. icm2:re (I Changed My Mind Reviewing Everything) is an 

ongoing web column  by Brunella Longo

This column deals with some aspects of change management processes experienced almost in any industry impacted by the digital revolution: how to select, create, gather, manage, interpret, share data and information either because of internal and usually incremental scope - such learning, educational and re-engineering processes - or because of external forces, like mergers and acquisitions, restructuring goals, new regulations or disruptive technologies.

The title - I Changed My Mind Reviewing Everything - is a tribute to authors and scientists from different disciplinary fields that have illuminated my understanding of intentional change and decision making processes during the last thirty years, explaining how we think - or how we think about the way we think. The logo is a bit of a divertissement, from the latin divertere that means turn in separate ways.


Chronological Index | Subject Index

Where there is dirt there is a system

About connecting invisible parts for data (systems) engineering

How to cite this article?
Longo, Brunella (2018). Where there is dirt there is a system. About connecting invisible parts for data (systems) engineering. icm2re [I Changed my Mind Reviewing Everything ISSN 2059-688X (Print)], 7.7 (July).

How to cite this article?
Longo, Brunella (2018). When the accuracy page is not found. A case of arguable data (or systems) engineering. icm2re [I Changed my Mind Reviewing Everything ISSN 2059-688X (Print)], 7.7 (July).

When thinking about systems, the first thing to note is that systems are complex.
Edgar Morin

London, 6 October 2018 - Behavioural economists do not like to confront their theories of heuristics and biases against arguments provided by cultural studies. On the other side, anthropologists, social scientists and psychologists are often dealing with a multitude of fragmented theories and ideas at war with each other. From my point of view, that is pretty unqualified to join the academic arguments of either sides, there are good reasons why we should integrate the whole of the social sciences and cultural studies into the body of knowledge of systems engineers. In saying so I am afraid I have also dropped a bombshell in the place where I belong as engineers, technician and computer gurus do not like to be lectured with philosophical stances.

But, quoting the prolific writer and catholic anthropologist Dame Mary Douglas, among the sponsor of what was called cultural theory of institutions in the 1990s, the fact is that even where there is justdirt, there is a system, as it has been decided at some point, somewhere, that something must be discarded. There can be rules without meaning - she irritatingly went on arguing for three decades.

As a matter of fact, to improve the design, the reliability and the predictability of innumerable data driven applications we do have to include, connect and integrate parts that are often invisible and not easily communicated - and nonetheless they do exist and have a role in the codification, acquisition, transmission or application of knowledge.

Edgar Morin has convincingly (to me) and plainly explained further in recent years that a system is more and less than the sum of its parts at the same time. Building up on the foundation of Aristotle’s millenary assertion (that a system is more than the sum of its parts) Morin explained that a system has certain qualities that come from the organisation of the system - so that they do not actually belong to any part - but at the same time qualities and properties of singles parts get lost when they are connected. Confused? Think of your privacy when you connect your smartphone to a wi-fi, or when you use your credit card: unless you work for MI5 under cover, potentially everybody can guess where you are, what you are doing, who are you with and so on.

So what? Here comes the challenge of thinking, modelling and project managing data complexity.

Beyond and above the engineering walls

Early machine learning investigations in healthcare have shown that the relationships between our genes and the conditions we experience interacting with our surrounding environment are very complicated, to say the least. I have considered this in the context of diagnosis and treatment of chronic diseases but I think the same is true for other medical conditions and other fields.

But it seems to me there is possibly a case for being extra cautious with investments in experimental machine learning applications for the prevention and cure of diseases. This is especially true when the scope of the innovations sought consists of tackling diseases classifiable as epidemic (such as flu), socially endemic (such as obesity) or chronic (autoimmune diseases, diabetes etc) for reasons that go largely beyond any possibility to forecast the combinations of two different types of information: genetic and phenotypic data.

Common seasonal infections that spread quickly in hospitals and schools can be successfully prevented by the most educated segments of the population, no matter the weakness of their immune system’s defences against cold and cough; the same is true for diseases heavily linked to smoking, alcoholism, malnutrition or drug addictions. And yet, if we change continent and socio-economic contexts it looks like the opposite turns to be true.

The same extreme variance exists in the potential information power of the human genome in giving scientists some safe clues on what is going on within the human body: not more than 3% of the 3 billion stuff that constitutes the human genome is made by genes. Researchers agree that it will probably take some time before we truly understand what part the remaining 97% of our genotypes plays in determining and directing our performances over diseases. But it looks like it is not a secondary one.

It is exactly because we would like to factor in this type of great uncertainties and to achieve a better understanding of those complicated relationships between genotypes and phenotypes that a number of new massively large databases have been created in Europe, China and in the USA: their fundamental assumption is that with a huge volume of data (that would be impossible to process manually whereas it can be successfully calculated and elaborated by algorithms) we can eventually cure cancer and prevent terrible diseases. Eventually.

Sounds familiar? The fascination for big data or volumetric data projects continues with all sorts of arguments favouring inferences as the common methodological ground for machine learning developments. I will not return here on aspects of such tendencies I have already considered in several articles, expressing my critical, doubtful or ambivalent views. But, shortly said, my preferences go to developments and innovations in which change is managed, that means pilots and prototypes on one side and production or implementation on the other are always kept separated, a configuration choice that I believe most of the times is safer than experimenting with people data in real time.

A machine’s dream? Factoring in culture and emotions

While biomedical computing projects and other extraordinary machine learning continue to proliferate and to make the case not only for new businesses in healthcare but also in the food and catering industry, I remain fundamentally skeptical about the idea that processing large volumes of data can - per se - reduce the uncertainties embedded in any potentially significative data connection. Indeed, the magic is always ready available to the human brain: we often imagine breakthrough discoveries for human health or wellbeing behind the curtains of any scientific investigation only because … we have imagined.

It may happen, but is it worth the costs and the risks for human health, trust, life and human rights, volition, potential? I would not gamble on it.

The classification hint

Any data collection or database shows traces of one of two major types of classifications we apply even when we have little awareness of it: in the first type, data properties are pretty much definable formally in similar ways no matter the contents or the discipline (dates, sizes, duration for instance) so that (if that makes sense) they can be quite easily aggregated and mashed up to some extent without totally compromising the quality of the outcome, as long as there is some level of treatment of the similarities and the differences between the data and their formats. Big data enthusiasts believe that new knowledge is created exactly because we can leverage on the discovery of such previously unknown similarities or correlations. There are innumerable successful examples of this type of collections in the sector of aggregated databases for research or popular meta search engines.

Then, another type of classifications exist, in which the data are not at all homogeneous. It may be therefore not at all easy for a machine (neither it is easy for a human being) to recognise that the contents consist of objects that share some hidden characteristic, often according to a specific domain knowledge or structure or invisible and undefinable criteria.

Many years ago, when I was asked to teach librarians and information specialists how to search databases, I was used to show the difference between these two very distant types of classifications using the example of homonyms. Homonyms were a true nightmare for information retrieval systems only twenty years ago. They are still puzzling modern generalist search engines these days but at a much lower scale . There was a way to sort out the confusion, looking at the structure and syntax of the phrases, the grammar and the punctuation. Match, for example, is a noun with different meanings but it is also a verb. Knowing or understanding that linguistic differences creates hugely different possible aggregations of meanings is what allows us to make logical inferences that are substantially correct, culturally acceptable and productive for the purpose of investigating full-text databases. My nephew, 8 years old, jumped to the same conclusions doing his science homework about amber when Google returned pages about the former Home Office Secretary Amber Rudd, a person instead of a material!

Any special language that defines its objects, classes or clusters according to its own internal structural, conceptual or operational criteria puzzles or surprises the non specialist in unpredictable ways. There are classification criteria that are not immediately evident, that may not have any sense at all in other contexts, sectors or problem domain but we seem equipped to understand in the blink of an eye without any training.

Logical inference starts from what we already know

Most of the times, unknown matters pose categorisation problems that can be automated only at a later stage, when numerous objects can be eventually named or counted according to some widespread logic that, at first, was spotted only as a possible common characteristic among few items. Logical inferences always start with classes already formed (again, I am quoting here Mary Douglas even if I am quite disappointed to have to agree with her more than with Satya Nadella!).

There are of course areas of scientific investigations that prove the value of serendipity, subjects in which the pure exercise of aggregating enormous amounts of data in some way is not only useful for scientific advances but it also has an inner methodological value for the entire world of knowledge: we frequently discover new meanings, new phenomena, new ways of processing data or we invent new tools while we are busy doing something else.

A sector in which big data seem very promising in this respect is astronomy - see for instance the amazing case of the citizen science group called Planet Hunters that simply aggregating the way in which 300,000 volunteers classify data from the NASA’s Kepler spacecraft helped making very interesting discoveries.

But most of the times we need predictable and reliable classifications to be acted upon either by human beings or machines in order to make constructive steps forward. We need an acceptable, relatively high level of precision that satisfies some power (distribution) rule before an innovation can really take off in the real world.

This is the fundamental reason why I am often not at all impressed with the argument that massive volume of data, gathered and crawled involving thousands or millions of individuals, should produce new knowledge or insight, particularly for the cure of chronic diseases or cancer, just because we are able to connect everything to everything else.

Medical evidence from patients data will confirm what we already know in respect of plant based diet, for instance. Aggregating thousands or millions of data about family life of people with dementia seems very unlikely to lead to a cure just because we are digging into massive volumes of personal stories.

Conversely, what is fascinating me is the possibility to engineer small datasets with very precise, specific and limited scope and a deep degree of detail, up to the point that operations on the data could be safely automated and possibly originate a model or some procedural insight. This could be in turn tested, reused, applied or replicated across different sectors, to solve similar problems, abandoning aspiration of large scale applications, but maintaining very ambitious purposes.

Factoring in culture and emotions could be the key challenge that data engineering has to face in a number of areas where opportunism, policies and politics, human computer interactions and market pressures get in the way and dictate rules on how to shape and manage data modelling and data processing. But just because we can, it does it mean it makes sense.

Understanding how cultural structures and groups or societal pressures and individual emotions like fear of a deadly disease or compassion for a pet can change our mind and directions in innumerable areas, and particularly promote healthier behaviours, may become not only the mission of charitable organisations and political campaigners but also the operational goal of data (systems) engineers.