icm2re logo. icm2:re (I Changed My Mind Reviewing Everything) is an 

ongoing web column by Brunella Longo

This column deals with some aspects of change management processes experienced almost in any industry impacted by the digital revolution: how to select, create, gather, manage, interpret, share data and information either because of internal and usually incremental scope - such learning, educational and re-engineering processes - or because of external forces, like mergers and acquisitions, restructuring goals, new regulations or disruptive technologies.

The title - I Changed My Mind Reviewing Everything - is a tribute to authors and scientists from different disciplinary fields that have illuminated my understanding of intentional change and decision making processes during the last thirty years, explaining how we think - or how we think about the way we think. The logo is a bit of a divertissement, from the latin divertere that means turn in separate ways.

Chronological Index | Subject Index

Decluttering machine learning through accuracy

About changing perceptions and achieving global consensus on data quality

How to cite this article?
Longo, Brunella (2017). Decluttering machine learning through accuracy. About changing perceptions and achieving global consensus on data quality. icm2re [I Changed my Mind Reviewing Everything ISSN 2059-688X (Print)], 6.4 (April).

How to cite this article?
Longo, Brunella (2017). Decluttering machine learning through accuracy. About changing perceptions and achieving global consensus on data quality. icm2re [I Changed my Mind Reviewing Everything ISSN 2059-688X (Print)], 6.4 (April).

London, 15 November 2017 - This year (2017) has seen a considerable increase in studies, projects and conferences calling for more quality in data production and management from all over the business sectors.

There has also been a considerable rise in data errors widely spread in the public domain due to either privacy or calculus mismanagement causing political or diplomatic turmoil - for instance the UK Home Office had to apologise during the summer for wrong “removal” letters sent to a hundred of EU nationals.

The academic world was shaken as well by researchers from the University of Chicago asking a review of the conventional p-value threshold for statistical significance: since in a number of disciplines the reproducibility of scientific results claimed through published research has dropped dramatically, such value should be adapted consequently.

The list of management, academic or political and judicial errors caused by inaccurate or wrong data can reach exponential levels if we turn towards the cybercrime scene, where in 2017 the public opinion had a taste of the systemic disruption that can be caused by ransomware and other socially engineered attacks (in which techniques like phishing, impersonation and falsification are used with the explicit purpose of inducing error of judgements and distortion of resources).

Does it matter?

I have already written about accuracy many times in my life - and more recently in I am not Karen. Every time one has at first to cross the Rubicon of a certain confusion between accuracy and quality. The first is something that in a given domain can be applied, assured, measured by human beings and machines (software). The second can only be valued by human beings and, with a certain degree of uncertainty, only inferred by machines.

I do believe that accuracy is one of the essential features of data and information we should all look after in the digital economy, although it is true that some individuals have an aptitude towards it much more developed than others - up to the point that it constitutes a measurable kind of intelligence.

The technicalities of such amazing job of data mining and knowledge construction and organisation are still under development - and in spite of being at the same time subject to an increasing level of automation that is, per se, enough challenge for business plans. Concerns raised for decades - if not for centuries - about the need for accuracy by several professional categories (including librarians, software programmers, indexers, actuaries, editors and publishers to name a few) are easily dismissed by those who believe that the quality available at their fingertips is “good enough” but for immediately moving the goalpost when something unexpectedly affects the perceived quality of their own products, services, performances or results.

In fact, all the operations we do with all sorts of data can be performed without paying any attention to the context in which (and purpose of) the data have been produced in the first place and are then further treated, manipulated, edited, distributed and so on. This is why, for instance, in spite of having robust mathematical models in place and exceptional key performance indicators on their analytics dashboards, utilities companies still fail dramatically to prevent environmental disasters and predict incidents affecting their assets. And yet, predictive maintenance could save enormous amount of money.

The sort of context blindness I am talking about is perhaps at the heart of the accuracy problem and can even coexist with high levels of comprehensiveness, huge volumes of data, great depth of details - in sum, with situations in which we have the impression of absolute reliability and trustworthiness. The point is that reliability and trustworthiness are in our mind and it is extremely rare to establish what they consist of universally, on a globally recognised scale, whereas accuracy in its minimal terms always consists of truth-values (true / false) that can be measured according to basic logical, mathematical or technical rules. For instance, in information retrieval, as long as we define the linguistic perimeter of our investigations, accuracy can be measured by retrieval (and it can affect retrieval as well) whereas in a search engine we tend to appreciate the perceived quality of the relevance ranking (more on this on a forthcoming issue of icm2re). And so on and so forth, these are the types of arguments pro accuracy or pro quality we can discuss about if we ask ourselves ‘does it matter’?

If machine learning experts and software engineers want to make their efforts more productive and focus on more reliable ways to automate data they do have to deal with their own perceptual change and start defining and measuring the accuracy of their own artefacts. That means taking into account the extra-text, what does not necessarily belongs to or fits into (or third parties can infer from) the datasets at their disposal.

Can they say if the accuracy issue is considered a matter of professional deontology and core responsibility for anybody working with data at any level? I have made a change request in this precise direction for the development of SFIA version 7 (SFIA is the Skills Framework for the Information Age, adopted by information professionals in dozens of Countries and managed internationally by a non profit foundation that keeps the framework up to date through consultations processes). At present (within SFIA 5), there is no explicit mention for an accuracy basic or “core” competence not even in respect of skills for which is traditionally considered a very critical success factor, such as requirements definition and management, audit of quality assurance or information and content publishing and authoring. Where does the accuracy issue stand within professional bodies of knowledge, best practices, guidelines inspired by the SFIA framework or within computer science or engineering curricula all over the world? We often assume that accuracy comes with or is embedded in a certain level of literacy and numeracy but the reality is there are exceptional minds and Nobel prizes that do not have this type of intelligence at all and viceversa individuals with fantastic attention to details may not necessarily be brilliant mathematicians!

At global or industry level we still have to cross the Rubicon and recognise that accuracy and quality are two different things. Let’s add accuracy as a core competence for our information age, together with autonomy, business skills, influence, complexity.

Awareness first, responsibility follows

I asked an academic colleague who has recently investigated applications of machine learning into the automotive sector (driverless cars and the like) why the information technology industry seems incapable to move forward on its own learning curve about the disadvantages of being extremely short term oriented, superficial and opportunistic when organising data to extract knowledge and insight.

We tend to be content of “good enough” data, “now”, even when we know that at a later time and with another scope or a greater level of automation either the properties or the granularity of that data can cause confusion, misunderstandings, errors of judgments up to disruptions of businesses and personal injuries. We tend to voraciously embrace disruptive ideas without engineering them for more sustainable and medium or long term benefits - and again in the IT skills landscape internationally agreed through the SFIA framework the attention for sustainability assessment and management does not extend to what should be the primary object of that sustainability assessment and management.

Indeed, it may be the same question that stirs debates about the end of capitalism among economists, politicians and social scientists but I definitely wanted to focus on an ideology-free approach to the essence of our data management capabilities.

The conversation that followed underlined the fact that often awareness is not enough. Simply that - I have already touched this button in icm2re 2.2, No, cyber security training does not solve the problem. What is really missing in the fight against cyber crime?). We have not matured yet a willing and a political stance to deal with digital markets in a complex way. Digital markets have been dominated by financial forces and a very superficial appraisal of what new technologies can bring to the whole of the economy. They have shown an anxious and wild tendency to monetise knowledge that actually does not exist and an urgency to confirm assumptions and easy-peeling discovery routes that please most of the time project sponsors, here and now, more than unknown potential future customers, users, employees.

We do need to invent, spread and learn data engineering practices that blend technical capability with a wide understanding of social, cultural and ethical dimensions. We need a mindset able to approach the issue of design of algorithms and decluttering of algorithms that unfortunately have been produced for over three decades with a terribly childish and short-term oriented approach (they tend to embed and mingle all of our cliches and bias with no consideration of consequences, others' interests, potential abuses).

Craig Venter discovered new species of bacteria just applying the idea of a concept challenge to statistical models in gene sequencing - in very simple terms he applied the idea of “what … if” to large amount of previously not analysed data. Accuracy was not at all among his concerns. And yet it would be very wrong to say that such serendipitous luck is what usually, normally leads to new knowledge discoveries. Most of the times discoveries in science and new knowledge achievements are the results of dull honing of a certain skill and repetitions of the same procedures, over and over again, for ages. Serendipity is an exception and not the rule in history of science, it occurs in tales of accidental discoveries that do not emphasise enough the underneath complex and systematic journey of intentional research, where accuracy in handling data, I would say care in dealing with data, is indeed the critical success factor - and it is planned and managed as such.

We concluded the conversation considering that alternative facts, fake evidence, data falsification and the whole world of media trends show that a certain level of systematic lack of accuracy has become endemic! the financial and communications markets seem often profiting from such systematic lack of accuracy as well - that contradicts any formal call or demand for it but for complaining about “data quality” when things go wrong or must address disappointing results. Perhaps we are ahead of the hype cycle.

What’s next

How we can possibly incentivise, nudge, direct people towards a change of perception and increase their will to see the other side of the coin, the half-full or half-empty part of the glass in respect of data accuracy?

A possibility consists in asking people to focus not yet on generic and abstract matters of quality - or wider or non ambiguous and clean datasets, or more open or more shared data - but on “just” some simple and measurable properties that in a particular context do constitute accurate data we can all perceive, agree and measure as such.

Let’s forget the serendipity myth for a while.

We know since the 1990s that putting some products together on the supermarkets’ shelves increases consumptions. But having spotted such immensely useful pattern does not mean we understand why that really happens nor that we will be able to explain why at a certain point in time that association does not work anymore. The same is with innumerable scientific articles or policy stances that heavily relied on inferences or “good enough data” in the first place.

It is indeed by a new way of raising the bar for understanding and demonstrating causation via simple measurements and truth-values that people with all levels of abilities from all walks of life can change perceptions, judgements and behavioural patterns where data accuracy it absolutely critical for sense making.

Let’s make a connection between our aptitudes and data accuracy.

Perhaps machine learning developers should aim at this level of maturity in scoping, designing and then assessing the accuracy of algorithms too.