No Magic AI: Humans still very much in the loop at PyData Cambridge

Last month, FeedStock sponsored the first ever PyData convention in the UK outside of London. The two-day conference was a gathering of some of the industries’ top data science practitioners, exploring the complex realities, broad skillsets and domain expertise behind the latest ‘AI Wizardry’.

What is PyData?

PyData is an international community of developers and users of open source data tools. Their goal is to create an environment to collaborate, share ideas and promote open source ideals through a series of local meetups and larger conferences.

The international network is organised with support from NumFOCUS, a non-profit group based in Austin, Texas (founded 2012). Conferences in Silicon Valley, Boston, NYC and London are staples on the organisation’s calendar; however, one such new entry this November was PyData Cambridge – a city rich in academic history, with many claiming it to be the birthplace of modern computing.

Day 1 – Amateurs talk algorithms, professionals talk data

PyData is a broad church with practitioners coming from all sorts of backgrounds. There was no theme for the event and the line-up was clearly curated to provide something for everyone. However, despite the breadth of topics, recurring issues kept bubbling to the surface until the unofficial theme was shouting loud and clear – Data is King. While algorithms take the headlines, it’s the other 80%, the grim and dirty work of preparation, interpretation, and analysis of the data they feed on that creates a competitive advantage.

First up in the lecture room was Senior Cambridge University Research Associate and CSO at Korbit AI, Ekaterina Kochmar presenting her keynote on the recent advances in Natural Language Processing (NLP).

The morning progressed with talks by data scientists at leading business and other institutions covering Data Interpretability (Raphael Meudec of Sicara), Machine learning Workflow (Philip Goddard of Kindred Group), Transaction Data (Matthew Sattler, Global Head of Data Science at HSBC) and Jacob Montiel, winning the prize for furthest travelled from the University of Waikato, New Zealand, to share his work on Infinite Data Streams.

The importance of…

The morning’s sessions all emphasised the necessity of human judgement when managing data sets; but just as important is the need to stay humble and recognise just how wrong we can all be. Building an environment to acknowledge and catch these inevitable human lapses is key.

The first session of the afternoon started with the PyData London veteran Ian Ozsvald, author of the renowned book “High Performance Python: Practical Performant Programming for Humans”, presenting his Higher Performance Python insights.

The next talk was a huge surprise. I’d already seen a few Cambridge Spark alumni, but I didn’t think I’d be hearing from one at the podium. Former classmate Davide Sarra and his colleague Kishan Manani from luxury online retailer, Farfetch shared gritty and practical tips from the trenches of machine learning – well done Davide!

Finally, Kirstie Whitaker of The Alan Turing Institute promoted the importance of reproducible, inclusive and collaborative Data Science. Though aimed more at the academic community, the lessons were a guide in best practice for the entire industry.

The audience was left in no doubt that data science, when conducted with responsibility and care, has a tremendous capacity for good in the world.

Day 2 – Transfer learning

Despite the previous evening’s festivities involving a blacktie college dinner at Corpus Christi, Sunday’s first talk was surprisingly well attended. I’m not sure how many Operational Drillers were in the audience, but David Fraser Halliday of Schlumberger (the oilfield services company) addressed strategies such as orchestrating data science workflows with Google Cloud Storage and Big Query; both methodologies that can be transferred to other industries.

Another cool application discussed by Jaymin Mistry, a Data Scientist at PA Consulting, was using NLP to improve Theatre Utilisation in hospitals. It’s fascinating finding out about the nuisances involved in collecting data from doctors’ notes.

Fun fact…

Of the Doctors’ notes analysed, there were 11 different words used to represent ‘knee’ ranging from misspellings to Latin terminology.

In a study to predict which heart patients to admit and which to send home, a history of asthma was interpreted as a sign that a patient complaining of heart pain would eventually be fine and could be sent home. So historically, asthma almost guaranteed you’d be ok. What the data did not explicitly say is the patients with asthma have (very sensibly) always been rushed into the best possible care straight away – giving those patients fantastic survival odds.

I’d encountered similar concepts before in the study of behavioural finance, or how humans tend to make less than optimal financial decisions in the real world. The same identification of biases and human blind spots are just as relevant to the Data Scientist. For example, it’s a known phenomenon that Hedge Fund indexes are heavily biased upwards as the worst performing Hedge Funds blow up, drop out or otherwise disappear to leave only the track records of the winners.

Clean, engineered and complete data sets alone are not enough. Intelligent use of data, sensibly relating back to a rational use case is key. Common sense, I was once told, is not all that common.

All too human

The final talk by Prof Kenneth Harris of UCL entitled “How Does the Brain Work” was an epic wrap up. One of the reasons attending conferences in person is so important is that the enthusiasm of the speaker makes an impression in a way that’s hard to match on YouTube. It was a fast-moving jump from observing a single brain cell firing up at the mention of the “Simpsons” to a video of jacking together 100 neutrons in paraplegic patient’s brain to move a robotic arm at will. Although late in the day (following a stretch of mental gymnastics which was a little beyond me), the message was clear – not only is the brain by far the most complex computer we know of, current forms of what we refer to as AI or computer intelligence are not in direct competition, but on entirely different evolutionary paths. Although machines outperform humans in a rapidly expanding number of tasks, humans are and will remain for a long time to come the critical component of effective data science solutions.

Was it worth it all?

Lecture at PyData conference about Data is King

Without question PyData was a great success. Feedback from those I spoke to and my own experience confirmed that the knowledge and insights shared would be extremely useful. From a data analyst’s perspective, just the information about overcoming psychological biases in data collection could have direct impact on our growing data factory at Feedstock.

From a company perspective, two very enthusiast staff, ready to share their experience turned up to work on Monday. Feedstock has a generous programme of professional development for every employee, including time and funds to attend specialist conferences and workshops. For a team who are passionate about what they do, this has an immediate benefit for the whole enterprise.

Exciting times ahead

Both Cambridge Spark and FeedStock have grown exponentially in the last couple of years and it’s been exciting to witness part of the journey. I’m delighted to be part of a company that supports the ideals and philosophies of the organisations, charities and educational foundations which put on this event. Although there’s clearly a lot of hype surrounding data science and ‘AI’ in general, PyData Cambridge demonstrated the breadth and quality of talent applying data science on a massive scale to create real business value.

To every future PyData attendee out there – expect more to come.

Author: Mike Smith, Data Analyst at FeedStock

‍

FeedStock’s Data Science team meets PyData Cambridge