There’s an ‘us and them’ mentality between software engineering and data science.
The changing landscape in recent years has meant that data scientists are not simply expected to communicate insights via Powerpoint presentations once a month but they need to deploy those insights to live web applications. As a result, the distinction between the two disciplines is breaking down.
Recently, I was the data science liaison in a project with one of the Big Four banks. The bank had assigned a small team of data scientists to assist us with the project. As a data scientist with plenty of experience in deploying solutions, I’m a competent software engineer. Even so, there was a whole lot of lessons learned about how to manage the interfacing of two teams to create a final, beautiful product.
The project in an ideal world
Generally, with a big software project, there are software engineers, data scientists, or user experience specialists, who do not know how the solution gets onto the internet and hooks together. This is where the project lead and infrastructure (deployment) lead come in, who understand how the deployment of the solutions will work.
In an ideal world they’d all have access to huge sets of clear, verified, raw data ready to drop into the Amazon Web Services (AWS). They’d have an expert in the company domain available to help us understand the nuances of the problem and refine our data into highly correlated features.
They’d spin up a bunch of enormous machines to model the petabytes of data in parallel. They’d use AWS Sagemaker containers from the very beginning so that the final model artifacts would deploy continuously to the same, automatically scaling, containers without a hitch. The business requirement would be about taking small data payloads and return even smaller responses. There would be no row-level heuristics that couldn’t be expressed as a feature.
Aaaaand life would be grand: the butterflies would flutter lightheartedly around Bambi as he pranced through the forest.
The project in the real world
The reality is that the majority of the ‘data science’ community has a limited understanding of software development methodologies. Usually, they are producing one-off analyses for strategic, decision-making Powerpoint presentations rather than producing the real-time, high throughput models required to serve an online user.
Many have a set of tools that works for them, so it becomes a struggle to understand why another set of tools may be more deployable. Because deployment is such an abstract concept in data science, it’s hard to justify why time which could be spent understanding a problem should be spent learning a tool that replaces an existing one.
The result is that they’ve put a lot of hard work into something that only works on their machine. This is not practical for anyone or anything.
How to reconcile the two worlds
Projects that are novel innovations (rather than reboots, rewrites or refreshment projects) are great because there’s no legacy systems to drag forward. You have the independence to design it right from the get-go. But while sometimes, as a consultant, you can avoid working with legacy systems, there will always be some level of legacy-thinking that requires careful team augmentation.
When you’re a software engineer, working with a team of data scientists, there are a few steps to take that will ensure that the data scientists don’t feel alienated and bogged-down by a bunch of processes and requirements they have never needed before while also ensuring repeatability and reliability in a production deployment.
1. Educate and be educated
Having a consultant blow-in and start preaching does not work for everyone. The fact is, there is a lot to learn from most clients about the domain they work in. There’s also a lot to teach.
Most of these old-school researchers really want to see their work being used at scale. It’s much more satisfying than getting a passing mention in a Powerpoint presentation.
So discuss the mechanics of the cloud, the limitations, and the features. Learn about their tool-set preference and find a compromise. Be clear about the ramifications of the decisions you settle on together. It will almost certainly rule out seamless scalability and deployment that comes with Sagemaker so, you’ll need to deal with that with a team of people. Make it clear what those choices will cost.
2. Have a Clear Division of Responsibilities
Spend time talking. Always have patience for the basics.
Modelling isn’t like regular software development, but it is software and it needs to be developed. You can’t take a design document and write tests that enforce deterministic behaviour. Often you don’t see incremental progress but rather long lulls followed by breakthroughs. Many software development managers don’t have the skills or tools to deal with this.
Define the relationship between the data science team and those deploying their solutions in a similar way that you define an API. Agree on a contract that allows you to work together. This starts with a commitment from the modelling team about what they can produce. Whether or not something can be produced from the data and what format it can be produced in should be the key focus.
Once that commitment is established, use automated build processes to hold the modelling team to account (we’ll come back to this).
3. Accountability and management
Depending on your background and who you work with, this may sound crazy but - You have to teach your data scientists Version Control. Even if they say they use it already. Get everyone in a room. Open an issue, get them to create a branch and open a Pull Request (PR). Finally, get them to check the status of the build that ran from the commit to the PR.
Be very careful at this point, it’s easy to lose people. Build the lesson on a dummy example so everything works, under no circumstances use an existing code-base. They have to see the process work. They have to feel the liberation that comes from being able to share functional code across the team. It’s worth it.
The alternative is that you get an email once a week with an ‘update to the model’ with an ambiguous description of how good it is and you spend the next week trying to make it work. The modelling team gets more and more disconnected from the production environment and so more incompatibility arises. You spend all your time hacking together terrible solutions that are not performing and are unmaintainable. And the best one: you’re responsible for when a model doesn’t behave the same way as it did on the author’s Windows 7 desktop.
4. Don’t be too fussy
Once you get the modelling team enjoying the version control workflow, relax. Help them get the builds to pass, and explain why they broke. At this point it’s easy and we’re all enjoying the chance to understand how production hardware works together.
Start adding tests and encourage them to as well. Show them how the tests are run and why breaking builds/tests are great.
Only test end-to-end functionality. Don’t try to be too thorough with your tests. They’re already struggling with the rules so just make sure the code does what you need it to do. Be generous during code reviews. They will enjoy doing them but make reviews about functionality - not style or maintainability. Never do anything to risk losing their motivation with the process.
We must find ways of achieving high standards in a scalable solution deployment while also fostering the growth of software development skills in those of us with varied backgrounds. At Kablamo, the majority of our projects are novel innovations for our clients rather than reboots, rewrites or refreshment projects.
We’ve made a name for ourselves as being the team that delivers insight and vision for aspirational projects where the pathway to success is murky. It is the focused development of skills within a multidisciplinary team that puts us in this position.