Guest blog - Data science and data governance: collaboration for trustworthy data insights

Data scientists are storytellers. They gather data from a variety of sources, clean and combine that data, and use their programming, math, statistical and analytical skills to interpret that data. Data scientists can help businesses understand customers and market trends, forecast sales, improve processes, and help to make better, smarter, faster decisions. But the data driven results depend on good and reliable data. Advanced data models and algorithms are only as good as the data they are applied on. Without good quality data, even the best models and algorithms fail.

Strong Data Governance policies and practices ensure valid, quality data and thus ensure that data analytics and data science methods can arrive at meaningful and trustworthy conclusions. Collaboration is key for the best results.

As a data scientist, you can expect to spend up to 80% of your time cleaning, transforming, and checking your data. A discovery and understanding stage is important, but it should not be so prolonged. If the data scientist is confident that the data has been verified by the business, it is consistent and compliant with regulations, then they can focus on bringing out the stories within the data rather than double checking its content. Data governance plays the key role of validity checking to prevent confusion over the data or misunderstanding and thus meaningless data science results.

What can a data scientist do for your business? Visualisations can be produced to understand and extract insights from data about the business and its customers. Machine learning (ML) models can be developed to continuously capture insights to help make more informed business decisions. Past performance can be studied and predictions made.

One example of the use of ML is the work I carried out for a research hospital in Rome to understand public opinion. Vaccine hesitancy was identified as one of the top threats to global health by the World Health Organization (WHO) in 2019, with the growth of online communications and misinformation about vaccines an increasing area of disquiet. The hospital wanted to understand people’s stance on the subject of vaccination. Concerns had been raised about the low maternal vaccine uptake in Italy. Social media is increasingly being used to express opinions and attitudes, so we decided to use Twitter as our source of data. I trained and fine-tuned a natural language processing machine learning model to classify the vaccine stance (promotional, neutral, or discouraging towards vaccines) of Italian tweets. This is now used on a web platform for medical professionals and policy makers to monitor vaccine stance in almost real-time.

Another example of applying data science to a business is the work I have done for a local gin distillery, analysing sales data to predict future sales and profits. Data visualisations and predictions were an important part of the business plan for explaining the business and its potential to investors.

Data governance can deliver high-quality, trusted, and compliant data. Data science can deliver insights into that data. Collaboration between data governance and data science professionals increases the level of certainty in the results, models and predictions so you can make the data-driven decisions for your business with confidence.


The author: Susan Cheatham is an independent consultant in data science and data mentoring. She gained her technical data skills and physics PhD from the European Centre of Physics Research (CERN). She enjoys communicating and sharing knowledge.

2 Comments

What is Data Wrangling?

If you are a regular follower of my videos and blogs, you will know that one of my key aims is to help explain the vast - and sometimes confusing – amount of terminology that is found within Data Governance.

Often things have different meanings depending on the organisation you work within or can even vary from person-to-person, which is why I want to say first and foremost: there is no such thing as a stupid question! The person who sent me today's question actually apologised for asking it but I'm a great believer that there should be no such thing as a stupid question when it comes to Data Governance.

If you feel that you need to ask the question, then that means that somebody hasn't explained it well enough to you. So, the question we’re dealing with in this blog is not a stupid one.

What is Data Wrangling?

Now, the person who sent me this e-mail felt stupid because they felt that perhaps it was something they should be doing, but they didn't understand what it was, and they didn't want to look stupid by asking.

The short answer is this: yes, the chances are you probably do have to do Data Wrangling in your job, whatever your job is, but whether you should be doing it is a different matter entirely.

I've actually heard the term Data Wrangling quite a lot over the past year or so, and I think people are using it to describe the situation where data isn't perhaps where you would like it to be, or it isn't good enough quality for you.

So, what they tend to use the term to mean, is the getting together of data from various sources and doing something to it so that you can use it.

What could that be? Well, it might be amalgamating it into a spreadsheet; it could be cleansing and fixing the data; it could even be running around various people asking them to fill in the gaps that you've got on your spreadsheet.

That all means that unfortunately, Data Wrangling is unfortunately a necessary thing if you have poor quality or missing data, and is very common in organisations that perhaps haven't yet got a proper Data Governance initiative in place or are very early on in their journey.

It’s part of the problem – not the solution

Data Wrangling also tends to be used to describe the frustration that you have of doing these activities, of bringing together data from disparate systems or spreadsheets, or fixing data before you can do what you should do with it.

Therefore, I don't think Data Wrangling is necessarily a good thing. It’s definitely not a skill you should perhaps aspire to have – what you should be aspiring to have is complete and accurate business data with a proper Data Governance initiative in place. Data Wrangling is not the solution – it’s a temporary fix for a much wider problem within your organisation. Especially if you find yourself having to do this regularly. At that point you should really stop and ask yourself ‘why am I having to do this so often – what data quality issues is my organisation facing and how can we find long-term solutions to address them’?

Data Wrangling is just something that unfortunately we have to do a lot of in our jobs at the moment, but it should be one of the things we should be looking to eradicate by having Data Governance in place.

Get in touch

Don't forget if you have any questions you’d like covered in future videos or blogs please email me - questions@nicolaaskham.com.

Or you’d like to know more about how I can help you and your organisation then please book a call using the button below.

2 Comments

Who Owns the Data that Appears in Your Reports or Dashboards?

If you’ve ever read any of my previous blogs on data ownership, you’ll probably know that I feel quite categorically, from my many years of experience, that you really cannot have more than one Data Owner per data set. It really doesn't work, and I don’t recommend you try it.

There’s no exception to this rule - believe me, I've been there, done it, still have the scars... What I believe you need to do is find one senior person within your organisation who is going to take overall accountability for that data, wherever it is within your organisation.

So hopefully, from that you're getting an inkling of the answer to today’s question… ‘Who Owns the Data that Appears in Your Reports or Dashboards?’

Well, I believe quite strongly that if the data that is showing in that report, is the same data as it has always been - it has not changed and is therefore still the original data, then it is owned by the same data owner who has always owned it!

For example, if you have a data owner that owns customer data, and you have a report, or more likely a whole suite of reports that contain customer data, then the owner of that data is still the customer data owner.

Now of course when you have reports, there are going to be multiple different data sources in them. And you might have many different data owners per report. This is why quite often when I'm rolling out a data governance framework, I don't make it an official Data Governance role, but I work with the BI or MI analytics team - whatever you call yours - to determine a role called Report Owner.

So, whoever first asked for that report, whoever gave you the requirements and then signed off on them. They are the report owner, and they are the people who know why that set of data was brought together in that report or dashboard and why it is useful – but, crucially, they don't own the data in it.

If there is a problem with the quality of any of the data in that report, then you would follow the normal data quality issue resolution process and you would go back to the original data owner or owners to get it fixed.

Now, there is sometimes a slightly different alternative and that is in a case where the data has been changed. I see that a lot where organisations are creating models or the report in some way aggregates data or transforms it performs a calculation.

So according to the data ownership principles I have already laid out, this means the data has changed. It's no longer the data that it was originally. It is new data. If you have performed any calculation or transformations to the data and created something else as part of producing that report or dashboard, then this is now new data, and the resulting data should have a new data owner.

If it was closely related to the original data, it may be the same data owner, but it may be somebody else. In those instances, it's often the consumer of that data - the person who has given you the requirements for what that calculation or aggregation is – they would be the data owner of the new data.

Now, if you have maybe two or more interested stakeholders interested in the same data set, what you must do is get them together and draw a conclusion as to who is the most appropriate person to own it and the other to be key stakeholders.

Another even better option is to consider splitting that data set into subsets until you find a way of splitting it so that everybody's happy that they are owning and responsible for the data that they really should be. Doing it any other way, I can guarantee you, is not going to work.

It's going to cause you loads of pain and is going to result in people telling you that this Data Governance doesn't work or doesn't help them. So, I really, cannot stress this enough - you should only have one data owner per data set – and that includes any data that you may use and/or change to form part of a report or dashboard!

Don't forget if you have any questions you’d like covered in future videos or blogs please email me - questions@nicolaaskham.com

Or you’d like to know more about how I can help you and your organisation then please book a call using the button below.


Comment

How to get your business stakeholders to want and use a Data Glossary

Getting a data glossary in place can take a lot of hard work and effort, so it can be particularly frustrating if/when your business users don’t truly appreciate the value it brings and either don’t want to help you build it or don’t use it when it has been built.

Why are you creating a Data Glossary?

A big reason why such a scenario happens is because we often ignore why we are creating a Data Glossary in the first place.  By this, I don’t mean you should have one because you are implementing Data Governance. I mean answering the question why your business users should want and use one?

This doesn’t mean rattling off a predefined list of benefits, but rather taking that deep breath, stepping back (figuratively speaking) and working out why exactly you’re building a data glossary in the first place and for who.

Why does your organisation need a Data Glossary?

I’m often asked how to engage business stakeholders (which you should be doing right from the start of your Data Governance initiative) with your data glossary. To do this you need to understand the value a data glossary would bring to them.

And this message needs to be tailored to each group of individuals you’re speaking to, as one reason why won’t work universally across different stakeholders. These messages need to be specific for your each of your groups of stakeholders, however, here’s a couple of benefits to give you some examples when seeking to communicate the value of your data glossary.

For instance, the faster development of reports is a common theme as a lot of time and effort often can be wasted creating reports without agreed definitions. This can result in ongoing disputes, wasteful meetings and, ultimately, poor decision-making with damaging consequences for an organisation.

Another potential benefit of a data glossary can be identified in the quicker implementation and deployment of new systems. Whether building a system from scratch or implementing a bought package, decisions need to be made as to the data which will reside in the system and this will result in lengthy debates on the exact definitions of certain terms like ‘customer’. Wouldn’t it be nice (and much quicker) if those debates happened just once and the agreed definitions logged in the data glossary to be referred to in the future instead of repeating this for each new system?  And of course this approach inevitably results with different systems having slightly different definitions of the same thing! That is not going to help data integration and analysis...

A data glossary is invaluable for streamlining definitions across an organisation and ensuring a common understanding over data and how it can or should be utilised.

A data glossaries can act as a cornerstone of proper, consistent communication – the value of this speaks for itself.

Ah, you say, but we already have a business glossary but our business users are not engaged with it. How are we supposed to communicate value?

Simply put, don't let the fact that you already have a data glossary stop you from taking the approach detailed above. You simply have to work out what value it will bring and communicate it. This would mean identifying what data challenges your business users are facing and to use these examples to demonstrate how a data glossary can solve those challenges.

When having conversations with your business stakeholders, it is common for them to get confused between a data glossary and a data dictionary. If that is a challenge you are facing this blog will help you explain with confidence. Read it here.

And you may want to share this video with the people you want to write definitions for your data glossary to make sure you get good quality useful definitions to put in your data glossary. Click here.

If you are struggling with engaging your business users, please book a call using the button below to find out how I can help you:




Comment

At what level should Data Owners and Data Stewards be?

In previous blogs we’ve discussed what Data Owners and Data Stewards are and during those discussions, I’ve given you hints as to what level of seniority within your organisation these people should be, but in this article, we’re going to address it directly.

First of all, to refresh your memory, data owners are a small number of people within your organisation (maybe between 15 and 20) who own all the data in your organisation and are accountable for the quality of that data.

Data stewards on the other hand are chosen by the data owners, who delegate the day-to-day responsibility of the data to the data stewards. In my experience data stewards often tend to be the subject matter experts.

So, where do these two roles sit within the structure of an organisation?

Data Owners

I always say that data owners have to be suitably senior people. That generally means that they have to have appropriate authority and budget to be a data owner to be able to make the decisions and fund any changes needed.

Let’s look at an example of a finance department because it doesn't matter what sector you work in; your organisation will have a finance department. And most finance departments are headed up by a finance director. If you follow my logic of having a really senior person being a data owner, then you might come to the conclusion that the finance director is going to be the best person for the role. And they may well be, but what I would encourage you to do is think very practically about this.

Is this going to work in practice? In my experience, for the vast majority of my clients, the finance director has just been a bit too senior.

They might well understand data governance and support it if you could find the time to talk to them, but I think that's going to be your problem, they're just not going to have the time to support you and to take on this role.

I have seen it work very successfully at that level, but only in a very small number of organisations which have either been very flat in hierarchical terms, or very small organisations.

So, if I'm saying maybe not the finance director, we need to find somebody still who is suitably senior, and I've seen it work very well with the Deputy Finance Director or maybe even the level below that, but they've got to be somebody that's got that overarching view across all the finance data but who also has the authority to make decisions about that data.

Data Stewards

It’s not quite as easy to identify what seniority your data stewards should have, but actually in this can also make it easier to identify your data stewards because you don't have to decide what level they should be. This is down to the data owner.

Your job when implementing Data Governance is to identify who the data owners are and engage them and get them to sign up to be the data owners. Once you've done that, it is their job to nominate their data stewards. And to make their own role ultimately easier, they're going to want to nominate people who actually do understand the data and who they trust.

If we look again at the finance example, I would expect the Deputy Finance Director to choose a number of data stewards. Now, in my experience, most finance departments have multiple teams, each working in their own their own area of specialisation and what usually happens is the data owner will appoint the heads of each of those teams to be the data stewards for each of those subject matter areas or those subsets of finance data. And that works really well.

Some advice

I would say to leave it to the data owner to appoint the data stewards because if you've explained well enough what they're accountable for and their responsibilities, they will nominate and choose the right people who have the right knowledge and authority to be able to do that on their behalf.

Also, don’t be too worried if your data stewards are not all at the same level or grade within your organisation. This isn't a problem. What's more important is that the right person is chosen so you might find that the data owner chooses four people that all head up their own teams and then they choose one other person who's a little bit more junior.

This could be because they are the subject matter expert and the only subject matter expert in some very specific data. So, always consider that and don't argue back immediately. Try and find out why they chosen the different levels and you will usually find that there's a very practical reason that they are the right person for the role.

Don't forget if you have any questions, you’d like covered in future videos or blogs please email me - questions@nicolaaskham.com.

Or you’d like to know more about how I can help you and your organisation then please book a call using the button below.

Comment

What is a Data Domain?

This is a very short and succinct question, and I thought the answer was likely to be the same. However, when I was sent this question, I was very surprised to see that it was a former colleague from my very first consultancy who submitted it!

I was taken aback at first because I thought to myself ‘surely, they would know the answer to this?!’ but on reflection I realised this is yet another example of how we data professionals are actually very bad at defining things and as data governance professionals,

that's even worse because we spend our time helping others and asking others to write definitions for their data and yet we so often don't define the terms we use well enough for others to understand. 

So, I decided that this was actually an excellent question and was definitely one that I should answer.

The first thing I did was look at one of my most commonly used reference books, the DAMA Dictionary of Data Management. This is an excellent reference book for anyone in data governance and as a DAMA member I would highly recommend it. However, on this occasion, it did take me aback

I opened the page at Data Domain and was quite surprised at the definition it gave. It states that ‘a data domain is a set of allowable values for a data attribute’. However, that is not how I use the term and I think that is the perfect example of what happens as data professionals. We start using a term and people we work with start using it, it proliferates, but we're not necessarily using it for its original intention.

While a data domain is perhaps terminology more commonly used in data modelling and in databases, we use it a lot in the data governance world, but in my view, with a slightly different meaning. We clearly don't mean ‘a set of allowable values for a data attribute’ -that's very techy and data geeky and not at all the type of langue we would want to use when we're trying to talk to business users. So, what do we mean?

Well, when I use the term, I mean a logical grouping of data - something where we can tell where it starts, and it ends. From my point of view, I'm normally trying to find identify data domains so that I can identify data owners.

For example, you might call customer data a data domain. Or finance data, HR data, product data, supplier data. These are all ideas of logical groupings of data that all relate together.

It's then the work of data governance to work out the details of what is actually included in those domains… but that is what I, and many professionals that I work with, mean when we say, ‘data domain’.

Sometimes I don't use the word data domain. I talk in terms of ‘data set’, which is any logical grouping of data and I think you can use the words ‘data set’ and ‘data domain’ interchangeably. Just make sure that you actually understand what you mean when you use the term and explain to the business users that you're talking to what it means to avoid all confusion.

So, there you have it. That’s my definition of a data domain. I hope you find it useful. If you do, please help me on my mission to help as many people as possible be successful with data governance by sharing it on your choice of social media.

I really appreciate your help in getting the message out.

Don't forget if you have any questions you’d like covered in future videos or blogs please email me - questions@nicolaaskham.com.

Or you’d like to know more about how I can help you and your organisation then please book a call using the button below.

3 Comments

Data Governance Interview With Andy Lunt

My name is Andy Lunt and I’ve been working in the field of data for the last 10+ years previously for Adecco Group and more recently Carruthers & Jackson.

How long have you been working in Data Governance?

I’ve been working specifically in data governance for almost 3 years now.

Some people view Data Governance as an unusual career choice, would you mind sharing how you got into this area of work?

I worked as an MI/BI manager for many years and got to see the results of poor data management in the many hours spent trying to make sense of the data coming into my team – lots of troubleshooting! An opportunity to work with a newly formed data science team as a data governance manager came up and a chance to fix the causes of many of the data problems we had was one I couldn’t pass off, so I took it!

What characteristics do you have that make you successful at Data Governance and why?

I would say empathy, resilience, persistence, and self-motivation are all characteristics you need.

Empathy, because if you can’t learn to walk a mile in someone’s shoes, you won’t know what is causing them pain when it comes to data. Resilience in your approach is key, there are a lot of ‘dead ends’ in data governance so being resilient allows you to keep changing until you get it right. Persistence is needed for your message about the ‘why’ people will understand the need but it takes time and persistence to really drive the message of why bother with data governance. Self-motivation, when the chips are down and you’ve had enough doors slammed in your face that your nose appears shorter you need to find ways to keep knocking – this is where your passion for the subject, your team, and your knowledge of the ‘why’ being greater than the ‘how’ all come into play.

Are there any particular books or resources that you would recommend as useful support for those starting out in Data Governance?

I would firstly look at getting a mentor, this helped me hugely at the start and in fact helps me to this day. We don’t expect professional athletes to stay at the top of their game or even get to the top without one and in my opinion nor should we.

Some books I read:

The Jelly Effect by Andy Bounds

Verbal Judo by George J. Thomson & Jerry B. Jenkins

Telling Your Data Story by Scott Taylor

What is the biggest challenge you have ever faced in a Data Governance implementation?

Business culture/people.

Is there a company or industry you would particularly like to help implement Data Governance for and why?

Child social care – there is just too much critical data lost, not captured, not accessible, or not understood due to poor data management practices. It breaks my heart to think children suffer as a consequence especially as it’s within our power to change.

What single piece of advice would you give someone just starting out in Data Governance?

Stay positive and be the change.

Finally, I wondered if you could share a memorable data governance experience?

I once spent 30 minutes trying to work out why a full stop had gotten into a field where the data quality rule didn’t allow it…turned out to be a spec of dirt on my screen - I took a long hard look at myself in the mirror that night!

I purchased a cheap keyboard from Amazon which added an extra space in between words randomly – it was a true data governance hater! The words ‘you buy cheap, you buy twice’ were ringing in my head when the new (much better quality) keyboard turned up!

Comment

The Rocky Horror Data Show: Disastrous duplicates…

Data shouldn’t be a wild and untamed thing, but sometimes it is just that - wild… and untamed. And unfortunately for our friend Tim, he’s about to find out just how wild and untamed data can be. As this is ‘The Rocky Data Horror Show’… where the data is not what it seems.

If you missed the previous episodes, you can read them here:

The Rocky Horror Data Show - where the data is not what it seems

The Rocky Horror Data Show: Did you get what you asked for?

The Rocky Horror Data Show: Not everyone’s on board…

The Rocky Horror Data Show: Disastrous data definitions…


Fresh from developing their data glossary, there’s been a worrying development in The Magical Wish Factory. And it has left Tim and Janet quite perplexed. Over the last couple of weeks, it has become apparent that some subscribers of The Magical Wish Factory have been getting more than their monthly quota of wishes.

Tim and Janet have been investigating the issue – which has cost the MWF hundreds of wishes and a lot of profit so far – and have discovered that a data input error has resulted in several customers having duplicate records on the system, and therefore getting double or sometimes three time more than their quota of wishes.

“This is a disaster” wailed Janet, “This has cost us so much money and there’s so many wasted wishes!”

“It’s not great” conceded Tim, “but that’s why you brought me in all these months ago, because the Data Governance here isn’t where it needs to be… and I’m sure we can get is all sorted out!”

“We need to get the data cleaned as soon as possible and these duplicate records deleted!” stated Janet, quite firmly.

“Yes, the data does need cleaned, and the duplicate records need condensed and removed… but if we clean the data without addressing the source of the issue, then we’re going to end up in this mess again in a few months – and that’s quite counterproductive, Janet,” said Tim.

“Okay – so what do you suggest then?” asked Janet sceptically.

Tim explained to Janet that the Magical Wish Factory needs to get out the habit of constantly fighting fires and get to the source of ignition and deal with that. Yes, the data could just be cleaned and fixed at the point of use, but it is fundamentally the wrong place to start.

Tim says: “If we clean and fix the data but don’t address why it is wrong and figure out why it is happening in the first place it will be clean for a day or two and then immediately start to deteriorate again.”

“I found a person in my old job who used to spend three weeks manually fixing role codes in their HR system every six months. It became a regular process to fix it because they never fixed the source of the issue, think about all that time wasted!”

Tim went on to explain further that the Magical Wish Factory needs to continue to press ahead with its Data Governance initiative first as that will ensure that you look at things strategically instead of a mentality of fixing things tactically all the time.

Tim also explained that they needed to establish a master of all wish makers and that Data Governance would help them identify and fix the source of the issue, not just the resulting data at the time you use it!

“Ah, that makes a lot of sense…” said Janet.

“Yes, Data Governance first and foremost always… that’s my wish!” replied Tim.

Stay tuned for episode six of The Data Governance Coach’s new series ‘The Rocky Horror Data Show’ and follow the adventures of Tim and Janet as they try to implement a successful data governance initiative at the Magical Wish Factory.

And if you want to chat about your Data Governance Training requirements, why not book a call by using the button below?

Comment