Do you need Data Governance over a Data Lake?

There continues to be a lot of excitement about data lakes and the possibilities that they offer, particularly about with analytics, data visualizations, AI and machine learning. As such, I’m increasingly being asked whether you really need Data Governance over a data lake.  After all, a data lake is a centralised repository that allows you to store all your structured and unstructured data on a scalable basis.

Unlike a data warehouse, in a data lake you can store your data as-is without having to structure it first.  This has resulted in many organisations “dumping” lots of data into data lakes in an uncontrolled and thoughtless manner.  The result is what many people are calling “Data Swamps” which have not provided the amazing insights they hoped for.

So the simple answer to the question is yes – you do need Data Governance over data lakes to prevent them from becoming data swamps that users don’t access because they don’t know what data is there, they can’t find it, or they just don’t trust it.  If you have Data Governance in place over your data lake, then you and your users can be confident that it contains clean data which can found and used appropriately.

But I don’t expect you to just take my word for it; let’s have a look at some of the reasons why you want to implement Data Governance on data being ingested into your data lake:

Data Owners Are Agreed

Data Owners should be approving whether the data they own is appropriate to be loaded to the Data Lake e.g. is it sensitive data, should it be anonymised before loading?

In addition, users of the data lake need to know who to contact if they have any questions about the data and what it can or can’t be used for.

Data Definitions

Whilst data definitions are desirable in all situations, they are even more necessary for data lakes.  In the absence of definitions, users of data in more structured databases can use the context of that data to glean some idea of what the data may be.  As a data lake is by its nature unstructured, there is no such context.

A lack of data definitions means that users may not be able to find or understand the data, or alternatively use the wrong data for their analysis.  A data lake could provide a ready source of data, but a lack of understanding about it means that it can not be used quickly and easily. This means that opportunities are missed and use of the data lake ends up confined to a small number of expert users.

Data Quality Standards

Data Quality Standards enable you to monitor and report on the quality of the data held in the data lake.  While you do not always need perfect data when analysing high volumes, users do need to be aware of the quality of the data. Without standards (and the ability to monitor against them) it will be impossible for users to know whether the data is good enough for their analysis.

Data Cleansing

Any data cleansing done in an automated manner inside the data lake needs to be agreed with Data Owners and Data Consumers. This is to ensure that all such actions undertaken comply with the definition and standards and that it does not cause the data to be unusable for certain analysis purposes— e.g. defaulting missing date of births to an agreed date could skew an analysis that involved looking at the ages of customers.

Data Quality Issue Resolution

While there may be some cases where automated data cleansing inside the data lake may be appropriate, all identified data quality issues in the data lake should be managed through the existing process to ensure that the most appropriate solution is agreed by the Data Owner and the Data Consumers.

Data Lineage

Having data flows documented is always valuable, but in order to meet certain regulatory requirements, (including EU GDPR) organisations need to prove that they know where data is and how it flows throughout their company.

One of the key data governance deliverables are data lineage diagrams. Critical or sensitive data being ingested into the data lake should be documented on data flow diagrams.  This will add to the understanding of the Data Consumers by highlighting the source of that data.  Such documentation also helps prevent duplicate data being loaded into the data lake in the future.

I hope I have convinced you that if you want a data lake to support your business decisions, then Data Governance is absolutely critical.  Albeit that it may not need to be as granular as the definitions and documentation that you would put in place for a data warehouse, it is needed to ensure that you create and maintain a data lake and not a data swamp!

Ingesting data into data lakes without first understanding that data, is just one of many data governance mistakes that are often made. You can find out the most common mistakes and, more importantly, how to avoid them by downloading my free report here.

Does it have to be called Data Governance?

This is a question that I get asked fairly regularly. After all it is not an exciting title and in no way conveys the benefits that an organisation can achieve by implementing Data Governance. Sadly however, there is no easy yes or no answer. There are a number of reasons for this:

  1. Data governance is a misunderstood and misused data management term

Naturally I am biased, but in my view, data governance is the foundation of all other data management disciplines (and of course therefore the most important). But the fact remains that despite an increasing focus on the topic, it remains a largely misunderstood discipline.

On top of this, it is a term which is frequently misused. A few years ago, a number of Data Security software vendors were using the term to describe their products. More recently the focus on meeting the EU GDPR requirements has led to a lot of confusion as to whether Data Protection and Data Governance are the same thing and I find that the terms are being used interchangeably. (For the record, having Data Governance in place does help you meet a chunk of the GDPR requirements, but they are not the same thing).

Having more people talking about Data Governance is definitely a good thing, but unless they are all meaning the same thing, it leads to much confusion over what data governance really is.

I explored this topic in a bit more detail in this blog: Why are there so many Data Governance Definitions?

In order to understand whether Data Governance is the right title for your organisation to call it, I would start with looking at how you define data governance. And this step leads nicely to the next item for consideration.

  1. Sometimes it is right to include things which are not pure data governance in the scope of your data governance initiative.

This is a topic that I covered in my last blog which you can read here.

To summarize that article, it is just not possible to have one or more people focus purely on Data Governance in smaller organisations. It’s a luxury of large organizations to be able to have separate teams responsible for each different data management discipline (e.g. Data Architecture, Data Modelling or Data Security).  Going back to my point above, if data governance is the foundation for all other data management disciplines, it is only natural that the line between them can sometimes get a little blurred. As a result of this, the responsibilities of the Data Governance Team can get expanded.

So consider what is included within the scope of your data governance initiative and decide whether it be more appropriate to name the initiative and your team (either or both)  something that is more aligned to the wider scope of the initiative and activities of the team.

Is the name going to make cultural change harder to achieve?

Achieving a sustainable cultural change is one of the biggest challenges in implementing data governance and insisting on calling it “data governance” could make achieving that cultural change more difficult if the term doesn’t resonate within your organization. This is related to a topic that I explored in another old blog Do we have to call them Data Owners?

Whether we’re talking about the roles, the team, or even the initiative the same principles are true. It is better to choose a name that works for the culture in your organization than to waste considerable effort trying to convince people that the “correct” terminology is the only one to use.

It would be my preference to explain that the initiative is to design and implement a Data Governance Framework, but if the primary reason for implementing data governance is to improve the quality of your data, perhaps calling it the “Data Quality Team” and “Data Quality Initiative” would fit better? After all, that very much focuses on the outcome of what you’re doing.  It also addresses the question that everybody asks (or should ask) when approached to get involved in data governance of “why are we doing this,” which is usually followed by “what’s in it for me?”

When having these conversations, I explain the initiative in terms of its outcomes (e.g. better quality data which will lead to more efficient ways of working, reduced costs and better customer service). That is a far easier concept to sell rather than implementing a governance structure, which can sound dull and boring.

Is the name causing confusion?

In the early days of a data governance initiative, the talk is all about designing and implementing a data governance framework. Once this work has been achieved you start designing and implementing processes which have “Data Quality” in their titles:

  • Data Quality Issue Resolution

  • Data Quality Reporting

I have been fortunate enough to work with organizations in the past who have had both a Data Governance Team (supporting the Data Owners and Data Stewards) and a Data Quality Team (responsible for the processes mentioned above) but that is fairly unusual in my experience. It is more common for the Data Governance Team to support the above processes. So it is worth considering whether it would confuse people if they had to report data quality issues to the Data Governance Team?

In summary, I would not want to miss the opportunity to educate more people on what Data Governance really is. But the banner under which it is delivered can be altered to make your data governance implementation both more successful and more sustainable. So if having considered all the points above in respect of your organization and you want to call it something else, then that is fine with me.

Deciding what to call your initiative is only the start of many things you need to do to make your Data Governance initiative successful.   You can download a free checklist of the things you need to do here. (Don't forget this is a high level summary view, but everyone who attends either my face to face or online training gets  a copy of the complete detailed checklist which I use when working with my clients.)

What do you include in Data Quality Issue Log?

58669333_m.jpg

Whenever I am helping clients implement a Data Governance Framework, a Data Quality Issue Resolution process is top of my list of the processes to implement. After all, if you are implementing Data Governance because you want to improve the quality of your data, it makes sense to have a central process to enable people to flag known issues, and to have a consistent approach for investigating and resolving them.

At the heart of such a process is the log you keep of the issues.  The log is what the Data Governance Team will be using while they help investigate and resolve data quality issues, as well as for monitoring and reporting on progress.  So, it is no surprise that I am often asked what should be included in this log.

For each client, I design a Data Quality Issue Resolution process that is as simple as possible (why create an overly complex process which only adds bureaucracy?) that meets their needs. Then, I create a Data Quality Issue Log to support that process.  Each log I design is, therefore, unique to that client.  That said, there are some column headings that I typically include on all logs.

Let’s have a look at each of these and consider why you might want to include them in your Data Quality Issue Log:

ID

Typically, I just use sequential numbers for an identifier (001, 002, 003 etc).  This has the advantage of being both simple and giving you an instant answer to how many issues have been identified since we introduced the process (a question that your senior stakeholders will ask you sooner or later).

If you are creating your log on an excel spreadsheet, then it is up to you to decide how you record ID numbers or letters.  If, however, you are recording your issues on an existing system (e.g. an Operational Risk System or Helpdesk System), you will need to follow their existing protocols.

Date Raised

Now this is important for tracking how long an issue has been open and monitoring average resolution times.  Just one small reminder: be sure to decide on and stick to a standard date format – it doesn’t look good for dates to have inconsistent formats in your Data Quality Issue log!

Raised By (Name and Department)

This is a good way to start to identify your key data consumers (it is usually the people using the data who notify you when there are issues with it) for each data set.  This is something you should also log in your Data Glossary for future reference (if you have one). More importantly, you need to know who to report progress to and agree on remedial action plans with.

Short Name of Issue

This is not essential and some of my clients prefer not to have it, but I do like to include this one. It makes referring to the Data Quality Issue easy and understandable.

If you are presenting a report to your Data Governance Committee or chasing Data Owners for a progress update, everyone will know what you mean if you refer to the “Duplicate Customer Issue”. They may not remember what “Data Quality Issue 067” is about, and “System x has an issue whereby duplicate customers are created if a field on a record is changed after the initial creation date of a record” is a bit wordy (this is the detail that can be supplied when it is needed).

Detailed Description

As I mentioned above, I don’t want to use the detailed description as the label for an issue, but the detailed description is needed. This is the full detail of the issue as supplied by the person who raised it and drives the investigation and remedial activities.

Impact

Again, this is supplied by the person who identified the issue. This field is useful in prioritizing your efforts when investigating and resolving issues. It is unlikely that your team will have unlimited resources and be able to action every single issue as soon as you are aware of it. Therefore, you need a way to prioritize which issues you investigate first. Understanding the impact of an issue means that you focus on resolving those issues that have the biggest impact on your organization.

I like to have defined classifications for this field. Something simple like High, Medium and Low is fine, just make sure that you define what these mean in business terms. I was once told about a ‘High’ impact issue and spent a fair amount of time on it before I discovered that in fact just a handful records had the wrong geocode. The percentage of incorrect records made it seem more likely that human error was to blame, rather than there being some major systemic issue that needed to be fixed! This small percentage of incorrect codes was indeed causing a problem for the team who reported them. They had to stop time critical month-end processes to fix them, but the impact category they chose had more to do with their level of frustration at the time they reported it than the true impact of the issue.

Data Owner

With all things (not just data), I find that activities don’t tend to happen unless it is very clear who is responsible for doing them. One of the first things I do after being notified of a data quality issue is to find out who the Data Owner for the affected data is and agree with them that they are responsible for investigating and fixing the issue (with support from the Data Governance Team of course).

Status

Status is another good field to use when monitoring and reporting on data quality issues. You may want to consider using more than just the obvious “open” and “closed’ statuses.

From time to time, you will come across issues that you either cannot fix, or that would be too costly to fix. In these situations, a business decision has to be made to accept the situation. You do not want to lose sight of these, but neither do you want to skew your numbers of ‘open’ issues by leaving them open indefinitely. I like to use ‘accepted’ as a status for these and have a regular review to see if solutions are possible at a later date. For example, the replacement of an old system can provide the answer to some outstanding issues.

Update

This is where you keep notes on progress to date and details of the next steps to be taken (and by whom).

Target Resolution Date

Finally, I like to keep a note of when we expect (and/or wish) the issue to be fixed by. This is a useful field for reporting and monitoring purposes. It also means that you don’t waste effort chasing for updates when issues won’t be fixed until a project delivers next year.

I hope this has given you a useful insight on the items you might want to include in your Data Quality Issue Log. You can download a template with these fields for free by clicking here.

Running and managing a Data Quality Log using excel and email is an easy place to start but it can get time consuming once volumes increase – especially when it comes to chasing those responsible!   That’s why I was delighted to be involved recently with helping Atticus Associates create their latest product in this space, DQLog.   The Atticus team are launching their beta version in Spring this year and they are keen to hear from anyone interested in trying it for their feedback.  If you are interested in testing the beta, please email me and I can put you in touch.

Make Sure you Follow These Practical Steps for Creating a Business Glossary

I’ve recently launched a new course: An Introduction to Data Governance Using Collibra and in order to ensure that attendees on this course have access to the best combination of both business (my focus) and technical skills, I have teamed up with a leading Collibra expert and Implementation Partner Carl White. As you know I like to use this blog to share practical advice to help you with your Data Governance initiatives and I thought that this new collaboration gave me an opportunity to ask Carl for his views on the best way to approach a typical activity for organisations embracing data governance - creating a Business Glossary.

Firstly, what is a business glossary?

In a nutshell, it’s the place where important business terms are clearly owned, articulated, contextualised and linked to other information assets (e.g. reports).  For example you will have a list of terms, what that means in business terms, who owns that data and then information such as which systems and processes it is used in.

A Business Glossary seems a fairly straightforward deliverable, surely it’s very easy to create one?

It seems straightforward but there will inevitably be many stakeholders, all of whom differ in their understanding, expectations, requirements and commitment. Enthusiastic stakeholders will expect the Business Glossary to store everything and solve all problems related to business semantics. Uncommitted stakeholders might see it as a valueless exercise. If it is not carefully positioned, the glossary can quickly become an unstructured dumping ground, ironically reflecting the reason the organisation needed one in the first place.

So what do you recommend that anyone creating a Business Glossary does first?

It’s critical to identify a focus area within the organisation where sponsorship is strong but a lack of clarity has caused problems. Canny sponsors will usually be aware of a particular domain or business area where terms are problematic, for instance, a certain set of management reports where Finance and Sales teams don’t even realise they define terms differently.

Once you have agreed a focus for your pilot what should you do next?

Starting with the sponsor, engage key stakeholders within the focus area to define a limited scope with clear and measurable outcomes that all stakeholders see as valuable to them.

Who do you consider ‘key stakeholder’ do you mean the really senior people in that area or the more junior people that really do the work?

Both senior and junior people have a part to play. Senior people will be accountable for terms and will want to review and approve definitions. Junior people will tend be more involved on a day to day basis so they often know more about the issues. There’s a collaboration to set up through the glossary in which the junior people begin articulating terms and the senior people review and approve. The collaboration is as important as the final definitions, in my opinion, as it leads onto generally better practice like clear accountability with data.

Once you have your area for your pilot identified and stakeholders engaged, what’s next?

Collect a small volume of the most problematic terms, perhaps in an Excel workbook. Identify stakeholders who are willing to act as owners of the term and others who are willing to articulate the term. Encourage stakeholders to be rigorous with their definitions and the information they keep on the terms. I’ve seen so many definitions along the lines of Customer Type - the type of customer’ but this tells me nothing about the possible values, who uses the term, why it matters, who wrote the definition, who approved the definition, when it might no longer apply and so on.

And once you’ve got them working, you move onto another area? 

Not quite, creating data glossaries is very much an iterative process. Once your stakeholders become involved they are likely to think of more information that they would like to add to the glossary. So after the pilot stage it is important that you review the pilot to determine whether all the required information has been collected whether changes are required before rolling the process out across the rest of your organisation.

And can all of this can be done in Microsoft Excel?

You can get a fair way along the journey with Microsoft Excel but the collaboration we talked about earlier includes an element of workflow, terms need to be very easily accessible to all users and changes to the glossary need to be tracked and understood. However, an organisation can start the process using Excel in order to begin their journey and really understand what they need. I would recommend starting small to understand the benefits. Once these are clear and there’s a head of steam, I’d strongly recommend making an investment in a tool.

I hope you have found this advice from Carl useful, if you want to learn where a Business Glossary fits in a data governance framework and even have an attempt at creating your own one in Collibra, why not come along and join us both on An Introduction to Data Governance Using Collibra on 7 September in Central London.

 

My free report reveals why companies struggle to successfully implement data governance. Discover how to quickly get you data governance initiative on track by downloading this free report

Building Relationships and Rapport

Data Governance relationships

I spent last week at two amazing data conferences in the US. Firstly I was at Enterprise Dataversity in Chicago and from there I flew to Richmond, Virginia to join the International Data Quality Summit.  Both were excellent events and gave me the opportunity to meet in person some data friends that until then had only been “virtual” friends via the wonders of social media.  Of course I also got to meet up with others who I had been lucky enough to meet previously and finally there are my new data friends who I would never have come across if I had not met them at the conferences.

Going back to my virtual data friends, I have always considered that I had good relationships with these people, but it's amazing how much better I know these people now that I have actually met them and spent some time with them.  It really doesn’t take long to build rapport with people, just a chat over coffee or a meal makes a huge difference in building relationships.

It bears out something that I taught on one of my tutorials this week - the importance of building relationships and rapport with your stakeholders (especially your Data Owners and Data Stewards).  Sending an email asking them to be a Data Owner is unlikely to be successful, but meeting them face to face and explaining what data governance is and why you think they should be a data owner will be much more successful.  Especially when you take the time to get to know them and the challenges that they are facing, so that you can articulate what being a Data Owner will mean to them.

Sometimes it just isn’t possible to meet up face to face and in those circumstances you will need to work hard to make the most of the communication options that you have available.  But as I experienced on numerous occasions last week, good long distance relationships can very quickly become so much stronger when you can meet up in person.

So if it is possible to meet your senior stakeholders when implementing data governance, make sure that you do and make the most of that opportunity to build relationships and rapport.  And of course if you get the opportunity to attend a data conference, make sure that you take it. It really is an excellent environment for learning from others experiences and meeting and networking with your peers. 

 

My free report reveals why companies struggle to successfully implement data governance. Discover how to quickly get you data governance initiative on track by downloading this free report

Data Governance Interview - Jim Harris

Data Governance Interview - Jim Harris

I'm very pleased that Jim Harris agreed to be interviewed for the first blog in my new website...

Jim Harris is a recognized industry thought leader with more than 20 years of enterprise data management experience, specializing in data quality, data integration, data warehousing, business intelligence, master data management, data governance, and big data analytics.

As Blogger-in-Chief at Obsessive Compulsive Data Quality, Jim Harris offers an independent, vendor-neutral perspective and hosts the popular audio podcast OCDQ Radio. Jim Harris is an independent consultant and freelance writer for hire.

Read More
Comment