Automated document cleansing / scrubbing / redacting tools? #content-management #tools


Douglas Kalish
 

Does anyone have knowledge of third-party automation tools that can be used to take a document and replace client names and financial information with generic data? I work at a professional services firm and we're looking at ways to make it easier for our employees to share more client engagement documents with the larger organization, so wanted to explore automation tools that can help cleanse sensitive documents before publishing. 


(I looked in our SIKMLeaders Conversations archive and didn't see this topic anywhere, so hopefully this is a new one for discussion!)


Thanks in advance for any suggestions or leads!


Doug Kalish

Executive Director, Knowledge Management

Grant Thornton LLP



Kenneth Myers
 

I’d be concerned if one were to use such a tool, that this would not remove all the sensitive information, while the financials of a business case are certainly part of it, there is a lot more that could be considered sensitive.  For example product plans, new services, methods, etc. that could be proprietary to a particular customer.  The financials are just one piece of what might be covered under NDA.  I’m afraid you need to have a smart person review the document with the relevant NDA in hand to really cleanse a document.  I’m speaking from the perspective of working with the telecommunications industry with operators who were developing new services, that they did not necessarily want replicated with their competitor.  It would depend on the NDA if it was protected.  Also some aspects of their operation could be very sensitive and would need to be protected.

 

-Ken Myers


Stephen Bounds
 

Hi Doug,

What you're probably looking for is a data masking product. From some cursory research it seems that most of the products out there focus on database content but you may be able to find something that also works on documents.

Try this list and see if it sparks off any useful lines of inquiry:

https://www.information-management.com/slideshow/19-top-products-for-data-masking

Cheers,
Stephen.

====================================
Stephen Bounds
Executive, Information Management
Cordelta
E: stephen.bounds@...
M: 0401 829 096
====================================
On 25/09/2018 5:49 AM, dougkalish@... [sikmleaders] wrote:

 

Does anyone have knowledge of third-party automation tools that can be used to take a document and replace client names and financial information with generic data? I work at a professional services firm and we're looking at ways to make it easier for our employees to share more client engagement documents with the larger organization, so wanted to explore automation tools that can help cleanse sensitive documents before publishing. 


(I looked in our SIKMLeaders Conversations archive and didn't see this topic anywhere, so hopefully this is a new one for discussion!)


Thanks in advance for any suggestions or leads!


Doug Kalish

Executive Director, Knowledge Management

Grant Thornton LLP




Douglas Kalish
 

Thanks Ken. We're definitely trying to find that balance between how much can we automate and how much still requires some human curation and review! Building on your ideas, I think there's probably a portfolio management way of looking at this too...content from some industries may require more manual intervention than others. Appreciate your insights!

Doug


Douglas Kalish
 

Stephen - Thanks for sharing this report! Looks incredibly helpful and our team will definitely go through this to see if some of these 19 tools are applicable to our situation. Really appreciate the knowledge sharing. :)

Doug


vs_shenoy@...
 

Doug,
Long time, hope you are well. I haven't done this for a while but the challenges are multiple when it comes to cleansing:
1. Different rules (levels of cleansing) for different clients, complicated by evolving confidentiality agreements.
2. Different kinds of documents formats (Word, Excel, PowerPoint, PDF) that have varying degrees of edit-ability. 
3. Images can have logos, proprietary information that cannot be easily caught even with a human eye
4. Text relating to Products, Services or other PII
5. Differing formats for financial numbers based on the author of the document (USD vs. $, including decimals vs not)

Also, most importantly, it is possible to end up with content that is so cleansed that it ceases to be useful. From all my reading data masking works well with large data sets or databases. Unless the documents are very standardized (unlikely in Consulting) it would be very hard to automate. But with well documented instructions and simpler rules around what to cleanse, it is possible to reduce the effort and cost associated with cleansing content. Sorry to be a Bob Bummer on the automation. This is likely a great opportunity to create such an tool.

- Vinod


Douglas Kalish
 

Hey Vinod! Great to chat with you! Appreciate the counsel based on your experiences with cleansing...there are definitely some strengths to automation but also still needs for human intervention!


Matt Moore <innotecture@...>
 

Hi,

A large professional services firm of my acquaintance used an off-shore content team in a low cost country to manually scrub deliverables judged to be "high value" before sharing via databases. They were mostly good but I did occasionally find things they had missed (logos as image files, text embedded in formatting).

This reminds of my recent experience with  Microsoft's DLP (data loss prevention) tools. Out of the box, these are used to look for PII-type data (e.g. credit cards) within documents and record, alert or prevent sharing. The issues they have are:
- false positives (a lot as the business rules are fairly crude)
- false negatives (some formats - e.g. credit cards - are global but most formats are specific to a jurisdiction - and these tools tend to default to American formats - so may miss those that do not conform to expectations)

If you are going to automate, you will need to reengineer how your staff create documents in the first place - i.e. make them highly standardised. As noted by others, I can see there being a lot of push back around this (as clients may insist that staff use their formats).

An alternative to all this is to control access to sensitive documents but publish overviews that allow people to find out they exist and then approach the appropriate custodian. As others have noted, sometimes a document can be so scrubbed that it is almost meaningless. And the really juicy documents often require someone involved in its creation to act as a tour-guide to point out the sights for you.

Regards,

Matt