Revealing Some Fundamentals of Data Masking
This article introduces the concept of data masking and will provide some key considerations for undertaking a data masking initiative. Most organizations recognize that their current practices of using production data in non-production environments are introducing security risks but they are hesitant to change those practices on the belief that creating reliable and comprehensive test data for developers and testers to use is impractical or much too great of a challenge – but it is not.
Commercial data masking products are available that allow the use and development of masking policies to convert sensitive production data into databases with non-sensitive, fictitious yet realistic data. These tools are highly capable of altering sensitive data but, like any IT solution, require a solid understanding of the requirements and careful planning for deployment.
Here are the top 7 important elements to consider when planning a data masking project.
1. Masking Identifiers and Descriptors
Your data masking solution may need to accommodate different approaches for different types of data.
Descriptors refers to the data that is associated with, or about, a subject, such as salary or details from a customer application form. Descriptor attributes may be left in their unmasked state or may be masked using the out-of-the-box type capabilities such as generating random numbers or shuffling the values within a group. Your approach to data masking should depend on the actual planned use of the masked database, as there may be cases where some of the descriptor attributes need to be masked and other cases where it is preferred to be left unmasked. When descriptor attributes need to be masked, it is common that the requirement is to generate some sort of random value, perhaps within specific constraints or in context with other attributes that may or may not be masked. An example of this would be masking salary but within the expected bounds of the subject’s job classification code.
Identifiers refer to the data that describes who the actual subject is, such as name, address, employee number or SIN. Identifier attributes generally need to be anonymized so that it is not possible to determine the actual identity of the subject in the masked database. Masking identifiers is typically more complicated and requires a good understanding of the planned use of the masked database and often needs more innovative and secure approaches. In general, it is necessary to examine the identifier attributes more holistically to ensure privacy but also to create a masked database that appears realistic enough to satisfy the development or testing requirements. Consider for example, first and last name of a subject. There may be situations where the masked values of the name attributes are not important and can therefore be generated with a mix of random or fixed strings and numbers (e.g. FIRSTNAME-000001, LASTNAME-999999). There may be other situations where it is necessary to select more realistic names from a pre-defined list of names or to ensure consistency of name masking across databases that include the same subject.
It is also possible that an attribute that would otherwise be considered an identifier, such as SIN, be treated as a descriptor, thereby reducing the complexity of its masking policy. Although SIN may have other constraints that need to be satisfied, such as uniqueness, masking requirements for SIN in some requirements may be satisfied with a random value each time. Other situations may require SIN to be treated as an identifier, such as when it is the key identifier, and to be masked differently in a way that is repeatable for the subject in all occurrences found in the database.
It is necessary to analyze the requirements and identify the appropriate approach to masking identifiers and descriptors of the subject. Ideally the data masking architecture implemented is flexible enough to handle both types of data.
2. Deterministic Masking Intra-Database and Cross-Database
Deterministic masking is the concept of being able to randomly select or generate a new value to use as the masked value but to ensure that that same random value is used in every subsequent data masking operation. In most cases, the deterministic mask is unique to each subject. For example, a common requirement is to deterministically mask employee number, meaning that the solution must randomly generate a masked employee number for each subject (employee) but then ensure that the same masked employee number is always applied to the correct employee every time thereafter. There are several approaches to deterministic masking, such as non-reversible calculations that use the original value to always generate the same but unique masked value. Another approach to deterministic masking may involve using a pre-defined list of acceptable values, from which the same random value is selected each time for the same employee. The former approach may be more applicable to attributes like employee numbers, while the latter approach may be more applicable for attributes like names.
An organization needs to assess the need for deterministic masking across databases as well as within databases. In general, deterministic masking is required within a single database to ensure referential integrity throughout the database (e.g. all instances of employee number in each table of the database must be masked to the same value). However, it is necessary to identify if and when there is a requirement to ensure the use of the same masked values in other organizational databases (e.g. ensure that the same masked employee number or names are used for the same subject in all other databases).
Virtually all implementations of data masking involve deterministic masking within a database, however the need for cross-database deterministic masking is more likely based on whether multiple databases need to be linked (logically or physically), for example, when data for the same subject from multiple databases needs to be compared during testing. Imagine the potential difficulties for testers if, for example, the masked name and other key identifier attributes of an employee are completely different between two databases that need to be compared.
It is necessary to analyze the requirements for deterministic masking and when to apply it across databases. Ideally the data masking architecture easily supports both approaches that can be applied based on the specific requirements. Where possible, avoiding deterministic masking across databases will simplify the approach, as well help to reduce risk.
3. Security, Seeding and Version Controlling
In all data masking deployments, there will be a need for some form of deterministic masking on one or more attributes. The method used to calculate or select the recurring masked value to use must be protected from disclosure to mitigate the risk of being able to reverse engineer the masked identity of a subject back to the original. One common approach is to use one-way hashing algorithms which, by design, protect against reversibility. For example, masking an employee number using one-way hashing means that it is not possible to directly determine the original value from the hashed (masked) value.
However, not all masking requirements are satisfied with only hashing an original value to calculate the masked value. There may be a need to use pre-defined lists of values from which to consistently select a randomly determined entry. The details of the method used to consistently calculate which entry to select must be protected as well to avoid reverse engineering andbecause random selection of an entry each time may not meet the requirements.
Seeding the calculation for selecting entries from pre-defined lists is one way to protect the masking operations. This means having the ability to apply a key (unique) value into the calculation to affect its outcome at run time. One possible approach is to enable the data masking administrator to manually enter a random value (i.e. a password) that directly impacts the algorithm used to calculate the entry to select. Using this approach helps to mitigate the possibility of an attacker figuring out how the selection was made and reversing it because the actual calculation was controlled in some manner by the value (password) entered by the administrator when the masking operation was performed, where providing a different value yields a different selection. It is important to select a data masking product that supports the ability to seed these types of masking operations.
Version controlling also needs to be considered. Over time it may be necessary to make changes to the data masking methods, or to the inputs used in these methods. Consider for example the use of a pre-defined list of masked names from which to select, and the possible requirement to add, change, delete or re-order the list in the future. The data masking solution should ideally account for this, particularly in the event of cross-database deterministic masking.
From a security perspective when multiple masked databases involving the same subjects are available, it may be necessary to consider masking descriptor attributes to help protect the actual identity. A robust data masking solution may need to consider the risk of using one masked database to help identify actual data for a subject in another masked database. The solution needs to protect against an attacker using known information about a subject in one database (i.e. specific values of key descriptors) to locate that subject and a deterministic masked key that is used to extract sensitive data about the subject from another database.
4. Support for Custom Mask Formats and Custom Functions
Most data masking product vendors will promote the value of their product on the capabilities of their out-of-the-box masking functions. These functions can offer significant value and time savings when deploying a solution. The out-of-the-box capabilities are particularly useful for masking descriptor attributes about a subject, such as random generation of a salary figure or a credit card number. Each product has its own method for applying these capabilities in a masking policy, usually in a graphical interface that simplifies the operations. To have the product randomly generate a credit card number might simply only require the data masking administrator to select the function to use, where the result at run time would be a random credit card number that meets the validation (e.g. checksum) for that credit card type.
Without a doubt, these are useful, but most deployments will have some unique requirements that can only be handled with custom masking and/or custom functions. Most products allow the creation of simple custom masking formats, such as a complex set of strings and numbers to match an internal identifier, but this may not satisfy all the requirements.
It is most likely that you will need the ability to incorporate custom programming to satisfy some requirements. For example, consider the requirement for masking SIN numbers. Although a product may be able to generate a valid SIN number, will it be able to satisfy a requirement to ensure uniqueness or validation against other tables in the database? These types of unique requirements may require custom development that, hopefully can leverage some of the out-of-the-box capabilities but allow them to be extended or modified as required. For a solution to have a chance at meeting all your requirements, it should also have the ability to call custom functions at various points in the masking process, as it is fairly certain you will run into challenges that can only be address this way
5. Masking on Export vs. Masking in Staging
The basic architecture of data masking solutions needs to consider how and where the production data will actually be masked. This goes beyond the masking algorithm to apply to an attribute in the database, but to the process of extracting the sensitive data from the production environment, into a masking engine and out to a repository for the applications to be used by developers and testers.
One approach is to clone the production database into a data masking staging area on a regular or on-demand basis, and then apply a pre-tested data masking policy against the data to produce a masked database. The major challenge with this approach is that the staging area will contain sensitive production data and will therefore need to be secured appropriately. There are multiple approaches to consider when masking from a cloned database in a staging area.
Another approach is to mask data while exporting it from production. The advantage to this is that the masked output does not need to be stored in a secured location, however there are implications on the production environment to consider.
It is recommended that you understand your needs and adopt the right approach that best serves the organization, however selecting a solution and implementing an architecture that supports both approaches may be the best overall solution that will satisfy the current and future needs of the entire organization.
Adopting a technical solution that can integrate data subsetting can also prove valuable as it provides the ability to mask smaller amounts of data where possible.
6. The Business Case for Data Masking
Data masking solutions can be implemented to address the needs of a single database, however there is more value in a solution that can scale to address the needs of more business units. Commercial data masking products tend to be overlooked when the objective is for a single database because of the belief that the needs of a single database can be handled with a targeted point solution. Point solutions often involve custom scripts that are not easily re-used, not validated and become difficult to maintain. Taking the approach of a broader solution with proper governance and controls is important when building the business case for a commercial solution.
Data masking is commonly viewed as a tool for creating databases for development and testing purposes, however these products can be used for other uses such as data sharing with partners, building databases for analytics and as a fundamental component of a cloud strategy. Masked databases permanently remove sensitive data from the databases which enables distribution of these databases with much lower, if not eliminated, risk of unauthorized data exposure.
Data masking projects are often considered following a failed audit or a security breach. Implementing an architecture that can scale broadly across the enterprise can offer important capabilities for implementing stronger controls over sensitive data and demonstrate evidence of such control in future audits.
A data masking infrastructure can begin to eliminate custom developed point solutions that typically only apply to a single database and are difficult to maintain, in favour of an enterprise solution that encourages consistency and re-use and that supports improved enterprise-wide data governance.
7. Getting started
How to begin a data masking initiative is an important consideration. A successful solution needs to be architected in such a way that it can satisfy the broader needs of the enterprise, however it is also important that the initiative start small and potentially grow to where it can be applied to more databases.
Understanding the intended use of the masked databases is essential. It is easy to over complicate the architecture by over thinking the requirements. At the same time, it is important to thoroughly analyze the requirements to ensure that poorly masked databases are not released and inadvertently create new vulnerabilities.
Begin with a pilot that involves 2 or 3 databases to ensure a thorough understanding of the data. Selecting databases that may have a requirement to be linked (logically or physically) is a good idea as it will help to understand the need for deterministic masking and the best approach to implement it.
Although it is always a good idea to establish governance and data masking design principles, it is strongly recommended that experience first be gained with the technology selected. As with any IT solution, every product will have its strengths and weaknesses, but it is the weaknesses and the limitations that they impose that need to be well understood. It will likely be necessary to identify innovative solutions to unique requirements and/or identify any significant shortcomings. It is critical to gain experience with the selected product and its limitations as it will affect the design and the path forward.
Most data masking deployments are undertaken to eliminate the risk of unauthorized exposure of sensitive information from databases that are stored in environments with lower security controls than production. The goal is typically to create masked databases that are no longer sensitive. Data masking solutions provide their best return on investment as they grow into broader enterprise solutions, meaning that they should be architected to grow into masking-as-a-service and that their masking policies, algorithms and supporting masking content and controls evolve to address additional sensitive data and unforeseen complexities. The net result would be a myriad of masked databases, however it is important that security and privacy of the data be viewed holistically to ensure that access to multiple masked databases does not introduce new vulnerabilities (i.e. gleaning information from one database that helps to identify a subject in another database). Seeking input from privacy and security analysts as part of the governance and architecture is recommended.
Consider the risks with a scenario where an attacker obtains access to multiple databases and is knowledgeable of a specific descriptor attribute associated with a subject of interest (e.g. an income value), then searches one database for that known value for the purposes of obtaining the masked identifier of the subject (e.g. the masked SIN), which is then used to locate the subject in another database. Privacy analysts can help identify these types of risks. In situations where the data and risk warrant higher levels of protection, selecting a product or an architecture that can support layers of abstraction may need to be considered.
Plan to involve a developer who can take on the needs for custom development. It is highly likely that the shortcomings in any data masking product, and the uniqueness of your requirements, will require some custom development. Don’t be surprised if, in the end, there is more custom development than anticipated, so understanding the real capabilities of the product’s support for custom masking formats and its ability to support user defined functions is essential.
One aspect in particular to understand with respect to creating masking policies is the ability to mask an attribute based on the value of other prerequisite attributes, before or after those prerequisite attributes are masked themselves. For example, a requirement to generate a masked email address based on the user’s first name and last name can only be satisfied by a product that ensures that the first name and last name used are masked first, and available to the email masking function. Similarly, you may have a requirement to base masking decisions on a prerequisite attribute’s original value despite needing to also mask that prerequisite attribute.
As discussed earlier, the design ultimately has to consider the ability to secure, seed and control versioning for it to pass the scrutiny of privacy and security analysts and to exist as a long-term, lower risk solution. The method to implement these aspects may not be native to the product or abundantly intuitive on how to implement them but could be fundamental to an effective design.
Finally, it is suggested to assemble a small and nimble team of data analysts, DBAs, security and privacy analysts, developers and someone with deeper knowledge and experience with the selected commercial data masking product. This will help to identify the real data masking requirements, avoid unnecessary and overly complicated solutions and overcome hurdles more quickly, while ensuring that the data masking solution has the proper foundation for growth and expansion
Ultimately the owner or custodian of sensitive data is responsible to ensure that sensitive personal information is not disclosed in an unauthorized manner. Focusing data protection attention primarily at production environments can introduce significant risk from outside production. Data masking, when well planned and executed can significantly reduce the risks from non-production environments but can also to provide the most realistic, no risk test data.
Watch for the next blog in this series which will discuss data masking architecture and ideas for tackling some of the common masking policy challenges.