Enterprise Security Options
Security vendors will tell you both attacks on corporate IT systems and data breaches are prevalent, so with gobs of data under management, Hadoop provides a tempting target for ‘hackers’. All of which is true, but we have not yet seen Hadoop blamed in a major data breach. So this sort of FUD carries little weight with IT Operations. But security is still a requirement! As sensitive information, customer data, medical histories, intellectual property, and just about every type of data used in enterprise computing is now commonly used in Hadoop clusters, the ‘C’ word (Compliance) has become part of the vocabulary. One of the major changes we have seen over the last couple years
has been Hadoop becoming business-critical infrastructure. Another, which flows directly from the first, is IT being tasked with bringing existing clusters in line with enterprise compliance requirements. This is challenging because a fresh install of Hadoop suffers all the same weak points as traditional IT systems, so it takes work to establish security. And more to create policies and reports for the compliance team. For clusters already up and running, you need to choose technologies and a deployment roadmap that do not disturb ongoing operations. Additionally, the in-house tools you use to secure things like SAP, or the SIEM infrastructure you use for compliance reporting, may be
inadequate or unsuitable for NoSQL.
The number of security solutions compatible with or even designed for Hadoop has been the biggest change since 2012. All the major security pillars authentication, authorization, encryption, key management and configuration management are available and viable; in many cases suitable open source options exist alongside with the commercial ones. The biggest advances came from firms providing enterprise distributions of Hadoop. They have purchased, built, and in many cases contributed back to the open source community, security tools which provide the basics of cluster security. Reviewing the threat-response models discussed previously, there exists a compensating security control for each threat vector. Better still, the commercial Hadoop vendors have done a lot
of the integration legwork for services like Kerberos, taking much of the pain out of deployments.
Here are some components and functions that were not available — or not truly viable — in 2012.
• LDAP/AD integration — AD and LDAP integration existed in 2012, but both options have been greatly improved, and they are easier to integrate. This area has received perhaps the most attention, and some commercial platforms make integration as simple as filling in a setup wizard. The benefits are obvious firms now leverage existing access and authorization schemes, and defer user and role management to external sources.
• Apache Ranger — Ranger is a policy administration tool for Hadoop clusters. It includes a broadest of management functions, including auditing, key management, and fine grained data access policies across HDFS, Hive, YARN, Solr, Kafka and other modules. Ranger is one of the few tools to offer a single, central management view for security policies. Better still, policies are context aware, so it understands to set file and directory policies in HDSF, SQL policies in Hive, and so on. This helps with data governance and compliance because administrators can now define how data may be accessed, and how certain modules should function.
• HDFS Encryption — HDFS offers ‘transparent’ encryption embedded within the Hadoop file system. This means data is encrypted as it is stored into the file system, transparently, without modification to the applications that use the cluster. HDFS encryption supports the concept of encryption zones; essentially these zones are directories in HDFS where all content, meaning every file and subdirectory in it, is encrypted. Each zone can use a different key if desired. This is an important feature to support tenant data privacy in multi-tenant clusters. HDFS can be used with
Hadoop’s Key Management Service (KMS), or integrated with third party key management services.
• Apache Knox — You can think of Knox as a Hadoop firewall. More precisely it is an API gateway. It handles HTTP and RESTful requests, enforcing authentication and usage policies on inbound requests and blocking everything else. Knox can be used as a virtual ‘moat’ around a cluster, or combined with network segmentation to further reduce network attack surface.
• Apache Atlas — Atlas is a proposed open source governance framework for Hadoop. It enables annotation of files and tables, establishment of relationships between data sets, and can even import metadata from other sources. From a compliance perspective, these features are helpful for reporting, data discovery and access control. Atlas is new and we expect to see significant maturation in coming years, but it already offers valuable tools for basic data governance and reporting.
• Apache Ambari — Ambari is a facility for provisioning and managing Hadoop clusters. It helps administrators set configurations and propagate changes to the entire cluster. During interviews we only spoke to two firms using this capability, but received positive feedback from both. Additionally we spoke with a handful of companies who had written their own configuration and launch scripts, with pre-deployment validation checks, mostly for cloud and virtual machine deployments. Homegrown scripts are more time-consuming but offer broader and deeper capabilities, with each function orchestrated within IT operational processes (e.g., continuous deployment, failure
recovery, & DevOps). For most organizations Ambari’s ability to get up and running quickly, with consistent cluster management, is a big win and makes it a good choice.
• Monitoring — Monitoring takes the concept of logging two step further, performing real time analysis on events, and alerting when misuse is detected. Hive, PIQL, Impala, Spark SQL and similar modules offer SQL or pseudo-SQL syntax. This enables you to leverage activity monitoring, dynamic masking, redaction, and tokenization technologies originally developed for relational platforms. The result is that we can both alert and block on misuse, or provide fine-grained authorization (beyond role-based access control) by altering queries or query result sets based on user metadata. And because these technologies examine queries they offer an application-centric view of events which log files may not capture. Your first step in addressing these compliance concerns is mapping your existing governance requirements to a Hadoop cluster, then deciding on suitable technologies to meet data and IT security requirements. Next you will deploy technologies that provide security and reporting functions, and setting up the policies to enforce usage controls or detect misuse. Since 2012 many technologies have become available to address common threats without killing scalability and performance, so there is no need to reinvent the wheel. But you will need to assemble these technologies into a coherent system.