Simplifying Mainframe Data Access
An overview of the IBM® InfoSphere® System z Connector for Hadoop™
Did you know?
For many, IBM z/OS® mainframes form the back-bone of mission-critical business applications, securely and reliably storing and processing massive volumes of data day after day. Faced with spiraling data volumes and new business demands, organizations are looking for new and cost efficient ways to get more value out of critical mainframe data while ensuring that security and data integrity is maintained.
Several specific challenges are compelling organizations to revisit how they handle Mainframe data. Among these are:
Increase need for data access – With an increased focus on analytics in many businesses, Mainframe resident data is increasingly needed to support downstream analytic models
Efficient allocation of skills – Organizations need to deploy scarce talent efficiently and are looking for ways to answer needs from the business without the need for specialized expertise or custom programming
Speeding the delivery of new applications – Businesses are under pressure to deliver new applications involving Mainframe data while containing application maintenance costs
New application types – Business units are increasingly interested in combining mainframe transactional data with data from other sources to improve customer service, gain efficiencies and support the creation of new service offerings
The need for access to mainframe data is not new. Reporting systems and data access tools have existed for decades. What is new however are several industry trends affecting how organizations think about data. Among these are:
The plummeting cost of storage – a commodity level, the cost of 1 TB of storage is in the range of $50 USD
The relative ease of data capture, and increasing availability of data in electronic form
Emerging toolsets able efficiently deal with non-traditional data types such as log files, images, documents and unstructured text
A rapid and accelerating growth in the volumes and variety of data; itself a function of the three trends above.
As a result of these challenges industry trends, organizations face a complex set of challenge around data architecture. They need mainframe data access strategies that satisfy the operational challenges described above while ensuring that systems are “future proof” and able to take advantage of new innovations.
The business value of new approaches to mainframe data access comes as a result of enabling a platform for secure and more cost efficient analysis of data, while simultaneously supporting a broader set of applications in support of new business initiatives.
While no technology solves every problem, Hadoop gets a lot of attention for new workloads. While some associate Hadoop with new data types like video, text or social media graphs, the reality is different. In a survey published in 2013, Gartner Research found that 70% of respondents cited transaction data (followed by log file data) as the most common targets of initiatives involving Hadoop – the very data types common in mainframe environments.
A discussion of Hadoop is beyond the scope of this paper, but it is worth reviewing some of the properties that make interesting to IT professionals.
Capacity and Scalability – As storage costs plummet, and organizations see increased value in retaining more data for longer periods, organizations need to think about managing and processing data at a scale that would have been unimaginable just a decade ago. As data sets grow into the Petabyes, Hadoop often emerges as the only game in town.
Cost efficiency – In addition to capacity and scalability, Hadoop has the desirable property that it can reliably store data at a very low cost per terabyte, particularly as environments grow large.
Open standards-based – While there are several commercial distributions with proprietary extensions, Hadoop is essentially open at its core with all vendors building on core Apache Hadoop components. This makes Hadoop a low risk technology. It can be sourced in a competitive environment with little risk of proprietary lock-in to a particular vendor.
Hadoop is now “good enough” for many use cases – This is perhaps the most important point. While taking on a Hadoop project required deep skills in Java and a few years ago, the tools in modern Hadoop distributions are much more accessible. While Hadoop is not going to replace transactional systems or optimized columnar databases any time soon, for many applications it is now up to the task, and the economic benefits are compelling.
As Hadoop becomes a more important technology in the modern data center, quality solutions for efficient, high-speed connectivity between the mainframe and Hadoop clusters become essential.
The IBM System z Connector for Hadoop
The IBM InfoSphere System z Connector for Hadoop is a software offering from IBM that provides fast and seamless data connectivity between a variety of mainframe sources. It also automates the transfer of data between those mainframe sources and a variety of destinations including Hadoop environments.
Among the data sources on z/OS supported are:
System log files and operator log files
Customers can easily extract data from z/OS sources without the need for mainframe-based SQL queries, custom programming, or specialized skills. Once data is in Hadoop, clients can use tools in the Hadoop platform to process and analyze data. Hadoop processing can take place on an external cluster connected to the zEnterprise® mainframe, or directly on mainframe Linux partitions using the System z Integrated Facility for Linux (IFL) for added security.
The following benefits can be realized by mainframe customers transferring data to Hadoop when using the System z connector for Hadoop.
No need for mainframe system programming
Fast-access to mainframe resident data
Streaming interface with no need for intermediate data staging
Minimal consumptions of mainframe MIPS
In-flight data format conversion with no intermediate data staging
Minimal load on the z/OS environment ensures that data transfers will not interfere with other mainframe workloads
In later sections we will discuss some of the capabilities of the System z Connector in more detail.
The IBM InfoSphere System z Connector is comprised of a set of services that run mainly on a Linux node along with a few components installed on z/OS. The architecture is flexible, and the Linux node may be a discrete Intel® or Power based system or a virtual machine running zLinux running on the System z mainframe. An architectural diagram is provided in figure XXX and the key components are described below.
Figure 1: The System z Connector Architecture
vHub is the component that is at the heart of the System z Connector. vHub defines interfaces that connect the various components of the System z Connector platform. vHub uses the vStorm Connect facility to access the source data on z/OS via agents deployed on the z/OS environment.
Data is copied from a defined source to a target. On the mainframe-side, data is extracted in raw binary without creating temporary copies of data. Data is then streamed to the target Linux platform. This binary extraction technology is one of the key reasons that the System z Connector is efficient and consumes minimal mainframe resources. To explain the distinction, when extracting data from DB2 on z/OS using an SQL query, considerable mainframe processing occurs. Queries need to be interpreted, processing occurs, data is read, and query results are written to DASD. The System z connector skips all of these steps, simply reading from the binary DB2 data source directly, and streaming the binary data off-mainframe to a Linux node where the binary data stream is processed and interpreted in memory without the need to stage the data to an intermediate file. This approach is the key reasons that the System z Connector consumes minimal z/OS MIPS and no mainframe DASD.
On the Linux platform, vHub works with the binary data and converts the data to the target format dealing with issues like EBCDIC to ASCII conversion in various code pages, dealing with packed decimal data types, and storing data in the target big data platform.
vConnect is a term used to describe a set of connectors that work with vHub described above. The following data sources supported by the System z Connector:
DB2 for z/OS Version 9 , 10, 11
QSAM ( Sequential )
System Log (SYSLOG)
Operator Log (OPERLOG)
System Measurement Facility (SMF) Record Type 30 and 80
Resource Management Facility (RMF)
JDBC (For other Relational Database Management Systems)
On the other side of the vHub components are the targets that the System z Connector can move data to. These different targets are:
HDFS - The Hadoop distributed file system. The System z connector will interface to the HDFS Namenode and move mainframe data as comma separated values (CSV) or AVRO files directly into to the Hadoop distributed file system.
Hive - Metadata is written to the Hive server reflecting the source data schema, and file data is moved to HDFS thus making data available on Hadoop for HiveQL queries. Hive is worth explaining briefly since to concepts may not be familiar to some readers. Hive is not a “data format” in the sense that a relational database would have its own on disk storage format. Hive is rather a facility that enables a schema to be imposed on existing data residing in HDFS to make it queryable using HiveQL, an SQL like language. Data that exists in Hive is already stored in HDFS. InfoSphere BigInsights users can access this data directly using Big SQL, an ANSI compliant SQL implementation that is able to directly read and process data stored in HIVE.
Linux File System - In addition to landing data in HDFS, the System z Connector can also land data directly into a Linux file system. The file system can reside within the mainframe in the zLinux environment or on nodes external to the mainframe. This provides additional flexibility. Data written to a local file system can be used by downstream ETL tools/applications as well as analytics tools/applications. This flexibility is important as clients may want to move data not only to Hadoop, but to environments as well.
Network End point - The System z Connector can also send data to a network end-point listening on a specific port. The data is made available to the listener as streaming bytes in CSV format. Organizations can build their own applications to parse and processing data originating from the mainframe on the fly. As long as the receving software can open and read data from a TCP/IP socket connection, it can receive data being streamed by the System z Connector.
The System z Connector Management Console - The System z Connector Management Console is the Graphical User Interface that manages the user interactions with the System z Connector for Hadoop system. It is a multi-tenant UI which is based on an applet/servlet interface. The Java applet executing in the browser is served by the J2EE server (tomcat).
While not all of the capabilities of the System z Connector are described here, some highlights are provided to give readers a sense of the connector and how it is used.
Specifying data sources
When users define a new connection to a data source they are guided through a step-by-step process. Data sources for a transfer include data on DASD, Unix System Services (USS), DB2 on z/OS or various log files.
Online help is provided through the browser interface so that authorized users can configure data sources for mainframe transfer.
While System z Connector for Hadoop users will need to consult with mainframe administrators initially to understand details and gather mainframe credentials, once these credentials are provided, users of the System z Connector for Hadoop can become self-sufficient in setting up data transfers.
Specifying Data Targets
Just as the process of selecting sources is guided by an interactive wizard, the same is true of targets also. Targets for Mainframe data may be the HDFS file system on Hadoop clusters, HIVE tables, a Linux File system, or a network end-point where a service listening on a TCP/IP port.
A nice feature of the System z Connector Hadoop is that there can be multiple concurrent connection definitions. It is therefore possible to move data on the System z environment not to just one target, but multiple targets including multiple Hadoop clusters.
For example a user may elect to move sensitive data to a Hadoop cluster configured on zLinux partitions within the System z environment and other less sensitive data to an external Hadoop cluster.
Filtering transferred data
Often when transferring data from the Mainframe we are only interested in a subset of the rows or columns from a given data source. Rather than transfer the whole data set and filter it on the receiving Hadoop cluster, the System z connector allows us to filter data on the fly.
As shown in figure 3, data transfers can be configured either to select only to individuals data columns to be transferred, or to filter rows based on configurable criteria. This improves flexibility by ensuring that we are transferring only the data required.
Automated scheduling of transfers
Often it is desirable to move mainframe data to particular destinations automatically on a periodic basis. For example, we may want to keep a daily view of Mainframe logs on the Hadoop cluster for downstream processing or move customer transaction data nightly.
With a built in cron-like scheduling facility, the System z Connector for Hadoop allows users to configure automated transfers that repeat based on a configurable recurring pattern. By automatic transfers we reduce workloads, and also reduce the opportunity for human error with manual processes.
Furthermore, the graphical interface may be used to define and test a particular type of data transfer. One defined however, these transfers can be invoked outside of the graphical interface. Transfers may be scripted, or may run under the control of a mainframe job scheduling system. This is an important capability for sites with complex requirements that may need to transfer thousands of files daily.
The capabilities above to configure data sources, targets and to specify how the data is configured help make it easy to transfer data from various Mainframe sources to multiple destinations.
Understanding the business need
To illustrate a typical scenario where it may be necessary to combine mainframe data in Hadoop with data from other sources, let’s spend a moment focusing on a specific scenario. While we base this example on hypothetical retailer, the example is applicable to a broad range of industries and similar use cases found in telecommunications, healthcare, insurance, banking and other industries.
An International retailer
Consider the case of a major international retailer. The retail business has always been complex, but has become especially so in the last decade with increasingly connected consumers, intense competition and fast moving product cycles and trends.
Major retailers frequently operate in many countries and distribute products from a broad set of suppliers and manufacturers through multiple sales channels. As examples sales channels many include retail locations, catalog stores, and country specific e-commerce websites.
For many retailers, key operational data exists on a corporate mainframe. This includes operational and transactional data related to customers, suppliers, stockable items, inventory levels and more.
Among the many challenges that retailers face is maintaining optimal levels of inventory across warehouses and distribution centers in various geographies. If a retailer finds themselves with inadequate supply prior to seasonal retail events like Christmas or Black Friday there can be a significant opportunity cost. On the other hand, retailers that find themselves with excess inventory on hand face increased carrying costs, and potential restocking charges if goods are not moved quickly.
Mainframe systems have long been augmented by data warehouses and decision support systems (DSS) to help address this challenge. Data warehouse environments are frequently subject to many complex queries. Organizations that can operate from a larger set of data, and who have good predictive models, have a significant advantage over competitors.
An example of a query against a data warehouse might be to assess the degree to which a change in a pricing strategy will affect inventory for various items:
For all items whose price was changed on a given date, compute the percentage change in inventory between the 30-day period BEFORE the price change and the 30-day period AFTER the change. Group this information by warehouse.
Other queries of interest might be to explore item color and size preferences by demographic and geography with various constraints, or too consider items most frequently returned by sales outlet, product class or manufacturer to estimate the effect of stocking these frequently returned items on profit margins. Needless to say, the business of optimizing inventory levels is complex.
Managing inventory has become more challenging than ever
While a data warehouse is essential to making good decisions, a challenge for most organizations is that the analysis described above is based on history. Decision support systems are only able to query and analyze data that they have on hand in the data warehouse.
In the age of big data, they may be missing “forward looking” data points that if incorporated into their analysis can help forecast required inventory with more precision.
As some examples:
Web browsing behaviors and search terms parsed from server logs can provide visibility to customer interest and help predict demand for specific product categories in specific geographies.
Analysis of customer service channels like chat, e-mail support and recorded call center conversation can provide an early warning about products or manufacturers with quality problems that are likely to lead to future returns.
Analyzing publicly available data from social media can be a gold mine of information about customer sentiment and trending topics
Using web-crawling techniques can help retailers understand what competitors are offering in geographies where they do business helping them price products more effectively and predict sales volume
While we’ve focused on examples related to inventory levels but this is really just the tip of the iceberg. The real opportunity that arises from having a more complete view of the customer is to provide tailored offers to customers or communities of customers at the time they are most receptive to the offers.
Customers may be interested in using social media data to assess the effectiveness of advertising campaigns, or may use dynamic pricing algorithms based on a customers pattern of access to a web-site or their purchase history.
A better approach is required
The example above illustrates the need to information from existing as well as new and non-traditional sources. Because of its utility in handling large and diverse data types cost efficiently, Hadoop is often the platform where the diverse data sources converge:
Customer demographic & transactional data from the mainframe
Historical data from operational warehouses
Supplier and store level data from various sources
Social media data gathered from twitter and public data aggregators
E-mail and chat data from customer service centers
Web-crawlers that track competitive promotions and prices
Recorded call center conversations converted to text
Consolidated log files from the various services that comprise the e-commerce infrastructure supporting various geographies
Streamlined data transfer from the mainframe becomes essential
To support the new types of applications described above, timely access to information from all data sources including the mainframe becomes essential. As Hadoop based tools are increasingly used to view and analyze data in new and interesting ways, there is increased pressure to quickly provide continuous access to data in required formats. It is no longer practical to address requirements individually and engage a system programmer to help with every request.
Mainframe data transfers to Hadoop need to be:
Self-serve, and not require the engagement of people with mainframe expertise
Secure – to ensure that truly sensitive data is not inadvertently exposed in less secure environments like Hadoop
Automated – Downstream models will require continuous access to the latest and greatest data, so data transfers should be able to run in an automated fashion without operator intervention
Flexible – transfers need to be able to draw data from multiple z/OS data sources and land data to multiple systems both Hadoop and non-Hadoop
Efficient – data transfers must be fast and efficient and must avoid the need for intermediate data storage and processing in order to keeps costs done and avoid additional unnecessary copies of data
Other common use cases
In section 4 we provided an example of where the System z Connector might be used to solve a specific business problem. In this case a retailer seeking to incorporate to data sources into analysis allowing them to better predict customer purchasing trends in order to allocate inventory more efficiently across warehouse locations. There are numerous other examples as well.
Some of the more common use cases for the System z Connector for Hadoop are discussed below.
ETL processing off-load
ETL processing is a common use case for Hadoop. While offerings such as IBM InfoSphere DataStage are well suited for ETL requirements involving structured data on the mainframe; Hadoop may offer better ETL for other data types. Often ETL is described as “ELT” in Hadoop environments reflecting the fact that transformation operations are performed after data is loaded into Hadoop.
Performing ETL in Hadoop becomes more viable as datasets grow very large or when they can benefit from the parallel processing facilities inherent in Hadoop to reduce total processing times. Hadoop is well suited to process semi-structured or unstructured data, because Hadoop provides specific facilities for dealing with these types of data.
Mainframe log-file analysis
Another important usage scenario for Hadoop and the mainframe is the analysis of various mainframe log file formats. IBM System Management Facility (SMF) is a component of IBM’s z/OS for mainframe computers that provides a standardized method for writing out activity to a dataset. SMF provides complete instrumentation of baseline activities on the mainframe including I/O, network activity, software usage, processor utilization and more. Add-on components including DB2, CICS, WebSphere MQ and WebSphere Application server provide their own log file type reporting using SMF.
Hadoop, and BigInsights in particular, provide rich facilities for parsing, analyzing and reporting on log file of all types. By analyzing logs using tools in BigInsights, clients can realize a number of benefits including:
Understanding usage patterns by user, application and group
Identifying issues prior to them affecting production applications
Gather trending information useful for capacity planning
Finding intrusion attempts, other security issues or evidence of fraudulent activity
Because Hadoop is designed to support large data sets, clients can retain raw log data for longer periods than would otherwise be feasible. More data helps clients discover longer term trends related to utilization and variance in activities.
The System z Connector for Hadoop supports a number of different deployment models. The appropriate model will depend on a clients environment and may be affected by a number of considerations.
The source for the connector is generally data on the z/OS mainframe (although the connector can be used to source data from other sources as well via JDBC). There are a number of options however related to target environments. These include:
InfoSphere BigInsights (Hadoop) running on the Mainframe
InfoSphere BigInsights running on local infrastructure
InfoSphere BigInsights running on IBM or third-party cloud services
Third party Hadoop distributions (such as Cloudera or Hortonworks) running on local infrastructure
Third party Hadoop distributions running on IBM or other cloud services
We’ll consider the pros (and cons) of these different approaches briefly
InfoSphere BigInsights on the Mainframe
IBM InfoSphere BigInsights is IBM’s Enterprise grade Hadoop offering. It is a completely Hadoop distribution that provides the same standard Hadoop components available in other Hadoop distributions along with additional features. In addition to running on standard Intel hardware, InfoSphere BigInsights can be run on zLinux guests running in the virtualized mainframe environment. In this environment each virtual machine configured in the zVM environment will correspond to a Hadoop node. The System z connector can connect directly to the BigInsights cluster running in the zLinux environment and use Hipersockets for a high performance secure connection between z/OS and the Hadoop environment.
Mainframe customers may find this approach attractive when:
They are dealing with sensitive data and want o keep all processing within the security perimeter of the mainframe
Most of the data being processed originates on the mainframe
Datasets are large but small enough to be handled economically on the mainframe (tens of terabytes as opposed to petabytes
The customer wants to take advantage of integration features between InfoSphere BigInsights and mainframe tools such as DB2.
InfoSphere BigInsights on a separate local cluster
IBM InfoSphere BigInsights can also be deployed on commodity Intel-based clusters and connected to the mainframe via one or more 10 GbE connections. This approach may be advantageous when:
The customer has the capacity in-house to manage a distributed cluster environment discrete from the mainframe
The client is comfortable with moving copies of mainframe data off the mainframe onto a local cluster
Most of the data volumes are originating off the mainframe
The environment is expected to grow very large (hundreds or terabytes or petabytes) and the client wants to take advantage of commodity components
InfoSphere BigInsights in the Cloud
In addition to being deployable on premise, IBM InfoSphere BigInsights can be optionally deployed in the cloud. IBM provides an “Enterprise Hadoop-as-a-service offering” called IBM BigInsights on Cloud that enables customers to get started quickly on high-performance bare-metal infrastructure, and that avoids the cost and complexity of managing Hadoop clusters on premise. As a part of IBM’s corporate agreement with Twitter, select configurations of IBM’d Cloud Services includes Twitter’s decahose service along with application tool that makes it easy for customers to incorporate up to date twitter data into analytic applications on the Hadoop cluster.
Customers will find this approach attractive when:
Much of the data originates in the cloud or from external to the organization (as examples, social data, data from external aggregation services or data feeds)
The client does not want to be bothered with managing local infrastructure
The client wishes to have a variable cost model where they can adjust capacity up (or down) rapidly as business requirements change
The client is comfortable moving corporate data up to the dedicated infrastructure on the cloud service, or there analytic requirements are such that they can avoid the need to do this.
Third-party Hadoop distributions on premise
Many customers have made corporate decisions around Hadoop and may already have Hadoop clusters deployed on premise. The System z Connector for Hadoop uses standard Hadoop interfaces, so from a technical standpoint it should be straightforward to connect to open-source or commercial Hadoop clusters. Popular third-party Hadoop environments including Cloudera and HortonWorks are supported by IBM. It will be important to check that the specific third party Hadoop environment is supported because support may vary depending on the version of the Hadoop distribution. This approach will be attractive when customers:
Have already standardized on a third party Hadoop environment
They see no need to run Hadoop components on the mainframe
They do not require the value added capabilities in IBM’s BigInsights Hadoop distribution
Third-party Hadoop distributions in the cloud
Just as there are a variety of Hadoop solutions that can be deployed on premise, there are even a wider variety of cloud providers offering Hadoop services. The decision about what provider to select is complex, and there are many factors to consider, but assuming that the third-party provider exposes standard Hadoop services, technically the IBM InfoSphere System z connector should work. It will be important to check with IBM that the System z Connector is supported with the chosen third party Hadoop-as-a-service offering. This approach will be appropriate when:
The customer has already selected a preferred Hadoop-in-the-cloud provider
The customer is comfortable with moving mainframe data to an external cloud provider, or the application requirements are such that there is no need to do so.
The data sets originated primarly in the cloud or the dataset sizes are small enough to make network transfer from the mainframe provider to the cloud provider practical
It is worth spending a few minutes considering hybrid environments. Real data center environments are often complex supporting multiple applications and user communities, so hybrid environments may emerge as the most practical solution.
Organizations may elect to have separate Hadoop clusters deployed on the mainframe, on local infrastructure and with one or more external cloud service providers. Clusters may be supporting different applications or different lines of business with different business and technical requirements.
Consider the following example:
A business has most of their customer and transaction data on the mainframe. The data is sensitive and includes financial information. They wish to exploit Hadoop-based tools to parse, transform and cleanse the data to remove identifiable information before they are comfortable moving the data outside the mainframe security perimeter.
Additionally, they are deploying a new sentiment analysis application that will incorporate twitter data. Since the twitter data is “born in the cloud”, it will be faster and more cost efficient to process the raw twitter data on a cloud-based cluster and move summary data back to on-premise systems
The client also has additional data types including web-server logs, chat lots and e-mails data that they would like to analyze together with summary social data and cleansed data coming from the mainframe. To achieve this requirement and archive the data over time, it makes sense for the client to deploy a third cluster to combine and retain the summarized data coming from the two other clusters – the mainframe cluster used for processing and cleansing customer transactions and the cloud-based cluster processing social media data.
In this example it is perfectly reasonable to have three separate clusters supporting different business requirements. An attraction of using BigInsights as a standard Hadoop deployment is that clients can use the software across all three tiers to simplify the environment and improve administrative efficiency.
As mentioned above, analysis of social data may be one of applications best suited to deployment in the public cloud since the data is in essence “born in the cloud” and there are fewer privacy concerns since the information being analyzed originates outside of the corporate firewall. With this in mind, IBM is developing solutions that make it easier than ever for clients to start incorporating social data into their analytic applications.
While social media data from various sources including Twitter can be accessed and analyzed on any major Hadoop distribution by including Twitter data as a part of the service, IBM is helping make it easier for customers to get productive quickly.
The IBM InfoSphere System z Connector for Hadoop runs on Linux hosts on or off the System z environment. The following are environments are supported.
Support Hadoop environments that may be targets for the System z Connector for Hadoop include
IBM InfoSphere BigInsights on System z (versions 2.1.2)
IBM InfoSphere BigInsights on Intel Distributed Clusters
IBM InfoSphere BigInsights on Power Systems
IBM BigInsights on Cloud
On-premise Cloudera CDH clusters
On-premise Hortonworks HDP clusters
Veristorm zDoop (open source Hadoop offering for System z Linux
The System z connector components on the mainframe will normally be deployed on Integrated Facility for Linux (IFL) processors on System z environments.
Information about how Linux is deployed on IBM z/VM environments can be found in the Redbook publication titled “The Virtualization Cookbook for IBM z/VM 6.3, RHEL 6.4, and SLES 11 SP3.