Cask Data Application Platform
CDAP is an open source framework to build and deploy data applications on Apache™ Hadoop®. CDAP is an abstraction layer on top of Hadoop and other open source infrastructure such as HBase, Hive, Tephra, and Tigon that enables developers to rapidly build, and operations to easily manage, real-time and batch data applications.
CDAP is oriented around the concepts of Datasets, Applications, and Services, and is supported by Tools, Packs, and Sample Apps.
CDAP Datasets are logical representations of data stored in HDFS and HBase. Datasets provide the layer for writing data from applications, agnostic to the underlying storage engine. They allow you to encapsulate your applications data access patterns in reusable libraries.
CDAP Applications consist of programs that use different open-source processing frameworks such as MapReduce, Spark and realtime Flow. CDAP comes with program containers to integrate each processing framework and provide a standardized way to develop, deploy, and manage programs.
CDAP Services are system-level services that are commonly required to support data and applications in development and production environments. This includes application management, metadata management, streams, and security.
CDAP Tools includes developer tools such as Maven archetypes, SDKs, debugging tools, testing framework, and operational UI.
CDAP support different runtimes for various environments. You can run your entire application on a single computer, or as a distributed runtime with execution in the YARN containers of a Hadoop cluster.
CDAP sub-projects include additional SDKs & tools for interaction with CDAP, applications, and reusable big data components called packs.
- CDAP Ingest is a repository of client SDKs, daemons and tools for ingesting data into CDAP.
- CDAP Apps is a repository of data applications built using CDAP. Try out the apps or, if you are interested in contributing applications, we welcome contributions.
- CDAP Packs is a repository of useful and reusable building blocks for your data applications. They consist of libraries for common data patterns and programs useful for building Big Data applications.
Why CDAP ?
Any application developer building a Big Data application is primarily concerned with five areas:
Data Collection: A method of getting data into the system, so that it can be processed. CDAP distinguish these types of collecting data:
- A system or application service may poll an external source for available data and then retrieve it (“pull”), or external clients may send data to a public endpoint of the platform (“push”).
- Data can come steadily, one event at a time (“realtime”) or in bulk, many events at once (“batch”).
- Data can be acquired with a fixed schedule (“periodic”) or whenever new data is available (“on-demand”).
CDAP provides Streams as a means to push events into the platform in real-time or in batch. It also provides tools that pull data in batch, be it periodic or on-demand, from external sources.
Streams are special type of dataset within CDAP that are exposed as a push endpoint for external clients. They support ingesting events in realtime at massive scale or in batch. Events in the stream can then be consumed by applications in real-time or batch.
Data Exploration: One of the most powerful paradigms of Big Data is the ability to collect and store data without knowing details about its structure. These details are only needed at processing time. An important step—between collecting the data and processing it—is exploration; that is, examining data with ad-hoc queries to learn about its structure and nature.
CDAP provides the ability to expose query-able datasets. Currently, the HIVE query language is used to interact with datasets, but in the future one will use other systems such as Impala.
Data Processing: After data is collected, we need to process it in various ways.
- Raw events are filtered and transformed into a canonical form, to ensure quality of input data for down-stream processing.
- Events (or certain dimensions of the events) are counted or aggregated in other ways.
- Events are annotated and used by an iterative algorithm to train a machine-learned model.
- Events from different sources are joined to find associations, correlations, or other views across multiple sources.
Processing can happen in realtime, where a stream processor consumes events immediately after they are collected. Such processing provides has less expressive power than other processing paradigms, but provides insights into the data in a very timely manner. CDAP offers Flows as the realtime processing framework.
Processing can also happen in batch, where many events are processed at the same time to analyze an entire data corpus at once. Batch processing is more powerful than realtime processing, but due to its very nature is always time-lagging and is often performed over historical data. In CDAP, batch processing can be done via Map/Reduce or Spark, and it can also be scheduled on a periodic basis as part of a workflow. CDAP also supports running real-time and batch processing on the same input Stream and storing the resulting data into a dataset with the required degree of isolations and consistency.
Data Storage: The results of processing data must be stored in a persistent and durable way that allows other programs or applications to further process or analyze the data. In CDAP, data is stored in datasets using the abstraction layer provided by CDAP, and domain APIs provided by datasets. This allows different data processing paradigms to interact with the dataset in their own way; in turn, this provides the flexibility in processing that a developer is looking for.
Data Serving: The ultimate purpose of processing data is not to store the results, but to make these results available to people and other applications. For example, a web analytics application may find ways to optimize the traffic on a website. However, these insights are worthless without a way to feed them back to the actual web application. CDAP allows serving datasets to external clients through procedures and services.
You can download the latest CDAP Standalone ZIP or the CDAP Standalone VM from the sidebar. If you are downloading the ZIP archive, you have three simple prerequisites:
- JDK 7 (required to run CDAP; note that $JAVA_HOME should be set)
- Node.js (v0.10.* and higher; required to run the CDAP UI)
- Apache Maven 3.1+ (required to build the example applications bundled with Standalone)
If you are interested in picking up the latest code, please follow these steps:
After the build completes, you will have a distribution of the CDAP Standalone under the
For more build options, please refer here.
Getting Started with Installing and an Application
Visit our web site for a Getting Started that will guide you through installing CDAP and running an example web analytics application that provides insights about web usage through the analysis of web traffic.
Now that you’ve had a look at the CDAP SDK, take a look at:
- Examples, located in the
/examplesdirectory of the CDAP SDK;
- Selected Examples (demonstrating basic features of the CDAP) are located on-line; and
- Developers’ Manual, located in the source distribution in
Released on: August 22, 2016
This major release of CDAP introduces the following new capabilities:
- Authorization – Fine grained role based authorization of CDAP entities and an integration with Apache Sentry.
- Impersonation and Encryption – Abilities to run CDAP and CDAP applications as different users, capabilities to store sensitive configurations in a secure keystone.
- Cask Hydrator:
- Join & Actions – Capabilities to join multiple data sources in data pipelines. Capabilities to configure actions allowing to run binaries on designated nodes.
- Spark Streaming – Capabilities to build and run Hydrator pipelines using Spark Streaming.
- Plugins – Support for XML, Mainframe (COBOL Copybook), Encryption/Decryption
- Cask Tracker
- Data usage analytics – Ability to report application usage of datasets
- Metadata Taxonomy – Support for annotating business metadata on business specified Taxonomy
- Dataset and logging performance improvements
- Support for CDH 5.8
More information can be found at:
Released on: July 1, 2016
This is a bug fix release that includes critical fixes to program run states, bug fixes for MapR distro and provides improved performance in Hydrator studio.
Released on: June 7, 2016
This release is a bug fix release that includes critical fixes for Spark in CDAP and Cask Hydrator.
Released on: May 12, 2016
This release is a bug fix release that includes fixes for an issue that prevents canceling of YARN delegation tokens while running Hive on Spark, and additional bug fixes to improve the startup behavior of standalone and distributed CDAP.
Released on: April 29, 2016
This release introduces a fresh new look to Cask Hydrator, and improvements to it that extend beyond just data ingestion use cases. CDAP 3.4 also introduces Cask Tracker, a new CDAP extension that provides visibility into how data is being utilized in a Data Lake. You will also see significant improvements in support for Spark and Spark Streaming in CDAP, as well as a number of platform improvements to provide better usability to CDAP users.
Released on: August 21, 2016
This is a bug fix release for CDAP that includes improvements to the Log Saver in CDAP and improved program launch performance to avoid large CPU spikes.
Released on: July 21, 2016
This is a bug fix release for CDAP that includes improvements to the Log Saver in CDAP and reduces Zookeeper watch leak in CDAP Master.
Released on: July 1, 2016
This is a bug fix release for CDAP that includes a critical fix for program run states.
Released on: May 19, 2016
This is a bug fix release for CDAP that fixes issues with the HDFS delegation token in HA mode, and Explore jobs to properly use the latest delegated tokens.
Released on: April 15, 2016
This is a bug fix release for CDAP that fixes an issue that prevented MapReduce programs from running on clusters with encryption.
Released on: March 8, 2016
This is a bug fix release for CDAP that fixes issues with running Spark programs in CDAP on Spark 1.4+ enabled Cloudera Manager clusters and other minor platform issues. This release also adds support for CDH 5.6.
Released on: Feb 19, 2016
This a bug fix release for CDAP which includes major bug fixes for Metadata, UI, and Hydrator. This also includes critical fixes for CDH 5.5 support.
Released on: Jan 20, 2016
This release of CDAP includes new functionality and improvements to CDAP Metadata, Cask Hydrator, as well improving the overall installation experience. It also adds support for CDH 5.5.
- Metadata entities in CDAP are now automatically annotated with certain properties and tags, which makes it easier to discover these components.
- The newer Cask Hydrator supports DAGs in pipelines, has improved schema validation and also experimental support to run ETL pipelines in Spark instead of MapReduce.
- CDAP 3.3.0 also delivers an improved installation experience by providing capabilities in CDAP Master service to check for prerequisites.
Released on: Dec 16, 2015
This is a bug-fix release for CDAP with fixes for CDAP to work on Hadoop High Availability clusters.
Released on: Oct 21, 2015
This is a bug fix release which includes fixes and improvements related to Hydrator, and the CDAP SDK.
Released on: Sept 23, 2015
In this release we have added Cask Hydrator: a highly-functional, redesigned framework and UI to support self-service ingestion and ETL for Hadoop Data Lakes. Hydrator provides CDAP users a code-free way to configure, deploy, and operationalize ingestion pipelines from different types of data sources.
We have also made several enhancements to the CDAP platform:
- Significant Metadata enhancements: Business Metadata, Data Discovery, Audit/Lineage
- Stream Views
- Major improvements to Datasets and MapReduce programs
- Support for the latest versions of HBase (1.1) and Hortonworks Data platform (2.3)
Released on: Sept 4, 2015
This release is a bug fix release that addressed bugs related to CDAP SDK, Explore, Readless increments and logback-container.xml.
Released on: Aug 18, 2015
This is a bug fix release which includes fixes related to the CDAP UI, CLI and Spark.
Released on: Aug 2, 2015
This release also includes many improvements and enhancements in Workflows, Datasets and Metrics.
Released on: Nov 6, 2015
This is a bug fix release which includes some critical fixes to the CDAP SDK.
Released on: Sept 4, 2015
This is a bug fix release that fixes streams events that are already processed from being re-processed in flows.
Released on: Aug 25, 2015
This release is a bug fix release that addressed bugs related to the Readless increments in CDAP, HBaseQueueDebugger and CDAP’s logback-container.xml.
Released on: July 17, 2015
This release is a bug fix release that addresses bugs with the dataset upgrade tool, and CDAP UI.
Released on: June 23, 2015
This release is a bug fix release for release-3.0.0 that contains fixes for UI, CDAP-SDK and CDAP-VM.
Released on: May 5, 2015
The core new feature we are introducing with 3.0 is called Application Templates. Application Templates are implementations of Hadoop use cases that are reusable through configuration and extensible through plugins; they can easily be managed and run in CDAP.
Other major features that are part of this release include a slick new UI, enhanced metrics and workflow support, OLAP Cube dataset to perform complex data aggregations, fine-grained views of logs by run-id of CDAP programs, support for core Table datasets queryable from Hive and ability to attach schema to streams to understand several data formats – syslog, apache common log format and any custom format.
Released on: January 25, 2016
This is a bug fix release that fixes a connection leak in TransactionClient, and adds a timeout for idle connections in the CDAP Router.
Released on: July 17, 2015
This is a bug fix release that addresses bugs in transactions and CDAP UI.
Released on: March 23, 2015
This release has many new features including Namespaces (provides application and data isolation that enables multi-tenancy), improved Queue Performance in the Flow system, Fork and join capabilities in Workflow system, a new Metrics storage layer and APIs, richer Time-partitioned File datasets, a Notification system that asynchronously triggers Workflows based on data availability, and more operability improvements.
Released on: February 5, 2015
This release is a major update that supports integration with Cloudera Manager via their Custom Service Descriptor framework. It also contains a new, experimental dataset type to support time-partitioned file sets that can be queried with Impala on CDH distributions.
Released on: May 20, 2015
This is a bugfix release which addresses issues with the transaction snapshot codec interactions with CDAP and Tephra.
Released on: March 23, 2015
This release is a minor update that contains some operability improvements.
Released on: January 29, 2015
This release is a minor update that contains a few bug fixes.
Released on: January 9, 2015
This release is a major update that contains changes to the programmatic API for configuring Services and MapReduce Jobs, health checks for system services, a new FileSet dataset for working with files, various Spark, Metrics, and Service improvements, and various bug fixes. Procedures have been deprecated in this release.
Released on: November 14, 2014
This release is a minor update that contains a reorganization of the documentation and bug fixes for security, classloading, and running in a secure Hadoop cluster.
Released on: October 15, 2014
This release is a minor update that fixes bugs with the CDAP Command Line Interface, removes dependencies on SNAPSHOT artifacts for netty-http and auth-clients, and corrects problems in both the CDAP Authentication and Stream Clients.
Released on: September 25, 2014
This is the first open source release of CDAP. We are carrying over the release version of this technology from being proprietary into OSS. Hence, the starting version of the first opens source release is 2.5.0.
We would love to get contributions from you. You don’t need to be a Hadoop expert to contribute to CDAP projects. If you are a Hadoop developer, we welcome your contributions too. We have projects of all sizes and flavors for everyone to contribute to.
- If you like to solve complex distributed system problems, love developers, love APIs, love simplicity—project cdap is for you and is built by people like you.
- If you are a geek who savors joy in building reusable libraries and see them being used by others—cdap-packs is an opportunity for you to build such libraries in the big data space.
- If you are someone who likes to build domain-specific applications that can solve real world big data problems—we have a seat for you in cdap-apps
- We love every developer on this planet and value their contributions. If you feel like you want to be part of this changing world of Hadoop, take a look at our issues and consider submitting a patch. Review our roadmap and provide feedback.
We have a simple pull-based development model with a consensus-building phase, similar to Apache’s voting process. If you’d like to help make cdap and other sub-projects of CDAP better by adding new features, enhancing existing features, or fixing bugs, here’s how to do it:
- If you are planning a large change or contribution, discuss your plans on the cdap-dev mailing list first. This will help us understand your needs and better guide your solution in a way that fits the project.
- Fork cdap or cdap subprojects into your own GitHub repository.
- Create a topic branch with an appropriate name.
- Work on the code to your heart’s content.
- Once you’re satisfied, create a pull request from your GitHub repo (it’s helpful if you fill in all of the description fields).
- After we review and accept your request, we’ll commit your code to the repository.
Thanks for helping to improve CDAP and CDAP subprojects!