CDAP

Plugins

CDAP is highly extensible, and exposes plugins, which allow users to extend its capabilities. On this page, you can see all the plugins available in CDAP. Refer to the community page to learn about contributing your own plugin.

ADLSBatchSink
Sink
View details
ADLSBatchSink
Azure Data Lake Store Batch Sink writes data to Azure Data Lake Store directory in avro, orc or text format.
Hide details
ADLSDelete
Action
View details
ADLSDelete
Deletes a file or files within ADLS file system.
Hide details
AddField
Transform
View details
AddField
Adds a new field to each record. The field value can either be a new UUID, or it can be set directly through configuration. This transform is used when you want to add a unique id field to each record, or when you want to tag each record with some constant value. For example, you may want to add the logical start time as a field to each record.
Hide details
Amazon S3 Client
Action
View details
Amazon S3 Client
The Amazon S3 Client Action is used to work with S3 buckets and objects before or after the execution of a pipeline.
Hide details
Argument Setter
Action
View details
Argument Setter
Performs an HTTP request to fetch arguments to set in the pipeline.
Hide details
AvroDynamicPartitionedDataset
Sink
AvroDynamicPartitionedDataset
Hide details
AzureBlobStore
Source
View details
AzureBlobStore
Batch source to use Microsoft Azure Blob Storage as a source.
Hide details
ADLS Batch Source
Source
View details
ADLS Batch Source
Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
Hide details
AzureDecompress
Action
View details
AzureDecompress
Azure decompress Action plugin decompress gz files from a container on Azure Storage Blob service into another container.
Hide details
AzureDelete
Action
View details
AzureDelete
Azure Delete Action plugin deletes a container on Azure Storage Blob service.
Hide details
AzureFaceExtractor
Transform
AzureFaceExtractor
Hide details
BigQuery Multi Table
Sink
View details
BigQuery Multi Table
This sink writes to a multiple BigQuery tables. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
Hide details
BigQuery
Sink
View details
BigQuery
This sink writes to a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
Hide details
BigQuery
Source
View details
BigQuery
This source reads the entire contents of a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data from the BigQuery table is first exported to a temporary location on Google Cloud Storage, then read into the pipeline from there.
Hide details
Bigtable
Sink
View details
Bigtable
This sink writes data to Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
Hide details
Bigtable
Source
View details
Bigtable
This source reads data from Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
Hide details
CSVFormatter
Transform
CSVFormatter
Hide details
CSVParser
Transform
CSVParser
Hide details
Cassandra
Sink
View details
Cassandra
Batch sink to use Apache Cassandra as a sink.
Hide details
Cassandra
Source
View details
Cassandra
Batch source to use Apache Cassandra as a source.
Hide details
CloneRecord
Transform
View details
CloneRecord
Makes a copy of every input record received for a configured number of times on the output. This transform does not change any record fields or types. It's an identity transform.
Hide details
Compressor
Transform
View details
Compressor
Compresses configured fields. Multiple fields can be specified to be compressed using different compression algorithms. Plugin supports SNAPPY, ZIP, and GZIP types of compression of fields.
Hide details
Conditional
Condition
View details
Conditional
A control flow plugin that allows conditional execution within pipelines. The conditions are specified as expressions and the variables could include values specified as runtime arguments of the pipeline, token from plugins prior to the condition and global that includes global information about pipeline like stage, pipeline, logical start time and plugin.
Hide details
Cube
Sink
View details
Cube
Batch sink that writes data to a Cube dataset.
Hide details
Data Profiler
Analytics
View details
Data Profiler
Calculates statistics for each input field. For every field, a total count and null count will be calculated. For numeric fields, min, max, mean, stddev, zero count, positive count, and negative count will be calculated. For string fields, min length, max length, mean length, and empty count will be calculated. For boolean fields, true and false counts will be calculated. When calculating means, only non-null values are considered.
Hide details
Database
Action
View details
Database
Action that runs a database command.
Hide details
Database
Sink
View details
Database
Writes records to a database table. Each record will be written to a row in the table.
Hide details
Database
Source
View details
Database
Reads from a database using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
DatabaseQuery
Action
View details
DatabaseQuery
Runs a database query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
Google Cloud Datastore
Sink
View details
Google Cloud Datastore
This sink writes data to Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
Hide details
Datastore
Source
View details
Datastore
This source reads data from Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
Hide details
DateTransform
Transform
View details
DateTransform
This transform takes a date in either a unix timestamp or a string, and converts it to a formatted string. (Macro-enabled)
Hide details
Db2
Action
View details
Db2
Action that runs a DB2 command.
Hide details
IBM DB2
Sink
View details
IBM DB2
Writes records to a DB2 table. Each record will be written to a row in the table.
Hide details
Db2
Source
View details
Db2
Reads from a DB2 using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
Db2
Action
View details
Db2
Runs a DB2 query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
Field Decoder
Transform
Field Decoder
Hide details
Decompress
Action
Decompress
Hide details
Field Decompressor
Transform
Field Decompressor
Hide details
Decryptor
Transform
View details
Decryptor
Decrypts one or more fields in input records using a keystore that must be present on all nodes of the cluster.
Hide details
Deduplicate
Analytics
Deduplicate
Hide details
Distinct
Analytics
View details
Distinct
De-duplicates input records so that all output records are distinct. Can optionally take a list of fields, which will project out all other fields and perform a distinct on just those fields.
Hide details
DynHBase
Sink
View details
DynHBase
This plugin supports writing dynamic schemas record to local or remote HBase Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
Hide details
CDAP Table with Dynamic Schema
Sink
View details
CDAP Table with Dynamic Schema
This plugin supports writing dynamic schemas record to CDAP Dataset Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
Hide details
ADLS Batch Sink
Sink
ADLS Batch Sink
Hide details
DynamicMultiFileset
Sink
View details
DynamicMultiFileset
This plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple filesets in text format. Each fileset it writes to will contain a single 'ingesttime' partition, which will contain the logical start time of the pipeline run. The plugin expects that the filsets it needs to write to will be set as pipeline arguments, where the key is 'multisink.[fileset]' and the value is the fileset schema. Normally, you rely on the MultiTableDatabase source to set those pipeline arguments, but they can also be manually set or set by an Action plugin in your pipeline. The sink will expect each record to contain a special split field that will be used to determine which records are written to each fileset. For example, suppose the the split field is 'tablename'. A record whose 'tablename' field is set to 'activity' will be written to the 'activity' fileset.
Hide details
Elasticsearch
Sink
View details
Elasticsearch
Takes the Structured Record from the input source and converts it to a JSON string, then indexes it in Elasticsearch using the index, type, and idField specified by the user. The Elasticsearch server should be running prior to creating the application.
Hide details
Elasticsearch
Source
View details
Elasticsearch
Pulls documents from Elasticsearch according to the query specified by the user and converts each document to a Structured Record with the fields and schema specified by the user. The Elasticsearch server should be running prior to creating the application.
Hide details
Email
Action
View details
Email
Sends an email at the end of a pipeline run.
Hide details
Field Encoder
Transform
Field Encoder
Hide details
Encryptor
Transform
View details
Encryptor
Encrypts one or more fields in input records using a java keystore that must be present on all nodes of the cluster.
Hide details
ErrorCollector
Error Handler
View details
ErrorCollector
The ErrorCollector plugin takes errors emitted from the previous stage and flattens them by adding the error message, code, and stage to the record and outputting the result.
Hide details
Excel
Source
View details
Excel
The Excel plugin provides user the ability to read data from one or more Excel file(s).
Hide details
FTP
Source
View details
FTP
Batch source for an FTP or SFTP source. Prefix of the path ('ftp://...' or 'sftp://...') determines the source server type, either FTP or SFTP.
Hide details
FTPCopy
Action
View details
FTPCopy
Copy files from FTP server to the specified destination.
Hide details
FTPPut
Action
View details
FTPPut
Copy files to FTP server from a specified destination.
Hide details
Fail Pipeline
Sink
View details
Fail Pipeline
Batch Sink is used to fail the running pipeline when any of the record flows to this sink on receiving the first record itself.
Hide details
FastFilter
Transform
View details
FastFilter
Filters out messages based on a specified criteria.
Hide details
File
Sink
View details
File
Writes to a filesystem in various formats format.
Hide details
File
Source
View details
File
This source is used whenever you need to read from a distributed file system. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
Hide details
File
Source
View details
File
File streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.
Hide details
FileAppender
Sink
View details
FileAppender
Writes to a CDAP FileSet in text format. HDFS append must be enabled for this to work. One line is written for each record sent to the sink. All record fields are joined using a configurable separator. Each time a batch is written, the sink will examine all existing files in the output directory. If there are any files that are smaller in size than the size threshold, or more recent than the age threshold, new data will be appended to those files instead of written to new files.
Hide details
FileContents
Action
View details
FileContents
This action plugin can be used to check if a file is empty or if the contents of a file match a given pattern.
Hide details
FileDelete
Action
View details
FileDelete
Deletes a file or files.
Hide details
FileMove
Action
View details
FileMove
Moves a file or files.
Hide details
Google Cloud Storage
Sink
View details
Google Cloud Storage
This plugin writes records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.
Hide details
GCSBucketCreate
Action
View details
GCSBucketCreate
This plugin creates objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
Hide details
GCSBucketDelete
Action
View details
GCSBucketDelete
This plugin deletes objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
Hide details
GCSCopy
Action
View details
GCSCopy
This plugin copies objects from one Google Cloud Storage bucket to another. A single object can be copied, or a directory of objects can be copied.
Hide details
GCSFile
Source
View details
GCSFile
This plugin reads objects from a path in a Google Cloud Storage bucket.
Hide details
GCSMove
Action
View details
GCSMove
This plugin moves objects from one Google Cloud Storage bucket to another. A single object can be moved, or a directory of objects can be moved.
Hide details
GCSMultiFiles
Sink
View details
GCSMultiFiles
This plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple directories in various formats. The plugin expects that the directories it needs to write to will be set as pipeline arguments, where the key is 'multisink.[directory]' and the value is the schema of the data.
Hide details
GooglePublisher
Sink
View details
GooglePublisher
This sink writes to a Google Cloud Pub/Sub topic. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
Hide details
GoogleSubscriber
Source
View details
GoogleSubscriber
This sources reads from a Google Cloud Pub/Sub subscription in realtime. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
Hide details
Group By
Analytics
Group By
Hide details
HBase
Sink
View details
HBase
Writes records to a column family in an HBase table with one record field mapping to the rowkey, and all other record fields mapping to table column qualifiers. This sink differs from the Table sink in that it does not use CDAP datasets, but writes to HBase directly.
Hide details
HBase
Source
View details
HBase
Batch source that reads from a column family in an HBase table. This source differs from the Table source in that it does not use a CDAP dataset, but reads directly from HBase.
Hide details
HTTP
Sink
View details
HTTP
Sink plugin to send the messages from the pipeline to an external http endpoint.
Hide details
HTTPCallback
Action
View details
HTTPCallback
Performs an HTTP request at the end of a pipeline run.
Hide details
HTTPPoller
Source
HTTPPoller
Hide details
HTTPToHDFS
Action
View details
HTTPToHDFS
Action to fetch data from an external http endpoint and create a file in HDFS.
Hide details
MD5/SHA Field Dataset
Transform
MD5/SHA Field Dataset
Hide details
Hive Bulk Export
Action
Hive Bulk Export
Hide details
Hive Bulk Import
Action
Hive Bulk Import
Hide details
JSONFormatter
Transform
JSONFormatter
Hide details
JSONParser
Transform
View details
JSONParser
Parses an input JSON event into a record. The input JSON event could be either a map of string fields to values or it could be a complex nested JSON structure. The plugin allows you to express JSON paths for extracting fields from complex nested input JSON.
Hide details
JavaScript
Transform
View details
JavaScript
Executes user-provided JavaScript that transforms one record into zero or more records. Input records are converted into JSON objects which can be directly accessed in JavaScript. The transform expects to receive a JSON object as input, which it can process and emit zero or more records or emit error using the provided emitter object.
Hide details
Joiner
Analytics
Joiner
Hide details
KVTable
Sink
View details
KVTable
Writes records to a KeyValueTable, using configurable fields from input records as the key and value.
Hide details
KVTable
Source
View details
KVTable
Reads the entire contents of a KeyValueTable, outputting records with a 'key' field and a 'value' field. Both fields are of type bytes.
Hide details
Kafka
alert
Kafka
Hide details
Kafka
Sink
View details
Kafka
Kafka sink that allows you to write events into CSV or JSON to kafka. Plugin has the capability to push the data to a Kafka topic. It can also be configured to partition events being written to kafka based on a configurable key. The sink can also be configured to operate in sync or async mode and apply different compression types to events. This plugin uses kafka 0.10.2 java apis.
Hide details
Kafka
Source
View details
Kafka
Kafka batch source. Emits the record from kafka. It will emit a record based on the schema and format you use, or if no schema or format is specified, the message payload will be emitted. The source will remember the offset it read last run and continue from that offset for the next run. The Kafka batch source supports providing additional kafka properties for the kafka consumer, reading from kerberos-enabled kafka and limiting the number of records read. This plugin uses kafka 0.10.2 java apis.
Hide details
Kafka
Source
View details
Kafka
Kafka streaming source. Emits a record with the schema specified by the user. If no schema is specified, it will emit a record with two fields: 'key' (nullable string) and 'message' (bytes). This plugin uses kafka 0.10.2 java apis.
Hide details
KafkaAlerts
Alert Publisher
View details
KafkaAlerts
Kafka Alert Publisher that allows you to publish alerts to kafka as json objects. The plugin internally uses kafka producer apis to publish alerts. The plugin allows to specify kafka topic to use for publishing and other additional kafka producer properties. This plugin uses kafka 0.10.2 java apis.
Hide details
KinesisSink
Sink
View details
KinesisSink
Kinesis sink that outputs to a specified Amazon Kinesis Stream.
Hide details
KinesisSource
Source
View details
KinesisSource
Spark streaming source that reads from AWS Kinesis streams.
Hide details
Kudu
Sink
View details
Kudu
CDAP Plugin for ingesting data into Apache Kudu. Plugin can be configured for both batch and real-time pipelines.
Hide details
Kudu
Source
View details
Kudu
CDAP Plugin for reading data from Apache Kudu table.
Hide details
LoadToSnowflake
Action
LoadToSnowflake
Hide details
LogParser
Transform
View details
LogParser
Parses logs from any input source for relevant information such as URI, IP, browser, device, HTTP status code, and timestamp.
Hide details
MLPredictor
Analytics
View details
MLPredictor
Uses a model trained by the ModelTrainer plugin to add a prediction field to incoming records. The same features used to train the model must be present in each input record, but input records can also contain additional non-feature fields. If the trained model uses categorical features, and if the record being predicted contains new categories, that record will be dropped. For example, suppose categorical feature 'city' was used to train a model that predicts housing prices. If an incoming record has 'New York' as the city, but 'New York' was not in the training set, that record will be dropped.
Hide details
MultiFieldAdder
Transform
View details
MultiFieldAdder
Multi Field Adder Transform allows you to add one or more fields to the output. Each field specified has a name and the value. The value is currently set to be of type string.
Hide details
MultiTableDatabase
Source
View details
MultiTableDatabase
Reads from multiple tables within a database using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline. The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
Hide details
MySQL Execute
Action
View details
MySQL Execute
Action that runs a MySQL command.
Hide details
MySQL
Sink
View details
MySQL
Writes records to a MySQL table. Each record will be written to a row in the table.
Hide details
MySQL
Source
View details
MySQL
Reads from a MySQL instance using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
Mysql
Action
View details
Mysql
Runs a MySQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
NGramTransform
Analytics
View details
NGramTransform
Transforms the input features into n-grams, where n-gram is a sequence of n tokens (typically words) for some integer 'n'.
Hide details
Netezza Execute
Action
View details
Netezza Execute
Action that runs a Netezza command.
Hide details
Netezza
Sink
View details
Netezza
Writes records to a Netezza table. Each record will be written to a row in the table.
Hide details
Netezza
Source
View details
Netezza
Reads from a Netezza using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
Netezza
Action
View details
Netezza
Runs a Netezza query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
Normalize
Transform
View details
Normalize
Normalize is a transform plugin that breaks one source row into multiple target rows. Attributes stored in the columns of a table or a file may need to be broken into multiple records: for example, one record per column attribute. In general, the plugin allows the conversion of columns to rows.
Hide details
NullFieldSplitter
Transform
NullFieldSplitter
Hide details
ORCDynamicPartitionedDataset
Sink
ORCDynamicPartitionedDataset
Hide details
Oracle
Action
View details
Oracle
Action that runs an Oracle command.
Hide details
Oracle
Sink
View details
Oracle
Writes records to an Oracle table. Each record will be written to a row in the table.
Hide details
Oracle
Source
View details
Oracle
Reads from an Oracle table using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
Oracle
Action
View details
Oracle
Runs an Oracle query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
OracleExport
Action
View details
OracleExport
A Hydrator Action plugin to efficiently export data from Oracle to HDFS or local file system. The plugin uses Oracle's command line tools to export data. The data exported from this tool can then be used in Hydrator pipelines.
Hide details
PDFExtractor
Transform
PDFExtractor
Hide details
ParquetDynamicPartitionedDataset
Sink
ParquetDynamicPartitionedDataset
Hide details
Postgres
Action
View details
Postgres
Action that runs a PostgreSQL command.
Hide details
Postgres
Sink
View details
Postgres
Writes records to a PostgreSQL table. Each record will be written to a row in the table.
Hide details
Postgres
Source
View details
Postgres
Reads from a PostgreSQL using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
Postgres
Action
View details
Postgres
Runs a PostgreSQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
Projection
Transform
View details
Projection
The Projection transform lets you drop, keep, rename, and cast fields to a different type. Fields are first dropped based on the drop or keep field, then cast, then renamed.
Hide details
PySparkProgram
Action
View details
PySparkProgram
Executes user-provided Spark code in Python.
Hide details
Python
Transform
View details
Python
Executes user-provided python code that transforms one record into zero or more records. Each input record is converted into a dictionary which can be directly accessed in python. The transform expects to receive a dictionary as input, which it can process and emit zero or more transformed dictionaries, or emit an error dictionary using the provided emitter object.
Hide details
RecordSplitter
Transform
View details
RecordSplitter
Given a field and a delimiter, emits one record for each split of the field.
Hide details
RedshiftToS3
Action
RedshiftToS3
Hide details
Repartitioner
Analytics
View details
Repartitioner
This plugins re-partitions a Spark RDD.
Hide details
RowDenormalizer
Analytics
View details
RowDenormalizer
Converts raw data into denormalized data based on a key column. User is able to specify the list of fields that should be used in the denormalized record, with an option to use an alias for the output field name. For example, 'ADDRESS' in the input is mapped to 'addr' in the output schema.
Hide details
Run
Transform
View details
Run
Runs an executable binary which is installed and available on the local filesystem of the Hadoop nodes. Run transform plugin allows the user to read the structured record as input and returns the output record, to be further processed downstream in the pipeline.
Hide details
S3
Sink
View details
S3
This sink is used whenever you need to write to Amazon S3 in various formats. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file.
Hide details
Amazon S3
Source
View details
Amazon S3
This source is used whenever you need to read from Amazon S3. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
Hide details
S3ToRedshift
Action
View details
S3ToRedshift
S3ToRedshift Action that will load the data from AWS S3 bucket into the AWS Redshift table.
Hide details
SFTPCopy
Action
SFTPCopy
Hide details
SFTPDelete
Action
SFTPDelete
Hide details
SFTPPut
Action
SFTPPut
Hide details
Remote Program Executor
Action
View details
Remote Program Executor
Establishes an SSH connection with remote machine to execute command on that machine.
Hide details
Salesforce
Sink
View details
Salesforce
A batch sink that inserts sObjects into Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom objects, etc.
Hide details
Salesforce
Source
View details
Salesforce
This source reads sObjects from Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
Hide details
Salesforce
Source
View details
Salesforce
This source tracks updates in Salesforce sObjects. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
Hide details
Salesforce Marketing
Sink
View details
Salesforce Marketing
This sink inserts records into a Salesforce Marketing Cloud Data Extension. The sink requires Server-to-Server integration with the Salesforce Marketing Cloud API. See https://developer.salesforce.com/docs/atlas.en-us.mc-app-development.meta/mc-app-development/api-integration.htm for more information about creating an API integration.
Hide details
SalesforceMultiObjects
Source
View details
SalesforceMultiObjects
This source reads multiple sObjects from Salesforce. The data which should be read is specified using list of sObjects and incremental or range date filters. The source will output a record for each row in the SObjects it reads, with each record containing an additional field that holds the name of the SObject the record came from. In addition, for each SObject that will be read, this plugin will set pipeline arguments where the key is 'multisink.[SObjectName]' and the value is the schema of the SObject.
Hide details
Sampling
Analytics
View details
Sampling
Sampling a large dataset flowing through this plugin to pull random records. Supports two types of sampling i.e, Systematic Sampling and Reservoir Sampling.
Hide details
ScalaSparkCompute
Analytics
View details
ScalaSparkCompute
Executes user-provided Spark code in Scala that transforms RDD to RDD with full access to all Spark features.
Hide details
ScalaSparkProgram
Action
View details
ScalaSparkProgram
Executes user-provided Spark code in Scala.
Hide details
Spark
Sink
View details
Spark
Executes user-provided Spark code in Scala that operates on an input RDD or Dataframe with full access to all Spark features.
Hide details
Avro Snapshot Dataset
Sink
View details
Avro Snapshot Dataset
A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Avro format. A corresponding SnapshotAvro source can be used to read only the most recently written snapshot.
Hide details
Avro Snapshot Dataset
Source
View details
Avro Snapshot Dataset
A batch source that reads from a corresponding SnapshotAvro sink. The source will only read the most recent snapshot written to the sink.
Hide details
Parquet Snapshot Dataset
Sink
View details
Parquet Snapshot Dataset
A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Parquet format. A corresponding SnapshotParquet source can be used to read only the most recently written snapshot.
Hide details
Parquet Snapshot Dataset
Source
View details
Parquet Snapshot Dataset
A batch source that reads from a corresponding SnapshotParquet sink. The source will only read the most recent snapshot written to the sink.
Hide details
SnapshotText
Sink
View details
SnapshotText
A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Text format.
Hide details
Google Cloud Spanner
Sink
View details
Google Cloud Spanner
This sink writes to a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
Hide details
Spanner
Source
View details
Spanner
This source reads from a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
Hide details
Google Cloud Speech-to-Text
Transform
View details
Google Cloud Speech-to-Text
This plugin converts audio files to text by using Google Cloud Speech-to-Text.
Hide details
SQL Server Execute
Action
View details
SQL Server Execute
Action that runs a SQL Server command.
Hide details
SQL Server
Sink
View details
SQL Server
Writes records to a SQL Server table. Each record will be written to a row in the table.
Hide details
SQL Server
Source
View details
SQL Server
Reads from a SQL Server using a configurable SQL query. Outputs one record for each row returned by the query.
Hide details
SqlServer
Action
View details
SqlServer
Runs a SQL Server query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
Hide details
StructuredRecordToGenericRecord
Transform
View details
StructuredRecordToGenericRecord
Transforms a StructuredRecord into an Avro GenericRecord.
Hide details
Transactional Alert Publisher
Alert Publisher
View details
Transactional Alert Publisher
Publishes alerts to the CDAP Transactional Messaging System (TMS) as json objects. The plugin allows you to specify the topic and namespace to publish to, as well as a rate limit for the maximum number of alerts to publish per second.
Hide details
Avro Time Partitioned Dataset
Sink
Avro Time Partitioned Dataset
Hide details
TPFSAvro
Source
View details
TPFSAvro
Reads from a TimePartitionedFileSet whose data is in Avro format.
Hide details
ORC Time Partitioned Dataset
Sink
ORC Time Partitioned Dataset
Hide details
TPFSParquet
Sink
TPFSParquet
Hide details
Parquet Time Partitioned Dataset
Source
View details
Parquet Time Partitioned Dataset
Reads from a TimePartitionedFileSet whose data is in Parquet format.
Hide details
CDAP Table Dataset
Sink
View details
CDAP Table Dataset
Writes records to a CDAP Table with one record field mapping to the Table rowkey, and all other record fields mapping to Table columns.
Hide details
Table
Source
View details
Table
Reads the entire contents of a CDAP Table. Outputs one record for each row in the Table. The Table must conform to a given schema.
Hide details
TopN
Analytics
View details
TopN
Top-N returns the top "n" records from the input set, based on the criteria specified in the plugin configuration.
Hide details
Trash
Sink
View details
Trash
Trash consumes all the records on the input and eats them all, means no output is generated or no output is stored anywhere.
Hide details
Twitter
Source
View details
Twitter
Samples tweets in real-time through Spark streaming. Output records will have this schema:
Hide details
UnionSplitter
Transform
View details
UnionSplitter
The union splitter is used to split data by a union schema, so that type specific logic can be done downstream.
Hide details
Validator
Transform
View details
Validator
Validates a record, writing to an error dataset if the record is invalid. Otherwise it passes the record on to the next stage.
Hide details
ValueMapper
Transform
View details
ValueMapper
Value Mapper is a transform plugin that maps string values of a field in the input record to a mapping value using a mapping dataset.
Hide details
VerticaBulkExportAction
Action
View details
VerticaBulkExportAction
Bulk exports data in a vertica table into a file.
Hide details
VerticaBulkImportAction
Action
View details
VerticaBulkImportAction
Vertica Bulk Import Action plugin gets executed after successful mapreduce or spark job. It reads all the files in a given directory and bulk imports contents of those files into vertica table.
Hide details
Window
Analytics
View details
Window
The Window plugin is used to window a part of a streaming pipeline.
Hide details
WindowsShareCopy
Action
View details
WindowsShareCopy
Copies a file or files on a Microsoft Windows share to an HDFS directory.
Hide details
Wrangler
Transform
View details
Wrangler
This plugin applies data transformation directives on your data records. The directives are generated either through an interactive user interface or by manual entry into the plugin.
Hide details
XMLMultiParser
Transform
View details
XMLMultiParser
The XML Multi Parser Transform uses XPath to extract fields from an XML document. It will generate records from the children of the element specified by the XPath. If there is some error parsing the document or building the record, the problematic input record will be dropped.
Hide details
XMLParser
Transform
View details
XMLParser
The XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.
Hide details
XMLReader
Source
View details
XMLReader
The XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.
Hide details
XML to Json String
Transform
View details
XML to Json String
Accepts a field that contains a properly-formatted XML string and outputs a properly-formatted JSON string version of the data. This is meant to be used with the Javascript transform for the parsing of complex XML documents into parts. Once the XML is a JSON string, you can convert it into a Javascript object using:
Hide details