Plugins: Usage & Development

Plugins: Registry

The plugins registry is a registry of plugins intended to be used in pre/post processing. This feature comes in handy :
    During ETL: Cleanup data, adding columns enforcing data-typing, removing/encrypting PHI ...
    In a collaborative environment (Jupyter-x; Zeppelin; AWS Service Workbench)

Plugins: Architecture & Design

Plugins are designed around plugin architecture using Iterator design-pattern. In that respect and function as a pipeline i.e executed sequentially in the order in which they are expressed in the parameter. Effectively the output of one function will be the input to the next.

Data Transport UML Plugin Component View

Quick Start

    The code here shows a function that will be registered as "autoincrement".
    The data, will always be a pandas.DataFrame
    For the sake of this example the file will be my-plugin.py
import transport
import numpy as np


_index = 0
@transport.Plugin(name='autoincrement')
def _incr (_data):
    global _index
    _data['_id'] = _index + np.arange(_data.shape[0])
    _index = _data.shape[0]
    return _data

data-transport comes with a built-in command line interface (CLI). It allows plugins to be registered and reused.
    Registered functions are stored in $HOME/.data-transport/plugins/code
    Any updates to my-plugin.py will require re-registering the file
    Additional plugin registry functions (list, test) are available
$ transport plugin-add demo ./my-plugin.py

The following command allows data-transport to determine what is knows about the function i.e real name and name to be used in code.

$ transport plugin-test demo.autoincrement

Once registered, the plugins are ready for use within code or configuration file (auth-file).

import transport
from transport import providers
_args = {
"provider":providers.HTTP,
"url":"https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv",
"plugins":["demo@autoincrement"]
}
reader = transport.get.reader(**_args)
_data = reader.read()
print (_data.head())