Compare commits

..

36 Commits
v2.0 ... master

Author SHA1 Message Date
Steve L. Nyemba 492dc8f374 Merge pull request 'new provider console and bug fixes with applied commands' (#25) from v2.2.0 into master
1 month ago
Steve Nyemba 2df926da12 new provider console and bug fixes with applied commands
1 month ago
Steve L. Nyemba e848367378 Merge pull request 'bug fix, duckdb in-memory handling' (#24) from v2.2.0 into master
1 month ago
Steve Nyemba e9aab3b034 bug fix, duckdb in-memory handling
1 month ago
Steve L. Nyemba c872ba8cc2 Merge pull request 'v2.2.0 - Bug fixes with mongodb, console' (#23) from v2.2.0 into master
1 month ago
Steve Nyemba 34db729ad4 bug fixes: mongodb console
1 month ago
Steve Nyemba a7c72391e8 s3 notebook - code as documentation
3 months ago
Steve L. Nyemba baa8164f16 Merge pull request 'aws s3 notebook, brief example' (#22) from v2.2.0 into master
3 months ago
Steve Nyemba 955369fdd8 aws s3 notebook, brief example
3 months ago
Steve L. Nyemba 31556ebd32 Merge pull request 'v2.2.0 bug fix - AWS-S3' (#21) from v2.2.0 into master
3 months ago
Steve Nyemba 63666e95ce bug fix, TODO: figure out how to parse types
3 months ago
Steve Nyemba 9dba5daecd bug fix, TODO: figure out how to parse types
3 months ago
Steve Nyemba 40f9c3930a bug fixes, using boto3 instead of boto for s3 support
3 months ago
Steve L. Nyemba 1e7839198a Merge pull request 'v2.2.0 - shared environment support and duckdb support' (#20) from v2.2.0 into master
3 months ago
Steve Nyemba 3faee02fa2 documentation ...
3 months ago
Steve Nyemba 6f6fd48982 bug fixes: environment variable usage
4 months ago
Steve Nyemba 808378afdb bug fix: delegate (new feature)
4 months ago
Steve Nyemba 2edce85aed documentation duckdb support
4 months ago
Steve Nyemba 235a44be66 bug fix: registry and parameter handling
4 months ago
Steve Nyemba 037019c1d7 bug fix
4 months ago
Steve Nyemba c443c6c953 duckdb support
4 months ago
Steve Nyemba dde4767e37 new version
4 months ago
Steve Nyemba 8aa6f2c93d bug fix: improve handling in registry
4 months ago
Steve Nyemba 24cdd9f8fe bug fix: print statement
4 months ago
Steve Nyemba b9bc898161 bug fix: registry (more usable) and added to factory method
4 months ago
Steve Nyemba 8edb764d11 documentation typo
4 months ago
Steve Nyemba 6544bf852a feature: registry for security and enterprise use
4 months ago
Steve L. Nyemba dce50a967e Merge pull request 'documentation ...' (#19) from v2.0.4 into master
4 months ago
Steve Nyemba 2b5c038610 documentation ...
4 months ago
Steve L. Nyemba 5ccb073865 Merge pull request 'refactor: etl,better reusability & streamlined and threaded' (#18) from v2.0.4 into master
4 months ago
Steve Nyemba d0472ccee5 documentation added (notebooks)
4 months ago
Steve Nyemba 870c1caed3 bug fix: use plugins to refer to plugins
4 months ago
Steve Nyemba f5187790ce refactor: etl,better reusability & streamlined and threaded
4 months ago
Steve L. Nyemba 3081fb98e7 Merge pull request 'version 2.0 - Refactored, Plugins support' (#17) from v2.0 into master
6 months ago
Steve L. Nyemba 58959359ad Merge pull request 'bug fix: psycopg2 with numpy' (#14) from dev into master
8 months ago
Steve L. Nyemba 68b8f6af5f Merge pull request 'fixes 2024 pandas-gbq and sqlalchemy' (#10) from dev into master
8 months ago

@ -8,7 +8,7 @@ Mostly data scientists that don't really care about the underlying database and
1. Familiarity with **pandas data-frames** 1. Familiarity with **pandas data-frames**
2. Connectivity **drivers** are included 2. Connectivity **drivers** are included
3. Mining data from various sources 3. Reading/Writing data from various sources
4. Useful for data migrations or **ETL** 4. Useful for data migrations or **ETL**
@ -18,6 +18,20 @@ Within the virtual environment perform the following :
pip install git+https://github.com/lnyemba/data-transport.git pip install git+https://github.com/lnyemba/data-transport.git
## Features
- read/write from over a dozen databases
- run ETL jobs seamlessly
- scales and integrates into shared environments like apache zeppelin; jupyterhub; SageMaker; ...
## What's new
Unlike older versions 2.0 and under, we focus on collaborative environments like jupyter-x servers; apache zeppelin:
1. Simpler syntax to create reader or writer
2. auth-file registry that can be referenced using a label
3. duckdb support
## Learn More ## Learn More

@ -13,29 +13,6 @@ The above copyright notice and this permission notice shall be included in all c
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Usage :
transport help -- will print this page
transport move <path> [index]
<path> path to the configuration file
<index> optional index within the configuration file
e.g: configuration file (JSON formatted)
- single source to a single target
{"source":{"provider":"http","url":"https://cdn.wsform.com/wp-content/uploads/2020/06/agreement.csv"}
"target":{"provider":"sqlite3","path":"transport-demo.sqlite","table":"agreement"}
}
- single source to multiple targets
{
"source":{"provider":"http","url":"https://cdn.wsform.com/wp-content/uploads/2020/06/agreement.csv"},
"target":[
{"provider":"sqlite3","path":"transport-demo.sqlite","table":"agreement},
{"provider":"mongodb","db":"transport-demo","collection":"agreement"}
]
}
""" """
import pandas as pd import pandas as pd
import numpy as np import numpy as np
@ -44,15 +21,22 @@ import sys
import transport import transport
import time import time
from multiprocessing import Process from multiprocessing import Process
import typer
import os import os
import transport import transport
from transport import etl from transport import etl
# from transport import providers # from transport import providers
import typer
from typing_extensions import Annotated
from typing import Optional
import time
from termcolor import colored
app = typer.Typer() app = typer.Typer()
REGISTRY_PATH=os.sep.join([os.environ['HOME'],'.data-transport'])
REGISTRY_FILE= 'transport-registry.json'
CHECK_MARK = ' '.join(['[',colored(u'\u2713', 'green'),']'])
TIMES_MARK= ' '.join(['[',colored(u'\u2717','red'),']'])
# @app.command() # @app.command()
def help() : def help() :
print (__doc__) print (__doc__)
@ -62,28 +46,33 @@ def wait(jobs):
time.sleep(1) time.sleep(1)
@app.command(name="apply") @app.command(name="apply")
def apply (path,index=None): def apply (path:Annotated[str,typer.Argument(help="path of the configuration file")],
index:int = typer.Option(default= None, help="index of the item of interest, otherwise everything in the file will be processed")):
""" """
This function applies data transport from one source to one or several others This function applies data transport ETL feature to read data from one source to write it one or several others
:path path of the configuration file
:index index of the _item of interest (otherwise everything will be processed)
""" """
_proxy = lambda _object: _object.write(_object.read()) # _proxy = lambda _object: _object.write(_object.read())
if os.path.exists(path): if os.path.exists(path):
file = open(path) file = open(path)
_config = json.loads (file.read() ) _config = json.loads (file.read() )
file.close() file.close()
if index : if index :
_config = _config[ int(index)] _config = [_config[ int(index)]]
etl.instance(**_config) jobs = []
else: for _args in _config :
etl.instance(config=_config) pthread = etl.instance(**_args) #-- automatically starts the process
jobs.append(pthread)
#
# @TODO: Log the number of processes started and estimated time
while jobs :
jobs = [pthread for pthread in jobs if pthread.is_alive()]
time.sleep(1)
#
# @TODO: Log the job termination here ...
@app.command(name="providers") @app.command(name="providers")
def supported (format:str="table") : def supported (format:Annotated[str,typer.Argument(help="format of the output, supported formats are (list,table,json)")]="table") :
""" """
This function will print supported providers and their associated classifications This function will print supported providers/vendors and their associated classifications
""" """
_df = (transport.supported()) _df = (transport.supported())
if format in ['list','json'] : if format in ['list','json'] :
@ -94,54 +83,69 @@ def supported (format:str="table") :
@app.command() @app.command()
def version(): def version():
print (transport.version.__version__) """
This function will display version and license information
"""
print (transport.__app_name__,'version ',transport.__version__)
print (transport.__license__)
@app.command() @app.command()
def generate (path:str): def generate (path:Annotated[str,typer.Argument(help="path of the ETL configuration file template (name included)")]):
""" """
This function will generate a configuration template to give a sense of how to create one This function will generate a configuration template to give a sense of how to create one
""" """
_config = [ _config = [
{ {
"source":{"provider":"http","url":"https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"}, "source":{"provider":"http","url":"https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"},
"target": "target":
[{"provider":"file","path":"addresses.csv","delimiter":"csv"},{"provider":"sqlite","database":"sample.db3","table":"addresses"}] [{"provider":"files","path":"addresses.csv","delimiter":","},{"provider":"sqlite","database":"sample.db3","table":"addresses"}]
} }
] ]
file = open(path,'w') file = open(path,'w')
file.write(json.dumps(_config)) file.write(json.dumps(_config))
file.close() file.close()
@app.command() print (f"""{CHECK_MARK} Successfully generated a template ETL file at {path}""" )
def usage(): print ("""NOTE: Each line (source or target) is the content of an auth-file""")
print (__doc__)
@app.command(name="init")
def initregistry (email:Annotated[str,typer.Argument(help="email")],
path:str=typer.Option(default=REGISTRY_PATH,help="path or location of the configuration file"),
override:bool=typer.Option(default=False,help="override existing configuration or not")):
"""
This functiion will initialize the registry and have both application and calling code loading the database parameters by a label
"""
try:
transport.registry.init(email=email, path=path, override=override)
_msg = f"""{CHECK_MARK} Successfully wrote configuration to {path} from {email}"""
except Exception as e:
_msg = f"{TIMES_MARK} {e}"
print (_msg)
print ()
@app.command(name="register")
def register (label:Annotated[str,typer.Argument(help="unique label that will be used to load the parameters of the database")],
auth_file:Annotated[str,typer.Argument(help="path of the auth_file")],
default:bool=typer.Option(default=False,help="set the auth_file as default"),
path:str=typer.Option(default=REGISTRY_PATH,help="path of the data-transport registry file")):
"""
This function will register an auth-file i.e database connection and assign it a label,
Learn more about auth-file at https://healthcareio.the-phi.com/data-transport
"""
try:
if transport.registry.exists(path) :
transport.registry.set(label=label,auth_file=auth_file, default=default, path=path)
_msg = f"""{CHECK_MARK} Successfully added label "{label}" to data-transport registry"""
else:
_msg = f"""{TIMES_MARK} Registry is not initialized, please initialize the registry (check help)"""
except Exception as e:
_msg = f"""{TIMES_MARK} {e}"""
print (_msg)
pass
if __name__ == '__main__' : if __name__ == '__main__' :
app() app()
# #
# # Load information from the file ...
# if 'help' in SYS_ARGS :
# print (__doc__)
# else:
# try:
# _info = json.loads(open(SYS_ARGS['config']).read())
# if 'index' in SYS_ARGS :
# _index = int(SYS_ARGS['index'])
# _info = [_item for _item in _info if _info.index(_item) == _index]
# pass
# elif 'id' in SYS_ARGS :
# _info = [_item for _item in _info if 'id' in _item and _item['id'] == SYS_ARGS['id']]
# procs = 1 if 'procs' not in SYS_ARGS else int(SYS_ARGS['procs'])
# jobs = transport.factory.instance(provider='etl',info=_info,procs=procs)
# print ([len(jobs),' Jobs are running'])
# N = len(jobs)
# while jobs :
# x = len(jobs)
# jobs = [_job for _job in jobs if _job.is_alive()]
# if x != len(jobs) :
# print ([len(jobs),'... jobs still running'])
# time.sleep(1)
# print ([N,' Finished running'])
# except Exception as e:
# print (e)

@ -1,8 +1,8 @@
__app_name__ = 'data-transport'
__author__ = 'The Phi Technology' __author__ = 'The Phi Technology'
__version__= '2.0.2' __version__= '2.2.6'
__license__=""" __email__ = "info@the-phi.com"
__license__=f"""
Copyright 2010 - 2024, Steve L. Nyemba Copyright 2010 - 2024, Steve L. Nyemba
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
@ -12,3 +12,10 @@ The above copyright notice and this permission notice shall be included in all c
THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
""" """
__whatsnew__=f"""version {__version__}, focuses on collaborative environments like jupyter-base servers (apache zeppelin; jupyter notebook, jupyterlab, jupyterhub)
1. simpler syntax to create readers/writers
2. auth-file registry that can be referenced using a label
3. duckdb support
"""

@ -15,21 +15,21 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"name": "stderr", "name": "stderr",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"100%|██████████| 1/1 [00:00<00:00, 5440.08it/s]\n" "100%|██████████| 1/1 [00:00<00:00, 10106.76it/s]\n"
] ]
}, },
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"['data transport version ', '2.0.0']\n" "['data transport version ', '2.0.4']\n"
] ]
} }
], ],
@ -45,7 +45,7 @@
"PRIVATE_KEY = os.environ['BQ_KEY'] #-- location of the service key\n", "PRIVATE_KEY = os.environ['BQ_KEY'] #-- location of the service key\n",
"DATASET = 'demo'\n", "DATASET = 'demo'\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"bqw = transport.factory.instance(provider=providers.BIGQUERY,dataset=DATASET,table='friends',context='write',private_key=PRIVATE_KEY)\n", "bqw = transport.get.writer(provider=providers.BIGQUERY,dataset=DATASET,table='friends',private_key=PRIVATE_KEY)\n",
"bqw.write(_data,if_exists='replace') #-- default is append\n", "bqw.write(_data,if_exists='replace') #-- default is append\n",
"print (['data transport version ', transport.__version__])\n" "print (['data transport version ', transport.__version__])\n"
] ]
@ -63,7 +63,8 @@
"\n", "\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
@ -93,7 +94,7 @@
"from transport import providers\n", "from transport import providers\n",
"import os\n", "import os\n",
"PRIVATE_KEY=os.environ['BQ_KEY']\n", "PRIVATE_KEY=os.environ['BQ_KEY']\n",
"pgr = transport.instance(provider=providers.BIGQUERY,dataset='demo',table='friends',private_key=PRIVATE_KEY)\n", "pgr = transport.get.reader(provider=providers.BIGQUERY,dataset='demo',table='friends',private_key=PRIVATE_KEY)\n",
"_df = pgr.read()\n", "_df = pgr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from demo.friends'\n", "_query = 'SELECT COUNT(*) _counts, AVG(age) from demo.friends'\n",
"_sdf = pgr.read(sql=_query)\n", "_sdf = pgr.read(sql=_query)\n",
@ -106,35 +107,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"\n", "For code in shared environments, we recommend \n",
"**NOTE**:\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted" "1. Having the **auth-file** stored on disk \n",
] "2. and the location of the file is set to an environment variable.\n",
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dataset': 'demo', 'table': 'friends'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n", "\n",
"{\n", "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
" \n",
" \"dataset\":\"demo\",\"table\":\"friends\"\n",
"}"
] ]
}, },
{ {

@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extract Transform Load (ETL) from Code\n",
"\n",
"The example below reads data from an http source (github) and will copy the data to a csv file and to a database. This example illustrates the one-to-many ETL features.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>location_id</th>\n",
" <th>address_1</th>\n",
" <th>address_2</th>\n",
" <th>city</th>\n",
" <th>state_province</th>\n",
" <th>postal_code</th>\n",
" <th>country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2600 Middlefield Road</td>\n",
" <td>NaN</td>\n",
" <td>Redwood City</td>\n",
" <td>CA</td>\n",
" <td>94063</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>24 Second Avenue</td>\n",
" <td>NaN</td>\n",
" <td>San Mateo</td>\n",
" <td>CA</td>\n",
" <td>94401</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>24 Second Avenue</td>\n",
" <td>NaN</td>\n",
" <td>San Mateo</td>\n",
" <td>CA</td>\n",
" <td>94403</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>24 Second Avenue</td>\n",
" <td>NaN</td>\n",
" <td>San Mateo</td>\n",
" <td>CA</td>\n",
" <td>94401</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>24 Second Avenue</td>\n",
" <td>NaN</td>\n",
" <td>San Mateo</td>\n",
" <td>CA</td>\n",
" <td>94401</td>\n",
" <td>US</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id location_id address_1 address_2 city \\\n",
"0 1 1 2600 Middlefield Road NaN Redwood City \n",
"1 2 2 24 Second Avenue NaN San Mateo \n",
"2 3 3 24 Second Avenue NaN San Mateo \n",
"3 4 4 24 Second Avenue NaN San Mateo \n",
"4 5 5 24 Second Avenue NaN San Mateo \n",
"\n",
" state_province postal_code country \n",
"0 CA 94063 US \n",
"1 CA 94401 US \n",
"2 CA 94403 US \n",
"3 CA 94401 US \n",
"4 CA 94401 US "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#\n",
"# Writing to Google Bigquery database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"import os\n",
"\n",
"#\n",
"#\n",
"source = {\"provider\": \"http\", \"url\": \"https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv\"}\n",
"target = [{\"provider\": \"files\", \"path\": \"addresses.csv\", \"delimiter\": \",\"}, {\"provider\": \"sqlite\", \"database\": \"sample.db3\", \"table\": \"addresses\"}]\n",
"\n",
"_handler = transport.get.etl (source=source,target=target)\n",
"_data = _handler.read() #-- all etl begins with data being read\n",
"_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extract Transform Load (ETL) from CLI\n",
"\n",
"The documentation for this is available at https://healthcareio.the-phi.com/data-transport \"Docs\" -> \"Terminal CLI\"\n",
"\n",
"The entire process is documented including how to generate an ETL configuration file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -11,14 +11,14 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"2.0.0\n" "2.0.4\n"
] ]
} }
], ],
@ -30,7 +30,7 @@
"from transport import providers\n", "from transport import providers\n",
"import pandas as pd\n", "import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"mgw = transport.factory.instance(provider=providers.MONGODB,db='demo',collection='friends',context='write')\n", "mgw = transport.get.writer(provider=providers.MONGODB,db='demo',collection='friends')\n",
"mgw.write(_data)\n", "mgw.write(_data)\n",
"print (transport.__version__)" "print (transport.__version__)"
] ]
@ -48,12 +48,13 @@
"\n", "\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -73,7 +74,7 @@
"\n", "\n",
"import transport\n", "import transport\n",
"from transport import providers\n", "from transport import providers\n",
"mgr = transport.instance(provider=providers.MONGODB,db='foo',collection='friends')\n", "mgr = transport.get.reader(provider=providers.MONGODB,db='foo',collection='friends')\n",
"_df = mgr.read()\n", "_df = mgr.read()\n",
"PIPELINE = [{\"$group\":{\"_id\":0,\"_counts\":{\"$sum\":1}, \"_mean\":{\"$avg\":\"$age\"}}}]\n", "PIPELINE = [{\"$group\":{\"_id\":0,\"_counts\":{\"$sum\":1}, \"_mean\":{\"$avg\":\"$age\"}}}]\n",
"_sdf = mgr.read(aggregate='friends',pipeline=PIPELINE)\n", "_sdf = mgr.read(aggregate='friends',pipeline=PIPELINE)\n",
@ -86,41 +87,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n", "\n",
"**NOTE**:\n", "1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted" "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'host': 'klingon.io',\n",
" 'port': 27017,\n",
" 'username': 'me',\n",
" 'password': 'foobar',\n",
" 'db': 'foo',\n",
" 'collection': 'friends',\n",
" 'authSource': '<authdb>',\n",
" 'mechamism': '<SCRAM-SHA-256|MONGODB-CR|SCRAM-SHA-1>'}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" \"host\":\"klingon.io\",\"port\":27017,\"username\":\"me\",\"password\":\"foobar\",\"db\":\"foo\",\"collection\":\"friends\",\n",
" \"authSource\":\"<authdb>\",\"mechamism\":\"<SCRAM-SHA-256|MONGODB-CR|SCRAM-SHA-1>\"\n",
"}"
] ]
}, },
{ {

@ -17,17 +17,9 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [],
{
"name": "stdout",
"output_type": "stream",
"text": [
"['data transport version ', '2.0.0']\n"
]
}
],
"source": [ "source": [
"#\n", "#\n",
"# Writing to Google Bigquery database\n", "# Writing to Google Bigquery database\n",
@ -41,7 +33,7 @@
"MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n", "MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n",
"\n", "\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"msw = transport.factory.instance(provider=providers.MSSQL,table='friends',context='write',auth_file=MSSQL_AUTH_FILE)\n", "msw = transport.get.writer(provider=providers.MSSQL,table='friends',auth_file=MSSQL_AUTH_FILE)\n",
"msw.write(_data,if_exists='replace') #-- default is append\n", "msw.write(_data,if_exists='replace') #-- default is append\n",
"print (['data transport version ', transport.__version__])\n" "print (['data transport version ', transport.__version__])\n"
] ]
@ -59,30 +51,15 @@
"\n", "\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 5, "execution_count": null,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [],
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"\n",
"--------- STATISTICS ------------\n",
"\n",
" _counts \n",
"0 3 83\n"
]
}
],
"source": [ "source": [
"\n", "\n",
"import transport\n", "import transport\n",
@ -91,7 +68,7 @@
"AUTH_FOLDER = os.environ['DT_AUTH_FOLDER'] #-- location of the service key\n", "AUTH_FOLDER = os.environ['DT_AUTH_FOLDER'] #-- location of the service key\n",
"MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n", "MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n",
"\n", "\n",
"msr = transport.instance(provider=providers.MSSQL,table='friends',auth_file=MSSQL_AUTH_FILE)\n", "msr = transport.get.reader(provider=providers.MSSQL,table='friends',auth_file=MSSQL_AUTH_FILE)\n",
"_df = msr.read()\n", "_df = msr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n", "_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = msr.read(sql=_query)\n", "_sdf = msr.read(sql=_query)\n",
@ -104,25 +81,31 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n", "\n",
"**NOTE**:\n", "1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted" "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 1,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"{'dataset': 'demo', 'table': 'friends'}" "{'provider': 'sqlserver',\n",
" 'dataset': 'demo',\n",
" 'table': 'friends',\n",
" 'username': '<username>',\n",
" 'password': '<password>'}"
] ]
}, },
"execution_count": 3, "execution_count": 1,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -130,10 +113,17 @@
"source": [ "source": [
"\n", "\n",
"{\n", "{\n",
" \n", " \"provider\":\"sqlserver\",\n",
" \"dataset\":\"demo\",\"table\":\"friends\",\"username\":\"<username>\",\"password\":\"<password>\"\n", " \"dataset\":\"demo\",\"table\":\"friends\",\"username\":\"<username>\",\"password\":\"<password>\"\n",
"}" "}"
] ]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
} }
], ],
"metadata": { "metadata": {

@ -14,14 +14,14 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 8, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"2.0.0\n" "2.0.4\n"
] ]
} }
], ],
@ -33,7 +33,7 @@
"from transport import providers\n", "from transport import providers\n",
"import pandas as pd\n", "import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"myw = transport.factory.instance(provider=providers.MYSQL,database='demo',table='friends',context='write',auth_file=\"/home/steve/auth-mysql.json\")\n", "myw = transport.get.writer(provider=providers.MYSQL,database='demo',table='friends',auth_file=\"/home/steve/auth-mysql.json\")\n",
"myw.write(_data,if_exists='replace') #-- default is append\n", "myw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)" "print (transport.__version__)"
] ]
@ -51,12 +51,13 @@
"\n", "\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 9, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -68,8 +69,8 @@
"1 Steve Rogers 150\n", "1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n", "2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n", "--------- STATISTICS ------------\n",
" _counts avg\n", " _counts AVG(age)\n",
"0 3 83.0\n" "0 3 83.0\n"
] ]
} }
], ],
@ -77,7 +78,7 @@
"\n", "\n",
"import transport\n", "import transport\n",
"from transport import providers\n", "from transport import providers\n",
"myr = transport.instance(provider=providers.POSTGRESQL,database='demo',table='friends',auth_file='/home/steve/auth-mysql.json')\n", "myr = transport.get.reader(provider=providers.MYSQL,database='demo',table='friends',auth_file='/home/steve/auth-mysql.json')\n",
"_df = myr.read()\n", "_df = myr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n", "_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = myr.read(sql=_query)\n", "_sdf = myr.read(sql=_query)\n",
@ -90,16 +91,18 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n", "\n",
"**NOTE**:\n", "1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted" "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 5,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -109,21 +112,29 @@
" 'port': 3306,\n", " 'port': 3306,\n",
" 'username': 'me',\n", " 'username': 'me',\n",
" 'password': 'foobar',\n", " 'password': 'foobar',\n",
" 'provider': 'mysql',\n",
" 'database': 'demo',\n", " 'database': 'demo',\n",
" 'table': 'friends'}" " 'table': 'friends'}"
] ]
}, },
"execution_count": 1, "execution_count": 5,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
], ],
"source": [ "source": [
"{\n", "{\n",
" \"host\":\"klingon.io\",\"port\":3306,\"username\":\"me\",\"password\":\"foobar\",\n", " \"host\":\"klingon.io\",\"port\":3306,\"username\":\"me\",\"password\":\"foobar\", \"provider\":\"mysql\",\n",
" \"database\":\"demo\",\"table\":\"friends\"\n", " \"database\":\"demo\",\"table\":\"friends\"\n",
"}" "}"
] ]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
} }
], ],
"metadata": { "metadata": {

@ -0,0 +1,149 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing data-transport plugins\n",
"\n",
"The data-transport plugins are designed to automate pre/post processing i.e\n",
"\n",
" - Read -> Post processing\n",
" - Write-> Pre processing\n",
" \n",
"In this example we will assume, data and write both pre/post processing to any supported infrastructure. We will equally show how to specify the plugins within a configuration file"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#\n",
"# Writing to Google Bigquery database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"import os\n",
"import shutil\n",
"#\n",
"#\n",
"\n",
"DATABASE = '/home/steve/tmp/demo.db3'\n",
"if os.path.exists(DATABASE) :\n",
" os.remove(DATABASE)\n",
"#\n",
"# \n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"litew = transport.get.writer(provider=providers.SQLITE,database=DATABASE)\n",
"litew.write(_data,table='friends')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from SQLite\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age from a plugin function we will write. \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Read with pipeline functions defined in code\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** or **transport.get.<[reader|writer]>** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"\n",
"\n",
" name age autoinc\n",
"0 James Bond 5.5 0\n",
"1 Steve Rogers 15.0 1\n",
"2 Steve Nyemba 4.4 2\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"import os\n",
"import numpy as np\n",
"def _autoincrement (_data,**kwargs) :\n",
" \"\"\"\n",
" This function will add an autoincrement field to the table\n",
" \"\"\"\n",
" _data['autoinc'] = np.arange(_data.shape[0])\n",
" \n",
" return _data\n",
"def reduce(_data,**_args) :\n",
" \"\"\"\n",
" This function will reduce the age of the data frame\n",
" \"\"\"\n",
" _data.age /= 10\n",
" return _data\n",
"reader = transport.get.reader(provider=providers.SQLITE,database=DATABASE,table='friends')\n",
"#\n",
"# basic read of the data created in the first cell\n",
"_df = reader.read()\n",
"print (_df)\n",
"print ()\n",
"print()\n",
"#\n",
"# read of the data with pipeline function provided to alter the database\n",
"print (reader.read(pipeline=[_autoincrement,reduce]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parameters for instianciating a transport object (reader or writer) can be found at [data-transport home](https://healthcareio.the-phi.com/data-transport)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -14,14 +14,14 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 8, "execution_count": 1,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"2.0.0\n" "2.0.4\n"
] ]
} }
], ],
@ -33,7 +33,7 @@
"from transport import providers\n", "from transport import providers\n",
"import pandas as pd\n", "import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"pgw = transport.factory.instance(provider=providers.POSTGRESQL,database='demo',table='friends',context='write')\n", "pgw = transport.get.writer(provider=providers.POSTGRESQL,database='demo',table='friends')\n",
"pgw.write(_data,if_exists='replace') #-- default is append\n", "pgw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)" "print (transport.__version__)"
] ]
@ -49,14 +49,16 @@
"- Basic read of the designated table (friends) created above\n", "- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n", "- Execute an aggregate SQL against the table\n",
"\n", "\n",
"\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 6, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -77,7 +79,7 @@
"\n", "\n",
"import transport\n", "import transport\n",
"from transport import providers\n", "from transport import providers\n",
"pgr = transport.instance(provider=providers.POSTGRESQL,database='demo',table='friends')\n", "pgr = transport.get.reader(provider=providers.POSTGRESQL,database='demo',table='friends')\n",
"_df = pgr.read()\n", "_df = pgr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n", "_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = pgr.read(sql=_query)\n", "_sdf = pgr.read(sql=_query)\n",
@ -90,16 +92,18 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n", "\n",
"**NOTE**:\n", "1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted" "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -109,18 +113,19 @@
" 'port': 5432,\n", " 'port': 5432,\n",
" 'username': 'me',\n", " 'username': 'me',\n",
" 'password': 'foobar',\n", " 'password': 'foobar',\n",
" 'provider': 'postgresql',\n",
" 'database': 'demo',\n", " 'database': 'demo',\n",
" 'table': 'friends'}" " 'table': 'friends'}"
] ]
}, },
"execution_count": 1, "execution_count": 4,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
], ],
"source": [ "source": [
"{\n", "{\n",
" \"host\":\"klingon.io\",\"port\":5432,\"username\":\"me\",\"password\":\"foobar\",\n", " \"host\":\"klingon.io\",\"port\":5432,\"username\":\"me\",\"password\":\"foobar\", \"provider\":\"postgresql\",\n",
" \"database\":\"demo\",\"table\":\"friends\"\n", " \"database\":\"demo\",\"table\":\"friends\"\n",
"}" "}"
] ]

@ -0,0 +1,131 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to AWS S3\n",
"\n",
"We have setup our demo environment with the label **aws** passed to reference our s3 access_key and secret_key and file (called friends.csv). In the cell below we will write the data to our aws s3 bucket named **com.phi.demo**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.2.1\n"
]
}
],
"source": [
"#\n",
"# Writing to mongodb database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"mgw = transport.get.writer(label='aws')\n",
"mgw.write(_data)\n",
"print (transport.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from AWS S3\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a mongodb pipeline. The code in the background executes an aggregation using\n",
"\n",
"- Basic read of the designated file **friends.csv**\n",
"- Compute average age using standard pandas functions\n",
"\n",
"**NOTE**\n",
"\n",
"By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" bname age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n",
"83.0\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"\n",
"def cast(stream) :\n",
" print (stream)\n",
" return pd.DataFrame(str(stream))\n",
"mgr = transport.get.reader(label='aws')\n",
"_df = mgr.read()\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_df.age.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n",
"1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n",
"To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -18,7 +18,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"2.0.0\n" "2.0.4\n"
] ]
} }
], ],
@ -30,7 +30,7 @@
"from transport import providers\n", "from transport import providers\n",
"import pandas as pd\n", "import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n", "_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"sqw = transport.factory.instance(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends',context='write')\n", "sqw = transport.get.writer(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends')\n",
"sqw.write(_data,if_exists='replace') #-- default is append\n", "sqw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)" "print (transport.__version__)"
] ]
@ -46,9 +46,11 @@
"- Basic read of the designated table (friends) created above\n", "- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n", "- Execute an aggregate SQL against the table\n",
"\n", "\n",
"\n",
"**NOTE**\n", "**NOTE**\n",
"\n", "\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern." "By design **read** object are separated from **write** objects in order to avoid accidental writes to the database.\n",
"Read objects are created with **transport.get.reader** whereas write objects are created with **transport.get.writer**"
] ]
}, },
{ {
@ -74,10 +76,10 @@
"\n", "\n",
"import transport\n", "import transport\n",
"from transport import providers\n", "from transport import providers\n",
"pgr = transport.instance(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends')\n", "sqr = transport.get.reader(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends')\n",
"_df = pgr.read()\n", "_df = sqr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n", "_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = pgr.read(sql=_query)\n", "_sdf = sqr.read(sql=_query)\n",
"print (_df)\n", "print (_df)\n",
"print ('--------- STATISTICS ------------')\n", "print ('--------- STATISTICS ------------')\n",
"print (_sdf)" "print (_sdf)"
@ -87,11 +89,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n", "An **auth-file** is a file that contains database parameters used to access the database. \n",
"For code in shared environments, we recommend \n",
"\n", "\n",
"**NOTE**:\n", "1. Having the **auth-file** stored on disk \n",
"2. and the location of the file is set to an environment variable.\n",
"\n", "\n",
"The auth_file is intended to be **JSON** formatted. This is an overkill for SQLite ;-)" "To generate a template of the **auth-file** open the **file generator wizard** found at visit https://healthcareio.the-phi.com/data-transport"
] ]
}, },
{ {

@ -5,24 +5,21 @@ from setuptools import setup, find_packages
import os import os
import sys import sys
# from version import __version__,__author__ # from version import __version__,__author__
from info import __version__, __author__ from info import __version__, __author__,__app_name__,__license__
# __author__ = 'The Phi Technology'
# __version__= '1.8.0'
def read(fname): def read(fname):
return open(os.path.join(os.path.dirname(__file__), fname)).read() return open(os.path.join(os.path.dirname(__file__), fname)).read()
args = { args = {
"name":"data-transport", "name":__app_name__,
"version":__version__, "version":__version__,
"author":__author__,"author_email":"info@the-phi.com", "author":__author__,"author_email":"info@the-phi.com",
"license":"MIT", "license":__license__,
# "packages":["transport","info","transport/sql"]}, # "packages":["transport","info","transport/sql"]},
"packages": find_packages(include=['info','transport', 'transport.*'])} "packages": find_packages(include=['info','transport', 'transport.*'])}
args["keywords"]=['mongodb','couchdb','rabbitmq','file','read','write','s3','sqlite'] args["keywords"]=['mongodb','duckdb','couchdb','rabbitmq','file','read','write','s3','sqlite']
args["install_requires"] = ['pyncclient','pymongo','sqlalchemy','pandas','typer','pandas-gbq','numpy','cloudant','pika','nzpy','boto3','boto','pyarrow','google-cloud-bigquery','google-cloud-bigquery-storage','flask-session','smart_open','botocore','psycopg2-binary','mysql-connector-python','numpy','pymssql'] args["install_requires"] = ['pyncclient','duckdb-engine','pymongo','sqlalchemy','pandas','typer','pandas-gbq','numpy','cloudant','pika','nzpy','termcolor','boto3','boto','pyarrow','google-cloud-bigquery','google-cloud-bigquery-storage','flask-session','smart_open','botocore','psycopg2-binary','mysql-connector-python','numpy','pymssql']
args["url"] = "https://healthcareio.the-phi.com/git/code/transport.git" args["url"] = "https://healthcareio.the-phi.com/git/code/transport.git"
args['scripts'] = ['bin/transport'] args['scripts'] = ['bin/transport']
# if sys.version_info[0] == 2 : # if sys.version_info[0] == 2 :

@ -22,36 +22,76 @@ from transport import sql, nosql, cloud, other
import pandas as pd import pandas as pd
import json import json
import os import os
from info import __version__,__author__ from info import __version__,__author__,__email__,__license__,__app_name__,__whatsnew__
from transport.iowrapper import IWriter, IReader from transport.iowrapper import IWriter, IReader, IETL
from transport.plugins import PluginLoader from transport.plugins import PluginLoader
from transport import providers from transport import providers
import copy
from transport import registry
PROVIDERS = {} PROVIDERS = {}
def init(): def init():
global PROVIDERS global PROVIDERS
for _module in [cloud,sql,nosql,other] : for _module in [cloud,sql,nosql,other] :
for _provider_name in dir(_module) : for _provider_name in dir(_module) :
if _provider_name.startswith('__') : if _provider_name.startswith('__') or _provider_name == 'common':
continue continue
PROVIDERS[_provider_name] = {'module':getattr(_module,_provider_name),'type':_module.__name__} PROVIDERS[_provider_name] = {'module':getattr(_module,_provider_name),'type':_module.__name__}
def _getauthfile (path) :
f = open(path)
_object = json.loads(f.read())
f.close()
return _object
def instance (**_args): def instance (**_args):
""" """
type: This function returns an object of to read or write from a supported database provider/vendor
read: true|false (default true) @provider provider
auth_file @context read/write (default is read)
@auth_file: Optional if the database information provided is in a file. Useful for not sharing passwords
kwargs These are arguments that are provider/vendor specific
""" """
global PROVIDERS global PROVIDERS
# if not registry.isloaded () :
# if ('path' in _args and registry.exists(_args['path'] )) or registry.exists():
# registry.load() if 'path' not in _args else registry.load(_args['path'])
# print ([' GOT IT'])
# if 'label' in _args and registry.isloaded():
# _info = registry.get(_args['label'])
# if _info :
# #
# _args = dict(_args,**_info)
if 'auth_file' in _args: if 'auth_file' in _args:
if os.path.exists(_args['auth_file']) : if os.path.exists(_args['auth_file']) :
#
# @TODO: add encryption module and decryption to enable this to be secure
#
f = open(_args['auth_file']) f = open(_args['auth_file'])
_args = dict (_args,** json.loads(f.read()) ) #_args = dict (_args,** json.loads(f.read()) )
#
# we overrite file parameters with arguments passed
_args = dict (json.loads(f.read()),**_args )
f.close() f.close()
else: else:
filename = _args['auth_file'] filename = _args['auth_file']
raise Exception(f" {filename} was not found or is invalid") raise Exception(f" {filename} was not found or is invalid")
if _args['provider'] in PROVIDERS : if 'provider' not in _args and 'auth_file' not in _args :
if not registry.isloaded () :
if ('path' in _args and registry.exists(_args['path'] )) or registry.exists():
registry.load() if 'path' not in _args else registry.load(_args['path'])
_info = {}
if 'label' in _args and registry.isloaded():
_info = registry.get(_args['label'])
else:
_info = registry.get()
if _info :
#
# _args = dict(_args,**_info)
_args = dict(_info,**_args) #-- we can override the registry parameters with our own arguments
if 'provider' in _args and _args['provider'] in PROVIDERS :
_info = PROVIDERS[_args['provider']] _info = PROVIDERS[_args['provider']]
_module = _info['module'] _module = _info['module']
if 'context' in _args : if 'context' in _args :
@ -62,22 +102,58 @@ def instance (**_args):
_agent = _pointer (**_args) _agent = _pointer (**_args)
# #
loader = None loader = None
if 'plugins' in _args :
_params = _args['plugins']
if 'path' in _params and 'names' in _params : #
loader = PluginLoader(**_params) # @TODO:
elif type(_params) == list: # define a logger object here that will used by the wrapper
loader = PluginLoader() # this would allow us to know what the data-transport is doing and where/how it fails
for _delegate in _params : #
loader.set(_delegate)
# if 'plugins' in _args :
# _params = _args['plugins']
# if 'path' in _params and 'names' in _params :
# loader = PluginLoader(**_params)
# elif type(_params) == list:
# loader = PluginLoader()
# for _delegate in _params :
# loader.set(_delegate)
loader = None if 'plugins' not in _args else _args['plugins']
return IReader(_agent,loader) if _context == 'read' else IWriter(_agent,loader) return IReader(_agent,loader) if _context == 'read' else IWriter(_agent,loader)
else: else:
#
# We can handle the case for an ETL object
#
raise Exception ("Missing or Unknown provider") raise Exception ("Missing or Unknown provider")
pass pass
class get :
"""
This class is just a wrapper to make the interface (API) more conversational and easy to understand
"""
@staticmethod
def reader (**_args):
if not _args or ('provider' not in _args and 'label' not in _args):
_args['label'] = 'default'
_args['context'] = 'read'
return instance(**_args)
@staticmethod
def writer(**_args):
"""
This function is a wrapper that will return a writer to a database. It disambiguates the interface
"""
if not _args or ('provider' not in _args and 'label' not in _args):
_args['label'] = 'default'
_args['context'] = 'write'
return instance(**_args)
@staticmethod
def etl (**_args):
if 'source' in _args and 'target' in _args :
return IETL(**_args)
else:
raise Exception ("Malformed input found, object must have both 'source' and 'target' attributes")
def supported (): def supported ():
_info = {} _info = {}
for _provider in PROVIDERS : for _provider in PROVIDERS :

@ -3,10 +3,13 @@ Data Transport - 1.0
Steve L. Nyemba, The Phi Technology LLC Steve L. Nyemba, The Phi Technology LLC
This file is a wrapper around s3 bucket provided by AWS for reading and writing content This file is a wrapper around s3 bucket provided by AWS for reading and writing content
TODO:
- Address limitations that will properly read csv if it is stored with content type text/csv
""" """
from datetime import datetime from datetime import datetime
import boto import boto3
from boto.s3.connection import S3Connection, OrdinaryCallingFormat # from boto.s3.connection import S3Connection, OrdinaryCallingFormat
import numpy as np import numpy as np
import botocore import botocore
from smart_open import smart_open from smart_open import smart_open
@ -14,6 +17,7 @@ import sys
import json import json
from io import StringIO from io import StringIO
import pandas as pd
import json import json
class s3 : class s3 :
@ -29,46 +33,37 @@ class s3 :
@param filter filename or filtering elements @param filter filename or filtering elements
""" """
try: try:
self.s3 = S3Connection(args['access_key'],args['secret_key'],calling_format=OrdinaryCallingFormat()) self._client = boto3.client('s3',aws_access_key_id=args['access_key'],aws_secret_access_key=args['secret_key'],region_name=args['region'])
self.bucket = self.s3.get_bucket(args['bucket'].strip(),validate=False) if 'bucket' in args else None self._bucket_name = args['bucket']
# self.path = args['path'] self._file_name = args['file']
self.filter = args['filter'] if 'filter' in args else None self._region = args['region']
self.filename = args['file'] if 'file' in args else None
self.bucket_name = args['bucket'] if 'bucket' in args else None
except Exception as e : except Exception as e :
self.s3 = None
self.bucket = None
print (e) print (e)
pass
def has(self,**_args):
_found = None
try:
if 'file' in _args and 'bucket' in _args:
_found = self.meta(**_args)
elif 'bucket' in _args and not 'file' in _args:
_found = self._client.list_objects(Bucket=_args['bucket'])
elif 'file' in _args and not 'bucket' in _args :
_found = self.meta(bucket=self._bucket_name,file = _args['file'])
except Exception as e:
_found = None
pass
return type(_found) == dict
def meta(self,**args): def meta(self,**args):
""" """
This function will return information either about the file in a given bucket
:name name of the bucket :name name of the bucket
""" """
info = self.list(**args) _bucket = self._bucket_name if 'bucket' not in args else args['bucket']
[item.open() for item in info] _file = self._file_name if 'file' not in args else args['file']
return [{"name":item.name,"size":item.size} for item in info] _data = self._client.get_object(Bucket=_bucket,Key=_file)
def list(self,**args): return _data['ResponseMetadata']
""" def close(self):
This function will list the content of a bucket, the bucket must be provided by the name self._client.close()
:name name of the bucket
"""
return list(self.s3.get_bucket(args['name']).list())
def buckets(self):
#
# This function will return all buckets, not sure why but it should be used cautiously
# based on why the s3 infrastructure is used
#
return [item.name for item in self.s3.get_all_buckets()]
# def buckets(self):
pass
# """
# This function is a wrapper around the bucket list of buckets for s3
# """
# return self.s3.get_all_buckets()
class Reader(s3) : class Reader(s3) :
""" """
@ -77,51 +72,66 @@ class Reader(s3) :
- stream content if file is Not None - stream content if file is Not None
@TODO: support read from all buckets, think about it @TODO: support read from all buckets, think about it
""" """
def __init__(self,**args) : def __init__(self,**_args) :
s3.__init__(self,**args) super().__init__(**_args)
def files(self):
r = [] def _stream(self,**_args):
try:
return [item.name for item in self.bucket if item.size > 0]
except Exception as e:
pass
return r
def stream(self,limit=-1):
""" """
At this point we should stream a file from a given bucket At this point we should stream a file from a given bucket
""" """
key = self.bucket.get_key(self.filename.strip()) _object = self._client.get_object(Bucket=_args['bucket'],Key=_args['file'])
if key is None : _stream = None
yield None try:
_stream = _object['Body'].read()
except Exception as e:
pass
if not _stream :
return None
if _object['ContentType'] in ['text/csv'] :
return pd.read_csv(StringIO(str(_stream).replace("\\n","\n").replace("\\r","").replace("\'","")))
else: else:
count = 0 return _stream
with smart_open(key) as remote_file:
for line in remote_file:
if count == limit and limit > 0 :
break
yield line
count += 1
def read(self,**args) : def read(self,**args) :
if self.filename is None :
# _name = self._file_name if 'file' not in args else args['file']
# returning the list of files because no one file was specified. _bucket = args['bucket'] if 'bucket' in args else self._bucket_name
return self.files() return self._stream(bucket=_bucket,file=_name)
else:
limit = args['size'] if 'size' in args else -1
return self.stream(limit)
class Writer(s3) : class Writer(s3) :
"""
def __init__(self,**args) : """
s3.__init__(self,**args) def __init__(self,**_args) :
def mkdir(self,name): super().__init__(**_args)
#
#
if not self.has(bucket=self._bucket_name) :
self.make_bucket(self._bucket_name)
def make_bucket(self,bucket_name):
""" """
This function will create a folder in a bucket This function will create a folder in a bucket,It is best that the bucket is organized as a namespace
:name name of the folder :name name of the folder
""" """
self.s3.put_object(Bucket=self.bucket_name,key=(name+'/'))
def write(self,content): self._client.create_bucket(Bucket=bucket_name,CreateBucketConfiguration={'LocationConstraint': self._region})
file = StringIO(content.decode("utf8")) def write(self,_data,**_args):
self.s3.upload_fileobj(file,self.bucket_name,self.filename) """
This function will write the data to the s3 bucket, files can be either csv, or json formatted files
"""
content = 'text/plain'
if type(_data) == pd.DataFrame :
_stream = _data.to_csv(index=False)
content = 'text/csv'
elif type(_data) == dict :
_stream = json.dumps(_data)
content = 'application/json'
else:
_stream = _data
file = StringIO(_stream)
bucket = self._bucket_name if 'bucket' not in _args else _args['bucket']
file_name = self._file_name if 'file' not in _args else _args['file']
self._client.put_object(Bucket=bucket, Key = file_name, Body=_stream,ContentType=content)
pass pass

@ -0,0 +1,19 @@
"""
This file will be intended to handle duckdb database
"""
import duckdb
from transport.common import Reader,Writer
class Duck(Reader):
def __init__(self,**_args):
super().__init__(**_args)
self._path = None if 'path' not in _args else _args['path']
self._handler = duckdb.connect() if not self._path else duckdb.connect(self._path)
class DuckReader(Duck) :
def __init__(self,**_args):
super().__init__(**_args)
def read(self,**_args) :
pass

@ -39,22 +39,22 @@ import os
from multiprocessing import Process from multiprocessing import Process
SYS_ARGS = {} # SYS_ARGS = {}
if len(sys.argv) > 1: # if len(sys.argv) > 1:
N = len(sys.argv) # N = len(sys.argv)
for i in range(1,N): # for i in range(1,N):
value = None # value = None
if sys.argv[i].startswith('--'): # if sys.argv[i].startswith('--'):
key = sys.argv[i][2:] #.replace('-','') # key = sys.argv[i][2:] #.replace('-','')
SYS_ARGS[key] = 1 # SYS_ARGS[key] = 1
if i + 1 < N: # if i + 1 < N:
value = sys.argv[i + 1] = sys.argv[i+1].strip() # value = sys.argv[i + 1] = sys.argv[i+1].strip()
if key and value and not value.startswith('--'): # if key and value and not value.startswith('--'):
SYS_ARGS[key] = value # SYS_ARGS[key] = value
i += 2 # i += 2
class Transporter(Process): class Transporter(Process):
""" """
The transporter (Jason Stathem) moves data from one persistant store to another The transporter (Jason Stathem) moves data from one persistant store to another
@ -74,81 +74,72 @@ class Transporter(Process):
# #
# Let's insure we can support multiple targets # Let's insure we can support multiple targets
self._target = [self._target] if type(self._target) != list else self._target self._target = [self._target] if type(self._target) != list else self._target
pass pass
def read(self,**_args): def run(self):
"""
This function _reader = transport.get.etl(source=self._source,target=self._target)
"""
_reader = transport.factory.instance(**self._source)
# #
# If arguments are provided then a query is to be executed (not just a table dump)
if 'cmd' in self._source or 'query' in self._source : if 'cmd' in self._source or 'query' in self._source :
_query = self._source['cmd'] if 'cmd' in self._source else self._source['query'] _query = self._source['cmd'] if 'cmd' in self._source else self._source['query']
return _reader.read(**_query) return _reader.read(**_query)
else: else:
return _reader.read() return _reader.read()
# return _reader.read() if 'query' not in self._source else _reader.read(**self._source['query'])
def _delegate_write(self,_data,**_args):
"""
This function will write a data-frame to a designated data-store, The function is built around a delegation design pattern
:data data-frame or object to be written
"""
if _data.shape[0] > 0 :
for _target in self._target :
if 'write' not in _target :
_target['context'] = 'write'
# _target['lock'] = True
else:
# _target['write']['lock'] = True
pass
_writer = transport.factory.instance(**_target)
_writer.write(_data,**_args)
if hasattr(_writer,'close') :
_writer.close()
def write(self,_df,**_args):
"""
"""
SEGMENT_COUNT = 6
MAX_ROWS = 1000000
# _df = self.read()
_segments = np.array_split(np.arange(_df.shape[0]),SEGMENT_COUNT) if _df.shape[0] > MAX_ROWS else np.array( [np.arange(_df.shape[0])])
# _index = 0
for _indexes in _segments :
_fwd_args = {} if not _args else _args
self._delegate_write(_df.iloc[_indexes],**_fwd_args)
time.sleep(1)
#
# @TODO: Perhaps consider writing up each segment in a thread/process (speeds things up?)
pass
def instance(**_args): # def _read(self,**_args):
_proxy = lambda _agent: _agent.write(_agent.read()) # """
if 'source' in _args and 'target' in _args : # This function
# """
_agent = Transporter(**_args) # _reader = transport.factory.instance(**self._source)
_proxy(_agent) # #
# # If arguments are provided then a query is to be executed (not just a table dump)
else: # if 'cmd' in self._source or 'query' in self._source :
_config = _args['config'] # _query = self._source['cmd'] if 'cmd' in self._source else self._source['query']
_items = [Transporter(**_item) for _item in _config ] # return _reader.read(**_query)
_MAX_JOBS = 5 # else:
_items = np.array_split(_items,_MAX_JOBS) # return _reader.read()
for _batch in _items : # # return _reader.read() if 'query' not in self._source else _reader.read(**self._source['query'])
jobs = []
for _item in _batch : # def _delegate_write(self,_data,**_args):
thread = Process(target=_proxy,args = (_item,)) # """
thread.start() # This function will write a data-frame to a designated data-store, The function is built around a delegation design pattern
jobs.append(thread) # :data data-frame or object to be written
while jobs : # """
jobs = [thread for thread in jobs if thread.is_alive()] # if _data.shape[0] > 0 :
time.sleep(1) # for _target in self._target :
# if 'write' not in _target :
# _target['context'] = 'write'
# # _target['lock'] = True
# else:
# # _target['write']['lock'] = True
# pass
# _writer = transport.factory.instance(**_target)
# _writer.write(_data,**_args)
# if hasattr(_writer,'close') :
# _writer.close()
# def write(self,_df,**_args):
# """
# """
# SEGMENT_COUNT = 6
# MAX_ROWS = 1000000
# # _df = self.read()
# _segments = np.array_split(np.arange(_df.shape[0]),SEGMENT_COUNT) if _df.shape[0] > MAX_ROWS else np.array( [np.arange(_df.shape[0])])
# # _index = 0
# for _indexes in _segments :
# _fwd_args = {} if not _args else _args
# self._delegate_write(_df.iloc[_indexes],**_fwd_args)
# time.sleep(1)
# #
# # @TODO: Perhaps consider writing up each segment in a thread/process (speeds things up?)
# pass
def instance(**_args):
pthread = Transporter (**_args)
pthread.start()
return pthread
pass pass
# class Post(Process): # class Post(Process):
# def __init__(self,**args): # def __init__(self,**args):

@ -1,14 +1,39 @@
""" """
This class is a wrapper around read/write classes of cloud,sql,nosql,other packages This class is a wrapper around read/write classes of cloud,sql,nosql,other packages
The wrapper allows for application of plugins as pre-post conditions The wrapper allows for application of plugins as pre-post conditions.
NOTE: Plugins are converted to a pipeline, so we apply a pipeline when reading or writing:
- upon initialization we will load plugins
- on read/write we apply a pipeline (if passed as an argument)
""" """
from transport.plugins import plugin, PluginLoader
import transport
from transport import providers
from multiprocessing import Process
import time
class IO: class IO:
""" """
Base wrapper class for read/write Base wrapper class for read/write and support for logs
""" """
def __init__(self,_agent,plugins): def __init__(self,_agent,plugins):
self._agent = _agent self._agent = _agent
self._plugins = plugins if plugins :
self._init_plugins(plugins)
else:
self._plugins = None
def _init_plugins(self,_args):
"""
This function will load pipelined functions as a plugin loader
"""
if 'path' in _args and 'names' in _args :
self._plugins = PluginLoader(**_args)
else:
self._plugins = PluginLoader()
[self._plugins.set(_pointer) for _pointer in _args]
#
# @TODO: We should have a way to log what plugins are loaded and ready to use
def meta (self,**_args): def meta (self,**_args):
if hasattr(self._agent,'meta') : if hasattr(self._agent,'meta') :
return self._agent.meta(**_args) return self._agent.meta(**_args)
@ -27,10 +52,22 @@ class IO:
if hasattr(self._agent,'apply') : if hasattr(self._agent,'apply') :
return self._agent.apply(_query) return self._agent.apply(_query)
return None return None
def submit(self,_query):
return self.delegate('submit',_query)
def delegate(self,_name,_query):
if hasattr(self._agent,_name) :
pointer = getattr(self._agent,_name)
return pointer(_query)
return None
class IReader(IO): class IReader(IO):
"""
This is a wrapper for read functionalities
"""
def __init__(self,_agent,pipeline=None): def __init__(self,_agent,pipeline=None):
super().__init__(_agent,pipeline) super().__init__(_agent,pipeline)
def read(self,**_args): def read(self,**_args):
if 'plugins' in _args :
self._init_plugins(_args['plugins'])
_data = self._agent.read(**_args) _data = self._agent.read(**_args)
if self._plugins and self._plugins.ratio() > 0 : if self._plugins and self._plugins.ratio() > 0 :
_data = self._plugins.apply(_data) _data = self._plugins.apply(_data)
@ -41,7 +78,43 @@ class IWriter(IO):
def __init__(self,_agent,pipeline=None): def __init__(self,_agent,pipeline=None):
super().__init__(_agent,pipeline) super().__init__(_agent,pipeline)
def write(self,_data,**_args): def write(self,_data,**_args):
if 'plugins' in _args :
self._init_plugins(_args['plugins'])
if self._plugins and self._plugins.ratio() > 0 : if self._plugins and self._plugins.ratio() > 0 :
_data = self._plugins.apply(_data) _data = self._plugins.apply(_data)
self._agent.write(_data,**_args) self._agent.write(_data,**_args)
#
# The ETL object in its simplest form is an aggregation of read/write objects
# @TODO: ETL can/should aggregate a writer as a plugin and apply it as a process
class IETL(IReader) :
"""
This class performs an ETL operation by ineriting a read and adding writes as pipeline functions
"""
def __init__(self,**_args):
super().__init__(transport.get.reader(**_args['source']))
if 'target' in _args:
self._targets = _args['target'] if type(_args['target']) == list else [_args['target']]
else:
self._targets = []
self.jobs = []
#
# If the parent is already multiprocessing
self._hasParentProcess = False if 'hasParentProcess' not in _args else _args['hasParentProcess']
def read(self,**_args):
_data = super().read(**_args)
for _kwargs in self._targets :
self.post(_data,**_kwargs)
return _data
def post (self,_data,**_args) :
"""
This function returns an instance of a process that will perform the write operation
:_args parameters associated with writer object
"""
writer = transport.get.writer(**_args)
writer.write(_data)
writer.close()

@ -33,6 +33,8 @@ class Mongo :
:password password for current user :password password for current user
""" """
self.host = 'localhost' if 'host' not in args else args['host'] self.host = 'localhost' if 'host' not in args else args['host']
if ':' not in self.host and 'port' in args :
self.host = ':'.join([self.host,str(args['port'])])
self.mechanism= 'SCRAM-SHA-256' if 'mechanism' not in args else args['mechanism'] self.mechanism= 'SCRAM-SHA-256' if 'mechanism' not in args else args['mechanism']
# authSource=(args['authSource'] if 'authSource' in args else self.dbname) # authSource=(args['authSource'] if 'authSource' in args else self.dbname)
self._lock = False if 'lock' not in args else args['lock'] self._lock = False if 'lock' not in args else args['lock']

@ -1 +1 @@
from . import files, http, rabbitmq, callback, files from . import files, http, rabbitmq, callback, files, console

@ -1,3 +1,7 @@
"""
This module uses callback architectural style as a writer to enable user-defined code to handle the output of a reader
The intent is to allow users to have control over the output of data to handle things like logging, encryption/decryption and other
"""
import queue import queue
from threading import Thread, Lock from threading import Thread, Lock
# from transport.common import Reader,Writer # from transport.common import Reader,Writer

@ -1,3 +1,6 @@
"""
This class uses classback pattern to allow output to be printed to the console (debugging)
"""
from . import callback from . import callback

@ -53,8 +53,8 @@ class Writer (File):
""" """
try: try:
_delim = self._delimiter if 'delimiter' not in _args else _args['delimiter'] _delim = self.delimiter if 'delimiter' not in _args else _args['delimiter']
_path = self._path if 'path' not in _args else _args['path'] _path = self.path if 'path' not in _args else _args['path']
_mode = self._mode if 'mode' not in _args else _args['mode'] _mode = self._mode if 'mode' not in _args else _args['mode']
info.to_csv(_path,index=False,sep=_delim) info.to_csv(_path,index=False,sep=_delim)
@ -62,6 +62,7 @@ class Writer (File):
except Exception as e: except Exception as e:
# #
# Not sure what should be done here ... # Not sure what should be done here ...
print (e)
pass pass
finally: finally:
# DiskWriter.THREAD_LOCK.release() # DiskWriter.THREAD_LOCK.release()

@ -25,9 +25,9 @@ class plugin :
self._name = _args['name'] self._name = _args['name']
self._about = _args['about'] self._about = _args['about']
self._mode = _args['mode'] if 'mode' in _args else 'rw' self._mode = _args['mode'] if 'mode' in _args else 'rw'
def __call__(self,pointer): def __call__(self,pointer,**kwargs):
def wrapper(_args): def wrapper(_args,**kwargs):
return pointer(_args) return pointer(_args,**kwargs)
# #
# @TODO: # @TODO:
# add attributes to the wrapper object # add attributes to the wrapper object
@ -55,6 +55,7 @@ class PluginLoader :
self._names = [] self._names = []
if path and os.path.exists(path) and _names: if path and os.path.exists(path) and _names:
for _name in self._names : for _name in self._names :
spec = importlib.util.spec_from_file_location('private', path) spec = importlib.util.spec_from_file_location('private', path)
module = importlib.util.module_from_spec(spec) module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module) #--loads it into sys.modules spec.loader.exec_module(module) #--loads it into sys.modules
@ -101,7 +102,7 @@ class PluginLoader :
return _name in self._modules return _name in self._modules
def ratio (self): def ratio (self):
""" """
how many modules loaded vs unloaded given the list of names This functiion determines how many modules loaded vs unloaded given the list of names
""" """
_n = len(self._names) _n = len(self._names)

@ -10,8 +10,11 @@ HTTP='http'
BIGQUERY ='bigquery' BIGQUERY ='bigquery'
FILE = 'file' FILE = 'file'
ETL = 'etl' ETL = 'etl'
SQLITE = 'sqlite' SQLITE = 'sqlite'
SQLITE3= 'sqlite3' SQLITE3= 'sqlite3'
DUCKDB = 'duckdb'
REDSHIFT = 'redshift' REDSHIFT = 'redshift'
NETEZZA = 'netezza' NETEZZA = 'netezza'
MYSQL = 'mysql' MYSQL = 'mysql'
@ -42,5 +45,6 @@ PGSQL = POSTGRESQL
AWS_S3 = 's3' AWS_S3 = 's3'
RABBIT = RABBITMQ RABBIT = RABBITMQ
# QLISTENER = 'qlistener' # QLISTENER = 'qlistener'

@ -0,0 +1,102 @@
import os
import json
from info import __version__
import copy
import transport
"""
This class manages data from the registry and allows (read only)
@TODO: add property to the DATA attribute
"""
REGISTRY_PATH=os.sep.join([os.environ['HOME'],'.data-transport'])
#
# This path can be overriden by an environment variable ...
#
if 'DATA_TRANSPORT_REGISTRY_PATH' in os.environ :
REGISTRY_PATH = os.environ['DATA_TRANSPORT_REGISTRY_PATH']
REGISTRY_FILE= 'transport-registry.json'
DATA = {}
def isloaded ():
return DATA not in [{},None]
def exists (path=REGISTRY_PATH) :
"""
This function determines if there is a registry at all
"""
p = os.path.exists(path)
q = os.path.exists( os.sep.join([path,REGISTRY_FILE]))
return p and q
def load (_path=REGISTRY_PATH):
global DATA
if exists(_path) :
path = os.sep.join([_path,REGISTRY_FILE])
f = open(path)
DATA = json.loads(f.read())
f.close()
def init (email,path=REGISTRY_PATH,override=False):
"""
Initializing the registry and will raise an exception in the advent of an issue
"""
p = '@' in email
q = False if '.' not in email else email.split('.')[-1] in ['edu','com','io','ai','org']
if p and q :
_config = {"email":email,'version':__version__}
if not os.path.exists(path):
os.makedirs(path)
filename = os.sep.join([path,REGISTRY_FILE])
if not os.path.exists(filename) or override == True :
f = open(filename,'w')
f.write( json.dumps(_config))
f.close()
# _msg = f"""{CHECK_MARK} Successfully wrote configuration to {path} from {email}"""
else:
raise Exception (f"""Unable to write configuration, Please check parameters (or help) and try again""")
else:
raise Exception (f"""Invalid Input, {email} is not well formatted, provide an email with adequate format""")
def lookup (label):
global DATA
return label in DATA
def get (label='default') :
global DATA
return copy.copy(DATA[label]) if label in DATA else {}
def set (label, auth_file, default=False,path=REGISTRY_PATH) :
"""
This function will add a label (auth-file data) into the registry and can set it as the default
"""
if label == 'default' :
raise Exception ("""Invalid label name provided, please change the label name and use the switch""")
reg_file = os.sep.join([path,REGISTRY_FILE])
if os.path.exists (auth_file) and os.path.exists(path) and os.path.exists(reg_file):
f = open(auth_file)
_info = json.loads(f.read())
f.close()
f = open(reg_file)
_config = json.loads(f.read())
f.close()
#
# set the proposed label
_object = transport.factory.instance(**_info)
if _object :
_config[label] = _info
if default :
_config['default'] = _info
#
# now we need to write this to the location
f = open(reg_file,'w')
f.write(json.dumps(_config))
f.close()
else:
raise Exception( f"""Unable to load file locate at {path},\nLearn how to generate auth-file with wizard found at https://healthcareio.the-phi.com/data-transport""")
pass
else:
pass
pass

@ -3,7 +3,7 @@ This namespace/package wrap the sql functionalities for a certain data-stores
- netezza, postgresql, mysql and sqlite - netezza, postgresql, mysql and sqlite
- mariadb, redshift (also included) - mariadb, redshift (also included)
""" """
from . import postgresql, mysql, netezza, sqlite, sqlserver from . import postgresql, mysql, netezza, sqlite, sqlserver, duckdb
# #

@ -3,6 +3,8 @@ This file encapsulates common operations associated with SQL databases via SQLAl
""" """
import sqlalchemy as sqa import sqlalchemy as sqa
from sqlalchemy import text
import pandas as pd import pandas as pd
class Base: class Base:
@ -56,7 +58,15 @@ class Base:
@TODO: Execution of stored procedures @TODO: Execution of stored procedures
""" """
return pd.read_sql(sql,self._engine) if sql.lower().startswith('select') or sql.lower().startswith('with') else None if sql.lower().startswith('select') or sql.lower().startswith('with') :
return pd.read_sql(sql,self._engine)
else:
_handler = self._engine.connect()
_handler.execute(text(sql))
_handler.commit ()
_handler.close()
return None
class SQLBase(Base): class SQLBase(Base):
def __init__(self,**_args): def __init__(self,**_args):

@ -0,0 +1,24 @@
"""
This module implements the handler for duckdb (in memory or not)
"""
from transport.sql.common import Base, BaseReader, BaseWriter
class Duck :
def __init__(self,**_args):
#
# duckdb with none as database will operate as an in-memory database
#
self.database = _args['database'] if 'database' in _args else ''
def get_provider(self):
return "duckdb"
def _get_uri(self,**_args):
return f"""duckdb:///{self.database}"""
class Reader(Duck,BaseReader) :
def __init__(self,**_args):
Duck.__init__(self,**_args)
BaseReader.__init__(self,**_args)
class Writer(Duck,BaseWriter):
def __init__(self,**_args):
Duck.__init__(self,**_args)
BaseWriter.__init__(self,**_args)
Loading…
Cancel
Save