version 2.0 - Refactored, Plugins support #17

Merged
steve merged 23 commits from v2.0 into master 6 months ago

@ -1,34 +1,16 @@
# Introduction
This project implements an abstraction of objects that can have access to a variety of data stores, implementing read/write with a simple and expressive interface. This abstraction works with **NoSQL** and **SQL** data stores and leverages **pandas**.
The supported data store providers :
| Provider | Underlying Drivers | Description |
| :---- | :----: | ----: |
| sqlite| Native SQLite|SQLite3|
| postgresql| psycopg2 | PostgreSQL
| redshift| psycopg2 | Amazon Redshift
| s3| boto3 | Amazon Simple Storage Service
| netezza| nzpsql | IBM Neteeza
| Files: CSV, TSV| pandas| pandas data-frame
| Couchdb| cloudant | Couchbase/Couchdb
| mongodb| pymongo | Mongodb
| mysql| mysql| Mysql
| bigquery| google-bigquery| Google BigQuery
| mariadb| mysql| Mariadb
| rabbitmq|pika| RabbitMQ Publish/Subscribe
This project implements an abstraction of objects that can have access to a variety of data stores, implementing read/write with a simple and expressive interface. This abstraction works with **NoSQL**, **SQL** and **Cloud** data stores and leverages **pandas**.
# Why Use Data-Transport ?
Mostly data scientists that don't really care about the underlying database and would like to manipulate data transparently.
Mostly data scientists that don't really care about the underlying database and would like a simple and consistent way to read/write and move data are well served. Additionally we implemented lightweight Extract Transform Loading API and command line (CLI) tool. Finally it is possible to add pre/post processing pipeline functions to read/write
1. Familiarity with **pandas data-frames**
2. Connectivity **drivers** are included
3. Mining data from various sources
4. Useful for data migrations or ETL
4. Useful for data migrations or **ETL**
# Usage
## Installation
@ -36,169 +18,7 @@ Within the virtual environment perform the following :
pip install git+https://github.com/lnyemba/data-transport.git
Once installed **data-transport** can be used as a library in code or a command line interface (CLI), as a CLI it is used for ETL and requires a configuration file.
## Data Transport as a Library (in code)
---
The data-transport can be used within code as a library, and offers the following capabilities:
* Read/Write against [mongodb](https://github.com/lnyemba/data-transport/wiki/mongodb)
* Read/Write against tranditional [RDBMS](https://github.com/lnyemba/data-transport/wiki/rdbms)
* Read/Write against [bigquery](https://github.com/lnyemba/data-transport/wiki/bigquery)
* ETL CLI/Code [ETL](https://github.com/lnyemba/data-transport/wiki/etl)
* Support for pre/post conditions i.e it is possible to specify queries to run before or after a read or write
The read/write functions make data-transport a great candidate for **data-science**; **data-engineering** or all things pertaining to data. It enables operations across multiple data-stores(relational or not)
## ETL
**Embedded in Code**
It is possible to perform ETL within custom code as follows :
```
import transport
import time
_info = [{source:{'provider':'sqlite','path':'/home/me/foo.csv','table':'me',"pipeline":{"pre":[],"post":[]}},target:{provider:'bigquery',private_key='/home/me/key.json','table':'me','dataset':'mydataset'}}, ...]
procs = transport.factory.instance(provider='etl',info=_info)
#
#
while procs:
procs = [pthread for pthread in procs if pthread.is_alive()]
time.sleep(1)
```
**Command Line Interface (CLI):**
---
The CLI program is called **transport** and it requires a configuration file. The program is intended to move data from one location to another. Supported data stores are in the above paragraphs.
```
[
{
"id":"logs",
"source":{
"provider":"postgresql","context":"read","database":"mydb",
"cmd":{"sql":"SELECT * FROM logs limit 10"}
},
"target":{
"provider":"bigquery","private_key":"/bgqdrive/account/bq-service-account-key.json",
"dataset":"mydataset"
}
},
]
```
Assuming the above content is stored in a file called **etl-config.json**, we would perform the following in a terminal window:
```
[steve@data-transport]$ transport --config ./etl-config.json [--index <value>]
```
**Reading/Writing Mongodb**
For this example we assume here we are tunneling through port 27018 and there is not access control:
```
import transport
reader = factory.instance(provider='mongodb',context='read',host='localhost',port='27018',db='example',doc='logs')
df = reader.read() #-- reads the entire collection
print (df.head())
#
#-- Applying mongodb command
PIPELINE = [{"$group":{"_id":None,"count":{"$sum":1}}}]
_command_={"cursor":{},"allowDiskUse":True,"aggregate":"logs","pipeline":PIPLINE}
df = reader.read(mongo=_command)
print (df.head())
reader.close()
```
**Read/Writing to Mongodb**
---
Scenario 1: Mongodb with security in place
1. Define an authentication file on disk
The semantics of the attributes are provided by mongodb, please visit [mongodb documentation](https://mongodb.org/docs). In this example the file is located on _/transport/mongo.json_
<div style="display:grid; grid-template-columns:60% auto; gap:4px">
<div>
<b>configuration file</b>
```
{
"username":"me","password":"changeme",
"mechanism":"SCRAM-SHA-1",
"authSource":"admin"
}
```
<b>Connecting to Mongodb </b>
```
import transport
PIPELINE = ... #-- do this yourself
MONGO_KEY = '/transport/mongo.json'
mreader = transport.factory.instance(provider=transport.providers.MONGODB,auth_file=MONGO_KEY,context='read',db='mydb',doc='logs')
_aggregateDF = mreader.read(mongo=PIPELINE) #--results of a aggregate pipeline
_collectionDF= mreader.read()
```
In order to enable write, change **context** attribute to **'read'**.
</div>
<div>
- The configuration file is in JSON format
- The commands passed to mongodb are the same as you would if you applied runCommand in mongodb
- The output is a pandas data-frame
- By default the transport reads, to enable write operations use **context='write'**
|parameters|description |
| --- | --- |
|db| Name of the database|
|port| Port number to connect to
|doc| Name of the collection of documents|
|username|Username |
|password|password|
|authSource|user database that has authentication info|
|mechanism|Mechnism used for authentication|
**NOTE**
Arguments like **db** or **doc** can be placed in the authentication file
</div>
</div>
**Limitations**
Reads and writes aren't encapsulated in the same object, this is to allow the calling code to deliberately perform actions and hopefully minimize accidents associated with data wrangling.
```
import transport
improt pandas as pd
writer = factory.instance(provider=transport.providers.MONGODB,context='write',host='localhost',port='27018',db='example',doc='logs')
df = pd.DataFrame({"names":["steve","nico"],"age":[40,30]})
writer.write(df)
writer.close()
```
#
# reading from postgresql
pgreader = factory.instance(type='postgresql',database=<database>,table=<table_name>)
pg.read() #-- will read the table by executing a SELECT
pg.read(sql=<sql query>)
#
# Reading a document and executing a view
#
document = dreader.read()
result = couchdb.view(id='<design_doc_id>',view_name=<view_name',<key=value|keys=values>)
## Learn More
We have available notebooks with sample code to read/write against mongodb, couchdb, Netezza, PostgreSQL, Google Bigquery, Databricks, Microsoft SQL Server, MySQL ... Visit [data-transport homepage](https://healthcareio.the-phi.com/data-transport)

@ -48,24 +48,8 @@ import typer
import os
import transport
from transport import etl
from transport import providers
# from transport import providers
# SYS_ARGS = {}
# if len(sys.argv) > 1:
# N = len(sys.argv)
# for i in range(1,N):
# value = None
# if sys.argv[i].startswith('--'):
# key = sys.argv[i][2:] #.replace('-','')
# SYS_ARGS[key] = 1
# if i + 1 < N:
# value = sys.argv[i + 1] = sys.argv[i+1].strip()
# if key and value and not value.startswith('--'):
# SYS_ARGS[key] = value
# i += 2
app = typer.Typer()
@ -77,9 +61,15 @@ def wait(jobs):
jobs = [thread for thread in jobs if thread.is_alive()]
time.sleep(1)
@app.command()
def move (path,index=None):
@app.command(name="apply")
def apply (path,index=None):
"""
This function applies data transport from one source to one or several others
:path path of the configuration file
:index index of the _item of interest (otherwise everything will be processed)
"""
_proxy = lambda _object: _object.write(_object.read())
if os.path.exists(path):
file = open(path)
@ -90,27 +80,18 @@ def move (path,index=None):
etl.instance(**_config)
else:
etl.instance(config=_config)
@app.command(name="providers")
def supported (format:str="table") :
"""
This function will print supported providers and their associated classifications
"""
_df = (transport.supported())
if format in ['list','json'] :
print (json.dumps(_df.to_dict(orient="list")))
else:
print (_df)
print ()
#
# if type(_config) == dict :
# _object = transport.etl.instance(**_config)
# _proxy(_object)
# else:
# #
# # here we are dealing with a list of objects (long ass etl job)
# jobs = []
# failed = []
# for _args in _config :
# if index and _config.index(_args) != index :
# continue
# _object=transport.etl.instance(**_args)
# thread = Process(target=_proxy,args=(_object,))
# thread.start()
# jobs.append(thread())
# if _config.index(_args) == 0 :
# thread.join()
# wait(jobs)
@app.command()
def version():
print (transport.version.__version__)

@ -1,5 +1,5 @@
__author__ = 'The Phi Technology'
__version__= '1.9.8.20'
__version__= '2.0.2'
__license__="""

@ -0,0 +1,169 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to Google Bigquery\n",
"\n",
"1. Insure you have a Google Bigquery service account key on disk\n",
"2. The service key location is set as an environment variable **BQ_KEY**\n",
"3. The dataset will be automatically created within the project associated with the service key\n",
"\n",
"The cell below creates a dataframe that will be stored within Google Bigquery"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1/1 [00:00<00:00, 5440.08it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"['data transport version ', '2.0.0']\n"
]
}
],
"source": [
"#\n",
"# Writing to Google Bigquery database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"import os\n",
"\n",
"PRIVATE_KEY = os.environ['BQ_KEY'] #-- location of the service key\n",
"DATASET = 'demo'\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"bqw = transport.factory.instance(provider=providers.BIGQUERY,dataset=DATASET,table='friends',context='write',private_key=PRIVATE_KEY)\n",
"bqw.write(_data,if_exists='replace') #-- default is append\n",
"print (['data transport version ', transport.__version__])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from Google Bigquery\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a Google Bigquery (simple query). \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading: 100%|\u001b[32m██████████\u001b[0m|\n",
"Downloading: 100%|\u001b[32m██████████\u001b[0m|\n",
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n",
" _counts f0_\n",
"0 3 83.0\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"import os\n",
"PRIVATE_KEY=os.environ['BQ_KEY']\n",
"pgr = transport.instance(provider=providers.BIGQUERY,dataset='demo',table='friends',private_key=PRIVATE_KEY)\n",
"_df = pgr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from demo.friends'\n",
"_sdf = pgr.read(sql=_query)\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dataset': 'demo', 'table': 'friends'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"{\n",
" \n",
" \"dataset\":\"demo\",\"table\":\"friends\"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,155 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to mongodb\n",
"\n",
"Insure mongodb is actually installed on the system, The cell below creates a dataframe that will be stored within mongodb"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.0.0\n"
]
}
],
"source": [
"#\n",
"# Writing to mongodb database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"mgw = transport.factory.instance(provider=providers.MONGODB,db='demo',collection='friends',context='write')\n",
"mgw.write(_data)\n",
"print (transport.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from mongodb\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a mongodb pipeline. The code in the background executes an aggregation using **db.runCommand**\n",
"\n",
"- Basic read of the designated collection **find=\\<collection>**\n",
"- Executing an aggregate pipeline against a collection **aggreate=\\<collection>**\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"--------- STATISTICS ------------\n",
" _id _counts _mean\n",
"0 0 2 102.5\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"mgr = transport.instance(provider=providers.MONGODB,db='foo',collection='friends')\n",
"_df = mgr.read()\n",
"PIPELINE = [{\"$group\":{\"_id\":0,\"_counts\":{\"$sum\":1}, \"_mean\":{\"$avg\":\"$age\"}}}]\n",
"_sdf = mgr.read(aggregate='friends',pipeline=PIPELINE)\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'host': 'klingon.io',\n",
" 'port': 27017,\n",
" 'username': 'me',\n",
" 'password': 'foobar',\n",
" 'db': 'foo',\n",
" 'collection': 'friends',\n",
" 'authSource': '<authdb>',\n",
" 'mechamism': '<SCRAM-SHA-256|MONGODB-CR|SCRAM-SHA-1>'}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" \"host\":\"klingon.io\",\"port\":27017,\"username\":\"me\",\"password\":\"foobar\",\"db\":\"foo\",\"collection\":\"friends\",\n",
" \"authSource\":\"<authdb>\",\"mechamism\":\"<SCRAM-SHA-256|MONGODB-CR|SCRAM-SHA-1>\"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to Microsoft SQLServer\n",
"\n",
"1. Insure the Microsoft SQL Server is installed and you have access i.e account information\n",
"2. The target database must be created before hand.\n",
"3. We created an authentication file that will contain user account and location of the database\n",
"\n",
"The cell below creates a dataframe that will be stored in a Microsoft SQL Server database.\n",
"\n",
"**NOTE** This was not tested with a cloud instance"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['data transport version ', '2.0.0']\n"
]
}
],
"source": [
"#\n",
"# Writing to Google Bigquery database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"import os\n",
"\n",
"AUTH_FOLDER = os.environ['DT_AUTH_FOLDER'] #-- location of the service key\n",
"MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n",
"\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"msw = transport.factory.instance(provider=providers.MSSQL,table='friends',context='write',auth_file=MSSQL_AUTH_FILE)\n",
"msw.write(_data,if_exists='replace') #-- default is append\n",
"print (['data transport version ', transport.__version__])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from Microsoft SQL Server database\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within an MS SQL Server (simple query). \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"\n",
"--------- STATISTICS ------------\n",
"\n",
" _counts \n",
"0 3 83\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"import os\n",
"AUTH_FOLDER = os.environ['DT_AUTH_FOLDER'] #-- location of the service key\n",
"MSSQL_AUTH_FILE= os.sep.join([AUTH_FOLDER,'mssql.json'])\n",
"\n",
"msr = transport.instance(provider=providers.MSSQL,table='friends',auth_file=MSSQL_AUTH_FILE)\n",
"_df = msr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = msr.read(sql=_query)\n",
"print (_df)\n",
"print ('\\n--------- STATISTICS ------------\\n')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dataset': 'demo', 'table': 'friends'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"{\n",
" \n",
" \"dataset\":\"demo\",\"table\":\"friends\",\"username\":\"<username>\",\"password\":\"<password>\"\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to MySQL\n",
"\n",
"1. Insure MySQL is actually installed on the system, \n",
"2. There is a database called demo created on the said system\n",
"\n",
"The cell below creates a dataframe that will be stored within postgreSQL"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.0.0\n"
]
}
],
"source": [
"#\n",
"# Writing to PostgreSQL database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"myw = transport.factory.instance(provider=providers.MYSQL,database='demo',table='friends',context='write',auth_file=\"/home/steve/auth-mysql.json\")\n",
"myw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from MySQL\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a MySQL (simple query). \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n",
" _counts avg\n",
"0 3 83.0\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"myr = transport.instance(provider=providers.POSTGRESQL,database='demo',table='friends',auth_file='/home/steve/auth-mysql.json')\n",
"_df = myr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = myr.read(sql=_query)\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'host': 'klingon.io',\n",
" 'port': 3306,\n",
" 'username': 'me',\n",
" 'password': 'foobar',\n",
" 'database': 'demo',\n",
" 'table': 'friends'}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" \"host\":\"klingon.io\",\"port\":3306,\"username\":\"me\",\"password\":\"foobar\",\n",
" \"database\":\"demo\",\"table\":\"friends\"\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,157 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to PostgreSQL\n",
"\n",
"1. Insure PostgreSQL is actually installed on the system, \n",
"2. There is a database called demo created on the said system\n",
"\n",
"The cell below creates a dataframe that will be stored within postgreSQL"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.0.0\n"
]
}
],
"source": [
"#\n",
"# Writing to PostgreSQL database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"pgw = transport.factory.instance(provider=providers.POSTGRESQL,database='demo',table='friends',context='write')\n",
"pgw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from PostgreSQL\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a PostreSQL (simple query). \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n",
" _counts avg\n",
"0 3 83.0\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"pgr = transport.instance(provider=providers.POSTGRESQL,database='demo',table='friends')\n",
"_df = pgr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = pgr.read(sql=_query)\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'host': 'klingon.io',\n",
" 'port': 5432,\n",
" 'username': 'me',\n",
" 'password': 'foobar',\n",
" 'database': 'demo',\n",
" 'table': 'friends'}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" \"host\":\"klingon.io\",\"port\":5432,\"username\":\"me\",\"password\":\"foobar\",\n",
" \"database\":\"demo\",\"table\":\"friends\"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,139 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Writing to SQLite3+\n",
"\n",
"The requirements to get started are minimal (actually none). The cell below creates a dataframe that will be stored within SQLite 3+"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.0.0\n"
]
}
],
"source": [
"#\n",
"# Writing to PostgreSQL database\n",
"#\n",
"import transport\n",
"from transport import providers\n",
"import pandas as pd\n",
"_data = pd.DataFrame({\"name\":['James Bond','Steve Rogers','Steve Nyemba'],'age':[55,150,44]})\n",
"sqw = transport.factory.instance(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends',context='write')\n",
"sqw.write(_data,if_exists='replace') #-- default is append\n",
"print (transport.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading from SQLite3+\n",
"\n",
"The cell below reads the data that has been written by the cell above and computes the average age within a PostreSQL (simple query). \n",
"\n",
"- Basic read of the designated table (friends) created above\n",
"- Execute an aggregate SQL against the table\n",
"\n",
"**NOTE**\n",
"\n",
"It is possible to use **transport.factory.instance** or **transport.instance** they are the same. It allows the maintainers to know that we used a factory design pattern."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" name age\n",
"0 James Bond 55\n",
"1 Steve Rogers 150\n",
"2 Steve Nyemba 44\n",
"--------- STATISTICS ------------\n",
" _counts AVG(age)\n",
"0 3 83.0\n"
]
}
],
"source": [
"\n",
"import transport\n",
"from transport import providers\n",
"pgr = transport.instance(provider=providers.SQLITE,database='/home/steve/demo.db3',table='friends')\n",
"_df = pgr.read()\n",
"_query = 'SELECT COUNT(*) _counts, AVG(age) from friends'\n",
"_sdf = pgr.read(sql=_query)\n",
"print (_df)\n",
"print ('--------- STATISTICS ------------')\n",
"print (_sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell bellow show the content of an auth_file, in this case if the dataset/table in question is not to be shared then you can use auth_file with information associated with the parameters.\n",
"\n",
"**NOTE**:\n",
"\n",
"The auth_file is intended to be **JSON** formatted. This is an overkill for SQLite ;-)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"\n",
"{\n",
" \"provider\":\"sqlite\",\n",
" \"database\":\"/home/steve/demo.db3\",\"table\":\"friends\"\n",
"}\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -18,12 +18,14 @@ args = {
"version":__version__,
"author":__author__,"author_email":"info@the-phi.com",
"license":"MIT",
"packages":["transport","info"]}
# "packages":["transport","info","transport/sql"]},
"packages": find_packages(include=['info','transport', 'transport.*'])}
args["keywords"]=['mongodb','couchdb','rabbitmq','file','read','write','s3','sqlite']
args["install_requires"] = ['pyncclient','pymongo','sqlalchemy','pandas','typer','pandas-gbq','numpy','cloudant','pika','nzpy','boto3','boto','pyarrow','google-cloud-bigquery','google-cloud-bigquery-storage','flask-session','smart_open','botocore','psycopg2-binary','mysql-connector-python','numpy']
args["install_requires"] = ['pyncclient','pymongo','sqlalchemy','pandas','typer','pandas-gbq','numpy','cloudant','pika','nzpy','boto3','boto','pyarrow','google-cloud-bigquery','google-cloud-bigquery-storage','flask-session','smart_open','botocore','psycopg2-binary','mysql-connector-python','numpy','pymssql']
args["url"] = "https://healthcareio.the-phi.com/git/code/transport.git"
args['scripts'] = ['bin/transport']
if sys.version_info[0] == 2 :
args['use_2to3'] = True
args['use_2to3_exclude_fixers']=['lib2to3.fixes.fix_import']
# if sys.version_info[0] == 2 :
# args['use_2to3'] = True
# args['use_2to3_exclude_fixers']=['lib2to3.fixes.fix_import']
setup(**args)

@ -11,360 +11,88 @@ This library is designed to serve as a wrapper to a set of supported data stores
- s3
- sqlite
The supported operations are read/write and providing meta data to the calling code
Requirements :
pymongo
boto
couldant
The configuration for the data-store is as follows :
e.g:
mongodb
provider:'mongodb',[port:27017],[host:localhost],db:<name>,doc:<_name>,context:<read|write>
We separated reads from writes to mitigate accidents associated with writes.
Source Code is available under MIT License:
https://healthcareio.the-phi.com/data-transport
https://hiplab.mc.vanderbilt.edu/git/hiplab/data-transport
"""
# import pandas as pd
# import numpy as np
import json
import importlib
import sys
import sqlalchemy
from datetime import datetime
if sys.version_info[0] > 2 :
# from transport.common import Reader, Writer,Console #, factory
from transport import disk
from transport import s3 as s3
from transport import rabbitmq as queue
from transport import couch as couch
from transport import mongo as mongo
from transport import sql as sql
from transport import etl as etl
# from transport.version import __version__
from info import __version__,__author__
from transport import providers
else:
from common import Reader, Writer,Console #, factory
import disk
import queue
import couch
import mongo
import s3
import sql
import etl
from info import __version__,__author__
import providers
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
register_adapter(np.int64, AsIs)
# import psycopg2 as pg
# import mysql.connector as my
# from google.cloud import bigquery as bq
# import nzpy as nz #--- netezza drivers
from transport import sql, nosql, cloud, other
import pandas as pd
import json
import os
# class providers :
# POSTGRESQL = 'postgresql'
# MONGODB = 'mongodb'
# BIGQUERY ='bigquery'
# FILE = 'file'
# ETL = 'etl'
# SQLITE = 'sqlite'
# SQLITE3= 'sqlite'
# REDSHIFT = 'redshift'
# NETEZZA = 'netezza'
# MYSQL = 'mysql'
# RABBITMQ = 'rabbitmq'
# MARIADB = 'mariadb'
# COUCHDB = 'couch'
# CONSOLE = 'console'
# ETL = 'etl'
# #
# # synonyms of the above
# BQ = BIGQUERY
# MONGO = MONGODB
# FERRETDB= MONGODB
# PG = POSTGRESQL
# PSQL = POSTGRESQL
# PGSQL = POSTGRESQL
# import providers
# class IEncoder (json.JSONEncoder):
# def IEncoder (self,object):
# if type(object) == np.integer :
# return int(object)
# elif type(object) == np.floating:
# return float(object)
# elif type(object) == np.ndarray :
# return object.tolist()
# elif type(object) == datetime :
# return o.isoformat()
# else:
# return super(IEncoder,self).default(object)
from info import __version__,__author__
from transport.iowrapper import IWriter, IReader
from transport.plugins import PluginLoader
from transport import providers
PROVIDERS = {}
def init():
global PROVIDERS
for _module in [cloud,sql,nosql,other] :
for _provider_name in dir(_module) :
if _provider_name.startswith('__') :
continue
PROVIDERS[_provider_name] = {'module':getattr(_module,_provider_name),'type':_module.__name__}
def instance (**_args):
"""
type:
read: true|false (default true)
auth_file
"""
global PROVIDERS
if 'auth_file' in _args:
if os.path.exists(_args['auth_file']) :
f = open(_args['auth_file'])
_args = dict (_args,** json.loads(f.read()) )
f.close()
else:
filename = _args['auth_file']
raise Exception(f" {filename} was not found or is invalid")
if _args['provider'] in PROVIDERS :
_info = PROVIDERS[_args['provider']]
_module = _info['module']
if 'context' in _args :
_context = _args['context']
else:
_context = 'read'
_pointer = getattr(_module,'Reader') if _context == 'read' else getattr(_module,'Writer')
_agent = _pointer (**_args)
#
loader = None
if 'plugins' in _args :
_params = _args['plugins']
if 'path' in _params and 'names' in _params :
loader = PluginLoader(**_params)
elif type(_params) == list:
loader = PluginLoader()
for _delegate in _params :
loader.set(_delegate)
return IReader(_agent,loader) if _context == 'read' else IWriter(_agent,loader)
else:
raise Exception ("Missing or Unknown provider")
pass
def supported ():
_info = {}
for _provider in PROVIDERS :
_item = PROVIDERS[_provider]
if _item['type'] not in _info :
_info[_item['type']] = []
_info[_item['type']].append(_provider)
_df = pd.DataFrame()
for _id in _info :
if not _df.shape[0] :
_df = pd.DataFrame(_info[_id],columns=[_id.replace('transport.','')])
else:
_df = pd.DataFrame(_info[_id],columns=[_id.replace('transport.','')]).join(_df, how='outer')
return _df.fillna('')
class factory :
# TYPE = {"sql":{"providers":["postgresql","mysql","neteeza","bigquery","mariadb","redshift"]}}
# PROVIDERS = {
# "etl":{"class":{"read":etl.instance,"write":etl.instance}},
# # "console":{"class":{"write":Console,"read":Console}},
# "file":{"class":{"read":disk.DiskReader,"write":disk.DiskWriter}},
# "sqlite":{"class":{"read":disk.SQLiteReader,"write":disk.SQLiteWriter}},
# "postgresql":{"port":5432,"host":"localhost","database":None,"driver":pg,"default":{"type":"VARCHAR"},"class":{"read":sql.SQLReader,"write":sql.SQLWriter}},
# "redshift":{"port":5432,"host":"localhost","database":None,"driver":pg,"default":{"type":"VARCHAR"},"class":{"read":sql.SQLReader,"write":sql.SQLWriter}},
# "bigquery":{"class":{"read":sql.BQReader,"write":sql.BQWriter}},
# "mysql":{"port":3306,"host":"localhost","default":{"type":"VARCHAR(256)"},"driver":my,"class":{"read":sql.SQLReader,"write":sql.SQLWriter}},
# "mariadb":{"port":3306,"host":"localhost","default":{"type":"VARCHAR(256)"},"driver":my,"class":{"read":sql.SQLReader,"write":sql.SQLWriter}},
# "mongo":{"port":27017,"host":"localhost","class":{"read":mongo.MongoReader,"write":mongo.MongoWriter}},
# "couch":{"port":5984,"host":"localhost","class":{"read":couch.CouchReader,"write":couch.CouchWriter}},
# "netezza":{"port":5480,"driver":nz,"default":{"type":"VARCHAR(256)"},"class":{"read":sql.SQLReader,"write":sql.SQLWriter}},
# "rabbitmq":{"port":5672,"host":"localhost","class":{"read":queue.QueueReader,"write":queue.QueueWriter,"listen":queue.QueueListener,"listener":queue.QueueListener},"default":{"type":"application/json"}}}
# #
# # creating synonyms
# PROVIDERS['mongodb'] = PROVIDERS['mongo']
# PROVIDERS['couchdb'] = PROVIDERS['couch']
# PROVIDERS['bq'] = PROVIDERS['bigquery']
# PROVIDERS['sqlite3'] = PROVIDERS['sqlite']
# PROVIDERS['rabbit'] = PROVIDERS['rabbitmq']
# PROVIDERS['rabbitmq-server'] = PROVIDERS['rabbitmq']
@staticmethod
def instance(**_args):
if 'type' in _args :
#
# Legacy code being returned
return factory._instance(**_args);
else:
return instance(**_args)
@staticmethod
def _instance(**args):
"""
This class will create an instance of a transport when providing
:type name of the type we are trying to create
:args The arguments needed to create the instance
"""
source = args['type']
params = args['args']
anObject = None
if source in ['HttpRequestReader','HttpSessionWriter']:
#
# @TODO: Make sure objects are serializable, be smart about them !!
#
aClassName = ''.join([source,'(**params)'])
else:
stream = json.dumps(params)
aClassName = ''.join([source,'(**',stream,')'])
try:
anObject = eval( aClassName)
#setattr(anObject,'name',source)
except Exception as e:
print(['Error ',e])
return anObject
import time
def instance(**_pargs):
"""
creating an instance given the provider, we should have an idea of :class, :driver
:provider
:read|write = {connection to the database}
"""
#
# @TODO: provide authentication file that will hold all the parameters, that will later on be used
#
_args = dict(_pargs,**{})
if 'auth_file' in _args :
path = _args['auth_file']
file = open(path)
_config = json.loads( file.read())
_args = dict(_args,**_config)
file.close()
_provider = _args['provider']
_context = list( set(['read','write','listen']) & set(_args.keys()) )
if _context :
_context = _context[0]
else:
_context = _args['context'] if 'context' in _args else 'read'
# _group = None
# for _id in providers.CATEGORIES :
# if _provider in providers.CATEGORIES[_id] :
# _group = _id
# break
# if _group :
if _provider in providers.PROVIDERS and _context in providers.PROVIDERS[_provider]:
# _classPointer = _getClassInstance(_group,**_args)
_classPointer = providers.PROVIDERS[_provider][_context]
#
# Let us reformat the arguments
# if 'read' in _args or 'write' in _args :
# _args = _args['read'] if 'read' in _args else _args['write']
# _args['provider'] = _provider
# if _group == 'sql' :
if _provider in providers.CATEGORIES['sql'] :
_info = _get_alchemyEngine(**_args)
_args = dict(_args,**_info)
_args['driver'] = providers.DRIVERS[_provider]
else:
if _provider in providers.DEFAULT :
_default = providers.DEFAULT[_provider]
_defkeys = list(set(_default.keys()) - set(_args.keys()))
if _defkeys :
for key in _defkeys :
_args[key] = _default[key]
pass
#
# get default values from
return _classPointer(**_args)
#
# Let us determine the category of the provider that has been given
def _get_alchemyEngine(**_args):
"""
This function returns the SQLAlchemy engine associated with parameters, This is only applicable for SQL _items
:_args arguments passed to the factory {provider and other}
"""
_provider = _args['provider']
_pargs = {}
if _provider == providers.SQLITE3 :
_path = _args['database'] if 'database' in _args else _args['path']
uri = ''.join([_provider,':///',_path])
else:
#@TODO: Enable authentication files (private_key)
_username = _args['username'] if 'username' in _args else ''
_password = _args['password'] if 'password' in _args else ''
_account = _args['account'] if 'account' in _args else ''
_database = _args['database'] if 'database' in _args else _args['path']
if _username != '':
_account = _username + ':'+_password+'@'
_host = _args['host'] if 'host' in _args else ''
_port = _args['port'] if 'port' in _args else ''
if _provider in providers.DEFAULT :
_default = providers.DEFAULT[_provider]
_host = _host if _host != '' else (_default['host'] if 'host' in _default else '')
_port = _port if _port != '' else (_default['port'] if 'port' in _default else '')
if _port == '':
_port = providers.DEFAULT['port'] if 'port' in providers.DEFAULT else ''
#
if _host != '' and _port != '' :
_fhost = _host+":"+str(_port) #--formatted hostname
else:
_fhost = _host
# Let us update the parameters we have thus far
#
uri = ''.join([_provider,"://",_account,_fhost,'/',_database])
_pargs = {'host':_host,'port':_port,'username':_username,'password':_password}
_engine = sqlalchemy.create_engine (uri,future=True)
_out = {'sqlalchemy':_engine}
for key in _pargs :
if _pargs[key] != '' :
_out[key] = _pargs[key]
return _out
@DeprecationWarning
def _getClassInstance(_group,**_args):
"""
This function returns the class instance we are attempting to instanciate
:_group items in providers.CATEGORIES.keys()
:_args arguments passed to the factory class
"""
# if 'read' in _args or 'write' in _args :
# _context = 'read' if 'read' in _args else _args['write']
# _info = _args[_context]
# else:
# _context = _args['context'] if 'context' in _args else 'read'
# _class = providers.READ[_group] if _context == 'read' else providers.WRITE[_group]
# if type(_class) == dict and _args['provider'] in _class:
# _class = _class[_args['provider']]
# return _class
@DeprecationWarning
def __instance(**_args):
"""
@param provider {file,sqlite,postgresql,redshift,bigquery,netezza,mongo,couch ...}
@param context read|write|rw
@param _args argument to got with the datastore (username,password,host,port ...)
"""
provider = _args['provider']
context = _args['context']if 'context' in _args else None
_id = context if context in list(factory.PROVIDERS[provider]['class'].keys()) else 'read'
if _id :
args = {'provider':_id}
for key in factory.PROVIDERS[provider] :
if key == 'class' :
continue
value = factory.PROVIDERS[provider][key]
args[key] = value
#
#
args = dict(args,**_args)
# print (provider in factory.PROVIDERS)
if 'class' in factory.PROVIDERS[provider]:
pointer = factory.PROVIDERS[provider]['class'][_id]
else:
pointer = sql.SQLReader if _id == 'read' else sql.SQLWriter
#
# Let us try to establish an sqlalchemy wrapper
try:
account = ''
host = ''
if provider not in [providers.BIGQUERY,providers.MONGODB, providers.COUCHDB, providers.SQLITE, providers.CONSOLE,providers.ETL, providers.FILE, providers.RABBITMQ] :
# if provider not in ['bigquery','mongodb','mongo','couchdb','sqlite','console','etl','file','rabbitmq'] :
#
# In these cases we are assuming RDBMS and thus would exclude NoSQL and BigQuery
username = args['username'] if 'username' in args else ''
password = args['password'] if 'password' in args else ''
if username == '' :
account = ''
else:
account = username + ':'+password+'@'
host = args['host']
if 'port' in args :
host = host+":"+str(args['port'])
database = args['database']
elif provider in [providers.SQLITE,providers.FILE]:
account = ''
host = ''
database = args['path'] if 'path' in args else args['database']
if provider not in [providers.MONGODB, providers.COUCHDB, providers.BIGQUERY, providers.CONSOLE, providers.ETL,providers.FILE,providers.RABBITMQ] :
# if provider not in ['mongodb','mongo','couchdb','bigquery','console','etl','file','rabbitmq'] :
uri = ''.join([provider,"://",account,host,'/',database])
e = sqlalchemy.create_engine (uri,future=True)
args['sqlalchemy'] = e
#
# @TODO: Include handling of bigquery with SQLAlchemy
except Exception as e:
print (_args)
print (e)
return pointer(**args)
return None
pass
factory.instance = instance
init()

@ -0,0 +1,6 @@
"""
Steve L. Nyemba, nyemba@gmail.com
This namespace implements support for cloud databases databricks,bigquery ...
"""
from . import bigquery, databricks, nextcloud, s3

@ -0,0 +1,159 @@
"""
Implementing support for google's bigquery
- cloud.bigquery.Read
- cloud.bigquery.Write
"""
import json
from google.oauth2 import service_account
from google.cloud import bigquery as bq
from multiprocessing import Lock, RLock
import pandas as pd
import pandas_gbq as pd_gbq
import numpy as np
import time
MAX_CHUNK = 2000000
class BigQuery:
def __init__(self,**_args):
path = _args['service_key'] if 'service_key' in _args else _args['private_key']
self.credentials = service_account.Credentials.from_service_account_file(path)
self.dataset = _args['dataset'] if 'dataset' in _args else None
self.path = path
self.dtypes = _args['dtypes'] if 'dtypes' in _args else None
self.table = _args['table'] if 'table' in _args else None
self.client = bq.Client.from_service_account_json(self.path)
def meta(self,**_args):
"""
This function returns meta data for a given table or query with dataset/table properly formatted
:param table name of the name WITHOUT including dataset
:param sql sql query to be pulled,
"""
table = _args['table'] if 'table' in _args else self.table
try:
if table :
_dataset = self.dataset if 'dataset' not in _args else _args['dataset']
sql = f"""SELECT column_name as name, data_type as type FROM {_dataset}.INFORMATION_SCHEMA.COLUMNS WHERE table_name = '{table}' """
_info = {'credentials':self.credentials,'dialect':'standard'}
return pd_gbq.read_gbq(sql,**_info).to_dict(orient='records')
# return self.read(sql=sql).to_dict(orient='records')
# ref = self.client.dataset(self.dataset).table(table)
# _schema = self.client.get_table(ref).schema
# return [{"name":_item.name,"type":_item.field_type,"description":( "" if not hasattr(_item,"description") else _item.description )} for _item in _schema]
else :
return []
except Exception as e:
return []
def has(self,**_args):
found = False
try:
_has = self.meta(**_args)
found = _has is not None and len(_has) > 0
except Exception as e:
pass
return found
class Reader (BigQuery):
"""
Implementing support for reading from bigquery, This class acts as a wrapper around google's API
"""
def __init__(self,**_args):
super().__init__(**_args)
def apply(self,sql):
return self.read(sql=sql)
def read(self,**_args):
SQL = None
table = self.table if 'table' not in _args else _args['table']
if 'sql' in _args :
SQL = _args['sql']
elif table:
table = "".join(["`",table,"`"]) if '.' in table else "".join(["`:dataset.",table,"`"])
SQL = "SELECT * FROM :table ".replace(":table",table)
if not SQL :
return None
if SQL and 'limit' in _args:
SQL += " LIMIT "+str(_args['limit'])
if (':dataset' in SQL or ':DATASET' in SQL) and self.dataset:
SQL = SQL.replace(':dataset',self.dataset).replace(':DATASET',self.dataset)
_info = {'credentials':self.credentials,'dialect':'standard'}
return pd_gbq.read_gbq(SQL,**_info) if SQL else None
# return self.client.query(SQL).to_dataframe() if SQL else None
class Writer (BigQuery):
"""
This class implements support for writing against bigquery
"""
lock = RLock()
def __init__(self,**_args):
super().__init__(**_args)
self.parallel = False if 'lock' not in _args else _args['lock']
self.table = _args['table'] if 'table' in _args else None
self.mode = {'if_exists':'append','chunksize':900000,'destination_table':self.table,'credentials':self.credentials}
self._chunks = 1 if 'chunks' not in _args else int(_args['chunks'])
self._location = 'US' if 'location' not in _args else _args['location']
def write(self,_data,**_args) :
"""
This function will perform a write to bigquery
:_data data-frame to be written to bigquery
"""
try:
if self.parallel or 'lock' in _args :
Writer.lock.acquire()
_args['table'] = self.table if 'table' not in _args else _args['table']
self._write(_data,**_args)
finally:
if self.parallel:
Writer.lock.release()
def submit(self,_sql):
"""
Write the output of a massive query to a given table, biquery will handle this as a job
This function will return the job identifier
"""
_config = bq.QueryJobConfig()
_config.destination = self.client.dataset(self.dataset).table(self.table)
_config.allow_large_results = True
# _config.write_disposition = bq.bq_consts.WRITE_APPEND
_config.dry_run = False
# _config.priority = 'BATCH'
_resp = self.client.query(_sql,location=self._location,job_config=_config)
return _resp.job_id
def status (self,_id):
return self.client.get_job(_id,location=self._location)
def _write(self,_info,**_args) :
_df = None
if type(_info) in [list,pd.DataFrame] :
if type(_info) == list :
_df = pd.DataFrame(_info)
elif type(_info) == pd.DataFrame :
_df = _info
if '.' not in _args['table'] :
self.mode['destination_table'] = '.'.join([self.dataset,_args['table']])
else:
self.mode['destination_table'] = _args['table'].strip()
if 'schema' in _args :
self.mode['table_schema'] = _args['schema']
#
# Let us insure that the types are somewhat compatible ...
# _map = {'INTEGER':np.int64,'DATETIME':'datetime64[ns]','TIMESTAMP':'datetime64[ns]','FLOAT':np.float64,'DOUBLE':np.float64,'STRING':str}
# _mode = copy.deepcopy(self.mode)
# _mode = self.mode
# _df.to_gbq(**self.mode) #if_exists='append',destination_table=partial,credentials=credentials,chunksize=90000)
#
# Let us adjust the chunking here
if 'if_exists' in _args :
self.mode['if_exists'] = _args['if_exists']
self._chunks = 10 if _df.shape[0] > MAX_CHUNK and self._chunks == 1 else self._chunks
_indexes = np.array_split(np.arange(_df.shape[0]),self._chunks)
for i in _indexes :
# _df.iloc[i].to_gbq(**self.mode)
pd_gbq.to_gbq(_df.iloc[i],**self.mode)
time.sleep(1)
pass

@ -14,7 +14,7 @@ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLI
"""
import os
import sqlalchemy
from transport.common import Reader,Writer
# from transport.common import Reader,Writer
import pandas as pd
@ -39,7 +39,7 @@ class Bricks:
# Sometimes when the cluster isn't up and running it takes a while, the user should be alerted of this
#
_uri = f'''databricks://token:{_token}@{_host}?http_path={_cluster_path}&catalog={_catalog}&schema={self._schema}'''
_uri = f'''databricks+connector://token:{_token}@{_host}?http_path={_cluster_path}&catalog={_catalog}&schema={self._schema}'''
self._engine = sqlalchemy.create_engine (_uri)
pass
def meta(self,**_args):
@ -67,7 +67,7 @@ class Bricks:
except Exception as e:
pass
class BricksReader(Bricks,Reader):
class Reader(Bricks):
"""
This class is designed for reads and will execute reads against a table name or a select SQL statement
"""
@ -89,7 +89,7 @@ class BricksReader(Bricks,Reader):
else:
return pd.DataFrame()
pass
class BricksWriter(Bricks,Writer):
class Writer(Bricks):
def __init__(self,**_args):
super().__init__(**_args)
def write(self,_data,**_args):

@ -3,7 +3,7 @@ We are implementing transport to and from nextcloud (just like s3)
"""
import os
import sys
from transport.common import Reader,Writer, IEncoder
from transport.common import IEncoder
import pandas as pd
from io import StringIO
import json
@ -28,7 +28,7 @@ class Nextcloud :
pass
class NextcloudReader(Nextcloud,Reader):
class Reader(Nextcloud):
def __init__(self,**_args):
# self._file = [] if 'file' not in _args else _args['file']
super().__init__(**_args)
@ -54,7 +54,7 @@ class NextcloudReader(Nextcloud,Reader):
# if it is neither a structured document like csv, we will return the content as is
return _content
return None
class NextcloudWriter (Nextcloud,Writer):
class Writer (Nextcloud):
"""
This class will write data to an instance of nextcloud
"""

@ -11,10 +11,7 @@ import numpy as np
import botocore
from smart_open import smart_open
import sys
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer
else:
from common import Reader, Writer
import json
from io import StringIO
import json
@ -73,7 +70,7 @@ class s3 :
# return self.s3.get_all_buckets()
class s3Reader(s3,Reader) :
class Reader(s3) :
"""
Because s3 contains buckets and files, reading becomes a tricky proposition :
- list files if file is None
@ -113,7 +110,7 @@ class s3Reader(s3,Reader) :
limit = args['size'] if 'size' in args else -1
return self.stream(limit)
class s3Writer(s3,Writer) :
class Writer(s3) :
def __init__(self,**args) :
s3.__init__(self,**args)

@ -1,44 +1,7 @@
"""
Data Transport - 1.0
Steve L. Nyemba, The Phi Technology LLC
This module is designed to serve as a wrapper to a set of supported data stores :
- couchdb
- mongodb
- Files (character delimited)
- Queues (Rabbmitmq)
- Session (Flask)
- s3
The supported operations are read/write and providing meta data to the calling code
Requirements :
pymongo
boto
couldant
@TODO:
Enable read/writing to multiple reads/writes
"""
__author__ = 'The Phi Technology'
import numpy as np
import json
import importlib
from multiprocessing import RLock
import queue
# import couch
# import mongo
import numpy as np
from datetime import datetime
class IO:
def init(self,**args):
"""
This function enables attributes to be changed at runtime. Only the attributes defined in the class can be changed
Adding attributes will require sub-classing otherwise we may have an unpredictable class ...
"""
allowed = list(vars(self).keys())
for field in args :
if field not in allowed :
continue
value = args[field]
setattr(self,field,value)
class IEncoder (json.JSONEncoder):
def default (self,object):
if type(object) == np.integer :
@ -52,100 +15,4 @@ class IEncoder (json.JSONEncoder):
else:
return super(IEncoder,self).default(object)
class Reader (IO):
"""
This class is an abstraction of a read functionalities of a data store
"""
def __init__(self):
pass
def meta(self,**_args):
"""
This function is intended to return meta-data associated with what has just been read
@return object of meta data information associated with the content of the store
"""
raise Exception ("meta function needs to be implemented")
def read(self,**args):
"""
This function is intended to read the content of a store provided parameters to be used at the discretion of the subclass
"""
raise Exception ("read function needs to be implemented")
class Writer(IO):
def __init__(self):
self.cache = {"default":[]}
def log(self,**args):
self.cache[id] = args
def meta (self,id="default",**args):
raise Exception ("meta function needs to be implemented")
def format(self,row,xchar):
if xchar is not None and isinstance(row,list):
return xchar.join(row)+'\n'
elif xchar is None and isinstance(row,dict):
row = json.dumps(row)
return row
def write(self,**args):
"""
This function will write content to a store given parameters to be used at the discretion of the sub-class
"""
raise Exception ("write function needs to be implemented")
def archive(self):
"""
It is important to be able to archive data so as to insure that growth is controlled
Nothing in nature grows indefinitely neither should data being handled.
"""
raise Exception ("archive function needs to be implemented")
def close(self):
"""
This function will close the persistent storage connection/handler
"""
pass
class ReadWriter(Reader,Writer) :
"""
This class implements the read/write functions aggregated
"""
pass
# class Console(Writer):
# lock = RLock()
# def __init__(self,**_args):
# self.lock = _args['lock'] if 'lock' in _args else False
# self.info = self.write
# self.debug = self.write
# self.log = self.write
# pass
# def write (self,logs=None,**_args):
# if self.lock :
# Console.lock.acquire()
# try:
# _params = _args if logs is None and _args else logs
# if type(_params) == list:
# for row in _params :
# print (row)
# else:
# print (_params)
# except Exception as e :
# print (e)
# finally:
# if self.lock :
# Console.lock.release()
"""
@NOTE : Experimental !!
"""
class Proxy :
"""
This class will forward a call to a function that is provided by the user code
"""
def __init__(self,**_args):
self.callback = _args['callback']
def read(self,**_args) :
try:
return self.callback(**_args)
except Exception as e:
return self.callback()
pass
def write(self,data,**_args):
self.callback(data,**_args)

@ -1,269 +0,0 @@
import os
import sys
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer #, factory
else:
from common import Reader,Writer
# import nujson as json
import json
# from threading import Lock
import sqlite3
import pandas as pd
from multiprocessing import Lock
from transport.common import Reader, Writer, IEncoder
import sqlalchemy
from sqlalchemy import create_engine
class DiskReader(Reader) :
"""
This class is designed to read data from disk (location on hard drive)
@pre : isready() == True
"""
def __init__(self,**params):
"""
@param path absolute path of the file to be read
"""
Reader.__init__(self)
self.path = params['path'] if 'path' in params else None
self.delimiter = params['delimiter'] if 'delimiter' in params else ','
def isready(self):
return os.path.exists(self.path)
def meta(self,**_args):
return []
def read(self,**args):
_path = self.path if 'path' not in args else args['path']
_delimiter = self.delimiter if 'delimiter' not in args else args['delimiter']
return pd.read_csv(_path,delimiter=self.delimiter)
def stream(self,**args):
"""
This function reads the rows from a designated location on disk
@param size number of rows to be read, -1 suggests all rows
"""
size = -1 if 'size' not in args else int(args['size'])
f = open(self.path,'rU')
i = 1
for row in f:
i += 1
if size == i:
break
if self.delimiter :
yield row.split(self.delimiter)
yield row
f.close()
class DiskWriter(Writer):
"""
This function writes output to disk in a designated location. The function will write a text to a text file
- If a delimiter is provided it will use that to generate a xchar-delimited file
- If not then the object will be dumped as is
"""
THREAD_LOCK = Lock()
def __init__(self,**params):
super().__init__()
self._path = params['path']
self._delimiter = params['delimiter'] if 'delimiter' in params else None
self._mode = 'w' if 'mode' not in params else params['mode']
# def meta(self):
# return self.cache['meta']
# def isready(self):
# """
# This function determines if the class is ready for execution or not
# i.e it determines if the preconditions of met prior execution
# """
# return True
# # p = self.path is not None and os.path.exists(self.path)
# # q = self.name is not None
# # return p and q
# def format (self,row):
# self.cache['meta']['cols'] += len(row) if isinstance(row,list) else len(row.keys())
# self.cache['meta']['rows'] += 1
# return (self.delimiter.join(row) if self.delimiter else json.dumps(row))+"\n"
def write(self,info,**_args):
"""
This function writes a record to a designated file
@param label <passed|broken|fixed|stats>
@param row row to be written
"""
try:
DiskWriter.THREAD_LOCK.acquire()
_delim = self._delimiter if 'delimiter' not in _args else _args['delimiter']
_path = self._path if 'path' not in _args else _args['path']
_mode = self._mode if 'mode' not in _args else _args['mode']
info.to_csv(_path,index=False,sep=_delim)
pass
except Exception as e:
#
# Not sure what should be done here ...
pass
finally:
DiskWriter.THREAD_LOCK.release()
class SQLite :
def __init__(self,**_args) :
self.path = _args['database'] if 'database' in _args else _args['path']
self.conn = sqlite3.connect(self.path,isolation_level="IMMEDIATE")
self.conn.row_factory = sqlite3.Row
self.fields = _args['fields'] if 'fields' in _args else []
def has (self,**_args):
found = False
try:
if 'table' in _args :
table = _args['table']
sql = "SELECT * FROM :table limit 1".replace(":table",table)
_df = pd.read_sql(sql,self.conn)
found = _df.columns.size > 0
except Exception as e:
pass
return found
def close(self):
try:
self.conn.close()
except Exception as e :
print(e)
def apply(self,sql):
try:
if not sql.lower().startswith('select'):
cursor = self.conn.cursor()
cursor.execute(sql)
cursor.close()
self.conn.commit()
else:
return pd.read_sql(sql,self.conn)
except Exception as e:
print (e)
class SQLiteReader (SQLite,DiskReader):
def __init__(self,**args):
super().__init__(**args)
# DiskReader.__init__(self,**args)
# self.path = args['database'] if 'database' in args else args['path']
# self.conn = sqlite3.connect(self.path,isolation_level=None)
# self.conn.row_factory = sqlite3.Row
self.table = args['table'] if 'table' in args else None
def read(self,**args):
if 'sql' in args :
sql = args['sql']
elif 'filter' in args :
sql = "SELECT :fields FROM ",self.table, "WHERE (:filter)".replace(":filter",args['filter'])
sql = sql.replace(":fields",args['fields']) if 'fields' in args else sql.replace(":fields","*")
else:
sql = ' '.join(['SELECT * FROM ',self.table])
if 'limit' in args :
sql = sql + " LIMIT "+args['limit']
return pd.read_sql(sql,self.conn)
def close(self):
try:
self.conn.close()
except Exception as e :
pass
class SQLiteWriter(SQLite,DiskWriter) :
connection = None
LOCK = Lock()
def __init__(self,**args):
"""
:path
:fields json|csv
"""
# DiskWriter.__init__(self,**args)
super().__init__(**args)
self.table = args['table'] if 'table' in args else None
path = self.path
self._engine = create_engine(f'sqlite:///{path}')
# self.conn = sqlite3.connect(self.path,isolation_level="IMMEDIATE")
# self.conn.row_factory = sqlite3.Row
# self.fields = args['fields'] if 'fields' in args else []
if self.fields and not self.isready() and self.table:
self.init(self.fields)
SQLiteWriter.connection = self.conn
def init(self,fields):
self.fields = fields;
sql = " ".join(["CREATE TABLE IF NOT EXISTS ",self.table," (", ",".join(self.fields),")"])
cursor = self.conn.cursor()
cursor.execute(sql)
cursor.close()
self.conn.commit()
def isready(self):
try:
sql = "SELECT count(*) FROM sqlite_master where name=':table'"
sql = sql.replace(":table",self.table)
cursor = self.conn.cursor()
r = cursor.execute(sql)
r = r.fetchall()
cursor.close()
return r[0][0] != 0
except Exception as e:
pass
return 0
#
# If the table doesn't exist we should create it
#
# def write(self,_data,**_args):
# SQLiteWriter.LOCK.acquire()
# try:
# if type(_data) == dict :
# _data = [_data]
# _table = self.table if 'table' not in _args else _args['table']
# _df = pd.DataFrame(_data)
# _df.to_sql(_table,self._engine.connect(),if_exists='append',index=False)
# except Exception as e:
# print (e)
# SQLiteWriter.LOCK.release()
def write(self,info,**_args):
"""
"""
#if not self.fields :
# #if type(info) == pd.DataFrame :
# # _columns = list(info.columns)
# #self.init(list(info.keys()))
if type(info) == dict :
info = [info]
elif type(info) == pd.DataFrame :
info = info.fillna('')
info = info.to_dict(orient='records')
if not self.fields :
_rec = info[0]
self.init(list(_rec.keys()))
SQLiteWriter.LOCK.acquire()
try:
cursor = self.conn.cursor()
sql = " " .join(["INSERT INTO ",self.table,"(", ",".join(self.fields) ,")", "values(:values)"])
for row in info :
values = [ str(row[field]) if type(row[field]) not in [list,dict] else json.dumps(row[field],cls=IEncoder) for field in self.fields]
values = ["".join(["'",value,"'"]) for value in values]
# stream =["".join(["",value,""]) if type(value) == str else value for value in row.values()]
# stream = json.dumps(stream,cls=IEncoder)
# stream = stream.replace("[","").replace("]","")
# print (sql.replace(":values",stream))
# self.conn.execute(sql.replace(":values",stream) )
self.conn.execute(sql.replace(":values", ",".join(values)) )
# cursor.commit()
self.conn.commit()
# print (sql)
except Exception as e :
print ()
print (e)
pass
SQLiteWriter.LOCK.release()

@ -83,7 +83,12 @@ class Transporter(Process):
_reader = transport.factory.instance(**self._source)
#
# If arguments are provided then a query is to be executed (not just a table dump)
return _reader.read() if 'args' not in self._source else _reader.read(**self._source['args'])
if 'cmd' in self._source or 'query' in self._source :
_query = self._source['cmd'] if 'cmd' in self._source else self._source['query']
return _reader.read(**_query)
else:
return _reader.read()
# return _reader.read() if 'query' not in self._source else _reader.read(**self._source['query'])
def _delegate_write(self,_data,**_args):
"""

@ -0,0 +1,47 @@
"""
This class is a wrapper around read/write classes of cloud,sql,nosql,other packages
The wrapper allows for application of plugins as pre-post conditions
"""
class IO:
"""
Base wrapper class for read/write
"""
def __init__(self,_agent,plugins):
self._agent = _agent
self._plugins = plugins
def meta (self,**_args):
if hasattr(self._agent,'meta') :
return self._agent.meta(**_args)
return []
def close(self):
if hasattr(self._agent,'close') :
self._agent.close()
def apply(self):
"""
applying pre/post conditions given a pipeline expression
"""
for _pointer in self._plugins :
_data = _pointer(_data)
def apply(self,_query):
if hasattr(self._agent,'apply') :
return self._agent.apply(_query)
return None
class IReader(IO):
def __init__(self,_agent,pipeline=None):
super().__init__(_agent,pipeline)
def read(self,**_args):
_data = self._agent.read(**_args)
if self._plugins and self._plugins.ratio() > 0 :
_data = self._plugins.apply(_data)
#
# output data
return _data
class IWriter(IO):
def __init__(self,_agent,pipeline=None):
super().__init__(_agent,pipeline)
def write(self,_data,**_args):
if self._plugins and self._plugins.ratio() > 0 :
_data = self._plugins.apply(_data)
self._agent.write(_data,**_args)

@ -0,0 +1,12 @@
"""
Steve L. Nyemba, nyemba@gmail.com
This namespace implements support for cloud databases couchdb,mongodb, cloudant ...
"""
# from transport.nosql import couchdb
# from transport.nosql import mongodb
from . import mongodb
from . import couchdb
# import mongodb
# import couchdb
cloudant = couchdb

@ -8,10 +8,10 @@ This file is a wrapper around couchdb using IBM Cloudant SDK that has an interfa
import cloudant
import json
import sys
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer
else:
from common import Reader, Writer
# from transport.common import Reader, Writer
from datetime import datetime
class Couch:
"""
This class is a wrapper for read/write against couchdb. The class captures common operations for read/write.
@ -77,7 +77,7 @@ class Couch:
class CouchReader(Couch,Reader):
class Reader(Couch):
"""
This function will read an attachment from couchdb and return it to calling code. The attachment must have been placed before hand (otherwise oops)
@T: Account for security & access control
@ -94,28 +94,7 @@ class CouchReader(Couch,Reader):
else:
self.filename = None
# def isready(self):
# #
# # Is the basic information about the database valid
# #
# p = Couchdb.isready(self)
# if p == False:
# return False
# #
# # The database name is set and correct at this point
# # We insure the document of the given user has the requested attachment.
# #
# doc = self.dbase.get(self._id)
# if '_attachments' in doc:
# r = self.filename in doc['_attachments'].keys()
# else:
# r = False
# return r
def stream(self):
#
# @TODO Need to get this working ...
@ -143,7 +122,7 @@ class CouchReader(Couch,Reader):
document = {}
return document
class CouchWriter(Couch,Writer):
class Writer(Couch):
"""
This class will write on a couchdb document provided a scope
The scope is the attribute that will be on the couchdb document
@ -156,16 +135,16 @@ class CouchWriter(Couch,Writer):
@param dbname database name (target)
"""
Couch.__init__(self,**args)
super().__init__(self,**args)
def set (self,info):
document = cloudand.document.Document(self.dbase,self._id)
document = cloudant.document.Document(self.dbase,self._id)
if document.exists() :
keys = list(set(document.keys()) - set(['_id','_rev','_attachments']))
for id in keys :
document.field_set(document,id,None)
for id in args :
value = args[id]
document.field_set(document,id,value)
for id in info :
value = info[id]
document.info(document,id,value)
document.save()
pass

@ -5,22 +5,21 @@ Steve L. Nyemba, The Phi Technology LLC
This file is a wrapper around mongodb for reading/writing content against a mongodb server and executing views (mapreduce)
"""
from pymongo import MongoClient
import bson
from bson.objectid import ObjectId
from bson.binary import Binary
# import nujson as json
from datetime import datetime
import pandas as pd
import numpy as np
import gridfs
# from transport import Reader,Writer
# import gridfs
from gridfs import GridFS
import sys
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer, IEncoder
else:
from common import Reader, Writer
import json
import re
from multiprocessing import Lock, RLock
from transport.common import IEncoder
class Mongo :
lock = RLock()
"""
@ -33,7 +32,7 @@ class Mongo :
:username username for authentication
:password password for current user
"""
self.host = 'localhost' if 'host' not in args else args['host']
self.mechanism= 'SCRAM-SHA-256' if 'mechanism' not in args else args['mechanism']
# authSource=(args['authSource'] if 'authSource' in args else self.dbname)
self._lock = False if 'lock' not in args else args['lock']
@ -61,7 +60,7 @@ class Mongo :
# Let us perform aliasing in order to remain backwards compatible
self.dbname = self.db if hasattr(self,'db')else self.dbname
self.uid = _args['table'] if 'table' in _args else (_args['doc'] if 'doc' in _args else (_args['collection'] if 'collection' in _args else None))
self.collection = _args['table'] if 'table' in _args else (_args['doc'] if 'doc' in _args else (_args['collection'] if 'collection' in _args else None))
if username and password :
self.client = MongoClient(self.host,
username=username,
@ -76,7 +75,7 @@ class Mongo :
def isready(self):
p = self.dbname in self.client.list_database_names()
q = self.uid in self.client[self.dbname].list_collection_names()
q = self.collection in self.client[self.dbname].list_collection_names()
return p and q
def setattr(self,key,value):
_allowed = ['host','port','db','doc','collection','authSource','mechanism']
@ -87,7 +86,7 @@ class Mongo :
self.client.close()
def meta(self,**_args):
return []
class MongoReader(Mongo,Reader):
class Reader(Mongo):
"""
This class will read from a mongodb data store and return the content of a document (not a collection)
"""
@ -100,7 +99,7 @@ class MongoReader(Mongo,Reader):
# @TODO:
cmd = {}
if 'aggregate' not in cmd and 'aggregate' not in args:
cmd['aggregate'] = self.uid
cmd['aggregate'] = self.collection
elif 'aggregate' in args :
cmd['aggregate'] = args['aggregate']
if 'pipeline' in args :
@ -144,9 +143,9 @@ class MongoReader(Mongo,Reader):
elif 'collection' in args :
_uid = args['collection']
else:
_uid = self.uid
_uid = self.collection
else:
_uid = self.uid
_uid = self.collection
collection = self.db[_uid]
_filter = args['filter'] if 'filter' in args else {}
_df = pd.DataFrame(collection.find(_filter))
@ -157,7 +156,7 @@ class MongoReader(Mongo,Reader):
This function is designed to execute a view (map/reduce) operation
"""
pass
class MongoWriter(Mongo,Writer):
class Writer(Mongo):
"""
This class is designed to write to a mongodb collection within a database
"""
@ -180,7 +179,7 @@ class MongoWriter(Mongo,Writer):
"""
This function will archive documents to the
"""
collection = self.db[self.uid]
collection = self.db[self.collection]
rows = list(collection.find())
for row in rows :
if type(row['_id']) == ObjectId :
@ -188,8 +187,8 @@ class MongoWriter(Mongo,Writer):
stream = Binary(json.dumps(collection,cls=IEncoder).encode())
collection.delete_many({})
now = "-".join([str(datetime.now().year()),str(datetime.now().month), str(datetime.now().day)])
name = ".".join([self.uid,'archive',now])+".json"
description = " ".join([self.uid,'archive',str(len(rows))])
name = ".".join([self.collection,'archive',now])+".json"
description = " ".join([self.collection,'archive',str(len(rows))])
self.upload(filename=name,data=stream,description=description,content_type='application/json')
# gfs = GridFS(self.db)
# gfs.put(filename=name,description=description,data=stream,encoding='utf-8')
@ -197,27 +196,44 @@ class MongoWriter(Mongo,Writer):
pass
def write(self,info,**_args):
"""
This function will write to a given collection i.e add a record to a collection (no updates)
@param info new record in the collection to be added
"""
# document = self.db[self.uid].find()
#collection = self.db[self.uid]
# document = self.db[self.collection].find()
#collection = self.db[self.collection]
# if type(info) == list :
# self.db[self.uid].insert_many(info)
# self.db[self.collection].insert_many(info)
# else:
try:
if 'table' in _args or 'collection' in _args :
_uid = _args['table'] if 'table' in _args else _args['collection']
else:
_uid = self.uid if 'doc' not in _args else _args['doc']
_uid = self.collection if 'doc' not in _args else _args['doc']
if self._lock :
Mongo.lock.acquire()
if type(info) == list or type(info) == pd.DataFrame :
self.db[_uid].insert_many(info if type(info) == list else info.to_dict(orient='records'))
if type(info) == pd.DataFrame :
info = info.to_dict(orient='records')
# info if type(info) == list else info.to_dict(orient='records')
info = json.loads(json.dumps(info,cls=IEncoder))
self.db[_uid].insert_many(info)
else:
self.db[_uid].insert_one(info)
#
# sometimes a dictionary can have keys with arrays (odd shaped)
#
_keycount = len(info.keys())
_arraycount = [len(info[key]) for key in info if type(info[key]) in (list,np.array,np.ndarray)]
if _arraycount and len(_arraycount) == _keycount and np.max(_arraycount) == np.min(_arraycount) :
#
# In case an object with consistent structure is passed, we store it accordingly
#
self.write(pd.DataFrame(info),**_args)
else:
self.db[_uid].insert_one(json.loads(json.dumps(info,cls=IEncoder)))
finally:
if self._lock :
Mongo.lock.release()
@ -227,15 +243,19 @@ class MongoWriter(Mongo,Writer):
Please use this function with great care (archive the content first before using it... for safety)
"""
collection = self.db[self.uid]
if collection.count_document() > 0 and '_id' in document:
collection = self.db[self.collection]
if collection.count_documents() > 0 and '_id' in document:
id = document['_id']
del document['_id']
collection.find_one_and_replace({'_id':id},document)
else:
collection.delete_many({})
self.write(info)
#
# Nothing to be done if we did not find anything
#
pass
# collection.delete_many({})
# self.write(info)
def close(self):
Mongo.close(self)
# collecton.update_one({"_id":self.uid},document,True)
# collecton.update_one({"_id":self.collection},document,True)

@ -0,0 +1 @@
from . import files, http, rabbitmq, callback, files

@ -1,22 +1,22 @@
import queue
from threading import Thread, Lock
from transport.common import Reader,Writer
# from transport.common import Reader,Writer
import numpy as np
import pandas as pd
class qListener :
class Writer :
lock = Lock()
_queue = {'default':queue.Queue()}
def __init__(self,**_args):
self._cache = {}
self._callback = _args['callback'] if 'callback' in _args else None
self._id = _args['id'] if 'id' in _args else 'default'
if self._id not in qListener._queue :
qListener._queue[self._id] = queue.Queue()
if self._id not in Writer._queue :
Writer._queue[self._id] = queue.Queue()
thread = Thread(target=self._forward)
thread.start()
def _forward(self):
_q = qListener._queue[self._id]
_q = Writer._queue[self._id]
_data = _q.get()
_q.task_done()
self._callback(_data)
@ -29,7 +29,7 @@ class qListener :
"""
This will empty the queue and have it ready for another operation
"""
_q = qListener._queue[self._id]
_q = Writer._queue[self._id]
with _q.mutex:
_q.queue.clear()
_q.all_tasks_done.notify_all()
@ -37,11 +37,9 @@ class qListener :
def write(self,_data,**_args):
_id = _args['id'] if 'id' in _args else self._id
_q = qListener._queue[_id]
_q = Writer._queue[_id]
_q.put(_data)
_q.join()
class Console (qListener):
def __init__(self,**_args):
super().__init__(callback=print)
# self.callback = print

@ -0,0 +1,7 @@
from . import callback
class Writer (callback.Writer):
def __init__(self,**_args):
super().__init__(callback=print)

@ -0,0 +1,68 @@
"""
This file is a wrapper around pandas built-in functionalities to handle character delimited files
"""
import pandas as pd
import numpy as np
import os
class File :
def __init__(self,**params):
"""
@param path absolute path of the file to be read
"""
self.path = params['path'] if 'path' in params else None
self.delimiter = params['delimiter'] if 'delimiter' in params else ','
def isready(self):
return os.path.exists(self.path)
def meta(self,**_args):
return []
class Reader (File):
"""
This class is designed to read data from disk (location on hard drive)
@pre : isready() == True
"""
def __init__(self,**_args):
super().__init__(**_args)
def read(self,**args):
_path = self.path if 'path' not in args else args['path']
_delimiter = self.delimiter if 'delimiter' not in args else args['delimiter']
return pd.read_csv(_path,delimiter=self.delimiter)
def stream(self,**args):
raise Exception ("streaming needs to be implemented")
class Writer (File):
"""
This function writes output to disk in a designated location. The function will write a text to a text file
- If a delimiter is provided it will use that to generate a xchar-delimited file
- If not then the object will be dumped as is
"""
# THREAD_LOCK = RLock()
def __init__(self,**_args):
super().__init__(**_args)
self._mode = 'w' if 'mode' not in _args else _args['mode']
def write(self,info,**_args):
"""
This function writes a record to a designated file
@param label <passed|broken|fixed|stats>
@param row row to be written
"""
try:
_delim = self._delimiter if 'delimiter' not in _args else _args['delimiter']
_path = self._path if 'path' not in _args else _args['path']
_mode = self._mode if 'mode' not in _args else _args['mode']
info.to_csv(_path,index=False,sep=_delim)
pass
except Exception as e:
#
# Not sure what should be done here ...
pass
finally:
# DiskWriter.THREAD_LOCK.release()
pass

@ -1,14 +1,14 @@
from flask import request, session
from datetime import datetime
import re
from transport.common import Reader, Writer
# from transport.common import Reader, Writer
import json
import requests
from io import StringIO
import pandas as pd
class HttpReader(Reader):
class Reader:
"""
This class is designed to read data from an Http request file handler provided to us by flask
The file will be heald in memory and processed accordingly
@ -38,7 +38,7 @@ class HttpReader(Reader):
r = requests.get(self._url,headers = self._headers)
return self.format(r)
class HttpWriter(Writer):
class Writer:
"""
This class is designed to submit data to an endpoint (url)
"""

@ -11,10 +11,10 @@ import re
import json
import os
import sys
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer
else:
from common import Reader, Writer
# if sys.version_info[0] > 2 :
# from transport.common import Reader, Writer
# else:
# from common import Reader, Writer
import json
from multiprocessing import RLock
class MessageQueue:
@ -80,7 +80,7 @@ class MessageQueue:
self.channel.close()
self.connection.close()
class QueueWriter(MessageQueue,Writer):
class Writer(MessageQueue):
"""
This class is designed to publish content to an AMQP (Rabbitmq)
The class will rely on pika to implement this functionality
@ -93,13 +93,6 @@ class QueueWriter(MessageQueue,Writer):
#self.queue = params['queue']
MessageQueue.__init__(self,**params);
self.init()
def write(self,data,_type='text/plain'):
"""
This function writes a stream of data to the a given queue
@ -122,7 +115,7 @@ class QueueWriter(MessageQueue,Writer):
self.channel.queue_delete( queue=self.queue);
self.close()
class QueueReader(MessageQueue,Reader):
class Reader(MessageQueue):
"""
This class will read from a queue provided an exchange, queue and host
@TODO: Account for security and virtualhosts

@ -0,0 +1,128 @@
"""
The functions within are designed to load external files and apply functions against the data
The plugins are applied as
- post-processing if we are reading data
- and pre-processing if we are writing data
The plugin will use a decorator to identify meaningful functions
@TODO: This should work in tandem with loggin (otherwise we don't have visibility into what is going on)
"""
import importlib as IL
import importlib.util
import sys
import os
class plugin :
"""
Implementing function decorator for data-transport plugins (post-pre)-processing
"""
def __init__(self,**_args):
"""
:name name of the plugin
:mode restrict to reader/writer
:about tell what the function is about
"""
self._name = _args['name']
self._about = _args['about']
self._mode = _args['mode'] if 'mode' in _args else 'rw'
def __call__(self,pointer):
def wrapper(_args):
return pointer(_args)
#
# @TODO:
# add attributes to the wrapper object
#
setattr(wrapper,'transport',True)
setattr(wrapper,'name',self._name)
setattr(wrapper,'mode',self._mode)
setattr(wrapper,'about',self._about)
return wrapper
class PluginLoader :
"""
This class is intended to load a plugin and make it available and assess the quality of the developed plugin
"""
def __init__(self,**_args):
"""
:path location of the plugin (should be a single file)
:_names of functions to load
"""
_names = _args['names'] if 'names' in _args else None
path = _args['path'] if 'path' in _args else None
self._names = _names if type(_names) == list else [_names]
self._modules = {}
self._names = []
if path and os.path.exists(path) and _names:
for _name in self._names :
spec = importlib.util.spec_from_file_location('private', path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module) #--loads it into sys.modules
if hasattr(module,_name) :
if self.isplugin(module,_name) :
self._modules[_name] = getattr(module,_name)
else:
print ([f'Found {_name}', 'not plugin'])
else:
#
# @TODO: We should log this somewhere some how
print (['skipping ',_name, hasattr(module,_name)])
pass
else:
#
# Initialization is empty
self._names = []
pass
def set(self,_pointer) :
"""
This function will set a pointer to the list of modules to be called
This should be used within the context of using the framework as a library
"""
_name = _pointer.__name__
self._modules[_name] = _pointer
self._names.append(_name)
def isplugin(self,module,name):
"""
This function determines if a module is a recognized plugin
:module module object loaded from importlib
:name name of the functiion of interest
"""
p = type(getattr(module,name)).__name__ =='function'
q = hasattr(getattr(module,name),'transport')
#
# @TODO: add a generated key, and more indepth validation
return p and q
def has(self,_name):
"""
This will determine if the module name is loaded or not
"""
return _name in self._modules
def ratio (self):
"""
how many modules loaded vs unloaded given the list of names
"""
_n = len(self._names)
return len(set(self._modules.keys()) & set (self._names)) / _n
def apply(self,_data):
for _name in self._modules :
_pointer = self._modules[_name]
#
# @TODO: add exception handling
_data = _pointer(_data)
return _data
# def apply(self,_data,_name):
# """
# This function applies an external module function against the data.
# The responsibility is on the plugin to properly return data, thus responsibility is offloaded
# """
# try:
# _pointer = self._modules[_name]
# _data = _pointer(_data)
# except Exception as e:
# pass
# return _data

@ -1,105 +0,0 @@
# from transport.common import Reader, Writer,Console #, factory
from transport import disk
import sqlite3
from transport import s3 as s3
from transport import rabbitmq as queue
from transport import couch as couch
from transport import mongo as mongo
from transport import sql as sql
from transport import etl as etl
from transport import qlistener
from transport import bricks
from transport import session
from transport import nextcloud
import psycopg2 as pg
import mysql.connector as my
from google.cloud import bigquery as bq
import nzpy as nz #--- netezza drivers
import os
from info import __version__
POSTGRESQL = 'postgresql'
MONGODB = 'mongodb'
HTTP='http'
BIGQUERY ='bigquery'
FILE = 'file'
ETL = 'etl'
SQLITE = 'sqlite'
SQLITE3= 'sqlite'
REDSHIFT = 'redshift'
NETEZZA = 'netezza'
MYSQL = 'mysql+mysqlconnector'
RABBITMQ = 'rabbitmq'
MARIADB = 'mariadb'
COUCHDB = 'couch'
CONSOLE = 'console'
ETL = 'etl'
TRANSPORT = ETL
NEXTCLOUD = 'nextcloud'
#
# synonyms of the above
BQ = BIGQUERY
MONGO = MONGODB
FERRETDB= MONGODB
PG = POSTGRESQL
PSQL = POSTGRESQL
PGSQL = POSTGRESQL
S3 = 's3'
AWS_S3 = 's3'
RABBIT = RABBITMQ
QLISTENER = 'qlistener'
QUEUE = QLISTENER
CALLBACK = QLISTENER
DATABRICKS= 'databricks+connector'
DRIVERS = {PG:pg,REDSHIFT:pg,MYSQL:my,MARIADB:my,NETEZZA:nz,SQLITE:sqlite3}
CATEGORIES ={'sql':[NETEZZA,PG,MYSQL,REDSHIFT,SQLITE,MARIADB],'nosql':[MONGODB,COUCHDB],'cloud':[NEXTCLOUD,S3,BIGQUERY,DATABRICKS],'file':[FILE],
'queue':[RABBIT,QLISTENER],'memory':[CONSOLE,QUEUE],'http':[HTTP]}
READ = {'sql':sql.SQLReader,'nosql':{MONGODB:mongo.MongoReader,COUCHDB:couch.CouchReader},
'cloud':{BIGQUERY:sql.BigQueryReader,DATABRICKS:bricks.BricksReader,NEXTCLOUD:nextcloud.NextcloudReader},
'file':disk.DiskReader,'queue':{RABBIT:queue.QueueReader,QLISTENER:qlistener.qListener},
# 'cli':{CONSOLE:Console},'memory':{CONSOLE:Console},'http':session.HttpReader
}
WRITE = {'sql':sql.SQLWriter,'nosql':{MONGODB:mongo.MongoWriter,COUCHDB:couch.CouchWriter},
'cloud':{BIGQUERY:sql.BigQueryWriter,DATABRICKS:bricks.BricksWriter,NEXTCLOUD:nextcloud.NextcloudWriter},
'file':disk.DiskWriter,'queue':{RABBIT:queue.QueueWriter,QLISTENER:qlistener.qListener},
# 'cli':{CONSOLE:Console},
# 'memory':{CONSOLE:Console}, 'http':session.HttpReader
}
# SQL_PROVIDERS = [POSTGRESQL,MYSQL,NETEZZA,MARIADB,SQLITE]
PROVIDERS = {
FILE:{'read':disk.DiskReader,'write':disk.DiskWriter},
SQLITE:{'read':disk.SQLiteReader,'write':disk.SQLiteWriter,'driver':sqlite3},
'sqlite3':{'read':disk.SQLiteReader,'write':disk.SQLiteWriter,'driver':sqlite3},
POSTGRESQL:{'read':sql.SQLReader,'write':sql.SQLWriter,'driver':pg,'default':{'host':'localhost','port':5432}},
NETEZZA:{'read':sql.SQLReader,'write':sql.SQLWriter,'driver':nz,'default':{'port':5480}},
REDSHIFT:{'read':sql.SQLReader,'write':sql.SQLWriter,'driver':pg,'default':{'host':'localhost','port':5432}},
RABBITMQ:{'read':queue.QueueReader,'writer':queue.QueueWriter,'context':queue.QueueListener,'default':{'host':'localhost','port':5432}},
MYSQL:{'read':sql.SQLReader,'write':sql.SQLWriter,'driver':my,'default':{'host':'localhost','port':3306}},
MARIADB:{'read':sql.SQLReader,'write':sql.SQLWriter,'driver':my,'default':{'host':'localhost','port':3306}},
S3:{'read':s3.s3Reader,'write':s3.s3Writer},
BIGQUERY:{'read':sql.BigQueryReader,'write':sql.BigQueryWriter},
DATABRICKS:{'read':bricks.BricksReader,'write':bricks.BricksWriter},
NEXTCLOUD:{'read':nextcloud.NextcloudReader,'write':nextcloud.NextcloudWriter},
QLISTENER:{'read':qlistener.qListener,'write':qlistener.qListener,'default':{'host':'localhost','port':5672}},
CONSOLE:{'read':qlistener.Console,"write":qlistener.Console},
HTTP:{'read':session.HttpReader,'write':session.HttpWriter},
MONGODB:{'read':mongo.MongoReader,'write':mongo.MongoWriter,'default':{'port':27017,'host':'localhost'}},
COUCHDB:{'read':couch.CouchReader,'writer':couch.CouchWriter,'default':{'host':'localhost','port':5984}},
# ETL :{'read':etl.Transporter,'write':etl.Transporter}
ETL :{'read':etl.instance,'write':etl.instance}
}
DEFAULT = {PG:{'host':'localhost','port':5432},MYSQL:{'host':'localhost','port':3306}}
DEFAULT[MONGODB] = {'port':27017,'host':'localhost'}
DEFAULT[REDSHIFT] = DEFAULT[PG]
DEFAULT[MARIADB] = DEFAULT[MYSQL]
DEFAULT[NETEZZA] = {'port':5480}

@ -0,0 +1,46 @@
"""
This file is intended to aggregate all we can about the framework in terms of support
"""
BIGQUERY='bigquery'
POSTGRESQL = 'postgresql'
MONGODB = 'mongodb'
HTTP='http'
BIGQUERY ='bigquery'
FILE = 'file'
ETL = 'etl'
SQLITE = 'sqlite'
SQLITE3= 'sqlite3'
REDSHIFT = 'redshift'
NETEZZA = 'netezza'
MYSQL = 'mysql'
MARIADB= MYSQL
COUCHDB = 'couchdb'
CONSOLE = 'console'
ETL = 'etl'
TRANSPORT = ETL
NEXTCLOUD = 'nextcloud'
S3 = 's3'
CALLBACK = 'callback'
CONSOLE = 'console'
RABBITMQ = 'rabbitmq'
DATABRICKS = 'databricks'
MSSQL ='sqlserver'
SQLSERVER ='sqlserver'
#
# synonyms of the above
BQ = BIGQUERY
MONGO = MONGODB
FERRETDB= MONGODB
PG = POSTGRESQL
PSQL = POSTGRESQL
PGSQL = POSTGRESQL
AWS_S3 = 's3'
RABBIT = RABBITMQ
# QLISTENER = 'qlistener'

@ -1,526 +0,0 @@
"""
This file is intended to perform read/writes against an SQL database such as PostgreSQL, Redshift, Mysql, MsSQL ...
LICENSE (MIT)
Copyright 2016-2020, The Phi Technology LLC
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@TODO:
- Migrate SQLite to SQL hierarchy
- Include Write in Chunks from pandas
"""
import psycopg2 as pg
import mysql.connector as my
import sys
import sqlalchemy
if sys.version_info[0] > 2 :
from transport.common import Reader, Writer #, factory
else:
from common import Reader,Writer
import json
from google.oauth2 import service_account
from google.cloud import bigquery as bq
# import constants.bq_utils as bq_consts
from multiprocessing import Lock, RLock
import pandas as pd
import pandas_gbq as pd_gbq
import numpy as np
import nzpy as nz #--- netezza drivers
import sqlite3
import copy
import os
import time
class SQLRW :
lock = RLock()
MAX_CHUNK = 2000000
DRIVERS = {"postgresql":pg,"redshift":pg,"mysql":my,"mariadb":my,"netezza":nz}
REFERENCE = {
"netezza":{"port":5480,"handler":nz,"dtype":"VARCHAR(512)"},
"postgresql":{"port":5432,"handler":pg,"dtype":"VARCHAR"},
"redshift":{"port":5432,"handler":pg,"dtype":"VARCHAR"},
"mysql":{"port":3360,"handler":my,"dtype":"VARCHAR(256)"},
"mariadb":{"port":3360,"handler":my,"dtype":"VARCHAR(256)"},
}
def __init__(self,**_args):
_info = {}
_info['dbname'] = _args['db'] if 'db' in _args else _args['database']
self.table = _args['table'] if 'table' in _args else None
self.fields = _args['fields'] if 'fields' in _args else []
self.schema = _args['schema'] if 'schema' in _args else ''
self._chunks = 1 if 'chunks' not in _args else int(_args['chunks'])
self._provider = _args['provider'] if 'provider' in _args else None
# _info['host'] = 'localhost' if 'host' not in _args else _args['host']
# _info['port'] = SQLWriter.REFERENCE[_provider]['port'] if 'port' not in _args else _args['port']
_info['host'] = _args['host'] if 'host' in _args else ''
_info['port'] = _args['port'] if 'port' in _args else ''
# if 'host' in _args :
# _info['host'] = 'localhost' if 'host' not in _args else _args['host']
# # _info['port'] = SQLWriter.PROVIDERS[_args['provider']] if 'port' not in _args else _args['port']
# _info['port'] = SQLWriter.REFERENCE[_provider]['port'] if 'port' not in _args else _args['port']
self.lock = False if 'lock' not in _args else _args['lock']
if 'username' in _args or 'user' in _args:
key = 'username' if 'username' in _args else 'user'
_info['user'] = _args[key]
_info['password'] = _args['password'] if 'password' in _args else ''
if 'auth_file' in _args :
_auth = json.loads( open(_args['auth_file']).read() )
key = 'username' if 'username' in _auth else 'user'
_info['user'] = _auth[key]
_info['password'] = _auth['password'] if 'password' in _auth else ''
_info['host'] = _auth['host'] if 'host' in _auth else _info['host']
_info['port'] = _auth['port'] if 'port' in _auth else _info['port']
if 'database' in _auth:
_info['dbname'] = _auth['database']
self.table = _auth['table'] if 'table' in _auth else self.table
#
# We need to load the drivers here to see what we are dealing with ...
# _handler = SQLWriter.REFERENCE[_provider]['handler']
_handler = _args['driver'] #-- handler to the driver
self._dtype = _args['default']['type'] if 'default' in _args and 'type' in _args['default'] else 'VARCHAR(256)'
# self._provider = _args['provider']
# self._dtype = SQLWriter.REFERENCE[_provider]['dtype'] if 'dtype' not in _args else _args['dtype']
# self._provider = _provider
if _handler == nz :
_info['database'] = _info['dbname']
_info['securityLevel'] = 0
del _info['dbname']
if _handler == my :
_info['database'] = _info['dbname']
del _info['dbname']
if _handler == sqlite3 :
_info = {'path':_info['dbname'],'isolation_level':'IMMEDIATE'}
if _handler != sqlite3 :
self.conn = _handler.connect(**_info)
else:
self.conn = _handler.connect(_info['path'],isolation_level='IMMEDIATE')
self._engine = _args['sqlalchemy'] if 'sqlalchemy' in _args else None
def meta(self,**_args):
schema = []
try:
if self._engine :
table = _args['table'] if 'table' in _args else self.table
if sqlalchemy.__version__.startswith('1.') :
_m = sqlalchemy.MetaData(bind=self._engine)
_m.reflect()
else:
_m = sqlalchemy.MetaData()
_m.reflect(bind=self._engine)
schema = [{"name":_attr.name,"type":str(_attr.type)} for _attr in _m.tables[table].columns]
#
# Some house keeping work
_m = {'BIGINT':'INTEGER','TEXT':'STRING','DOUBLE_PRECISION':'FLOAT','NUMERIC':'FLOAT','DECIMAL':'FLOAT','REAL':'FLOAT'}
for _item in schema :
if _item['type'] in _m :
_item['type'] = _m[_item['type']]
except Exception as e:
print (e)
pass
return schema
def _tablename(self,name) :
return self.schema +'.'+name if self.schema not in [None, ''] and '.' not in name else name
def has(self,**_args):
return self.meta(**_args)
# found = False
# try:
# table = self._tablename(_args['table'])if 'table' in _args else self._tablename(self.table)
# sql = "SELECT * FROM :table LIMIT 1".replace(":table",table)
# if self._engine :
# _conn = self._engine.connect()
# else:
# _conn = self.conn
# found = pd.read_sql(sql,_conn).shape[0]
# found = True
# except Exception as e:
# print (e)
# pass
# finally:
# if not self._engine :
# _conn.close()
# return found
def isready(self):
_sql = "SELECT * FROM :table LIMIT 1".replace(":table",self.table)
try:
_conn = self.conn if not hasattr(self,'_engine') else self._engine
return pd.read_sql(_sql,_conn).columns.tolist()
except Exception as e:
pass
return False
def apply(self,_sql):
"""
This function applies a command and/or a query against the current relational data-store
:param _sql insert/select statement
@TODO: Store procedure calls
"""
#
_out = None
try:
if _sql.lower().startswith('select') :
_conn = self._engine if self._engine else self.conn
return pd.read_sql(_sql,_conn)
else:
# Executing a command i.e no expected return values ...
cursor = self.conn.cursor()
cursor.execute(_sql)
self.conn.commit()
except Exception as e :
print (e)
finally:
if not self._engine :
self.conn.commit()
# cursor.close()
def close(self):
try:
self.conn.close()
except Exception as error :
print (error)
pass
class SQLReader(SQLRW,Reader) :
def __init__(self,**_args):
super().__init__(**_args)
def read(self,**_args):
if 'sql' in _args :
_sql = (_args['sql'])
else:
if 'table' in _args :
table = _args['table']
else:
table = self.table
# table = self.table if self.table is not None else _args['table']
_sql = "SELECT :fields FROM "+self._tablename(table)
if 'filter' in _args :
_sql = _sql +" WHERE "+_args['filter']
if 'fields' in _args :
_fields = _args['fields']
else:
_fields = '*' if not self.fields else ",".join(self.fields)
_sql = _sql.replace(":fields",_fields)
#
# At this point we have a query we can execute gracefully
if 'limit' in _args :
_sql = _sql + " LIMIT "+str(_args['limit'])
#
# @TODO:
# It is here that we should inspect to see if there are any pre/post conditions
#
return self.apply(_sql)
def close(self) :
try:
self.conn.close()
except Exception as error :
print (error)
pass
class SQLWriter(SQLRW,Writer):
def __init__(self,**_args) :
super().__init__(**_args)
#
# In the advent that data typing is difficult to determine we can inspect and perform a default case
# This slows down the process but improves reliability of the data
# NOTE: Proper data type should be set on the target system if their source is unclear.
self._cast = False if 'cast' not in _args else _args['cast']
def init(self,fields=None):
# if not fields :
# try:
# table = self._tablename(self.table)
# self.fields = pd.read_sql_query("SELECT * FROM :table LIMIT 1".replace(":table",table),self.conn).columns.tolist()
# except Exception as e:
# pass
# finally:
# pass
# else:
self.fields = fields;
def make(self,**_args):
table = self._tablename(self.table) if 'table' not in _args else self._tablename(_args['table'])
if 'fields' in _args :
fields = _args['fields']
# table = self._tablename(self.table)
sql = " ".join(["CREATE TABLE",table," (", ",".join([ name +' '+ self._dtype for name in fields]),")"])
else:
schema = _args['schema'] if 'schema' in _args else []
_map = _args['map'] if 'map' in _args else {}
sql = [] # ["CREATE TABLE ",_args['table'],"("]
for _item in schema :
_type = _item['type']
if _type in _map :
_type = _map[_type]
sql = sql + [" " .join([_item['name'], ' ',_type])]
sql = ",".join(sql)
# table = self._tablename(_args['table'])
sql = ["CREATE TABLE ",table,"( ",sql," )"]
sql = " ".join(sql)
cursor = self.conn.cursor()
try:
cursor.execute(sql)
except Exception as e :
print (e)
# print (sql)
pass
finally:
# cursor.close()
self.conn.commit()
pass
def write(self,info,**_args):
"""
:param info writes a list of data to a given set of fields
"""
# inspect = False if 'inspect' not in _args else _args['inspect']
# cast = False if 'cast' not in _args else _args['cast']
# if not self.fields :
# if type(info) == list :
# _fields = info[0].keys()
# elif type(info) == dict :
# _fields = info.keys()
# elif type(info) == pd.DataFrame :
# _fields = info.columns.tolist()
# # _fields = info.keys() if type(info) == dict else info[0].keys()
# # _fields = list (_fields)
# self.init(_fields)
try:
table = _args['table'] if 'table' in _args else self.table
#
# In SQL, schema can stand for namespace or the structure of a table
# In case we have a list, we are likely dealing with table structure
#
if 'schema' in _args :
if type(_args['schema']) == str :
self.schema = _args['schema'] if 'schema' in _args else self.schema
elif type(_args['schema']) == list and len(_args['schema']) > 0 and not self.has(table=table):
#
# There is a messed up case when an empty array is passed (no table should be created)
#
self.make(table=table,schema=_args['schema'])
pass
# self.schema = _args['schema'] if 'schema' in _args else self.schema
table = self._tablename(table)
_sql = "INSERT INTO :table (:fields) VALUES (:values)".replace(":table",table) #.replace(":table",self.table).replace(":fields",_fields)
if type(info) == list :
_info = pd.DataFrame(info)
elif type(info) == dict :
_info = pd.DataFrame([info])
else:
_info = pd.DataFrame(info)
if _info.shape[0] == 0 :
return
if self.lock :
SQLRW.lock.acquire()
#
# we will adjust the chunks here in case we are not always sure of the
if self._chunks == 1 and _info.shape[0] > SQLRW.MAX_CHUNK :
self._chunks = 10
_indexes = np.array_split(np.arange(_info.shape[0]),self._chunks)
for i in _indexes :
#
# In case we have an invalid chunk ...
if _info.iloc[i].shape[0] == 0 :
continue
#
# We are enabling writing by chunks/batches because some persistent layers have quotas or limitations on volume of data
if self._engine is not None:
# pd.to_sql(_info,self._engine)
if self.schema in ['',None] :
rows = _info.iloc[i].to_sql(table,self._engine,if_exists='append',index=False)
else:
#
# Writing with schema information ...
rows = _info.iloc[i].to_sql(self.table,self._engine,schema=self.schema,if_exists='append',index=False)
time.sleep(1)
else:
_fields = ",".join(self.fields)
_sql = _sql.replace(":fields",_fields)
values = ", ".join("?"*len(self.fields)) if self._provider == 'netezza' else ",".join(["%s" for name in self.fields])
_sql = _sql.replace(":values",values)
cursor = self.conn.cursor()
cursor.executemany(_sql,_info.iloc[i].values.tolist())
cursor.close()
# cursor.commit()
# self.conn.commit()
except Exception as e:
print(e)
pass
finally:
if self._engine is None :
self.conn.commit()
if self.lock :
SQLRW.lock.release()
# cursor.close()
pass
def close(self):
try:
self.conn.close()
finally:
pass
class BigQuery:
def __init__(self,**_args):
path = _args['service_key'] if 'service_key' in _args else _args['private_key']
self.credentials = service_account.Credentials.from_service_account_file(path)
self.dataset = _args['dataset'] if 'dataset' in _args else None
self.path = path
self.dtypes = _args['dtypes'] if 'dtypes' in _args else None
self.table = _args['table'] if 'table' in _args else None
self.client = bq.Client.from_service_account_json(self.path)
def meta(self,**_args):
"""
This function returns meta data for a given table or query with dataset/table properly formatted
:param table name of the name WITHOUT including dataset
:param sql sql query to be pulled,
"""
table = _args['table'] if 'table' in _args else self.table
try:
if table :
_dataset = self.dataset if 'dataset' not in _args else _args['dataset']
sql = f"""SELECT column_name as name, data_type as type FROM {_dataset}.INFORMATION_SCHEMA.COLUMNS WHERE table_name = '{table}' """
_info = {'credentials':self.credentials,'dialect':'standard'}
return pd_gbq.read_gbq(sql,**_info).to_dict(orient='records')
# return self.read(sql=sql).to_dict(orient='records')
# ref = self.client.dataset(self.dataset).table(table)
# _schema = self.client.get_table(ref).schema
# return [{"name":_item.name,"type":_item.field_type,"description":( "" if not hasattr(_item,"description") else _item.description )} for _item in _schema]
else :
return []
except Exception as e:
return []
def has(self,**_args):
found = False
try:
_has = self.meta(**_args)
found = _has is not None and len(_has) > 0
except Exception as e:
pass
return found
class BQReader(BigQuery,Reader) :
def __init__(self,**_args):
super().__init__(**_args)
def apply(self,sql):
return self.read(sql=sql)
def read(self,**_args):
SQL = None
table = self.table if 'table' not in _args else _args['table']
if 'sql' in _args :
SQL = _args['sql']
elif table:
table = "".join(["`",table,"`"]) if '.' in table else "".join(["`:dataset.",table,"`"])
SQL = "SELECT * FROM :table ".replace(":table",table)
if not SQL :
return None
if SQL and 'limit' in _args:
SQL += " LIMIT "+str(_args['limit'])
if (':dataset' in SQL or ':DATASET' in SQL) and self.dataset:
SQL = SQL.replace(':dataset',self.dataset).replace(':DATASET',self.dataset)
_info = {'credentials':self.credentials,'dialect':'standard'}
return pd_gbq.read_gbq(SQL,**_info) if SQL else None
# return self.client.query(SQL).to_dataframe() if SQL else None
class BQWriter(BigQuery,Writer):
lock = Lock()
def __init__(self,**_args):
super().__init__(**_args)
self.parallel = False if 'lock' not in _args else _args['lock']
self.table = _args['table'] if 'table' in _args else None
self.mode = {'if_exists':'append','chunksize':900000,'destination_table':self.table,'credentials':self.credentials}
self._chunks = 1 if 'chunks' not in _args else int(_args['chunks'])
self._location = 'US' if 'location' not in _args else _args['location']
def write(self,_info,**_args) :
try:
if self.parallel or 'lock' in _args :
BQWriter.lock.acquire()
_args['table'] = self.table if 'table' not in _args else _args['table']
self._write(_info,**_args)
finally:
if self.parallel:
BQWriter.lock.release()
def submit(self,_sql):
"""
Write the output of a massive query to a given table, biquery will handle this as a job
This function will return the job identifier
"""
_config = bq.QueryJobConfig()
_config.destination = self.client.dataset(self.dataset).table(self.table)
_config.allow_large_results = True
# _config.write_disposition = bq.bq_consts.WRITE_APPEND
_config.dry_run = False
# _config.priority = 'BATCH'
_resp = self.client.query(_sql,location=self._location,job_config=_config)
return _resp.job_id
def status (self,_id):
return self.client.get_job(_id,location=self._location)
def _write(self,_info,**_args) :
_df = None
if type(_info) in [list,pd.DataFrame] :
if type(_info) == list :
_df = pd.DataFrame(_info)
elif type(_info) == pd.DataFrame :
_df = _info
if '.' not in _args['table'] :
self.mode['destination_table'] = '.'.join([self.dataset,_args['table']])
else:
self.mode['destination_table'] = _args['table'].strip()
if 'schema' in _args :
self.mode['table_schema'] = _args['schema']
#
# Let us insure that the types are somewhat compatible ...
# _map = {'INTEGER':np.int64,'DATETIME':'datetime64[ns]','TIMESTAMP':'datetime64[ns]','FLOAT':np.float64,'DOUBLE':np.float64,'STRING':str}
# _mode = copy.deepcopy(self.mode)
_mode = self.mode
# _df.to_gbq(**self.mode) #if_exists='append',destination_table=partial,credentials=credentials,chunksize=90000)
#
# Let us adjust the chunking here
self._chunks = 10 if _df.shape[0] > SQLRW.MAX_CHUNK and self._chunks == 1 else self._chunks
_indexes = np.array_split(np.arange(_df.shape[0]),self._chunks)
for i in _indexes :
_df.iloc[i].to_gbq(**self.mode)
time.sleep(1)
pass
#
# Aliasing the big query classes allowing it to be backward compatible
#
BigQueryReader = BQReader
BigQueryWriter = BQWriter

@ -0,0 +1,18 @@
"""
This namespace/package wrap the sql functionalities for a certain data-stores
- netezza, postgresql, mysql and sqlite
- mariadb, redshift (also included)
"""
from . import postgresql, mysql, netezza, sqlite, sqlserver
#
# Creating aliases for support of additional data-store providerss
#
mariadb = mysql
redshift = postgresql
sqlite3 = sqlite
# from transport import sql

@ -0,0 +1,129 @@
"""
This file encapsulates common operations associated with SQL databases via SQLAlchemy
"""
import sqlalchemy as sqa
import pandas as pd
class Base:
def __init__(self,**_args):
self._host = _args['host'] if 'host' in _args else 'localhost'
self._port = None
self._database = _args['database']
self._table = _args['table'] if 'table' in _args else None
self._engine= sqa.create_engine(self._get_uri(**_args),future=True)
def _set_uri(self,**_args) :
"""
:provider provider
:host host and port
:account account user/pwd
"""
_account = _args['account'] if 'account' in _args else None
_host = _args['host']
_provider = _args['provider'].replace(':','').replace('/','').strip()
def _get_uri(self,**_args):
"""
This function will return the formatted uri for the sqlAlchemy engine
"""
raise Exception ("Function Needs to be implemented ")
def meta (self,**_args):
"""
This function returns the schema (table definition) of a given table
:table optional name of the table (can be fully qualified)
"""
_table = self._table if 'table' not in _args else _args['table']
_schema = []
if _table :
if sqa.__version__.startswith('1.') :
_handler = sqa.MetaData(bind=self._engine)
_handler.reflect()
else:
#
# sqlalchemy's version 2.+
_handler = sqa.MetaData()
_handler.reflect(bind=self._engine)
#
# Let us extract the schema with the native types
_map = {'BIGINT':'INTEGER','TEXT':'STRING','DOUBLE_PRECISION':'FLOAT','NUMERIC':'FLOAT','DECIMAL':'FLOAT','REAL':'FLOAT'}
_schema = [{"name":_attr.name,"type":_map.get(str(_attr.type),str(_attr.type))} for _attr in _handler.tables[_table].columns]
return _schema
def has(self,**_args):
return self.meta(**_args)
def apply(self,sql):
"""
Executing sql statement that returns query results (hence the restriction on sql and/or with)
:sql SQL query to be exectued
@TODO: Execution of stored procedures
"""
return pd.read_sql(sql,self._engine) if sql.lower().startswith('select') or sql.lower().startswith('with') else None
class SQLBase(Base):
def __init__(self,**_args):
super().__init__(**_args)
def get_provider(self):
raise Exception ("Provider Needs to be set ...")
def get_default_port(self) :
raise Exception ("default port needs to be set")
def _get_uri(self,**_args):
_host = self._host
_account = ''
if self._port :
_port = self._port
else:
_port = self.get_default_port()
_host = f'{_host}:{_port}'
if 'username' in _args :
_account = ''.join([_args['username'],':',_args['password'],'@'])
_database = self._database
_provider = self.get_provider().replace(':','').replace('/','')
# _uri = [f'{_provider}:/',_account,_host,_database]
# _uri = [_item.strip() for _item in _uri if _item.strip()]
# return '/'.join(_uri)
return f'{_provider}://{_host}/{_database}' if _account == '' else f'{_provider}://{_account}{_host}/{_database}'
class BaseReader(SQLBase):
def __init__(self,**_args):
super().__init__(**_args)
def read(self,**_args):
"""
This function will read a query or table from the specific database
"""
if 'sql' in _args :
sql = _args['sql']
else:
_table = _args['table'] if 'table' in _args else self._table
sql = f'SELECT * FROM {_table}'
return self.apply(sql)
class BaseWriter (SQLBase):
"""
This class implements SQLAlchemy support for Writting to a data-store (RDBMS)
"""
def __init__(self,**_args):
super().__init__(**_args)
def write(self,_data,**_args):
if type(_data) == dict :
_df = pd.DataFrame(_data)
elif type(_data) == list :
_df = pd.DataFrame(_data)
else:
_df = _data.copy()
#
# We are assuming we have a data-frame at this point
#
_table = _args['table'] if 'table' in _args else self._table
_mode = {'chunksize':2000000,'if_exists':'append','index':False}
for key in ['if_exists','index','chunksize'] :
if key in _args :
_mode[key] = _args[key]
# if 'schema' in _args :
# _mode['schema'] = _args['schema']
# if 'if_exists' in _args :
# _mode['if_exists'] = _args['if_exists']
_df.to_sql(_table,self._engine,**_mode)

@ -0,0 +1,18 @@
"""
This file implements support for mysql and maria db (with drivers mysql+mysql)
"""
from transport.sql.common import BaseReader, BaseWriter
# import mysql.connector as my
class MYSQL:
def get_provider(self):
return "mysql+mysqlconnector"
def get_default_port(self):
return "3306"
class Reader(MYSQL,BaseReader) :
def __init__(self,**_args):
super().__init__(**_args)
class Writer(MYSQL,BaseWriter) :
def __init__(self,**_args):
super().__init__(**_args)

@ -0,0 +1,15 @@
import nzpy as nz
from transport.sql.common import BaseReader, BaseWriter
class Netezza:
def get_provider(self):
return 'netezza+nzpy'
def get_default_port(self):
return '5480'
class Reader(Netezza,BaseReader) :
def __init__(self,**_args):
super().__init__(**_args)
class Writer(Netezza,BaseWriter):
def __init__(self,**_args):
super().__init__(**_args)

@ -0,0 +1,22 @@
from transport.sql.common import BaseReader , BaseWriter
from psycopg2.extensions import register_adapter, AsIs
import numpy as np
register_adapter(np.int64, AsIs)
class PG:
def __init__(self,**_args):
super().__init__(**_args)
def get_provider(self):
return "postgresql"
def get_default_port(self):
return "5432"
class Reader(PG,BaseReader) :
def __init__(self,**_args):
super().__init__(**_args)
class Writer(PG,BaseWriter):
def __init__(self,**_args):
super().__init__(**_args)

@ -0,0 +1,25 @@
import sqlalchemy
import pandas as pd
from transport.sql.common import Base, BaseReader, BaseWriter
class SQLite (BaseReader):
def __init__(self,**_args):
super().__init__(**_args)
if 'path' in _args :
self._database = _args['path']
if 'database' in _args :
self._database = _args['database']
def _get_uri(self,**_args):
path = self._database
return f'sqlite:///{path}' # ensure this is the correct path for the sqlite file.
class Reader(SQLite,BaseReader):
def __init__(self,**_args):
super().__init__(**_args)
# def read(self,**_args):
# sql = _args['sql']
# return pd.read_sql(sql,self._engine)
class Writer (SQLite,BaseWriter):
def __init__(self,**_args):
super().__init__(**_args)

@ -0,0 +1,24 @@
"""
Handling Microsoft SQL Server via pymssql driver/connector
"""
import sqlalchemy
import pandas as pd
from transport.sql.common import Base, BaseReader, BaseWriter
class MsSQLServer:
def __init__(self,**_args) :
super().__init__(**_args)
pass
def get_provider(self):
# mssql+pymssql://scott:tiger@hostname:port/dbname"
return "mssql+pymssql"
def get_default_port(self):
return "1433"
class Reader (MsSQLServer,BaseReader):
def __init__(self,**_args):
super().__init__(**_args)
class Writer (MsSQLServer,BaseWriter):
def __init__(self,**_args):
super().__init__(**_args)
Loading…
Cancel
Save