You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Steve L. Nyemba
ff6ae5a622
|
5 years ago | |
---|---|---|
data | 5 years ago | |
Dockerfile | 5 years ago | |
README.md | 5 years ago | |
pipeline.py | 5 years ago | |
setup.py | 5 years ago |
README.md
Introduction
This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques
- Generative Adversarial Networks
- With "Earth mover's distance"
Installation
pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release
Usage
After installing the easiest way to get started is as follows (using pandas). The process is as follows:
Train the GAN on the original/raw dataset
import pandas as pd
import data.maker
df = pd.read_csv('sample.csv')
column = 'gender'
id = 'id'
context = 'demo'
data.maker.train(context=context,data=df,column=column,id=id,logs='logs')
The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.
Generate a candidate dataset from the learned features
import pandas as pd
import data.maker
df = pd.read_csv('sample.csv')
id = 'id'
column = 'gender'
context = 'demo'
data.maker.generate(context=context,data=df,id=id,column=column,logs='logs')
Limitations
GANS will generate data assuming the original data has all the value space needed:
-
No new data will be created
Assuming we have a dataset with an gender attribute with values [M,F]. The synthetic data will not be able to generate genders outside [M,F]
-
Not advised on continuous values
GANS work well on discrete values and thus are not advised to be used. e.g:measurements (height, blood pressure, ...)
-
For now will only perform on a single feature.