You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Steve Nyemba 278d639fbf
bug fix ...
11 months ago
bin new features, bug fixes 4 years ago
data bug fix ... 11 months ago
drive Bug fix with the number of candidates generated 5 years ago
Dockerfile bug fix, and documentation 5 years ago
README.md bug fix: crash with dataset & epochs 2 years ago
binder.py bug fixes: enhancements 3 years ago
pipeline.py bug fixes and simplified interface 3 years ago
setup.py bug fix: random shuffle improvements 2 years ago
version.py bug fix: random shuffle improvements 2 years ago

README.md

Introduction

This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques

- Generative Adversarial Networks
- With "Earth mover's distance"

Installation

pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release

Usage

After installing the easiest way to get started is as follows (using pandas). The process is as follows:

Read about data-transport on github or on healthcareio.the-phi.com/git/code/transport

Train the GAN on the original/raw dataset

  1. We define the data sources

The sources will consists in source, target and logger20.

import pandas as pd
import data.maker
import transport
from transport import providers

The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.

Generate a candidate dataset from the learned features

import pandas as pd
import data.maker

df  = pd.read_csv('sample.csv')
id  = 'id'
column = 'gender'
context = 'demo'
data.maker.generate(context=context,data=df,id=id,column=column,logs='logs')

Limitations

GANS will generate data assuming the original data has all the value space needed:

  • No new data will be created

      Assuming we have a dataset with an gender attribute with values [M,F]. 
    
      The synthetic data will not be able to generate genders outside [M,F]
    
  • Not advised on continuous values

      GANS work well on discrete values and thus are not advised to be used.
      e.g:measurements (height, blood pressure, ...)
    
  • For now will only perform on a single feature.

Credits :