You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
data-maker/README.md

1.6 KiB

Introduction

This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques

- Generative Adversarial Networks
- With "Earth mover's distance"

Installation

pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release

Usage

After installing the easiest way to get started is as follows (using pandas). The process is as follows:

  1. Train the GAN on the original/raw dataset

import pandas as pd import data.maker

df = pd.read_csv('sample.csv') column = 'gender' id = 'id' context = 'demo' data.maker.train(context=context,data=df,column=column,id=id,logs='logs')

The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.

  1. Generate a candidate dataset from the learnt features

import pandas as pd import data.maker

df = pd.read_csv('sample.csv') id = 'id' column = 'gender' context = 'demo' data.maker.generate(data=df,id=id,column=column,logs='logs')

Limitations

GANS will generate data assuming the original data has all the value space needed:

  • No new data will be created

      Assuming we have a dataset with an gender attribute with values [M,F]. 
      The synthetic data will not be able to generate genders outside [M,F]
    
  • Not advised on continuous values

      GANS work well on discrete values and thus are not advised to be used.
      e.g:measurements (height, blood pressure, ...)
    

Credits :