You cannot select more than 25 topics
			Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
		
		
		
		
		
			| 
				
					
						 | 
			6 years ago | |
|---|---|---|
| data-maker | 6 years ago | |
| Dockerfile | 6 years ago | |
| README.md | 6 years ago | |
| setup.py | 6 years ago | |
		
			
				
				README.md
			
		
		
			
			
		
	
	Introduction
This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques
- Generative Adversarial Networks
- With "Earth mover's distance"
Installation
pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release
Usage
After installing the easiest way to get started is as follows (using pandas). The process is as follows:
- 
Train the GAN on the original/raw dataset
import pandas as pd import data.maker df = pd.read_csv('myfile.csv') cols= ['f1','f2','f2'] data.maker.train(data=df,cols=cols,logs='logs') - 
Generate a candidate dataset from the learnt features
import pandas as pd import data.maker
df = data.maker.generate(logs='logs') df.head()
 
Limitations
GANS will generate data assuming the original data has all the value space needed:
- 
No new data will be created
Assuming we have a dataset with an gender attribute with values [M,F]. The synthetic data will not be able to generate genders outside [M,F] - 
Not advised on continuous values
GANS work well on discrete values and thus are not advised to be used to synthesize things like measurements (height, blood pressure, ...)