From df47ed4cb2b7e1d05f86b20899244f2350bac05c Mon Sep 17 00:00:00 2001 From: Steve Nyemba Date: Tue, 31 Dec 2019 23:34:04 -0600 Subject: [PATCH] documentation --- README.md | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index b42b1f7..f5c5e5d 100644 --- a/README.md +++ b/README.md @@ -12,32 +12,33 @@ This package is designed to generate synthetic data from a dataset from an origi ## Usage After installing the easiest way to get started is as follows (using pandas). The process is as follows: -1. Train the GAN on the original/raw dataset +**Train the GAN on the original/raw dataset** -import pandas as pd -import data.maker -df = pd.read_csv('sample.csv') -column = 'gender' -id = 'id' -context = 'demo' -data.maker.train(context=context,data=df,column=column,id=id,logs='logs') + import pandas as pd + import data.maker + + df = pd.read_csv('sample.csv') + column = 'gender' + id = 'id' + context = 'demo' + data.maker.train(context=context,data=df,column=column,id=id,logs='logs') The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data. -2. Generate a candidate dataset from the learnt features +**Generate a candidate dataset from the learned features** -import pandas as pd -import data.maker + import pandas as pd + import data.maker -df = pd.read_csv('sample.csv') -id = 'id' -column = 'gender' -context = 'demo' -data.maker.generate(data=df,id=id,column=column,logs='logs') + df = pd.read_csv('sample.csv') + id = 'id' + column = 'gender' + context = 'demo' + data.maker.generate(data=df,id=id,column=column,logs='logs') ## Limitations @@ -46,11 +47,14 @@ GANS will generate data assuming the original data has all the value space neede - No new data will be created Assuming we have a dataset with an gender attribute with values [M,F]. + The synthetic data will not be able to generate genders outside [M,F] + - Not advised on continuous values GANS work well on discrete values and thus are not advised to be used. e.g:measurements (height, blood pressure, ...) +- For now will only perform on a single feature. ## Credits :