|
|
@ -1,37 +1,51 @@
|
|
|
|
"""
|
|
|
|
"""
|
|
|
|
(c) 2019, Health Information Privacy Lab
|
|
|
|
# Re-Identification Risk
|
|
|
|
Brad. Malin, Weiyi Xia, Steve L. Nyemba
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This framework computes re-identification risk of a dataset assuming the data being shared can be loaded into a dataframe (pandas)
|
|
|
|
This framework computes re-identification risk of a dataset by extending pandas. It works like a pandas **add-on**
|
|
|
|
The framework will compute the following risk measures:
|
|
|
|
The framework will compute the following risk measures: marketer, prosecutor, journalist and pitman risk.
|
|
|
|
- marketer
|
|
|
|
References for the risk measures can be found on
|
|
|
|
- prosecutor
|
|
|
|
- http://www.ehealthinformation.ca/wp-content/uploads/2014/08/2009-De-identification-PA-whitepaper1.pdf
|
|
|
|
- pitman
|
|
|
|
- https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/applying-pitmans-sampling-formula-to-microdata-disclosure-risk-assessment.pdf
|
|
|
|
|
|
|
|
|
|
|
|
References :
|
|
|
|
There are two modes available :
|
|
|
|
https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/applying-pitmans-sampling-formula-to-microdata-disclosure-risk-assessment.pdf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This framework integrates pandas (for now) as an extension and can be used in two modes :
|
|
|
|
**explore:**
|
|
|
|
1. explore:
|
|
|
|
|
|
|
|
Here the assumption is that we are not sure of the attributes to be disclosed,
|
|
|
|
Here the assumption is that we are not sure of the attributes to be disclosed, the framework will randomly generate random combinations of attributes and evaluate them accordingly as it provides all the measures of risk.
|
|
|
|
The framework will explore a variety of combinations and associate risk measures every random combinations it can come up with
|
|
|
|
|
|
|
|
|
|
|
|
**evaluation**
|
|
|
|
|
|
|
|
|
|
|
|
2. evaluation
|
|
|
|
|
|
|
|
Here the assumption is that we are clear on the sets of attributes to be used and we are interested in computing the associated risk.
|
|
|
|
Here the assumption is that we are clear on the sets of attributes to be used and we are interested in computing the associated risk.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Four risk measures are computed :
|
|
|
|
### Four risk measures are computed :
|
|
|
|
|
|
|
|
|
|
|
|
- Marketer risk
|
|
|
|
- Marketer risk
|
|
|
|
- Prosecutor risk
|
|
|
|
- Prosecutor risk
|
|
|
|
- Journalist risk
|
|
|
|
- Journalist risk
|
|
|
|
- Pitman Risk
|
|
|
|
- Pitman Risk
|
|
|
|
|
|
|
|
|
|
|
|
Usage:
|
|
|
|
### Usage:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Install this package using pip as follows :
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Stable :
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pip install git+https://hiplab.mc.vanderbilt.edu/git/steve/deid-risk.git
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Latest Development (not fully tested):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pip install git+https://hiplab.mc.vanderbilt.edu/git/steve/deid-risk.git@risk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The framework will depend on pandas and numpy (for now). Below is a basic sample to get started quickly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
import numpy as np
|
|
|
|
import numpy as np
|
|
|
|
import pandas as pd
|
|
|
|
import pandas as pd
|
|
|
|
from pandas_risk import *
|
|
|
|
from pandas_risk import *
|
|
|
|
|
|
|
|
|
|
|
|
mydf = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),50),"y":np.random.choice( np.random.randint(1,10),50) })
|
|
|
|
mydf = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),50),"y":np.random.choice( np.random.randint(1,10),50),"z":np.random.choice( np.random.randint(1,10),50),"r":np.random.choice( np.random.randint(1,10),50) })
|
|
|
|
print mydf.risk.evaluate()
|
|
|
|
print mydf.risk.evaluate()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@ -41,11 +55,15 @@ print mydf.risk.evaluate()
|
|
|
|
# - Insure the population size is much greater than the sample size
|
|
|
|
# - Insure the population size is much greater than the sample size
|
|
|
|
# - Insure the fields are identical in both sample and population
|
|
|
|
# - Insure the fields are identical in both sample and population
|
|
|
|
#
|
|
|
|
#
|
|
|
|
pop = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),150),"y":np.random.choice( np.random.randint(1,10),150) ,"q":np.random.choice( np.random.randint(1,10),150)})
|
|
|
|
pop = pd.DataFrame({"x":np.random.choice( np.random.randint(1,10),150),"y":np.random.choice( np.random.randint(1,10),150) ,"z":np.random.choice( np.random.randint(1,10),150),"r":np.random.choice( np.random.randint(1,10),150)})
|
|
|
|
mydf.risk.evaluate(pop=pop)
|
|
|
|
mydf.risk.evaluate(pop=pop)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@TODO:
|
|
|
|
@TODO:
|
|
|
|
- Evaluation of how sparse attributes are (the ratio of non-null over rows)
|
|
|
|
- Evaluation of how sparse attributes are (the ratio of non-null over rows)
|
|
|
|
- Have a smart way to drop attributes (based on the above in random policy search)
|
|
|
|
- Have a smart way to drop attributes (based on the above in random policy search)
|
|
|
|
|
|
|
|
Basic examples that illustrate usage of the the framework are in the notebook folder. The example is derived from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
"""
|
|
|
|
"""
|
|
|
|
from risk import risk
|
|
|
|
from risk import risk
|
|
|
|