You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
privacykit/notebooks/data-analysis.ipynb

44 KiB

None <html lang="en"> <head> </head>
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../src/out/risk_xoi.csv')
In [2]:
df.head()
Out[2]:
Unnamed: 0 group_count row_count marketer prosecutor field_count
0 0 432512 79080802 0.005469 1 10
1 0 17824004 79080802 0.225390 1 28
2 0 43538084 79080802 0.550552 1 38
3 0 64042788 79080802 0.809840 1 46
4 0 6866070 79080802 0.086823 1 17
In [3]:
compiled = df.groupby('field_count')[['field_count','marketer','prosecutor']].mean()
figure = compiled[['marketer','prosecutor']].plot.line().get_figure()

Dataset Used


We performed joins against all the tables from all-of-us and truncated records while randomly selecting on record per every join. As a result we have roughly a dataset of about 80 million records and about 5000 distinct patients.

Expriment Design


We compute both marketer and prosecutor risk computation while randomly selecting the number of attributes out of 111. This selection is between 2 and 111 attributes. The number of maximum number of attributes that can be computed at any time is 64 : limitations of Google's Big-query. We performed 500 runs.

Results


The results show the prosecutor risk is unchanging perhaps as an artifact of the number of runs 500 or the dataset curation: The joins we performed. The prosecutor risk shows there is at least one record that vulnerable.

The marketer risk seems to increase as the number of randomly selected attributes increases as a general trend.

{{ figure }}

In [ ]:

</html>