44 KiB
import pandas as pd
import numpy as np
df = pd.read_csv('../src/out/risk_xoi.csv')
df.head()
compiled = df.groupby('field_count')[['field_count','marketer','prosecutor']].mean()
figure = compiled[['marketer','prosecutor']].plot.line().get_figure()
Dataset Used¶
We performed joins against all the tables from all-of-us and truncated records while randomly selecting on record per every join. As a result we have roughly a dataset of about 80 million records and about 5000 distinct patients.
Expriment Design¶
We compute both marketer and prosecutor risk computation while randomly selecting the number of attributes out of 111. This selection is between 2 and 111 attributes. The number of maximum number of attributes that can be computed at any time is 64 : limitations of Google's Big-query. We performed 500 runs.
Results¶
The results show the prosecutor risk is unchanging perhaps as an artifact of the number of runs 500 or the dataset curation: The joins we performed. The prosecutor risk shows there is at least one record that vulnerable.
The marketer risk seems to increase as the number of randomly selected attributes increases as a general trend.
{{ figure }}