44 KiB

Raw Blame History

None <html lang="en"> <head> </head>

In [1]:

import pandas as pd
import numpy as np
df = pd.read_csv('../src/out/risk_xoi.csv')

In [2]:

df.head()

Out[2]:

	group_count	row_count	marketer	prosecutor	field_count
0	432512	79080802	0.005469	1	10
1	17824004	79080802	0.225390	1	28
2	43538084	79080802	0.550552	1	38
3	64042788	79080802	0.809840	1	46
4	6866070	79080802	0.086823	1	17

In [3]:

compiled = df.groupby('field_count')[['field_count','marketer','prosecutor']].mean()
figure = compiled[['marketer','prosecutor']].plot.line().get_figure()

Dataset Used¶

We performed joins against all the tables from all-of-us and truncated records while randomly selecting on record per every join. As a result we have roughly a dataset of about 80 million records and about 5000 distinct patients.

Expriment Design¶

We compute both marketer and prosecutor risk computation while randomly selecting the number of attributes out of 111. This selection is between 2 and 111 attributes. The number of maximum number of attributes that can be computed at any time is 64 : limitations of Google's Big-query. We performed 500 runs.

Results¶

The results show the prosecutor risk is unchanging perhaps as an artifact of the number of runs 500 or the dataset curation: The joins we performed. The prosecutor risk shows there is at least one record that vulnerable.

The marketer risk seems to increase as the number of randomly selected attributes increases as a general trend.

In [ ]:

</html>

44 KiB Raw Blame History

Dataset Used¶

Expriment Design¶

Results¶

44 KiB

Raw Blame History