You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
215 lines
44 KiB
Plaintext
215 lines
44 KiB
Plaintext
6 years ago
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import pandas as pd\n",
|
||
|
"import numpy as np\n",
|
||
|
"df = pd.read_csv('../src/out/risk_xoi.csv')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Unnamed: 0</th>\n",
|
||
|
" <th>group_count</th>\n",
|
||
|
" <th>row_count</th>\n",
|
||
|
" <th>marketer</th>\n",
|
||
|
" <th>prosecutor</th>\n",
|
||
|
" <th>field_count</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>432512</td>\n",
|
||
|
" <td>79080802</td>\n",
|
||
|
" <td>0.005469</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>10</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>17824004</td>\n",
|
||
|
" <td>79080802</td>\n",
|
||
|
" <td>0.225390</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>28</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>43538084</td>\n",
|
||
|
" <td>79080802</td>\n",
|
||
|
" <td>0.550552</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>38</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>64042788</td>\n",
|
||
|
" <td>79080802</td>\n",
|
||
|
" <td>0.809840</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>46</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>4</th>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>6866070</td>\n",
|
||
|
" <td>79080802</td>\n",
|
||
|
" <td>0.086823</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>17</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Unnamed: 0 group_count row_count marketer prosecutor field_count\n",
|
||
|
"0 0 432512 79080802 0.005469 1 10\n",
|
||
|
"1 0 17824004 79080802 0.225390 1 28\n",
|
||
|
"2 0 43538084 79080802 0.550552 1 38\n",
|
||
|
"3 0 64042788 79080802 0.809840 1 46\n",
|
||
|
"4 0 6866070 79080802 0.086823 1 17"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 2,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df.head()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"compiled = df.groupby('field_count')[['field_count','marketer','prosecutor']].mean()\n",
|
||
|
"figure = compiled[['marketer','prosecutor']].plot.line().get_figure()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"variables": {
|
||
|
" figure ": "<img src=\"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"# Dataset Used\n",
|
||
|
"---\n",
|
||
|
"\n",
|
||
|
"We performed joins against all the tables from all-of-us and truncated records while randomly selecting on record per every join. As a result we have roughly a dataset of about **80 million** records and about **5000** distinct patients.\n",
|
||
|
"\n",
|
||
|
"## Expriment Design\n",
|
||
|
"---\n",
|
||
|
"\n",
|
||
|
"We compute both marketer and prosecutor risk computation while randomly selecting the number of attributes out of **111**. This selection is between ***2*** and **111** attributes. The number of maximum number of attributes that can be computed at any time is **64** : limitations of Google's Big-query. We performed **500** runs.\n",
|
||
|
"\n",
|
||
|
"## Results\n",
|
||
|
"---\n",
|
||
|
"\n",
|
||
|
"The results show the prosecutor risk is unchanging perhaps as an artifact of the number of runs **500** or the dataset curation: The joins we performed. The prosecutor risk shows there is at least one record that vulnerable.\n",
|
||
|
"\n",
|
||
|
"The marketer risk seems to increase as the number of randomly selected attributes increases as a general trend. \n",
|
||
|
"\n",
|
||
|
"{{ figure }} \n",
|
||
|
"\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 2",
|
||
|
"language": "python",
|
||
|
"name": "python2"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 2
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython2",
|