You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
privacykit/notebooks/data-analysis.ipynb

215 lines
44 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"df = pd.read_csv('../src/out/risk_xoi.csv')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>group_count</th>\n",
" <th>row_count</th>\n",
" <th>marketer</th>\n",
" <th>prosecutor</th>\n",
" <th>field_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>432512</td>\n",
" <td>79080802</td>\n",
" <td>0.005469</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>17824004</td>\n",
" <td>79080802</td>\n",
" <td>0.225390</td>\n",
" <td>1</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>43538084</td>\n",
" <td>79080802</td>\n",
" <td>0.550552</td>\n",
" <td>1</td>\n",
" <td>38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>64042788</td>\n",
" <td>79080802</td>\n",
" <td>0.809840</td>\n",
" <td>1</td>\n",
" <td>46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>6866070</td>\n",
" <td>79080802</td>\n",
" <td>0.086823</td>\n",
" <td>1</td>\n",
" <td>17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 group_count row_count marketer prosecutor field_count\n",
"0 0 432512 79080802 0.005469 1 10\n",
"1 0 17824004 79080802 0.225390 1 28\n",
"2 0 43538084 79080802 0.550552 1 38\n",
"3 0 64042788 79080802 0.809840 1 46\n",
"4 0 6866070 79080802 0.086823 1 17"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"compiled = df.groupby('field_count')[['field_count','marketer','prosecutor']].mean()\n",
"figure = compiled[['marketer','prosecutor']].plot.line().get_figure()"
]
},
{
"cell_type": "markdown",
"metadata": {
"variables": {
" figure ": "<img src=\"
}
},
"source": [
"# Dataset Used\n",
"---\n",
"\n",
"We performed joins against all the tables from all-of-us and truncated records while randomly selecting on record per every join. As a result we have roughly a dataset of about **80 million** records and about **5000** distinct patients.\n",
"\n",
"## Expriment Design\n",
"---\n",
"\n",
"We compute both marketer and prosecutor risk computation while randomly selecting the number of attributes out of **111**. This selection is between ***2*** and **111** attributes. The number of maximum number of attributes that can be computed at any time is **64** : limitations of Google's Big-query. We performed **500** runs.\n",
"\n",
"## Results\n",
"---\n",
"\n",
"The results show the prosecutor risk is unchanging perhaps as an artifact of the number of runs **500** or the dataset curation: The joins we performed. The prosecutor risk shows there is at least one record that vulnerable.\n",
"\n",
"The marketer risk seems to increase as the number of randomly selected attributes increases as a general trend. \n",
"\n",
"{{ figure }} \n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}