unstructured

Форк
0
/
2-File Exploration.ipynb 
393 строки · 34.5 Кб
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "9cf9f373",
6
   "metadata": {},
7
   "source": [
8
    "# File Exploration\n",
9
    "\n",
10
    "In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
11
    "\n",
12
    "- [Filetype detection in `unstructured`](#filetype)\n",
13
    "- [How to generate summary statistics about documents](#summary)"
14
   ]
15
  },
16
  {
17
   "cell_type": "code",
18
   "execution_count": 1,
19
   "id": "59392a21",
20
   "metadata": {},
21
   "outputs": [],
22
   "source": [
23
    "import os\n",
24
    "import pathlib\n",
25
    "\n",
26
    "DIRECTORY = os.path.abspath(\"\")\n",
27
    "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
28
   ]
29
  },
30
  {
31
   "cell_type": "markdown",
32
   "id": "9a6ce38f",
33
   "metadata": {},
34
   "source": [
35
    "## Filetype Detection <a id=\"filetype\"></a>\n",
36
    "\n",
37
    "The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
38
   ]
39
  },
40
  {
41
   "cell_type": "code",
42
   "execution_count": 2,
43
   "id": "c6bd2f4a",
44
   "metadata": {},
45
   "outputs": [
46
    {
47
     "data": {
48
      "text/plain": [
49
       "<FileType.HTML: 50>"
50
      ]
51
     },
52
     "execution_count": 2,
53
     "metadata": {},
54
     "output_type": "execute_result"
55
    }
56
   ],
57
   "source": [
58
    "from unstructured.file_utils.filetype import detect_filetype\n",
59
    "\n",
60
    "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",
61
    "detect_filetype(filename=filename)"
62
   ]
63
  },
64
  {
65
   "cell_type": "markdown",
66
   "id": "c68e6a41",
67
   "metadata": {},
68
   "source": [
69
    "The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
70
   ]
71
  },
72
  {
73
   "cell_type": "markdown",
74
   "id": "8b7dc938",
75
   "metadata": {},
76
   "source": [
77
    "## Summary Statistics <a id=\"summary\"></a>\n",
78
    "\n",
79
    "`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
80
   ]
81
  },
82
  {
83
   "cell_type": "code",
84
   "execution_count": 3,
85
   "id": "c53f054e",
86
   "metadata": {},
87
   "outputs": [
88
    {
89
     "name": "stderr",
90
     "output_type": "stream",
91
     "text": [
92
      "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
93
     ]
94
    },
95
    {
96
     "data": {
97
      "text/plain": [
98
       "FileType.EML     4\n",
99
       "FileType.TXT     3\n",
100
       "FileType.HTML    2\n",
101
       "FileType.XML     2\n",
102
       "FileType.PDF     2\n",
103
       "FileType.JPG     2\n",
104
       "FileType.UNK     1\n",
105
       "FileType.DOCX    1\n",
106
       "FileType.PPTX    1\n",
107
       "FileType.XLSX    1\n",
108
       "Name: filetype, dtype: int64"
109
      ]
110
     },
111
     "execution_count": 3,
112
     "metadata": {},
113
     "output_type": "execute_result"
114
    }
115
   ],
116
   "source": [
117
    "from unstructured.file_utils.exploration import get_directory_file_info\n",
118
    "\n",
119
    "file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",
120
    "file_info.filetype.value_counts()"
121
   ]
122
  },
123
  {
124
   "cell_type": "code",
125
   "execution_count": 4,
126
   "id": "7e1b3300",
127
   "metadata": {},
128
   "outputs": [
129
    {
130
     "data": {
131
      "text/plain": [
132
       "<AxesSubplot: >"
133
      ]
134
     },
135
     "execution_count": 4,
136
     "metadata": {},
137
     "output_type": "execute_result"
138
    },
139
    {
140
     "data": {
141
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoULlZWVpbZt2+qnn37K85j4+HiFhYVlL1FRUYUtEwAABLhCh5Fhw4Zp586dWrJkSb77RUdHq1+/fmrRooXat2+vZcuWqVKlSpo1a1aex4wdO1bp6enZy/79+wtbJgAACHB+PTNyzvDhw7Vq1Sp98sknql69ul/HlipVSi1bttTu3bvz3Mfr9crr9RamNAAA4DB+9YwYYzR8+HAtX75c69atU61atfz+wjNnzmjHjh2qUqWK38cCAAD38atnZNiwYVq0aJFWrlypkJAQpaamSpLCwsJUtmxZSVK/fv1UrVo1xcfHS5KeeeYZtWnTRnXq1NHhw4f1wgsvaN++fRoyZMhlbgoAAHAiv8LIjBkzJEkdOnTIsX7u3LkaMGCAJCk5OVlBQf/rcDl06JCGDh2q1NRUXXXVVWrdurUSExPVqFGjS6scAAC4gl9hxBhT4D7r16/P8XnKlCmaMmWKX0UBAIDig3fTAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKv8CiPx8fG6/vrrFRISosqVKysuLk7ffvttgce9/fbbatCggcqUKaOmTZtq9erVhS4YAAC4i19h5OOPP9awYcP0xRdfKCEhQadPn1bnzp117NixPI9JTExUr169NHjwYG3dulVxcXGKi4vTzp07L7l4AADgfB5jjCnswb/88osqV66sjz/+WLfcckuu+/Ts2VPHjh3TqlWrste1adNGLVq00MyZM336noyMDIWFhSk9PV2hoaE+HXPNmP/4tN+l2Ptc7BX/Dre0AwBQ/Pj6+31Jz4ykp6dLksLDw/PcZ8OGDYqJicmxrkuXLtqwYUOex2RmZiojIyPHAgAA3KlkYQ/MysrSyJEjddNNN6lJkyZ57peamqqIiIgc6yIiIpSamprnMfHx8Xr66acLWxoCjFt6d2iHb9zQBokeQ6AoFbpnZNiwYdq5c6eWLFlyOeuRJI0dO1bp6enZy/79+y/7dwAAgMBQqJ6R4cOHa9WqVfrkk09UvXr1fPeNjIxUWlpajnVpaWmKjIzM8xiv1yuv11uY0gAAgMP41TNijNHw4cO1fPlyrVu3TrVq1SrwmOjoaK1duzbHuoSEBEVHR/tXKQAAcCW/ekaGDRumRYsWaeXKlQoJCcl+7iMsLExly5aVJPXr10/VqlVTfHy8JGnEiBFq3769Jk+erNjYWC1ZskSbNm3S7NmzL3NTAACAE/nVMzJjxgylp6erQ4cOqlKlSvaydOnS7H2Sk5OVkpKS/blt27ZatGiRZs+erebNm+udd97RihUr8n3oFQAAFB9+9Yz4MiXJ+vXrL1p3zz336J577vHnqwAAQDHBu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6HkU8++UTdunVT1apV5fF4tGLFinz3X79+vTwez0VLampqYWsGAAAu4ncYOXbsmJo3b67p06f7ddy3336rlJSU7KVy5cr+fjUAAHChkv4e0LVrV3Xt2tXvL6pcubIqVKjg93EAAMDdiuyZkRYtWqhKlSq67bbb9Pnnn+e7b2ZmpjIyMnIsAADAna54GKlSpYpmzpypf//73/r3v/+tqKgodejQQVu2bMnzmPj4eIWFhWUvUVFRV7pMAABgid+3afxVv3591a9fP/tz27Zt9cMPP2jKlClasGBBrseMHTtWo0aNyv6ckZFBIAEAwKWueBjJzQ033KDPPvssz+1er1der7cIKwIAALZYmWckKSlJVapUsfHVAAAgwPjdM3L06FHt3r07+/OePXuUlJSk8PBw1ahRQ2PHjtXPP/+s+fPnS5KmTp2qWrVqqXHjxjp58qRef/11rVu3Th988MHlawUAAHAsv8PIpk2b1LFjx+zP557t6N+/v+bNm6eUlBQlJydnbz916pQefvhh/fzzzypXrpyaNWumDz/8MMefAQAAii+/w0iHDh1kjMlz+7x583J8fvTRR/Xoo4/6XRgAACgeeDcNAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsMrvMPLJJ5+oW7duqlq1qjwej1asWFHgMevXr1erVq3k9XpVp04dzZs3rxClAgAAN/I7jBw7dkzNmzfX9OnTfdp/z549io2NVceOHZWUlKSRI0dqyJAhWrNmjd/FAgAA9ynp7wFdu3ZV165dfd5/5syZqlWrliZPnixJatiwoT777DNNmTJFXbp08ffrAQCAy1zxZ0Y2bNigmJiYHOu6dOmiDRs25HlMZmamMjIyciwAAMCd/O4Z8VdqaqoiIiJyrIuIiFBGRoZOnDihsmXLXnRMfHy8nn766StdGgBYdc2Y/1zx79j7XOwV/fPd0AaJdvjqSrUhIEfTjB07Vunp6dnL/v37bZcEAACukCveMxIZGam0tLQc69LS0hQaGpprr4gkeb1eeb3eK10aAAAIAFe8ZyQ6Olpr167NsS4hIUHR0dFX+qsBAIAD+B1Gjh49qqSkJCUlJUk6O3Q3KSlJycnJks7eYunXr1/2/vfff79+/PFHPfroo/rmm2/06quv6q233tLf/va3y9MCAADgaH6HkU2bNqlly5Zq2bKlJGnUqFFq2bKlnnrqKUlSSkpKdjCRpFq1auk///mPEhIS1Lx5c02ePFmvv/46w3oBAICkQjwz0qFDBxlj8tye2+yqHTp00NatW/39KgAAUAwE5GgaAABQfBBGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYVKoxMnz5d11xzjcqUKaMbb7xRX331VZ77zps3Tx6PJ8dSpkyZQhcMAADcxe8wsnTpUo0aNUrjx4/Xli1b1Lx5c3Xp0kUHDx7M85jQ0FClpKRkL/v27bukogEAgHv4HUZeeuklDR06VAMHDlSjRo00c+ZMlStXTnPmzMnzGI/Ho8jIyOwlIiLikooGAADu4VcYOXXqlDZv3qyYmJj//QFBQYqJidGGDRvyPO7o0aOqWbOmoqKidOedd+rrr7/O93syMzOVkZGRYwEAAO7kVxj59ddfdebMmYt6NiIiIpSamprrMfXr19ecOXO0cuVKLVy4UFlZWWrbtq1++umnPL8nPj5eYWFh2UtUVJQ/ZQIAAAe54qNpoqOj1a9fP7Vo0ULt27fXsmXLVKlSJc2aNSvPY8aOHav09PTsZf/+/Ve6TAAAYElJf3auWLGiSpQoobS0tBzr09LSFBkZ6dOfUapUKbVs2VK7d+/Ocx+v1yuv1+tPaQAAwKH86hkpXbq0WrdurbVr12avy8rK0tq1axUdHe3Tn3HmzBnt2LFDVapU8a9SAADgSn71jEjSqFGj1L9/f1133XW64YYbNHXqVB07dkwDBw6UJPXr10/VqlVTfHy8JOmZZ55RmzZtVKdOHR0+fFgvvPCC9u3bpyFDhlzelgAAAEfyO4z07NlTv/zyi5566imlpqaqRYsWev/997Mfak1OTlZQ0P86XA4dOqShQ4cqNTVVV111lVq3bq3ExEQ1atTo8rUCAAA4lt9hRJKGDx+u4cOH57pt/fr1OT5PmTJFU6ZMKczXAACAYoB30wAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrChVGpk+frmuuuUZlypTRjTfeqK+++irf/d9++201aNBAZcqUUdOmTbV69epCFQsAANzH7zCydOlSjRo1SuPHj9eWLVvUvHlzdenSRQcPHsx1/8TERPXq1UuDBw/W1q1bFRcXp7i4OO3cufOSiwcAAM7ndxh56aWXNHToUA0cOFCNGjXSzJkzVa5cOc2ZMyfX/V9++WXdfvvtGj16tBo2bKgJEyaoVatWmjZt2iUXDwAAnK+kPzufOnVKmzdv1tixY7PXBQUFKSYmRhs2bMj1mA0bNmjUqFE51nXp0kUrVqzI83syMzOVmZmZ/Tk9PV2SlJGR4XOtWZnHfd63sPypp7Dc0A43tEGiHb5yQxsk2uErN7RBoh2+8rcN5/Y3xuS/o/HDzz//bCSZxMTEHOtHjx5tbrjhhlyPKVWqlFm0aFGOddOnTzeVK1fO83vGjx9vJLGwsLCwsLC4YNm/f3+++cKvnpGiMnbs2By9KVlZWfr999919dVXy+PxXJHvzMjIUFRUlPbv36/Q0NAr8h1XmhvaILmjHW5og0Q7Aokb2iC5ox1uaINUNO0wxujIkSOqWrVqvvv5FUYqVqyoEiVKKC0tLcf6tLQ0RUZG5npMZGSkX/tLktfrldfrzbGuQoUK/pRaaKGhoY4+uSR3tEFyRzvc0AaJdgQSN7RBckc73NAG6cq3IywsrMB9/HqAtXTp0mrdurXWrl2bvS4rK0tr165VdHR0rsdER0fn2F+SEhIS8twfAAAUL37fphk1apT69++v6667TjfccIOmTp2qY8eOaeDAgZKkfv36qVq1aoqPj5ckjRgxQu3bt9fkyZMVGxurJUuWaNOmTZo9e/blbQkAAHAkv8NIz5499csvv+ipp55SamqqWrRooffff18RERGSpOTkZAUF/a/DpW3btlq0aJHGjRunxx9/XHXr1tWKFSvUpEmTy9eKy8Dr9Wr8+PEX3R5yEje0QXJHO9zQBol2BBI3tEFyRzvc0AYpsNrhMaag8TYAAABXDu+mAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAwIFSUlJslwBcNoSRXGzfvl2lS5e2XUa+ateurd9++812GVfcmTNndODAAdtlXLIff/xRnTt3tl0GHOLCN51fKCUlRR06dCiaYuAKu3btKnCfhQsXFkEluSOM5MIYozNnztguI1979+4N+Bovh507dyoqKsp2GZfsyJEjF70WIdDUqFEjR8CdNm1akbzy/HI7ceKEVq1alf353Is3zy2jR4/WyZMnLVZYsLlz52rixIm5bjsXRCpVqlTEVfnvySef1B9//JHn9uTkZN12221FWFHhvPHGG/luP3LkiIYMGVJE1RRO69at9eKLLyq3qcXS0tLUvXt3PfDAAxYqO4swAkCS9NNPP+UIuI8//rh+/fVXixUVzptvvqlZs2Zlf542bZoSExO1detWbd26VQsXLtSMGTMsVliwd999V5MmTbqoztTUVHXs2FHh4eF6//33LVXnuzfffFPXX3+9du7cedG2WbNmqUmTJipZMiBfHp/DqFGj9Kc//UmpqakXbVuzZo0aN26sjRs3WqjMdwsXLtTzzz+vW265RT/88EOO9Y0aNdLhw4e1detWewUaXCQpKckEBQXZLiNfHo/HzJ8/36xcuTLfxemc8HfhCye0w+PxmLS0tOzP5cuXNz/88IPFigqnXbt25t13383+fGE7FixYYNq0aWOjNL+sWrXKeL1es3jxYmOMMSkpKaZBgwbmhhtuMBkZGZar8016errp27ev8Xq9ZtKkSebMmTNm37595tZbbzWhoaFm1qxZtkv0yZ49e0yHDh1MeHi4WbRokTHGmIyMDDNo0CBTqlQpM3bsWHPq1CnLVRYsLS3NxMXFmeDgYPPCCy+Y7t27m7Jly5rJkyebrKwsq7UVyzCSnp6e7/Lpp5864oejoCXQ2+ALJ/yI+8IJ7XBLGImMjDR79uzJ/lyxYsUcn7/99lsTGhpa9IUVwr/+9S9TpkwZM3fuXNOwYUNz3XXXmcOHD9suy28rVqwwERERpnnz5iY0NNTExMSYvXv32i7Lb1OmTDHBwcEmNjbW1KhRwzRq1Mh89dVXtsvyW+/evY3H4zHly5c327dvt12OMcaYwO8fuwIqVKggj8eT53ZjTL7bA0VqaqoqV65su4xLsn379ny3f/vtt0VUyaVp2bJlvufM8ePHi7Cawnv99ddVvnx5SdIff/yhefPmqWLFijn2eeihh2yU5rPDhw8rMzMz+/Mvv/ySY3tWVlaO7YGsd+/eOnz4sAYPHqxWrVrpww8/VFhYmO2y/NamTRs1bdpUa9euVXBwsMaNG6eaNWvaLstv9913nz755BOtWLFCwcHBWrVqlZo2bWq7LJ8dOnRIw4YN08qVKzVmzBgtXbpUvXr10vz589WqVSurtRXLMPLRRx/ZLuGSOSEs+aJFixbyeDy5PlR1br0T2hoXF2e7hEtWo0YNvfbaa9mfIyMjtWDBghz7eDyegA8j1atX186dO1W/fv1ct2/fvl3Vq1cv4qr8c2G4LVWqlA4fPqyOHTvm2G/Lli1FXZrfFi9erOHDh6tFixb673//qzfeeEOdO3fWgw8+qPj4eJUpU8Z2iT75/PPPNXDgQJUsWVLvv/++Xn/9dUVHR2vixIkaMWKE7fIKtGrVKg0dOlQ1atTQ5s2b1aBBAz3xxBN65JFHFB0drUcffVTjx4+39gwPb+11qKCgIFf0jOzbt8+n/Zz4vyjYMWLECH344YfavHnzRT90J06c0HXXXaeYmBi9/PLLlios2NNPP+3TfuPHj7/ClVyau+++W2vWrFF8fLz+7//+L3t9YmKiBg4cKEmaN2+eoqOjbZXok4cffljTpk3T8OHDNXHixOzzaunSpRo+fLgaN26suXPnqlatWpYrzZvX69X48eM1ZswYBQXlHLuSkJCgIUOG6KqrrlJSUpKV+ggjDjVw4EC98sorCgkJyXOf06dPq1SpUkVYFWBfWlqaWrRoodKlS2v48OGqV6+epLO3/KZNm6Y//vhDW7duVUREhOVK3e+mm27SvHnzVLdu3Yu2nThxQmPGjNGMGTN06tQpC9X5rk6dOpo7d65uvvnmi7alpaXpr3/9q9atW6cjR45YqM4327dvV7NmzfLcnpGRob/97W8FDmO+UoplGClRooRP+wXyPB59+/bV9OnTFRoamuv2TZs2acCAAbkOqQskycnJPu1Xo0aNK1zJpenUqZNP+61bt+4KV3JpsrKyNG/ePC1btkx79+6Vx+NRrVq11KNHD/Xt29cRt8wkac+ePXrggQeUkJCQfQvQ4/Hotttu06uvvqratWtbrrB4yMrKuuh/4Rf65JNPdMsttxRRRYVz/PhxlStXLt99FixYoL59+xZRRe5TLMNIUFCQatasqf79+6tly5Z57nfnnXcWYVX+ad26tdLS0vTGG2+oS5cu2etPnz6tp556SpMnT9agQYM0c+ZMi1UW7PxgeP6PxvnrPB5PQAdD6X/nVGxsbL69UVOmTCnCqvxjjFG3bt20evVqNW/eXA0aNJAxRv/973+1Y8cOde/eXStWrLBdpl9+//137d69W9LZ/92Gh4dbrsg3HTt2LDD4eTyegJ9I78yZM/r6669Vt25dlS1bNse248ePa/fu3WrSpEmBgcW22rVra+PGjbr66qttl1Jo3333nQ4fPqwbbrghe93atWv17LPP6tixY4qLi9Pjjz9urb5i+QDrV199pTfeeEMvv/yyatWqpUGDBqlPnz666qqrbJfmsy+//FLPPPOMunXrpoEDB2ry5Mn65ptv1L9/fx09elSrVq1yxPTjHo9H1atX14ABA9StWzdHTICUm3/84x+aO3eu3n77bfXp00eDBg1SkyZNbJfll3nz5umTTz7R2rVrL3pQct26dYqLi9P8+fPVr18/SxX6bu/evUpISNDp06d1yy23OO7vokWLFnluO3LkiBYtWuSIEUELFizQtGnT9OWXX160rXTp0ho0aJBGjhypv/zlLxaq850bZrx+7LHH1LRp0+wwsmfPHnXr1k0333yzmjVrpvj4eJUrV04jR460U2DRjyYOHCdOnDALFiwwnTp1MuXKlTM9e/Y0H3zwge2y/LJx40bTuHFjU6VKFVOqVCkzaNAgk56ebrssn6WkpJjnnnvO1K9f30RERJiHH37Y7Nq1y3ZZhZaYmGiGDBliQkNDzfXXX29mzJjhmL+P2267zcTHx+e5feLEiaZz585FWFHhrFu3zpQrVy57vp1SpUqZBQsW2C7rkp0+fdpMnTrVVKpUydSpUyd7MrRA1q5du3zrXLp0qbn55puLsKLCuXAOHieqXr26SUxMzP48YcIE07x58+zPr7/+eo7PRa1Yh5Hz/fjjj6Zjx44mKCjI/Pbbb7bL8dmOHTtMixYtTLly5UxwcLCjL7qffvqpGTRokAkJCTE33nijmT17tjlz5oztsgrl2LFjZt68eeb66683wcHBjggkERERZuvWrXlu37Jli4mIiCi6ggrppptuMnfeeac5cOCA+f33382DDz5oqlSpYrusS7Jw4UJTu3ZtU6VKFTN9+nRz+vRp2yX5pFKlSjkmnLvQjz/+aCpWrFh0BRWSG2a8LlOmjElOTs7+3KlTJzNu3Ljsz7t37zZhYWEWKjur2IeR/fv3mwkTJphrr73WVKlSxTz22GOO+IeelZVlJk2aZLxerxkwYIA5dOiQmT59uilfvrz585//bA4ePGi7xEJLTU11ZDA836effmoGDhxoypcvb2688UZz/Phx2yUVqFSpUubAgQN5bv/5559N6dKli7CiwgkLCzNff/119udjx46ZEiVKmF9//dViVYXz3nvvZc9a+swzz5ijR4/aLskv5cqVM9u2bctz+7Zt20y5cuWKsKLCccOM11WrVjVffvmlMcaYM2fOmNDQULNq1ars7bt27bI6M3FgPzV0hZw6dUpLly5V586dVbduXW3ZskVTp07V/v379dxzzzniuYU2bdron//8p95++23NnTtXFSpU0IMPPqht27bp119/VaNGjbR06VLbZfolMTFRQ4YMUb169XT06FFNnz5dFSpUsF2Wzw4cOKBJkyapXr166tGjh8LDw/Xll1/qiy++uOjhvUB05syZfM/9EiVK5PsG1kCRkZGRY9bYcuXKqWzZskpPT7dYlX+++uordezYUX/+85/VsWNH/fDDD3ryyScVHBxsuzS/1K1bV4mJiXlu/+yzz3Id9huIUlNTlZWVlecS6M+UdOjQQRMmTND+/fs1depUZWVlqUOHDtnbd+3apWuuucZafYH/q3sFVKlSRSEhIerfv79effXV7InDjh07lmO/vIbNBoJatWrpvffeu2h0QO3atfXxxx9r6tSpGjx4sHr27GmpQt+kpKRo/vz5mjt3rg4dOqQ+ffro888/d9wDh3fccYc++ugjde7cWS+88IJiY2MdEWrPZ4zRgAED5PV6c93uhAcmz1mzZk2OadOzsrK0du3aHEPdu3fvbqM0n7Rp00Zly5bV/fffr1q1amnRokW57hfos+H27t1b48aNU9u2bS+a42Lbtm166qmn9Oijj1qqzndOGdKen4kTJ+q2225TzZo1VaJECb3yyis5wu2CBQt8nqLgSii2Q3vPye0kMw4YTpqcnKyoqKh8/5F8//33Af+/jlKlSqlatWrq37+/unfvnuew2Pwm6wkEQUFBqlKliipXrpzv30kgT999bkbMgsydO/cKV3JpfBkmGuj/vq+55hqfhvb++OOPRVRR4Zw+fVqdO3fWZ599ppiYGDVo0ECS9M033+jDDz/UTTfdpISEhICfnNEtM17/8ccf+vrrr1WpUiVVrVo1x7Zt27YpKirK2vD3YhlGPv74Y5/2a9++/RWupPBKlCihlJQUx//jyC0YXnhKBvoPh+Se6buBy+306dOaMmWKFi1apO+//17GGNWrV0+9e/fWyJEjVbp0adslFsiXGa+d7scff9T999+vDz74wMr3F8sw4gZuSeq8myawnJuf49SpU+rQoYMaN25suyTAujNnzujFF1/Uu+++q1OnTunWW2/V+PHjHfEsmK+2bdumVq1aWfuPn7Nual8mb731luLi4rIT+U8//aSqVatm/y/9+PHjmjZtWsDfy3TDfcw333xTjzzySIFTLQe6Xbt2qVGjRvnus3DhwoCe3Omjjz7Sn/70J504cUKSVLJkSc2ZMyega87Nu+++W+A+JUuWVGRkpJo0aRKQ/zMfNWpUruvDwsJUr1493XXXXXk+2xOITpw4oYSEBH333XeSpPr16ysmJsYxP+aTJk3S3//+9+yaX375ZR08eFBz5syxXZprFMuekQtvcYSGhiopKSn7fRVpaWmqWrVqQN8aCAoK0l//+tcCf8RfeumlIqqocNxyu6ls2bKaMGGCHn744YtCYlpamoYOHaqPPvoooF+k1a5dO1WsWFEzZsxQmTJlNG7cOC1fvlwHDhywXZpf/JlaPDIyUkuXLs31BWg2XTgD7jmHDx/W7t27FRERoXXr1gX8O5uks+FwyJAh+vXXX3Osr1ixot544w1169bNUmW+q1u3rh555BHdd999kqQPP/xQsbGxOnHiRMBPZe8r2z0jxTKMXHiLIyQkRNu2bXNcGImOjs73f3UejyfgX8zmlttN//73v/XAAw+ofv36mjdvnq699lpJZ3tDRowYocaNG2vOnDmqU6eO5UrzVqFCBSUmJmb38Bw/flyhoaFKS0tz9Ds5cmOMUVpamp599lklJiYG9IPFF8rIyFCfPn0UEhKS5yibQJGYmKgOHTqoe/fuevjhh9WwYUNJZ3sSJ0+erFWrVunjjz9WmzZtLFeaP6/Xq927dysqKip7XZkyZbR7925Vr17dYmWXD2HEAreEETf8iAcFBSktLU2VKlWyXcolO3jwoO677z4lJCTo73//uz799FMlJCTo2Wef1d/+9reAv62W2zl14b8Nt9m7d68aNGigkydP2i7FL1999ZXuuecen5+5suWOO+5QVFSUZs2alev2++67T/v379fq1auLuDL/lChRQqmpqTmuUyEhIdq+fbtq1aplsTLftWzZMt9r0PHjx/X999/zzAj8E+g/bP6oV69ege35/fffi6iawqtcubKWL1+uPn366NFHH1VwcLC+/PJLNW3a1HZpPnP6/By+SElJ0enTp1WjRg1dc801SktLs12S3ypWrOiIfxNffPGF/vGPf+S5fdiwYQE9avGc3ObgOXnypO6///4cc3UsW7bMRnk+iYuLs11CvoptGDn/onvhBffw4cMWK/ONmzq0nn766Rw/gE516NAhDRs2TCtXrtSYMWO0dOlS9erVS/Pnz1erVq1sl+eT/v37X7Tu3H1yyRnDrAvSqVMnfffdd9ntcOK598UXX2TfCgxkJ06cyHfyyLCwMEf0SuX278JpD3YH+rQCxTaMXHhynX/BlQK/52Hu3LmOvIjm5t5773X87aZVq1Zp6NChqlGjhjZv3qwGDRroiSee0COPPKLo6Gg9+uijGj9+fEDPypqVlWW7hCIxf/58HT9+3HYZ+dq+fXuu69PT07V582ZNmjQp4H9cpLMPfq5bty7PCfXWrl0b8BMzSoE/0Z8vAn7EXxG/CwdF5MCBA2bfvn22yyhQUFCQ41/NbYwxpUuXNhMnTsz1LcMffPCBqVGjhtXXc8NZzr14LbcXslWqVMnEx8ebrKws22UW6KWXXjLh4eHmP//5z0XbVq1aZa6++mozefJkC5X5b8+ePWb27Nlm2rRpZufOnbbL8VuZMmXMCy+8kOt5k5qaarp162bKly9vobKzCCMu1aBBg4B/i6QxZy+6bggj+b2Z1Bhj0tPTzaBBg4qomivDKQH3fIcOHTKvvfaaGTNmTPYboDdv3mx++ukny5Xlb+/evbkuv//+u+3S/HLmzBnTo0cP4/F4TIMGDcyf//xnExcXZ+rXr2+CgoLMXXfdlWuADzTr1q0z5cqVyw6EpUqVMgsWLLBdll/eeecdU6lSJdOuXTuze/fu7PULFiww4eHh5uabbzbff/+9tfqK5Wiagpz/gJtTbdy4UcePH3fEw2FwhoYNG+Z41iLQbd++XTExMQoLC9PevXv17bffqnbt2ho3bpySk5M1f/582yUWG0uXLtXixYuzJz2rV6+e7r33Xt17772WK/ONW+bgCeQRf4SRXDjtoutkd911l0/7BfJT6r4g4Ba9mJgYtWrVSs8//3yOIcqJiYnq3bu39u7da7vEQnPD+eQkbpuDp0+fPlq8eLGCg4OVmJgYECP+AvdpOouc8IDb+Q4fPqx33nlHP/zwg0aPHq3w8HBt2bJFERERqlatmu3y8nXhQ7iLFi1St27dXPdCqgtHcDjR9ddfb7sEv2zcuDHX+S2qVaum1NRUCxVdPk47n9LT05WQkKC9e/fK4/Godu3auvXWW/MdaRNIMjIyVLFixezP5cqVU9myZZWenu6oMBLII/4II7lw0kX3wq7ooUOHKjw8XMuWLXNEV/SFT6m/8847ev755103yRYBt+h5vV5lZGRctP67775z/CR7TjqfFi5cqOHDh1/0dxEWFqaZM2eqZ8+elirzj9Pn4An4EX/WnlYJEE59wO2cW2+91YwePdoYY0z58uXNDz/8YIwx5vPPPzc1a9a0WFnhnN8G2LFt2zZTqVIlU6dOHVOyZMnsv48nnnjC9O3b13J1vhs8eLCJi4szp06dMuXLlzc//vij2bdvn2nZsqUZMWKE7fKKhc2bN5uSJUua/v37m6SkJHPy5Elz4sQJs3nzZtO3b19TqlQpk5SUZLvMAuU2qunCJdAHDAT6iL9iHUbccNENDQ3NfjL6/B/yvXv3Gq/Xa7O0QnFDGCHgBobDhw+bmJgYU6FCBVOiRAkTFRVlSpUqZW655RZz9OhR2+X5zMnn04ABA0yPHj3y3H733XebgQMHFmFFxVegj/gr1rdpRo0apQEDBmQ/4HbOHXfcod69e1uszHdu7op2IqffNpPc86xFWFiYEhIS9Nlnn2n79u06evSoWrVqpZiYGNul+czp59Pnn3+uV199Nc/t999/vx588MEirKj4atasWb7bQ0ND9cYbbxRRNRcr1mHEDRfd7t2765lnntFbb70l6ezMscnJyXrsscd09913W66uYO+++26Oz7ndh5UC+17s+Qi4gaddu3Zq166d7TIKxenn04EDB1SvXr08t9erV08///xzEVZ0ZbhhdJP1NljrkwkAlSpVMlu2bDHG5OyO/uCDD0z16tVtluYzp3dFu+Fe7PnccNvMTc9afPjhhyY2NtbUrl3b1K5d28TGxpqEhATbZfnM6edTQZMapqamOurfd16cMslkfmy3oVj3jDi9V0Fyfle0296H4oZehcmTJ6tHjx6qXLmyTpw4ofbt2ys1NVXR0dGaOHGi7fJ89uqrr2rEiBHq0aOHRowYIensC+buuOMOTZkyRcOGDbNcYcHccD5dOArlfE54KakvnDS6KS+221CsJz1LT09Xjx49tGnTJh05ckRVq1bNvuiuXr06x6uhAV8MGTJEv/32m9566y2Fh4dr+/btKlGihOLi4nTLLbdo6tSptkv0mVMD7jnVq1fXmDFjNHz48Bzrp0+frkmTJjni9oDTz6egoKAC93HDm6Bx6Yp1GDnH6RfdtWvXasqUKfrvf/8r6ewMsiNHjnRcO3Jj/T6mnwi4gaN8+fJKSkpSnTp1cqz//vvv1bJlSx09etRSZb7jfAo8bpiDJxDbQBhxuPO7oqOjoyWd7Yp+5513HNMVnR+nTs1PwLWvd+/eatmypUaPHp1j/YsvvqhNmzZpyZIllirzn9PPJ7dww/uOArUNxT6MOP2i64au6Pw47X0obuCWgPvss8/qxRdf1E033ZSjHZ9//rkefvjhHFORP/TQQ7bKLNac1vPphvcdBWobinUYccNF1w1d0W5DwA0MtWrV8mk/j8ejH3/88QpXU3hOP5/y47Sez7CwMG3ZskXXXnttjh/yffv2qX79+jp58qTtEgsUqG0o+OkiF5s0aZKmTJmixYsX66GHHtJDDz2kRYsWacqUKZo0aZLt8nzSvXt3LV++/KL1K1eu1J/+9CcLFRXe4cOH9frrr2vs2LH6/fffJUlbtmxxzI+fdDbg3n777QoJCdGIESM0YsQIhYaG6o477tD06dNtl+eTw4cP6/bbb79ofefOnZWenm6hosLZs2ePT0sgBxE3nE/5mT9/vtatW2e7DJ+5YXRTwLbB1pjiQBAcHGy+//77i9Z/9913Jjg42EJF/pswYYIJCwszd9xxh5kwYYKZMGGCiY2NNRUqVDATJkwwL7/8cvYSyNwwNb8xxlSrVs3885//vGj9tGnTTNWqVS1U5L9evXqZ559//qL1L7zwgunZs6eFigpn3bp1tku4ZG44n9zEDXPwBGobivVtGjc84OaWruhAvY/pLzfcNnPLsxZer1fVq1fXwIED1b9/f0VFRdkuyW9uOJ/OCcQRHP5yw+imQG1DsQ4jbrnoukGg3sf0FwE3cPz6669asGCB3nzzTX399dfq1KmTBg8erLi4OJUuXdp2eT5xw/kkBe4IjsJyw+imQGtDsQ4jbrjofvTRR+rYsaPtMi5Z5cqVtWbNGrVs2TJHGElISNCgQYO0f/9+2yX6hIAbmLZs2aK5c+dq8eLFks7+yA8ePFjNmze3XFn+3HI+uaXnE1dOsQ4jbuCGrmjJ+TNNnkPADVwHDhzQ7Nmz9dxzz6lkyZI6efKkoqOjNXPmTDVu3Nh2eblyw/kkuafnU3LH6KaAbIO1p1UCgBsecPvll1/MSy+9ZJo3b25KlixpOnfubJYuXWoyMzNtl+YXp7/wz01Kly5tateubSZMmGCSk5Ntl3NJTp06Zd5++23TtWtXU7JkSdOmTRvz2muvmaNHj5o9e/aYPn36mIYNG9ou0/Xc8FJSY4yZPn26KVmypLn33nuzBwb06tXLlCpVykybNs12eT4J1DYU6zDipouuMcZs3rzZDB8+3Fx99dXm6quvNv/3f/9nkpKSbJfll08//dRMnz7d/OMf/3DU21XPIeDa17FjR3Po0KHsfwvh4eFmxIgRZseOHRftm5KSYjwej4UqfeOG88mYwB3B4S83jG4K1DYU6zDi9Itubn7++Wczfvx44/V6TXBwsClRooRp166d2blzp+3SigUCrn1BQUEmLS3NdOrUySxatMicPHkyz31Pnz5t1q9fX4TV+cct55Nbej7dMB1EoLahWIeR8znxonuOW7qiP/zwQxMbG2tq165tateubWJjYx3XO0LAtc/j8Zi0tDTbZVwWbjufnN7z6YY5eAK1DTzAeh4nPeDWqVMnLVu2TE8++aQWL14sY4z69u2rIUOGqEmTJjn2TU1NVdWqVZWVlWWp2oK5YWr+Czl1BIcknT59WitXrtScOXOUkJCg6667ToMHD1avXr30yy+/aNy4cdqyZYt27dplu9SLBAUFad26dQoPD893v2bNmhVRRZeHk88nt3DD6KaAbYO1GBQgnNqr4KauaGMC9z7mpXJSr4JbnrXweDwmKCjIeDyei5Zz64OCgmyXWShOOp8u5Iaez2uuucanpVatWrZLzVOgtqFYhhE3XHTd1BVtTODexywMAq5dHo/HbNy40ezduzffxSmcej6dL1BHcCBwFMsw4oaLrsfjMR999JHZtm1bvotTBOp9TF8RcAOHG9rhhvPpfG7p+XTD6KZAbUOxfGYkKChIqampqly5su1SCi0oKEgej0e5/fWdW+/xeBzzau6AvY/poxIlSiglJUW9evXSkCFDdNddd8nr9ea67x9//KHPP/9c7du3L+Iq8+eWZy3c8O/bDefT+dzyjh03TDIZqG0otmHE6RfdoKAgffXVVwW+8rlmzZpFVNGlcfpMk274AXRLwO3YsaOWL1+uChUq2C6l0NxwPp3PLe/YccP7jgK1DcU2jDj9ouu2i5XTEXADk1PfFOuG8+l8Tu/5zI0bRjcFUhuKbRhx+kXXbWHE6e9DIeAGHie/KdYN59P5nN7zmRcnTQeRl0BpQ8ki+6YAU6NGDUdfdNu3b++YbkFf3H777QF5H9MfX375ZYEBF0Vn1KhRGjBgQPabYs+544471Lt3b4uV+cZN59OePXtsl3DZ5DYHz7Rp03LMwXPPPfcE5Bw85wRkG4r6idlA4Ian7c936NAh89prr5kxY8aY3377zRhzdkbZn376yXJlvnP6TJNuOKc6dOhgDh06ZLuMyyY0NNTs3r3bGJPz5Wx79+41Xq/XZmkFcsP5dL5AHcHhKzeMbgr0NhTLMOKmi+62bdtMpUqVTJ06dUzJkiWzL7hPPPGE6du3r+XqCseJU/O77cfDDQHXyW+Kddv55PR37LhhOohAb0OxDCPnc/pF99ZbbzWjR482xuS84H7++eemZs2aFiu7NE6baZKAG3ic/KZYN51PxtDzGQgCvQ3F8gHWc5z8gNs5YWFh2rJli6699lqFhIRo27Ztql27tvbt26f69evr5MmTtkv0mZPfh3I+p47gOCcmJkatWrXKftbi3DmVmJio3r17a+/evbZL9El6erp69OihTZs26ciRI6patapSU1MVHR2t1atXKzg42HaJPnH6+XShQBrB4Ss3jG4K+DbYTkM2uaFXwcld0cYE/n1Mf7mhV8HJz1rkxslvinXD+ZQbp/V8uuF9R4HehmI7mkaSNm7cqFmzZl20vlq1akpNTbVQkf+6d++uZ555Rm+99Zaks0PjkpOT9dhjj+nuu++2XF3BPv74Y506dUq7du3SP//5z3xnmqxYsaI++uijIq7QP04fwSGdnaExIyPjovXfffedI0d3tGvXTu3atbNdRqG44Xw6JyBHcPjBDaObArkNxTqMuOGiO3nyZPXo0UOVK1fWiRMn1L59++yu6IkTJ9our0Dm/79LuHbt2gL3LVmyZEBPeS0RcG175ZVXfN7XCZNrOf186tSpk5YtW6Ynn3xSixcvljFGffv21fPPP68mTZpk7xccHKwXX3xRVatWtVht/pw+HYQU2G0o1mHEyRfdc8LCwpSQkKDPPvtM27dv19GjR9WqVSvFxMTYLs1nu3btKvDCGsj3Ys9HwLVrypQpPu3n8XgcEUacfj65recTV06xfoDVLQ+4OZnbZpocMmSIfvvtN7311lsKDw/X9u3bVaJECcXFxemWW27R1KlTbZfoMycHXLdw+vnklll93fC+o0BvQ7EOI+c47aLrpq5oN0zNfz4CLi4np59PAT+CoxDcMLopENtAGHEgN73nwS3/c7oQAdeOUaNGacKECQoODtaoUaPy3fell14qoqoundPOp3Pc1vPphukgArUNxS6MuOWi6xZuDSNO45aA27FjR7344otq2bKlbr311jz383g8WrduXRFWVjy5refTDXPwBGobil0YcctF1y0C/T6mLwi4gaVEiRJKSUnJDrg9e/bUK6+8ooiICMuV+cZN55Pb/rPhhkkmA7UNxW40jRveHummrujzn54PxPuYvnDbCA6nu/D/V++9956OHTtmqRr/cT4FLqePbpICtw3FLoy4wdatW/XNN9+oZcuW2rp1a577eTyeIqzq0lx4H3Po0KEKDw/XsmXLAv5eLAE3sDmt89cN59M57du3V+nSpW2Xcdm4YTqIQG1DsbtN45aLrtO7oi8UqPcxiws3PWtRokQJpaamZv8vLyQkRNu3b/f5Fi2uDKf2fJ7P6aObpMBtQ7HrGXFLr4LTu6Iv5OSZJt0QcD/66KPsgHvu1plTA64xRgMGDMieXOvkyZO6//77L7rILlu2zEZ5BXLD+XQhJ/d8ns8Nk0wGahuKXRhx00X3fE7v4ArU+5i+IOAGlv79++f4/Je//MVSJYXjlvPpfG56x47k7PcdnRNobSh2YURyx0XX4/FcdDFy0sXpQoF6H9MXBNzAMnfuXNslXBI3nk9O7vl0w+gmJ7ShWIaRCznxouv0rugLOfl9KBIBF5eXG86n8zm559MNo5uc0IZiGUbccNF1elf0hQL1PmZhEXBxOTnxfDqfk3s+3TC6yQltKHajaaSzE/F07do1+6L7//7f/1OnTp246KLQ3DCCY+DAgT7t5/TbIE7ghvPpfIE6ggOBo1iGES66gcEJ9zF9RcDF5eTW88mJPZ9uGN3khDYUyzCCwOCmqfkJuLicOJ8Chxvm4HFCGwgjAIDLzk09n26YZDLQ20AYAQBcdm7q+bzwhX+hoaFKSkpS7dq1LVfmu0BvQ7EcTYPA4IT7mAAKxwkjOArLDf+HD7Q2EEZgjRtnmgTgPm6YDiLQ28BtGlgV6PcxARSOm3o+3TC6KdDbQM8IrHLbTJMAznJTz6cbJpkM9DbQMwKrLnyoKiQkRNu2bQuYh6oAFB49n/BVkO0CULwF+n1MAIVHzyd8xW0aWMX7UIDig4545IUwAqsC/T4mgMKj5xO+4pkRAMAVEegjOBA46BkBAFwR9HzCV/SMAAAAqxhNAwAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMCq/w88uqkaYYVZSwAAAABJRU5ErkJggg==\n",
142
      "text/plain": [
143
       "<Figure size 640x480 with 1 Axes>"
144
      ]
145
     },
146
     "metadata": {},
147
     "output_type": "display_data"
148
    }
149
   ],
150
   "source": [
151
    "file_info.filetype.value_counts().plot.bar()"
152
   ]
153
  },
154
  {
155
   "cell_type": "markdown",
156
   "id": "c756d12e",
157
   "metadata": {},
158
   "source": [
159
    "You can also use this utility to find the average file size of documents in a directory, grouped by filetype."
160
   ]
161
  },
162
  {
163
   "cell_type": "code",
164
   "execution_count": 5,
165
   "id": "a600fb0f",
166
   "metadata": {},
167
   "outputs": [
168
    {
169
     "data": {
170
      "text/html": [
171
       "<div>\n",
172
       "<style scoped>\n",
173
       "    .dataframe tbody tr th:only-of-type {\n",
174
       "        vertical-align: middle;\n",
175
       "    }\n",
176
       "\n",
177
       "    .dataframe tbody tr th {\n",
178
       "        vertical-align: top;\n",
179
       "    }\n",
180
       "\n",
181
       "    .dataframe thead th {\n",
182
       "        text-align: right;\n",
183
       "    }\n",
184
       "</style>\n",
185
       "<table border=\"1\" class=\"dataframe\">\n",
186
       "  <thead>\n",
187
       "    <tr style=\"text-align: right;\">\n",
188
       "      <th></th>\n",
189
       "      <th>filesize</th>\n",
190
       "    </tr>\n",
191
       "    <tr>\n",
192
       "      <th>filetype</th>\n",
193
       "      <th></th>\n",
194
       "    </tr>\n",
195
       "  </thead>\n",
196
       "  <tbody>\n",
197
       "    <tr>\n",
198
       "      <th>FileType.DOCX</th>\n",
199
       "      <td>36602.0</td>\n",
200
       "    </tr>\n",
201
       "    <tr>\n",
202
       "      <th>FileType.EML</th>\n",
203
       "      <td>149088.5</td>\n",
204
       "    </tr>\n",
205
       "    <tr>\n",
206
       "      <th>FileType.HTML</th>\n",
207
       "      <td>1228404.0</td>\n",
208
       "    </tr>\n",
209
       "    <tr>\n",
210
       "      <th>FileType.JPG</th>\n",
211
       "      <td>64002.5</td>\n",
212
       "    </tr>\n",
213
       "    <tr>\n",
214
       "      <th>FileType.PDF</th>\n",
215
       "      <td>2429245.0</td>\n",
216
       "    </tr>\n",
217
       "    <tr>\n",
218
       "      <th>FileType.PPTX</th>\n",
219
       "      <td>38412.0</td>\n",
220
       "    </tr>\n",
221
       "    <tr>\n",
222
       "      <th>FileType.TXT</th>\n",
223
       "      <td>619.0</td>\n",
224
       "    </tr>\n",
225
       "    <tr>\n",
226
       "      <th>FileType.UNK</th>\n",
227
       "      <td>1102.0</td>\n",
228
       "    </tr>\n",
229
       "    <tr>\n",
230
       "      <th>FileType.XLSX</th>\n",
231
       "      <td>4765.0</td>\n",
232
       "    </tr>\n",
233
       "    <tr>\n",
234
       "      <th>FileType.XML</th>\n",
235
       "      <td>713.5</td>\n",
236
       "    </tr>\n",
237
       "  </tbody>\n",
238
       "</table>\n",
239
       "</div>"
240
      ],
241
      "text/plain": [
242
       "                filesize\n",
243
       "filetype                \n",
244
       "FileType.DOCX    36602.0\n",
245
       "FileType.EML    149088.5\n",
246
       "FileType.HTML  1228404.0\n",
247
       "FileType.JPG     64002.5\n",
248
       "FileType.PDF   2429245.0\n",
249
       "FileType.PPTX    38412.0\n",
250
       "FileType.TXT       619.0\n",
251
       "FileType.UNK      1102.0\n",
252
       "FileType.XLSX     4765.0\n",
253
       "FileType.XML       713.5"
254
      ]
255
     },
256
     "execution_count": 5,
257
     "metadata": {},
258
     "output_type": "execute_result"
259
    }
260
   ],
261
   "source": [
262
    "file_info.groupby(\"filetype\").mean(numeric_only=True)"
263
   ]
264
  },
265
  {
266
   "cell_type": "markdown",
267
   "id": "870ede91",
268
   "metadata": {},
269
   "source": [
270
    "If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
271
   ]
272
  },
273
  {
274
   "cell_type": "code",
275
   "execution_count": 6,
276
   "id": "e5e3a24d",
277
   "metadata": {},
278
   "outputs": [],
279
   "source": [
280
    "from unstructured.file_utils.exploration import get_file_info\n",
281
    "\n",
282
    "filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"
283
   ]
284
  },
285
  {
286
   "cell_type": "code",
287
   "execution_count": 7,
288
   "id": "d8e59472",
289
   "metadata": {},
290
   "outputs": [
291
    {
292
     "data": {
293
      "text/plain": [
294
       "['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",
295
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",
296
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",
297
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",
298
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",
299
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",
300
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",
301
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",
302
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",
303
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",
304
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",
305
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",
306
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",
307
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",
308
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",
309
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",
310
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",
311
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",
312
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"
313
      ]
314
     },
315
     "execution_count": 7,
316
     "metadata": {},
317
     "output_type": "execute_result"
318
    }
319
   ],
320
   "source": [
321
    "filenames"
322
   ]
323
  },
324
  {
325
   "cell_type": "code",
326
   "execution_count": 8,
327
   "id": "cb0add28",
328
   "metadata": {},
329
   "outputs": [
330
    {
331
     "name": "stderr",
332
     "output_type": "stream",
333
     "text": [
334
      "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
335
     ]
336
    },
337
    {
338
     "data": {
339
      "text/plain": [
340
       "FileType.EML     4\n",
341
       "FileType.TXT     3\n",
342
       "FileType.HTML    2\n",
343
       "FileType.XML     2\n",
344
       "FileType.PDF     2\n",
345
       "FileType.JPG     2\n",
346
       "FileType.UNK     1\n",
347
       "FileType.DOCX    1\n",
348
       "FileType.PPTX    1\n",
349
       "FileType.XLSX    1\n",
350
       "Name: filetype, dtype: int64"
351
      ]
352
     },
353
     "execution_count": 8,
354
     "metadata": {},
355
     "output_type": "execute_result"
356
    }
357
   ],
358
   "source": [
359
    "file_info = get_file_info(filenames=filenames)\n",
360
    "file_info.filetype.value_counts()"
361
   ]
362
  },
363
  {
364
   "cell_type": "code",
365
   "execution_count": null,
366
   "id": "ac4473b0",
367
   "metadata": {},
368
   "outputs": [],
369
   "source": []
370
  }
371
 ],
372
 "metadata": {
373
  "kernelspec": {
374
   "display_name": "Python 3 (ipykernel)",
375
   "language": "python",
376
   "name": "python3"
377
  },
378
  "language_info": {
379
   "codemirror_mode": {
380
    "name": "ipython",
381
    "version": 3
382
   },
383
   "file_extension": ".py",
384
   "mimetype": "text/x-python",
385
   "name": "python",
386
   "nbconvert_exporter": "python",
387
   "pygments_lexer": "ipython3",
388
   "version": "3.8.13"
389
  }
390
 },
391
 "nbformat": 4,
392
 "nbformat_minor": 5
393
}
394

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.