unstructured
393 строки · 34.5 Кб
1{
2"cells": [
3{
4"cell_type": "markdown",
5"id": "9cf9f373",
6"metadata": {},
7"source": [
8"# File Exploration\n",
9"\n",
10"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
11"\n",
12"- [Filetype detection in `unstructured`](#filetype)\n",
13"- [How to generate summary statistics about documents](#summary)"
14]
15},
16{
17"cell_type": "code",
18"execution_count": 1,
19"id": "59392a21",
20"metadata": {},
21"outputs": [],
22"source": [
23"import os\n",
24"import pathlib\n",
25"\n",
26"DIRECTORY = os.path.abspath(\"\")\n",
27"EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
28]
29},
30{
31"cell_type": "markdown",
32"id": "9a6ce38f",
33"metadata": {},
34"source": [
35"## Filetype Detection <a id=\"filetype\"></a>\n",
36"\n",
37"The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
38]
39},
40{
41"cell_type": "code",
42"execution_count": 2,
43"id": "c6bd2f4a",
44"metadata": {},
45"outputs": [
46{
47"data": {
48"text/plain": [
49"<FileType.HTML: 50>"
50]
51},
52"execution_count": 2,
53"metadata": {},
54"output_type": "execute_result"
55}
56],
57"source": [
58"from unstructured.file_utils.filetype import detect_filetype\n",
59"\n",
60"filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",
61"detect_filetype(filename=filename)"
62]
63},
64{
65"cell_type": "markdown",
66"id": "c68e6a41",
67"metadata": {},
68"source": [
69"The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
70]
71},
72{
73"cell_type": "markdown",
74"id": "8b7dc938",
75"metadata": {},
76"source": [
77"## Summary Statistics <a id=\"summary\"></a>\n",
78"\n",
79"`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
80]
81},
82{
83"cell_type": "code",
84"execution_count": 3,
85"id": "c53f054e",
86"metadata": {},
87"outputs": [
88{
89"name": "stderr",
90"output_type": "stream",
91"text": [
92"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
93]
94},
95{
96"data": {
97"text/plain": [
98"FileType.EML 4\n",
99"FileType.TXT 3\n",
100"FileType.HTML 2\n",
101"FileType.XML 2\n",
102"FileType.PDF 2\n",
103"FileType.JPG 2\n",
104"FileType.UNK 1\n",
105"FileType.DOCX 1\n",
106"FileType.PPTX 1\n",
107"FileType.XLSX 1\n",
108"Name: filetype, dtype: int64"
109]
110},
111"execution_count": 3,
112"metadata": {},
113"output_type": "execute_result"
114}
115],
116"source": [
117"from unstructured.file_utils.exploration import get_directory_file_info\n",
118"\n",
119"file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",
120"file_info.filetype.value_counts()"
121]
122},
123{
124"cell_type": "code",
125"execution_count": 4,
126"id": "7e1b3300",
127"metadata": {},
128"outputs": [
129{
130"data": {
131"text/plain": [
132"<AxesSubplot: >"
133]
134},
135"execution_count": 4,
136"metadata": {},
137"output_type": "execute_result"
138},
139{
140"data": {
141"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoULlZWVpbZt2+qnn37K85j4+HiFhYVlL1FRUYUtEwAABLhCh5Fhw4Zp586dWrJkSb77RUdHq1+/fmrRooXat2+vZcuWqVKlSpo1a1aex4wdO1bp6enZy/79+wtbJgAACHB+PTNyzvDhw7Vq1Sp98sknql69ul/HlipVSi1bttTu3bvz3Mfr9crr9RamNAAA4DB+9YwYYzR8+HAtX75c69atU61atfz+wjNnzmjHjh2qUqWK38cCAAD38atnZNiwYVq0aJFWrlypkJAQpaamSpLCwsJUtmxZSVK/fv1UrVo1xcfHS5KeeeYZtWnTRnXq1NHhw4f1wgsvaN++fRoyZMhlbgoAAHAiv8LIjBkzJEkdOnTIsX7u3LkaMGCAJCk5OVlBQf/rcDl06JCGDh2q1NRUXXXVVWrdurUSExPVqFGjS6scAAC4gl9hxBhT4D7r16/P8XnKlCmaMmWKX0UBAIDig3fTAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKv8CiPx8fG6/vrrFRISosqVKysuLk7ffvttgce9/fbbatCggcqUKaOmTZtq9erVhS4YAAC4i19h5OOPP9awYcP0xRdfKCEhQadPn1bnzp117NixPI9JTExUr169NHjwYG3dulVxcXGKi4vTzp07L7l4AADgfB5jjCnswb/88osqV66sjz/+WLfcckuu+/Ts2VPHjh3TqlWrste1adNGLVq00MyZM336noyMDIWFhSk9PV2hoaE+HXPNmP/4tN+l2Ptc7BX/Dre0AwBQ/Pj6+31Jz4ykp6dLksLDw/PcZ8OGDYqJicmxrkuXLtqwYUOex2RmZiojIyPHAgAA3KlkYQ/MysrSyJEjddNNN6lJkyZ57peamqqIiIgc6yIiIpSamprnMfHx8Xr66acLWxoCjFt6d2iHb9zQBokeQ6AoFbpnZNiwYdq5c6eWLFlyOeuRJI0dO1bp6enZy/79+y/7dwAAgMBQqJ6R4cOHa9WqVfrkk09UvXr1fPeNjIxUWlpajnVpaWmKjIzM8xiv1yuv11uY0gAAgMP41TNijNHw4cO1fPlyrVu3TrVq1SrwmOjoaK1duzbHuoSEBEVHR/tXKQAAcCW/ekaGDRumRYsWaeXKlQoJCcl+7iMsLExly5aVJPXr10/VqlVTfHy8JGnEiBFq3769Jk+erNjYWC1ZskSbNm3S7NmzL3NTAACAE/nVMzJjxgylp6erQ4cOqlKlSvaydOnS7H2Sk5OVkpKS/blt27ZatGiRZs+erebNm+udd97RihUr8n3oFQAAFB9+9Yz4MiXJ+vXrL1p3zz336J577vHnqwAAQDHBu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6HkU8++UTdunVT1apV5fF4tGLFinz3X79+vTwez0VLampqYWsGAAAu4ncYOXbsmJo3b67p06f7ddy3336rlJSU7KVy5cr+fjUAAHChkv4e0LVrV3Xt2tXvL6pcubIqVKjg93EAAMDdiuyZkRYtWqhKlSq67bbb9Pnnn+e7b2ZmpjIyMnIsAADAna54GKlSpYpmzpypf//73/r3v/+tqKgodejQQVu2bMnzmPj4eIWFhWUvUVFRV7pMAABgid+3afxVv3591a9fP/tz27Zt9cMPP2jKlClasGBBrseMHTtWo0aNyv6ckZFBIAEAwKWueBjJzQ033KDPPvssz+1er1der7cIKwIAALZYmWckKSlJVapUsfHVAAAgwPjdM3L06FHt3r07+/OePXuUlJSk8PBw1ahRQ2PHjtXPP/+s+fPnS5KmTp2qWrVqqXHjxjp58qRef/11rVu3Th988MHlawUAAHAsv8PIpk2b1LFjx+zP557t6N+/v+bNm6eUlBQlJydnbz916pQefvhh/fzzzypXrpyaNWumDz/8MMefAQAAii+/w0iHDh1kjMlz+7x583J8fvTRR/Xoo4/6XRgAACgeeDcNAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsMrvMPLJJ5+oW7duqlq1qjwej1asWFHgMevXr1erVq3k9XpVp04dzZs3rxClAgAAN/I7jBw7dkzNmzfX9OnTfdp/z549io2NVceOHZWUlKSRI0dqyJAhWrNmjd/FAgAA9ynp7wFdu3ZV165dfd5/5syZqlWrliZPnixJatiwoT777DNNmTJFXbp08ffrAQCAy1zxZ0Y2bNigmJiYHOu6dOmiDRs25HlMZmamMjIyciwAAMCd/O4Z8VdqaqoiIiJyrIuIiFBGRoZOnDihsmXLXnRMfHy8nn766StdGgBYdc2Y/1zx79j7XOwV/fPd0AaJdvjqSrUhIEfTjB07Vunp6dnL/v37bZcEAACukCveMxIZGam0tLQc69LS0hQaGpprr4gkeb1eeb3eK10aAAAIAFe8ZyQ6Olpr167NsS4hIUHR0dFX+qsBAIAD+B1Gjh49qqSkJCUlJUk6O3Q3KSlJycnJks7eYunXr1/2/vfff79+/PFHPfroo/rmm2/06quv6q233tLf/va3y9MCAADgaH6HkU2bNqlly5Zq2bKlJGnUqFFq2bKlnnrqKUlSSkpKdjCRpFq1auk///mPEhIS1Lx5c02ePFmvv/46w3oBAICkQjwz0qFDBxlj8tye2+yqHTp00NatW/39KgAAUAwE5GgaAABQfBBGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYVKoxMnz5d11xzjcqUKaMbb7xRX331VZ77zps3Tx6PJ8dSpkyZQhcMAADcxe8wsnTpUo0aNUrjx4/Xli1b1Lx5c3Xp0kUHDx7M85jQ0FClpKRkL/v27bukogEAgHv4HUZeeuklDR06VAMHDlSjRo00c+ZMlStXTnPmzMnzGI/Ho8jIyOwlIiLikooGAADu4VcYOXXqlDZv3qyYmJj//QFBQYqJidGGDRvyPO7o0aOqWbOmoqKidOedd+rrr7/O93syMzOVkZGRYwEAAO7kVxj59ddfdebMmYt6NiIiIpSamprrMfXr19ecOXO0cuVKLVy4UFlZWWrbtq1++umnPL8nPj5eYWFh2UtUVJQ/ZQIAAAe54qNpoqOj1a9fP7Vo0ULt27fXsmXLVKlSJc2aNSvPY8aOHav09PTsZf/+/Ve6TAAAYElJf3auWLGiSpQoobS0tBzr09LSFBkZ6dOfUapUKbVs2VK7d+/Ocx+v1yuv1+tPaQAAwKH86hkpXbq0WrdurbVr12avy8rK0tq1axUdHe3Tn3HmzBnt2LFDVapU8a9SAADgSn71jEjSqFGj1L9/f1133XW64YYbNHXqVB07dkwDBw6UJPXr10/VqlVTfHy8JOmZZ55RmzZtVKdOHR0+fFgvvPCC9u3bpyFDhlzelgAAAEfyO4z07NlTv/zyi5566imlpqaqRYsWev/997Mfak1OTlZQ0P86XA4dOqShQ4cqNTVVV111lVq3bq3ExEQ1atTo8rUCAAA4lt9hRJKGDx+u4cOH57pt/fr1OT5PmTJFU6ZMKczXAACAYoB30wAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrChVGpk+frmuuuUZlypTRjTfeqK+++irf/d9++201aNBAZcqUUdOmTbV69epCFQsAANzH7zCydOlSjRo1SuPHj9eWLVvUvHlzdenSRQcPHsx1/8TERPXq1UuDBw/W1q1bFRcXp7i4OO3cufOSiwcAAM7ndxh56aWXNHToUA0cOFCNGjXSzJkzVa5cOc2ZMyfX/V9++WXdfvvtGj16tBo2bKgJEyaoVatWmjZt2iUXDwAAnK+kPzufOnVKmzdv1tixY7PXBQUFKSYmRhs2bMj1mA0bNmjUqFE51nXp0kUrVqzI83syMzOVmZmZ/Tk9PV2SlJGR4XOtWZnHfd63sPypp7Dc0A43tEGiHb5yQxsk2uErN7RBoh2+8rcN5/Y3xuS/o/HDzz//bCSZxMTEHOtHjx5tbrjhhlyPKVWqlFm0aFGOddOnTzeVK1fO83vGjx9vJLGwsLCwsLC4YNm/f3+++cKvnpGiMnbs2By9KVlZWfr999919dVXy+PxXJHvzMjIUFRUlPbv36/Q0NAr8h1XmhvaILmjHW5og0Q7Aokb2iC5ox1uaINUNO0wxujIkSOqWrVqvvv5FUYqVqyoEiVKKC0tLcf6tLQ0RUZG5npMZGSkX/tLktfrldfrzbGuQoUK/pRaaKGhoY4+uSR3tEFyRzvc0AaJdgQSN7RBckc73NAG6cq3IywsrMB9/HqAtXTp0mrdurXWrl2bvS4rK0tr165VdHR0rsdER0fn2F+SEhIS8twfAAAUL37fphk1apT69++v6667TjfccIOmTp2qY8eOaeDAgZKkfv36qVq1aoqPj5ckjRgxQu3bt9fkyZMVGxurJUuWaNOmTZo9e/blbQkAAHAkv8NIz5499csvv+ipp55SamqqWrRooffff18RERGSpOTkZAUF/a/DpW3btlq0aJHGjRunxx9/XHXr1tWKFSvUpEmTy9eKy8Dr9Wr8+PEX3R5yEje0QXJHO9zQBol2BBI3tEFyRzvc0AYpsNrhMaag8TYAAABXDu+mAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAwIFSUlJslwBcNoSRXGzfvl2lS5e2XUa+ateurd9++812GVfcmTNndODAAdtlXLIff/xRnTt3tl0GHOLCN51fKCUlRR06dCiaYuAKu3btKnCfhQsXFkEluSOM5MIYozNnztguI1979+4N+Bovh507dyoqKsp2GZfsyJEjF70WIdDUqFEjR8CdNm1akbzy/HI7ceKEVq1alf353Is3zy2jR4/WyZMnLVZYsLlz52rixIm5bjsXRCpVqlTEVfnvySef1B9//JHn9uTkZN12221FWFHhvPHGG/luP3LkiIYMGVJE1RRO69at9eKLLyq3qcXS0tLUvXt3PfDAAxYqO4swAkCS9NNPP+UIuI8//rh+/fVXixUVzptvvqlZs2Zlf542bZoSExO1detWbd26VQsXLtSMGTMsVliwd999V5MmTbqoztTUVHXs2FHh4eF6//33LVXnuzfffFPXX3+9du7cedG2WbNmqUmTJipZMiBfHp/DqFGj9Kc//UmpqakXbVuzZo0aN26sjRs3WqjMdwsXLtTzzz+vW265RT/88EOO9Y0aNdLhw4e1detWewUaXCQpKckEBQXZLiNfHo/HzJ8/36xcuTLfxemc8HfhCye0w+PxmLS0tOzP5cuXNz/88IPFigqnXbt25t13383+fGE7FixYYNq0aWOjNL+sWrXKeL1es3jxYmOMMSkpKaZBgwbmhhtuMBkZGZar8016errp27ev8Xq9ZtKkSebMmTNm37595tZbbzWhoaFm1qxZtkv0yZ49e0yHDh1MeHi4WbRokTHGmIyMDDNo0CBTqlQpM3bsWHPq1CnLVRYsLS3NxMXFmeDgYPPCCy+Y7t27m7Jly5rJkyebrKwsq7UVyzCSnp6e7/Lpp5864oejoCXQ2+ALJ/yI+8IJ7XBLGImMjDR79uzJ/lyxYsUcn7/99lsTGhpa9IUVwr/+9S9TpkwZM3fuXNOwYUNz3XXXmcOHD9suy28rVqwwERERpnnz5iY0NNTExMSYvXv32i7Lb1OmTDHBwcEmNjbW1KhRwzRq1Mh89dVXtsvyW+/evY3H4zHly5c327dvt12OMcaYwO8fuwIqVKggj8eT53ZjTL7bA0VqaqoqV65su4xLsn379ny3f/vtt0VUyaVp2bJlvufM8ePHi7Cawnv99ddVvnx5SdIff/yhefPmqWLFijn2eeihh2yU5rPDhw8rMzMz+/Mvv/ySY3tWVlaO7YGsd+/eOnz4sAYPHqxWrVrpww8/VFhYmO2y/NamTRs1bdpUa9euVXBwsMaNG6eaNWvaLstv9913nz755BOtWLFCwcHBWrVqlZo2bWq7LJ8dOnRIw4YN08qVKzVmzBgtXbpUvXr10vz589WqVSurtRXLMPLRRx/ZLuGSOSEs+aJFixbyeDy5PlR1br0T2hoXF2e7hEtWo0YNvfbaa9mfIyMjtWDBghz7eDyegA8j1atX186dO1W/fv1ct2/fvl3Vq1cv4qr8c2G4LVWqlA4fPqyOHTvm2G/Lli1FXZrfFi9erOHDh6tFixb673//qzfeeEOdO3fWgw8+qPj4eJUpU8Z2iT75/PPPNXDgQJUsWVLvv/++Xn/9dUVHR2vixIkaMWKE7fIKtGrVKg0dOlQ1atTQ5s2b1aBBAz3xxBN65JFHFB0drUcffVTjx4+39gwPb+11qKCgIFf0jOzbt8+n/Zz4vyjYMWLECH344YfavHnzRT90J06c0HXXXaeYmBi9/PLLlios2NNPP+3TfuPHj7/ClVyau+++W2vWrFF8fLz+7//+L3t9YmKiBg4cKEmaN2+eoqOjbZXok4cffljTpk3T8OHDNXHixOzzaunSpRo+fLgaN26suXPnqlatWpYrzZvX69X48eM1ZswYBQXlHLuSkJCgIUOG6KqrrlJSUpKV+ggjDjVw4EC98sorCgkJyXOf06dPq1SpUkVYFWBfWlqaWrRoodKlS2v48OGqV6+epLO3/KZNm6Y//vhDW7duVUREhOVK3e+mm27SvHnzVLdu3Yu2nThxQmPGjNGMGTN06tQpC9X5rk6dOpo7d65uvvnmi7alpaXpr3/9q9atW6cjR45YqM4327dvV7NmzfLcnpGRob/97W8FDmO+UoplGClRooRP+wXyPB59+/bV9OnTFRoamuv2TZs2acCAAbkOqQskycnJPu1Xo0aNK1zJpenUqZNP+61bt+4KV3JpsrKyNG/ePC1btkx79+6Vx+NRrVq11KNHD/Xt29cRt8wkac+ePXrggQeUkJCQfQvQ4/Hotttu06uvvqratWtbrrB4yMrKuuh/4Rf65JNPdMsttxRRRYVz/PhxlStXLt99FixYoL59+xZRRe5TLMNIUFCQatasqf79+6tly5Z57nfnnXcWYVX+ad26tdLS0vTGG2+oS5cu2etPnz6tp556SpMnT9agQYM0c+ZMi1UW7PxgeP6PxvnrPB5PQAdD6X/nVGxsbL69UVOmTCnCqvxjjFG3bt20evVqNW/eXA0aNJAxRv/973+1Y8cOde/eXStWrLBdpl9+//137d69W9LZ/92Gh4dbrsg3HTt2LDD4eTyegJ9I78yZM/r6669Vt25dlS1bNse248ePa/fu3WrSpEmBgcW22rVra+PGjbr66qttl1Jo3333nQ4fPqwbbrghe93atWv17LPP6tixY4qLi9Pjjz9urb5i+QDrV199pTfeeEMvv/yyatWqpUGDBqlPnz666qqrbJfmsy+//FLPPPOMunXrpoEDB2ry5Mn65ptv1L9/fx09elSrVq1yxPTjHo9H1atX14ABA9StWzdHTICUm3/84x+aO3eu3n77bfXp00eDBg1SkyZNbJfll3nz5umTTz7R2rVrL3pQct26dYqLi9P8+fPVr18/SxX6bu/evUpISNDp06d1yy23OO7vokWLFnluO3LkiBYtWuSIEUELFizQtGnT9OWXX160rXTp0ho0aJBGjhypv/zlLxaq850bZrx+7LHH1LRp0+wwsmfPHnXr1k0333yzmjVrpvj4eJUrV04jR460U2DRjyYOHCdOnDALFiwwnTp1MuXKlTM9e/Y0H3zwge2y/LJx40bTuHFjU6VKFVOqVCkzaNAgk56ebrssn6WkpJjnnnvO1K9f30RERJiHH37Y7Nq1y3ZZhZaYmGiGDBliQkNDzfXXX29mzJjhmL+P2267zcTHx+e5feLEiaZz585FWFHhrFu3zpQrVy57vp1SpUqZBQsW2C7rkp0+fdpMnTrVVKpUydSpUyd7MrRA1q5du3zrXLp0qbn55puLsKLCuXAOHieqXr26SUxMzP48YcIE07x58+zPr7/+eo7PRa1Yh5Hz/fjjj6Zjx44mKCjI/Pbbb7bL8dmOHTtMixYtTLly5UxwcLCjL7qffvqpGTRokAkJCTE33nijmT17tjlz5oztsgrl2LFjZt68eeb66683wcHBjggkERERZuvWrXlu37Jli4mIiCi6ggrppptuMnfeeac5cOCA+f33382DDz5oqlSpYrusS7Jw4UJTu3ZtU6VKFTN9+nRz+vRp2yX5pFKlSjkmnLvQjz/+aCpWrFh0BRWSG2a8LlOmjElOTs7+3KlTJzNu3Ljsz7t37zZhYWEWKjur2IeR/fv3mwkTJphrr73WVKlSxTz22GOO+IeelZVlJk2aZLxerxkwYIA5dOiQmT59uilfvrz585//bA4ePGi7xEJLTU11ZDA836effmoGDhxoypcvb2688UZz/Phx2yUVqFSpUubAgQN5bv/5559N6dKli7CiwgkLCzNff/119udjx46ZEiVKmF9//dViVYXz3nvvZc9a+swzz5ijR4/aLskv5cqVM9u2bctz+7Zt20y5cuWKsKLCccOM11WrVjVffvmlMcaYM2fOmNDQULNq1ars7bt27bI6M3FgPzV0hZw6dUpLly5V586dVbduXW3ZskVTp07V/v379dxzzzniuYU2bdron//8p95++23NnTtXFSpU0IMPPqht27bp119/VaNGjbR06VLbZfolMTFRQ4YMUb169XT06FFNnz5dFSpUsF2Wzw4cOKBJkyapXr166tGjh8LDw/Xll1/qiy++uOjhvUB05syZfM/9EiVK5PsG1kCRkZGRY9bYcuXKqWzZskpPT7dYlX+++uordezYUX/+85/VsWNH/fDDD3ryyScVHBxsuzS/1K1bV4mJiXlu/+yzz3Id9huIUlNTlZWVlecS6M+UdOjQQRMmTND+/fs1depUZWVlqUOHDtnbd+3apWuuucZafYH/q3sFVKlSRSEhIerfv79effXV7InDjh07lmO/vIbNBoJatWrpvffeu2h0QO3atfXxxx9r6tSpGjx4sHr27GmpQt+kpKRo/vz5mjt3rg4dOqQ+ffro888/d9wDh3fccYc++ugjde7cWS+88IJiY2MdEWrPZ4zRgAED5PV6c93uhAcmz1mzZk2OadOzsrK0du3aHEPdu3fvbqM0n7Rp00Zly5bV/fffr1q1amnRokW57hfos+H27t1b48aNU9u2bS+a42Lbtm166qmn9Oijj1qqzndOGdKen4kTJ+q2225TzZo1VaJECb3yyis5wu2CBQt8nqLgSii2Q3vPye0kMw4YTpqcnKyoqKh8/5F8//33Af+/jlKlSqlatWrq37+/unfvnuew2Pwm6wkEQUFBqlKliipXrpzv30kgT999bkbMgsydO/cKV3JpfBkmGuj/vq+55hqfhvb++OOPRVRR4Zw+fVqdO3fWZ599ppiYGDVo0ECS9M033+jDDz/UTTfdpISEhICfnNEtM17/8ccf+vrrr1WpUiVVrVo1x7Zt27YpKirK2vD3YhlGPv74Y5/2a9++/RWupPBKlCihlJQUx//jyC0YXnhKBvoPh+Se6buBy+306dOaMmWKFi1apO+//17GGNWrV0+9e/fWyJEjVbp0adslFsiXGa+d7scff9T999+vDz74wMr3F8sw4gZuSeq8myawnJuf49SpU+rQoYMaN25suyTAujNnzujFF1/Uu+++q1OnTunWW2/V+PHjHfEsmK+2bdumVq1aWfuPn7Nual8mb731luLi4rIT+U8//aSqVatm/y/9+PHjmjZtWsDfy3TDfcw333xTjzzySIFTLQe6Xbt2qVGjRvnus3DhwoCe3Omjjz7Sn/70J504cUKSVLJkSc2ZMyega87Nu+++W+A+JUuWVGRkpJo0aRKQ/zMfNWpUruvDwsJUr1493XXXXXk+2xOITpw4oYSEBH333XeSpPr16ysmJsYxP+aTJk3S3//+9+yaX375ZR08eFBz5syxXZprFMuekQtvcYSGhiopKSn7fRVpaWmqWrVqQN8aCAoK0l//+tcCf8RfeumlIqqocNxyu6ls2bKaMGGCHn744YtCYlpamoYOHaqPPvoooF+k1a5dO1WsWFEzZsxQmTJlNG7cOC1fvlwHDhywXZpf/JlaPDIyUkuXLs31BWg2XTgD7jmHDx/W7t27FRERoXXr1gX8O5uks+FwyJAh+vXXX3Osr1ixot544w1169bNUmW+q1u3rh555BHdd999kqQPP/xQsbGxOnHiRMBPZe8r2z0jxTKMXHiLIyQkRNu2bXNcGImOjs73f3UejyfgX8zmlttN//73v/XAAw+ofv36mjdvnq699lpJZ3tDRowYocaNG2vOnDmqU6eO5UrzVqFCBSUmJmb38Bw/flyhoaFKS0tz9Ds5cmOMUVpamp599lklJiYG9IPFF8rIyFCfPn0UEhKS5yibQJGYmKgOHTqoe/fuevjhh9WwYUNJZ3sSJ0+erFWrVunjjz9WmzZtLFeaP6/Xq927dysqKip7XZkyZbR7925Vr17dYmWXD2HEAreEETf8iAcFBSktLU2VKlWyXcolO3jwoO677z4lJCTo73//uz799FMlJCTo2Wef1d/+9reAv62W2zl14b8Nt9m7d68aNGigkydP2i7FL1999ZXuuecen5+5suWOO+5QVFSUZs2alev2++67T/v379fq1auLuDL/lChRQqmpqTmuUyEhIdq+fbtq1aplsTLftWzZMt9r0PHjx/X999/zzAj8E+g/bP6oV69ege35/fffi6iawqtcubKWL1+uPn366NFHH1VwcLC+/PJLNW3a1HZpPnP6/By+SElJ0enTp1WjRg1dc801SktLs12S3ypWrOiIfxNffPGF/vGPf+S5fdiwYQE9avGc3ObgOXnypO6///4cc3UsW7bMRnk+iYuLs11CvoptGDn/onvhBffw4cMWK/ONmzq0nn766Rw/gE516NAhDRs2TCtXrtSYMWO0dOlS9erVS/Pnz1erVq1sl+eT/v37X7Tu3H1yyRnDrAvSqVMnfffdd9ntcOK598UXX2TfCgxkJ06cyHfyyLCwMEf0SuX278JpD3YH+rQCxTaMXHhynX/BlQK/52Hu3LmOvIjm5t5773X87aZVq1Zp6NChqlGjhjZv3qwGDRroiSee0COPPKLo6Gg9+uijGj9+fEDPypqVlWW7hCIxf/58HT9+3HYZ+dq+fXuu69PT07V582ZNmjQp4H9cpLMPfq5bty7PCfXWrl0b8BMzSoE/0Z8vAn7EXxG/CwdF5MCBA2bfvn22yyhQUFCQ41/NbYwxpUuXNhMnTsz1LcMffPCBqVGjhtXXc8NZzr14LbcXslWqVMnEx8ebrKws22UW6KWXXjLh4eHmP//5z0XbVq1aZa6++mozefJkC5X5b8+ePWb27Nlm2rRpZufOnbbL8VuZMmXMCy+8kOt5k5qaarp162bKly9vobKzCCMu1aBBg4B/i6QxZy+6bggj+b2Z1Bhj0tPTzaBBg4qomivDKQH3fIcOHTKvvfaaGTNmTPYboDdv3mx++ukny5Xlb+/evbkuv//+u+3S/HLmzBnTo0cP4/F4TIMGDcyf//xnExcXZ+rXr2+CgoLMXXfdlWuADzTr1q0z5cqVyw6EpUqVMgsWLLBdll/eeecdU6lSJdOuXTuze/fu7PULFiww4eHh5uabbzbff/+9tfqK5Wiagpz/gJtTbdy4UcePH3fEw2FwhoYNG+Z41iLQbd++XTExMQoLC9PevXv17bffqnbt2ho3bpySk5M1f/582yUWG0uXLtXixYuzJz2rV6+e7r33Xt17772WK/ONW+bgCeQRf4SRXDjtoutkd911l0/7BfJT6r4g4Ba9mJgYtWrVSs8//3yOIcqJiYnq3bu39u7da7vEQnPD+eQkbpuDp0+fPlq8eLGCg4OVmJgYECP+AvdpOouc8IDb+Q4fPqx33nlHP/zwg0aPHq3w8HBt2bJFERERqlatmu3y8nXhQ7iLFi1St27dXPdCqgtHcDjR9ddfb7sEv2zcuDHX+S2qVaum1NRUCxVdPk47n9LT05WQkKC9e/fK4/Godu3auvXWW/MdaRNIMjIyVLFixezP5cqVU9myZZWenu6oMBLII/4II7lw0kX3wq7ooUOHKjw8XMuWLXNEV/SFT6m/8847ev755103yRYBt+h5vV5lZGRctP67775z/CR7TjqfFi5cqOHDh1/0dxEWFqaZM2eqZ8+elirzj9Pn4An4EX/WnlYJEE59wO2cW2+91YwePdoYY0z58uXNDz/8YIwx5vPPPzc1a9a0WFnhnN8G2LFt2zZTqVIlU6dOHVOyZMnsv48nnnjC9O3b13J1vhs8eLCJi4szp06dMuXLlzc//vij2bdvn2nZsqUZMWKE7fKKhc2bN5uSJUua/v37m6SkJHPy5Elz4sQJs3nzZtO3b19TqlQpk5SUZLvMAuU2qunCJdAHDAT6iL9iHUbccNENDQ3NfjL6/B/yvXv3Gq/Xa7O0QnFDGCHgBobDhw+bmJgYU6FCBVOiRAkTFRVlSpUqZW655RZz9OhR2+X5zMnn04ABA0yPHj3y3H733XebgQMHFmFFxVegj/gr1rdpRo0apQEDBmQ/4HbOHXfcod69e1uszHdu7op2IqffNpPc86xFWFiYEhIS9Nlnn2n79u06evSoWrVqpZiYGNul+czp59Pnn3+uV199Nc/t999/vx588MEirKj4atasWb7bQ0ND9cYbbxRRNRcr1mHEDRfd7t2765lnntFbb70l6ezMscnJyXrsscd09913W66uYO+++26Oz7ndh5UC+17s+Qi4gaddu3Zq166d7TIKxenn04EDB1SvXr08t9erV08///xzEVZ0ZbhhdJP1NljrkwkAlSpVMlu2bDHG5OyO/uCDD0z16tVtluYzp3dFu+Fe7PnccNvMTc9afPjhhyY2NtbUrl3b1K5d28TGxpqEhATbZfnM6edTQZMapqamOurfd16cMslkfmy3oVj3jDi9V0Fyfle0296H4oZehcmTJ6tHjx6qXLmyTpw4ofbt2ys1NVXR0dGaOHGi7fJ89uqrr2rEiBHq0aOHRowYIensC+buuOMOTZkyRcOGDbNcYcHccD5dOArlfE54KakvnDS6KS+221CsJz1LT09Xjx49tGnTJh05ckRVq1bNvuiuXr06x6uhAV8MGTJEv/32m9566y2Fh4dr+/btKlGihOLi4nTLLbdo6tSptkv0mVMD7jnVq1fXmDFjNHz48Bzrp0+frkmTJjni9oDTz6egoKAC93HDm6Bx6Yp1GDnH6RfdtWvXasqUKfrvf/8r6ewMsiNHjnRcO3Jj/T6mnwi4gaN8+fJKSkpSnTp1cqz//vvv1bJlSx09etRSZb7jfAo8bpiDJxDbQBhxuPO7oqOjoyWd7Yp+5513HNMVnR+nTs1PwLWvd+/eatmypUaPHp1j/YsvvqhNmzZpyZIllirzn9PPJ7dww/uOArUNxT6MOP2i64au6Pw47X0obuCWgPvss8/qxRdf1E033ZSjHZ9//rkefvjhHFORP/TQQ7bKLNac1vPphvcdBWobinUYccNF1w1d0W5DwA0MtWrV8mk/j8ejH3/88QpXU3hOP5/y47Sez7CwMG3ZskXXXnttjh/yffv2qX79+jp58qTtEgsUqG0o+OkiF5s0aZKmTJmixYsX66GHHtJDDz2kRYsWacqUKZo0aZLt8nzSvXt3LV++/KL1K1eu1J/+9CcLFRXe4cOH9frrr2vs2LH6/fffJUlbtmxxzI+fdDbg3n777QoJCdGIESM0YsQIhYaG6o477tD06dNtl+eTw4cP6/bbb79ofefOnZWenm6hosLZs2ePT0sgBxE3nE/5mT9/vtatW2e7DJ+5YXRTwLbB1pjiQBAcHGy+//77i9Z/9913Jjg42EJF/pswYYIJCwszd9xxh5kwYYKZMGGCiY2NNRUqVDATJkwwL7/8cvYSyNwwNb8xxlSrVs3885//vGj9tGnTTNWqVS1U5L9evXqZ559//qL1L7zwgunZs6eFigpn3bp1tku4ZG44n9zEDXPwBGobivVtGjc84OaWruhAvY/pLzfcNnPLsxZer1fVq1fXwIED1b9/f0VFRdkuyW9uOJ/OCcQRHP5yw+imQG1DsQ4jbrnoukGg3sf0FwE3cPz6669asGCB3nzzTX399dfq1KmTBg8erLi4OJUuXdp2eT5xw/kkBe4IjsJyw+imQGtDsQ4jbrjofvTRR+rYsaPtMi5Z5cqVtWbNGrVs2TJHGElISNCgQYO0f/9+2yX6hIAbmLZs2aK5c+dq8eLFks7+yA8ePFjNmze3XFn+3HI+uaXnE1dOsQ4jbuCGrmjJ+TNNnkPADVwHDhzQ7Nmz9dxzz6lkyZI6efKkoqOjNXPmTDVu3Nh2eblyw/kkuafnU3LH6KaAbIO1p1UCgBsecPvll1/MSy+9ZJo3b25KlixpOnfubJYuXWoyMzNtl+YXp7/wz01Kly5tateubSZMmGCSk5Ntl3NJTp06Zd5++23TtWtXU7JkSdOmTRvz2muvmaNHj5o9e/aYPn36mIYNG9ou0/Xc8FJSY4yZPn26KVmypLn33nuzBwb06tXLlCpVykybNs12eT4J1DYU6zDipouuMcZs3rzZDB8+3Fx99dXm6quvNv/3f/9nkpKSbJfll08//dRMnz7d/OMf/3DU21XPIeDa17FjR3Po0KHsfwvh4eFmxIgRZseOHRftm5KSYjwej4UqfeOG88mYwB3B4S83jG4K1DYU6zDi9Itubn7++Wczfvx44/V6TXBwsClRooRp166d2blzp+3SigUCrn1BQUEmLS3NdOrUySxatMicPHkyz31Pnz5t1q9fX4TV+cct55Nbej7dMB1EoLahWIeR8znxonuOW7qiP/zwQxMbG2tq165tateubWJjYx3XO0LAtc/j8Zi0tDTbZVwWbjufnN7z6YY5eAK1DTzAeh4nPeDWqVMnLVu2TE8++aQWL14sY4z69u2rIUOGqEmTJjn2TU1NVdWqVZWVlWWp2oK5YWr+Czl1BIcknT59WitXrtScOXOUkJCg6667ToMHD1avXr30yy+/aNy4cdqyZYt27dplu9SLBAUFad26dQoPD893v2bNmhVRRZeHk88nt3DD6KaAbYO1GBQgnNqr4KauaGMC9z7mpXJSr4JbnrXweDwmKCjIeDyei5Zz64OCgmyXWShOOp8u5Iaez2uuucanpVatWrZLzVOgtqFYhhE3XHTd1BVtTODexywMAq5dHo/HbNy40ezduzffxSmcej6dL1BHcCBwFMsw4oaLrsfjMR999JHZtm1bvotTBOp9TF8RcAOHG9rhhvPpfG7p+XTD6KZAbUOxfGYkKChIqampqly5su1SCi0oKEgej0e5/fWdW+/xeBzzau6AvY/poxIlSiglJUW9evXSkCFDdNddd8nr9ea67x9//KHPP/9c7du3L+Iq8+eWZy3c8O/bDefT+dzyjh03TDIZqG0otmHE6RfdoKAgffXVVwW+8rlmzZpFVNGlcfpMk274AXRLwO3YsaOWL1+uChUq2C6l0NxwPp3PLe/YccP7jgK1DcU2jDj9ouu2i5XTEXADk1PfFOuG8+l8Tu/5zI0bRjcFUhuKbRhx+kXXbWHE6e9DIeAGHie/KdYN59P5nN7zmRcnTQeRl0BpQ8ki+6YAU6NGDUdfdNu3b++YbkFf3H777QF5H9MfX375ZYEBF0Vn1KhRGjBgQPabYs+544471Lt3b4uV+cZN59OePXtsl3DZ5DYHz7Rp03LMwXPPPfcE5Bw85wRkG4r6idlA4Ian7c936NAh89prr5kxY8aY3377zRhzdkbZn376yXJlvnP6TJNuOKc6dOhgDh06ZLuMyyY0NNTs3r3bGJPz5Wx79+41Xq/XZmkFcsP5dL5AHcHhKzeMbgr0NhTLMOKmi+62bdtMpUqVTJ06dUzJkiWzL7hPPPGE6du3r+XqCseJU/O77cfDDQHXyW+Kddv55PR37LhhOohAb0OxDCPnc/pF99ZbbzWjR482xuS84H7++eemZs2aFiu7NE6baZKAG3ic/KZYN51PxtDzGQgCvQ3F8gHWc5z8gNs5YWFh2rJli6699lqFhIRo27Ztql27tvbt26f69evr5MmTtkv0mZPfh3I+p47gOCcmJkatWrXKftbi3DmVmJio3r17a+/evbZL9El6erp69OihTZs26ciRI6patapSU1MVHR2t1atXKzg42HaJPnH6+XShQBrB4Ss3jG4K+DbYTkM2uaFXwcld0cYE/n1Mf7mhV8HJz1rkxslvinXD+ZQbp/V8uuF9R4HehmI7mkaSNm7cqFmzZl20vlq1akpNTbVQkf+6d++uZ555Rm+99Zaks0PjkpOT9dhjj+nuu++2XF3BPv74Y506dUq7du3SP//5z3xnmqxYsaI++uijIq7QP04fwSGdnaExIyPjovXfffedI0d3tGvXTu3atbNdRqG44Xw6JyBHcPjBDaObArkNxTqMuOGiO3nyZPXo0UOVK1fWiRMn1L59++yu6IkTJ9our0Dm/79LuHbt2gL3LVmyZEBPeS0RcG175ZVXfN7XCZNrOf186tSpk5YtW6Ynn3xSixcvljFGffv21fPPP68mTZpk7xccHKwXX3xRVatWtVht/pw+HYQU2G0o1mHEyRfdc8LCwpSQkKDPPvtM27dv19GjR9WqVSvFxMTYLs1nu3btKvDCGsj3Ys9HwLVrypQpPu3n8XgcEUacfj65recTV06xfoDVLQ+4OZnbZpocMmSIfvvtN7311lsKDw/X9u3bVaJECcXFxemWW27R1KlTbZfoMycHXLdw+vnklll93fC+o0BvQ7EOI+c47aLrpq5oN0zNfz4CLi4np59PAT+CoxDcMLopENtAGHEgN73nwS3/c7oQAdeOUaNGacKECQoODtaoUaPy3fell14qoqoundPOp3Pc1vPphukgArUNxS6MuOWi6xZuDSNO45aA27FjR7344otq2bKlbr311jz383g8WrduXRFWVjy5refTDXPwBGobil0YcctF1y0C/T6mLwi4gaVEiRJKSUnJDrg9e/bUK6+8ooiICMuV+cZN55Pb/rPhhkkmA7UNxW40jRveHummrujzn54PxPuYvnDbCA6nu/D/V++9956OHTtmqRr/cT4FLqePbpICtw3FLoy4wdatW/XNN9+oZcuW2rp1a577eTyeIqzq0lx4H3Po0KEKDw/XsmXLAv5eLAE3sDmt89cN59M57du3V+nSpW2Xcdm4YTqIQG1DsbtN45aLrtO7oi8UqPcxiws3PWtRokQJpaamZv8vLyQkRNu3b/f5Fi2uDKf2fJ7P6aObpMBtQ7HrGXFLr4LTu6Iv5OSZJt0QcD/66KPsgHvu1plTA64xRgMGDMieXOvkyZO6//77L7rILlu2zEZ5BXLD+XQhJ/d8ns8Nk0wGahuKXRhx00X3fE7v4ArU+5i+IOAGlv79++f4/Je//MVSJYXjlvPpfG56x47k7PcdnRNobSh2YURyx0XX4/FcdDFy0sXpQoF6H9MXBNzAMnfuXNslXBI3nk9O7vl0w+gmJ7ShWIaRCznxouv0rugLOfl9KBIBF5eXG86n8zm559MNo5uc0IZiGUbccNF1elf0hQL1PmZhEXBxOTnxfDqfk3s+3TC6yQltKHajaaSzE/F07do1+6L7//7f/1OnTp246KLQ3DCCY+DAgT7t5/TbIE7ghvPpfIE6ggOBo1iGES66gcEJ9zF9RcDF5eTW88mJPZ9uGN3khDYUyzCCwOCmqfkJuLicOJ8Chxvm4HFCGwgjAIDLzk09n26YZDLQ20AYAQBcdm7q+bzwhX+hoaFKSkpS7dq1LVfmu0BvQ7EcTYPA4IT7mAAKxwkjOArLDf+HD7Q2EEZgjRtnmgTgPm6YDiLQ28BtGlgV6PcxARSOm3o+3TC6KdDbQM8IrHLbTJMAznJTz6cbJpkM9DbQMwKrLnyoKiQkRNu2bQuYh6oAFB49n/BVkO0CULwF+n1MAIVHzyd8xW0aWMX7UIDig4545IUwAqsC/T4mgMKj5xO+4pkRAMAVEegjOBA46BkBAFwR9HzCV/SMAAAAqxhNAwAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMCq/w88uqkaYYVZSwAAAABJRU5ErkJggg==\n",
142"text/plain": [
143"<Figure size 640x480 with 1 Axes>"
144]
145},
146"metadata": {},
147"output_type": "display_data"
148}
149],
150"source": [
151"file_info.filetype.value_counts().plot.bar()"
152]
153},
154{
155"cell_type": "markdown",
156"id": "c756d12e",
157"metadata": {},
158"source": [
159"You can also use this utility to find the average file size of documents in a directory, grouped by filetype."
160]
161},
162{
163"cell_type": "code",
164"execution_count": 5,
165"id": "a600fb0f",
166"metadata": {},
167"outputs": [
168{
169"data": {
170"text/html": [
171"<div>\n",
172"<style scoped>\n",
173" .dataframe tbody tr th:only-of-type {\n",
174" vertical-align: middle;\n",
175" }\n",
176"\n",
177" .dataframe tbody tr th {\n",
178" vertical-align: top;\n",
179" }\n",
180"\n",
181" .dataframe thead th {\n",
182" text-align: right;\n",
183" }\n",
184"</style>\n",
185"<table border=\"1\" class=\"dataframe\">\n",
186" <thead>\n",
187" <tr style=\"text-align: right;\">\n",
188" <th></th>\n",
189" <th>filesize</th>\n",
190" </tr>\n",
191" <tr>\n",
192" <th>filetype</th>\n",
193" <th></th>\n",
194" </tr>\n",
195" </thead>\n",
196" <tbody>\n",
197" <tr>\n",
198" <th>FileType.DOCX</th>\n",
199" <td>36602.0</td>\n",
200" </tr>\n",
201" <tr>\n",
202" <th>FileType.EML</th>\n",
203" <td>149088.5</td>\n",
204" </tr>\n",
205" <tr>\n",
206" <th>FileType.HTML</th>\n",
207" <td>1228404.0</td>\n",
208" </tr>\n",
209" <tr>\n",
210" <th>FileType.JPG</th>\n",
211" <td>64002.5</td>\n",
212" </tr>\n",
213" <tr>\n",
214" <th>FileType.PDF</th>\n",
215" <td>2429245.0</td>\n",
216" </tr>\n",
217" <tr>\n",
218" <th>FileType.PPTX</th>\n",
219" <td>38412.0</td>\n",
220" </tr>\n",
221" <tr>\n",
222" <th>FileType.TXT</th>\n",
223" <td>619.0</td>\n",
224" </tr>\n",
225" <tr>\n",
226" <th>FileType.UNK</th>\n",
227" <td>1102.0</td>\n",
228" </tr>\n",
229" <tr>\n",
230" <th>FileType.XLSX</th>\n",
231" <td>4765.0</td>\n",
232" </tr>\n",
233" <tr>\n",
234" <th>FileType.XML</th>\n",
235" <td>713.5</td>\n",
236" </tr>\n",
237" </tbody>\n",
238"</table>\n",
239"</div>"
240],
241"text/plain": [
242" filesize\n",
243"filetype \n",
244"FileType.DOCX 36602.0\n",
245"FileType.EML 149088.5\n",
246"FileType.HTML 1228404.0\n",
247"FileType.JPG 64002.5\n",
248"FileType.PDF 2429245.0\n",
249"FileType.PPTX 38412.0\n",
250"FileType.TXT 619.0\n",
251"FileType.UNK 1102.0\n",
252"FileType.XLSX 4765.0\n",
253"FileType.XML 713.5"
254]
255},
256"execution_count": 5,
257"metadata": {},
258"output_type": "execute_result"
259}
260],
261"source": [
262"file_info.groupby(\"filetype\").mean(numeric_only=True)"
263]
264},
265{
266"cell_type": "markdown",
267"id": "870ede91",
268"metadata": {},
269"source": [
270"If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
271]
272},
273{
274"cell_type": "code",
275"execution_count": 6,
276"id": "e5e3a24d",
277"metadata": {},
278"outputs": [],
279"source": [
280"from unstructured.file_utils.exploration import get_file_info\n",
281"\n",
282"filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"
283]
284},
285{
286"cell_type": "code",
287"execution_count": 7,
288"id": "d8e59472",
289"metadata": {},
290"outputs": [
291{
292"data": {
293"text/plain": [
294"['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",
295" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",
296" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",
297" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",
298" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",
299" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",
300" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",
301" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",
302" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",
303" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",
304" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",
305" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",
306" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",
307" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",
308" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",
309" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",
310" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",
311" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",
312" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"
313]
314},
315"execution_count": 7,
316"metadata": {},
317"output_type": "execute_result"
318}
319],
320"source": [
321"filenames"
322]
323},
324{
325"cell_type": "code",
326"execution_count": 8,
327"id": "cb0add28",
328"metadata": {},
329"outputs": [
330{
331"name": "stderr",
332"output_type": "stream",
333"text": [
334"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
335]
336},
337{
338"data": {
339"text/plain": [
340"FileType.EML 4\n",
341"FileType.TXT 3\n",
342"FileType.HTML 2\n",
343"FileType.XML 2\n",
344"FileType.PDF 2\n",
345"FileType.JPG 2\n",
346"FileType.UNK 1\n",
347"FileType.DOCX 1\n",
348"FileType.PPTX 1\n",
349"FileType.XLSX 1\n",
350"Name: filetype, dtype: int64"
351]
352},
353"execution_count": 8,
354"metadata": {},
355"output_type": "execute_result"
356}
357],
358"source": [
359"file_info = get_file_info(filenames=filenames)\n",
360"file_info.filetype.value_counts()"
361]
362},
363{
364"cell_type": "code",
365"execution_count": null,
366"id": "ac4473b0",
367"metadata": {},
368"outputs": [],
369"source": []
370}
371],
372"metadata": {
373"kernelspec": {
374"display_name": "Python 3 (ipykernel)",
375"language": "python",
376"name": "python3"
377},
378"language_info": {
379"codemirror_mode": {
380"name": "ipython",
381"version": 3
382},
383"file_extension": ".py",
384"mimetype": "text/x-python",
385"name": "python",
386"nbconvert_exporter": "python",
387"pygments_lexer": "ipython3",
388"version": "3.8.13"
389}
390},
391"nbformat": 4,
392"nbformat_minor": 5
393}
394