unstructured

1-Intro to Bricks.ipynb
742 строки · 32.3 Кб
Перенос по словам
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "id": "99b955e5",
6
   "metadata": {},
7
   "source": [
8
    "# Intro to Bricks\n",
9
    "\n",
10
    "The goal of this notebook is to introduce you to the concept of bricks. Bricks are functions that live in `unstructured` and are the primary public API for the library. There are three types of bricks in `unstructured`, corresponding to the different stages of document pre-processing: partitioning, cleaning, and staging. At the conclusion of this notebook, you should be able to do the following:\n",
11
    "\n",
12
    "- [Extract content from a document using partitioning bricks](#partition)\n",
13
    "- [Remove unwanted content from document elements using cleaning bricks](#cleaning)\n",
14
    "- [Preparing data for downstream use cases using staging bricks](#staging)"
15
   ]
16
  },
17
  {
18
   "cell_type": "code",
19
   "execution_count": 1,
20
   "id": "3908be82",
21
   "metadata": {},
22
   "outputs": [],
23
   "source": [
24
    "import os\n",
25
    "import pathlib\n",
26
    "\n",
27
    "DIRECTORY = os.path.abspath(\"\")\n",
28
    "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
29
   ]
30
  },
31
  {
32
   "cell_type": "markdown",
33
   "id": "d0842e87",
34
   "metadata": {},
35
   "source": [
36
    "## Partitioning bricks  <a id=\"partition\"></a>\n",
37
    "\n",
38
    "Partitioning bricks in `unstructured` allow users to extract structured content from a raw unstructured document. As we covered in the [core concepts notebook](https://github.com/Unstructured-IO/unstructured/blob/main/examples/training/0-Core%20Concepts.ipynb), partitioning bricks break a document down into elements such as `Title`, `NarrativeText`, and `ListItem`, enabling users to decide what content they'd like to keep for their particular application. If you're training a summarization model, for example, you may only be interested in `NarrativeText`.\n",
39
    "\n",
40
    "The easiest way to partition documents in `unstructured` is to use the `partition` brick. If you call the `partition` brick, `unstructured` will use `libmagic` to automatically determine the file type and invoke the appropriate `partition` function. As shown in the examples below, the `partition` function accepts both filenames and file-like objects as input. `partition` also has some includes some optional kwargs. For example, if you set `include_page_breaks=True`, the output will include `PageBreak` elements if the filetype supports it. See the\n",
41
    "[`unstructured` documentation](https://unstructured-io.github.io/unstructured/bricks.html#partition) for full details on available options."
42
   ]
43
  },
44
  {
45
   "cell_type": "code",
46
   "execution_count": 2,
47
   "id": "8bbb73c0",
48
   "metadata": {},
49
   "outputs": [],
50
   "source": [
51
    "from unstructured.partition.auto import partition\n",
52
    "\n",
53
    "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"layout-parser-paper-fast.pdf\")\n",
54
    "elements = partition(filename=filename)"
55
   ]
56
  },
57
  {
58
   "cell_type": "code",
59
   "execution_count": 3,
60
   "id": "5319593c",
61
   "metadata": {},
62
   "outputs": [
63
    {
64
     "name": "stdout",
65
     "output_type": "stream",
66
     "text": [
67
      "LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis\n",
68
      "\n",
69
      "Zejiang Shen 1 ( (ea)\n",
70
      " ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5\n",
71
      "\n",
72
      "Allen Institute for AI shannons@allenai.org\n",
73
      "\n",
74
      "Brown University ruochen zhang@brown.edu\n",
75
      "\n",
76
      "Harvard University { melissadell,jacob carlson } @fas.harvard.edu\n",
77
      "\n",
78
      "University of Washington bcgl@cs.washington.edu\n",
79
      "\n",
80
      "University of Waterloo w\n",
81
      "\n",
82
      "li@uwaterloo.ca\n",
83
      "\n",
84
      "Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser , an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io\n",
85
      "\n",
86
      "Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.\n"
87
     ]
88
    }
89
   ],
90
   "source": [
91
    "print(\"\\n\\n\".join([str(el) for el in elements][:10]))"
92
   ]
93
  },
94
  {
95
   "cell_type": "code",
96
   "execution_count": 4,
97
   "id": "8de9bee1",
98
   "metadata": {},
99
   "outputs": [],
100
   "source": [
101
    "with open(filename, \"rb\") as f:\n",
102
    "    elements = partition(file=f, include_page_breaks=True)"
103
   ]
104
  },
105
  {
106
   "cell_type": "code",
107
   "execution_count": 5,
108
   "id": "75c6c73c",
109
   "metadata": {},
110
   "outputs": [
111
    {
112
     "name": "stdout",
113
     "output_type": "stream",
114
     "text": [
115
      "University of Washington bcgl@cs.washington.edu\n",
116
      "\n",
117
      "University of Waterloo w\n",
118
      "\n",
119
      "li@uwaterloo.ca\n",
120
      "\n",
121
      "Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser , an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io\n",
122
      "\n",
123
      "Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.\n",
124
      "\n",
125
      "Introduction\n",
126
      "\n",
127
      "Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,\n",
128
      "\n",
129
      "<PAGE BREAK>\n",
130
      "\n",
131
      "37], layout detection [38, 22], table detection [26], and scene text detection [4]. A generalized learning-based framework dramatically reduces the need for the manual speciﬁcation of complicated rules, which is the status quo with traditional methods. DL has the potential to transform DIA pipelines and beneﬁt a broad spectrum of large-scale document digitization projects.\n",
132
      "\n",
133
      "However, there are several practical diﬃculties for taking advantages of re- cent advances in DL-based methods: 1) DL models are notoriously convoluted for reuse and extension. Existing models are developed using distinct frame- works like TensorFlow [1] or PyTorch [24], and the high-level parameters can be obfuscated by implementation details [8]. It can be a time-consuming and frustrating experience to debug, reproduce, and adapt existing models for DIA, and many researchers who would beneﬁt the most from using these methods lack the technical background to implement them from scratch. 2) Document images contain diverse and disparate patterns across domains, and customized training is often required to achieve a desirable detection accuracy. Currently there is no full-ﬂedged infrastructure for easily curating the target document image datasets and ﬁne-tuning or re-training the models. 3) DIA usually requires a sequence of models and other processing to obtain the ﬁnal outputs. Often research teams use DL models and then perform further document analyses in separate processes, and these pipelines are not documented in any central location (and often not documented at all). This makes it diﬃcult for research teams to learn about how full pipelines are implemented and leads them to invest signiﬁcant resources in reinventing the DIA wheel .\n"
134
     ]
135
    }
136
   ],
137
   "source": [
138
    "print(\"\\n\\n\".join([str(el) for el in elements][5:15]))"
139
   ]
140
  },
141
  {
142
   "cell_type": "markdown",
143
   "id": "e3a8e7f4",
144
   "metadata": {},
145
   "source": [
146
    "The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partitioning brick instead of `partition`:\n",
147
    "\n",
148
    "1. If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly will make your program run faster.\n",
149
    "2. Fewer dependencies. You don't need to install `libmagic` for filetype detection if you're only using document-specific bricks.\n",
150
    "3. Additional features. The API for `partition` is the least common denominator for all document types. Certain document-specific brick include extra features that you may want to take advantage of. For example, `partition_html` allows you to pass in a URL so you don't have to store the `.html` file locally.\n",
151
    "\n",
152
    "Currently, the partitioning bricks we support in `unstructured` are:\n",
153
    "\n",
154
    "- `partition_docx`\n",
155
    "    - Works on `.docx` files. Does not yet work on older `.doc` files.\n",
156
    "- `partition_pptx`\n",
157
    "    - Works on `.pptx` files. Does not yet work on older `.ppt` files.\n",
158
    "- `partition_html`\n",
159
    "    - Works on `.html` files.\n",
160
    "    - Can pass in the HTML document as a string using the `text` kwarg.\n",
161
    "    - Can pass in the URL for an HTML document using the `url` kwarg.\n",
162
    "- `partition_pdf`\n",
163
    "    - Works on `.pdf` files. Partitions the document using a document image analysis model.\n",
164
    "    - If `url=None`, the model will run locally. If you pass in a URL, the brick will make a network call\n",
165
    "      to a hosted model inference API. There is also an optional `token` kwarg for passing in an authentication token.\n",
166
    "- `partition_image`\n",
167
    "    - Has the same API as `partition_pdf`. Works on `.jpg` and `.png` files.\n",
168
    "- `partition_email`\n",
169
    "    - Works on `.eml` files. Most common email clients (i.e. Gmail, Microsoft Outlook) allow users to export emails in\n",
170
    "      `.eml` format.\n",
171
    "    - Parses the `text/html` content from the email by default. If you set `content_source=\"text/plain\"`, the brick will parse the plain text instead.\n",
172
    "    - If you set `include_headers=True`, the output will include information from the email header.\n",
173
    "    - You can pass in the email as a string using the `text` kwarg.\n",
174
    "- `partition_text`\n",
175
    "    - Works on plain text files.\n",
176
    "    - You can pass in the document as a string using the `text` kwarg.\n",
177
    "\n",
178
    "\n",
179
    "See the [`unstructured` docs](https://unstructured-io.github.io/unstructured/bricks.html#partition-docx) for a full list of options. Below we see an example of how to partition a document directly with the URL using the `partition_html` function.\n"
180
   ]
181
  },
182
  {
183
   "cell_type": "code",
184
   "execution_count": 6,
185
   "id": "b7ce3fa8",
186
   "metadata": {},
187
   "outputs": [],
188
   "source": [
189
    "from unstructured.partition.html import partition_html\n",
190
    "\n",
191
    "url = \"https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html\"\n",
192
    "elements = partition_html(url=url)"
193
   ]
194
  },
195
  {
196
   "cell_type": "code",
197
   "execution_count": 7,
198
   "id": "ab6d9307",
199
   "metadata": {},
200
   "outputs": [
201
    {
202
     "name": "stdout",
203
     "output_type": "stream",
204
     "text": [
205
      "CNN\n",
206
      "         —\n",
207
      "\n",
208
      "The Empire State Building was lit in green and white to celebrate the Philadelphia Eagles' victory in the NFC Championship game on Sunday — a decision that's sparked a bit of a backlash in the Big Apple.\n",
209
      "\n",
210
      "The Eagles advanced to the Super Bowl for the first time since 2018 after defeating the San Francisco 49ers 31-7, and the Empire State Building later tweeted how it was marking the occasion.\n",
211
      "\n",
212
      "Fly @Eagles Fly! We're going Green and White in honor of the Eagles NFC Championship Victory. pic.twitter.com/RNiwbCIkt7— Empire State Building (@EmpireStateBldg)\n",
213
      "\n",
214
      "January 29, 2023\n",
215
      "\n",
216
      "But given the fierce rivalry between the Eagles and the New York Giants, who the Super Bowl-bound team had comfortably defeated in the previous round of the NFL Playoffs, many were left questioning the move.\n",
217
      "\n",
218
      "Did y'all lose a bet, ESPN contributor Mina Kimes asked in response to the tweet, while Giants running back Matt Breida also expressed his disbelief.\n",
219
      "\n",
220
      "SMHð¤¦ð¾âï¸— Matt Breida (@MattBreida)\n",
221
      "\n",
222
      "January 30, 2023\n",
223
      "\n",
224
      "As the representative for the Empire State Building, and a diehard Giants fan, let me be on the record saying that this is absolutely ridiculous, said New York City councilman Keith Powers.\n",
225
      "\n",
226
      "The Giants' Twitter account also acknowledged the divisive decision, writing: I'm just here for the comments.\n",
227
      "\n",
228
      "The Empire State Building, whose original tweet honoring the Eagles was viewed nearly 30 million at the time of writing, said the color switch hurt us more than it hurt you — but only after mocking another tweet calling the New York landmark lame.\n",
229
      "\n",
230
      "The building was later lit in red to celebrate the Kansas City Chiefs' AFC Championship win against the Cincinnati Bengals.\n",
231
      "\n",
232
      "In Philadelphia, meanwhile, Eagles fans poured onto the streets on Sunday night. Large crowds gathered in the city as people climbed up light posts, street signs, and on top of a bus stop canopy.\n",
233
      "\n",
234
      "The city announced street closures and vehicle restrictions in Philadelphia's city center due to Eagles celebratory activity between 8th to 20th streets and Race to Lombard streets, the city's Office of Emergency Management tweeted on Sunday night.\n",
235
      "\n",
236
      "Philadelphians, let's celebrate joyously, safely, and respectfully and show the same love we have for our team to our city. Go Birds! Mayor Jim Kenney tweeted.\n",
237
      "\n",
238
      "The Eagles and the Chiefs face off in Super Bowl LVII on February 12.\n"
239
     ]
240
    }
241
   ],
242
   "source": [
243
    "print(\"\\n\\n\".join([str(el) for el in elements]))"
244
   ]
245
  },
246
  {
247
   "cell_type": "markdown",
248
   "id": "e51c26ed",
249
   "metadata": {},
250
   "source": [
251
    "## Cleaning bricks <a id=\"cleaning\"></a>\n",
252
    "\n",
253
    "As part of data preparation for an NLP model, it's common to need to clean up your data prior to passing it into the model. If there's unwanted content in your output, it could impact the quality of your NLP model. To help with this, the `unstructured` library includes cleaning bricks to help users sanitize output before sending it to downstream applications. You can check out our [documentation](https://unstructured-io.github.io/unstructured/bricks.html#cleaning) for a full list of cleaning bricks.\n",
254
    "\n",
255
    "Some cleaning bricks apply automatically. In the example above, the output `Philadelphia Eaglesâ\\x80\\x99 victory` automatically gets converted to `Philadelphia Eagles' victory` in `partition_html` using the `replace_unicode_quotes` cleaning brick. You can see how that works in the code snippet below:"
256
   ]
257
  },
258
  {
259
   "cell_type": "code",
260
   "execution_count": 8,
261
   "id": "a1c4ba19",
262
   "metadata": {},
263
   "outputs": [
264
    {
265
     "data": {
266
      "text/plain": [
267
       "\"Philadelphia Eagles' victory\""
268
      ]
269
     },
270
     "execution_count": 8,
271
     "metadata": {},
272
     "output_type": "execute_result"
273
    }
274
   ],
275
   "source": [
276
    "from unstructured.cleaners.core import replace_unicode_quotes\n",
277
    "\n",
278
    "replace_unicode_quotes(\"Philadelphia Eaglesâ\\x80\\x99 victory\")"
279
   ]
280
  },
281
  {
282
   "cell_type": "markdown",
283
   "id": "db2391ce",
284
   "metadata": {},
285
   "source": [
286
    "Document elements in `unstructured` include an `apply` method that allow you to apply the text cleaning to the document element without instantiating a new element. The `apply` method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the `replace_unicode_quotes` cleaning brick using the `apply` method."
287
   ]
288
  },
289
  {
290
   "cell_type": "code",
291
   "execution_count": 9,
292
   "id": "215c4b35",
293
   "metadata": {},
294
   "outputs": [
295
    {
296
     "name": "stdout",
297
     "output_type": "stream",
298
     "text": [
299
      "Philadelphia Eagles' victory\n"
300
     ]
301
    }
302
   ],
303
   "source": [
304
    "from unstructured.documents.elements import Text\n",
305
    "\n",
306
    "element = Text(\"Philadelphia Eaglesâ\\x80\\x99 victory\")\n",
307
    "element.apply(replace_unicode_quotes)\n",
308
    "print(element)"
309
   ]
310
  },
311
  {
312
   "cell_type": "markdown",
313
   "id": "358e149b",
314
   "metadata": {},
315
   "source": [
316
    "Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to include for model training purposes."
317
   ]
318
  },
319
  {
320
   "cell_type": "code",
321
   "execution_count": 10,
322
   "id": "ae048814",
323
   "metadata": {},
324
   "outputs": [],
325
   "source": [
326
    "url = \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023\"\n",
327
    "elements = partition_html(url=url)"
328
   ]
329
  },
330
  {
331
   "cell_type": "code",
332
   "execution_count": 11,
333
   "id": "4211194b",
334
   "metadata": {},
335
   "outputs": [],
336
   "source": [
337
    "from unstructured.documents.elements import NarrativeText\n",
338
    "\n",
339
    "narrative_text = [el for el in elements if isinstance(el, NarrativeText)][2:]"
340
   ]
341
  },
342
  {
343
   "cell_type": "code",
344
   "execution_count": 12,
345
   "id": "3abd4280",
346
   "metadata": {},
347
   "outputs": [],
348
   "source": [
349
    "import re\n",
350
    "\n",
351
    "remove_citations = lambda text: re.sub(\"\\[\\d{1,3}\\]\", \"\", text)"
352
   ]
353
  },
354
  {
355
   "cell_type": "code",
356
   "execution_count": 13,
357
   "id": "3327feda",
358
   "metadata": {},
359
   "outputs": [
360
    {
361
     "data": {
362
      "text/plain": [
363
       "'[1]\\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'"
364
      ]
365
     },
366
     "execution_count": 13,
367
     "metadata": {},
368
     "output_type": "execute_result"
369
    }
370
   ],
371
   "source": [
372
    "narrative_text[0].text"
373
   ]
374
  },
375
  {
376
   "cell_type": "code",
377
   "execution_count": 14,
378
   "id": "02eb95ae",
379
   "metadata": {},
380
   "outputs": [
381
    {
382
     "data": {
383
      "text/plain": [
384
       "'\\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'"
385
      ]
386
     },
387
     "execution_count": 14,
388
     "metadata": {},
389
     "output_type": "execute_result"
390
    }
391
   ],
392
   "source": [
393
    "narrative_text[0].apply(remove_citations)\n",
394
    "narrative_text[0].text"
395
   ]
396
  },
397
  {
398
   "cell_type": "code",
399
   "execution_count": 15,
400
   "id": "b755cc86",
401
   "metadata": {},
402
   "outputs": [
403
    {
404
     "data": {
405
      "text/plain": [
406
       "'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion.\\xa0Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment.\\xa0Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property.\\xa0The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'"
407
      ]
408
     },
409
     "execution_count": 15,
410
     "metadata": {},
411
     "output_type": "execute_result"
412
    }
413
   ],
414
   "source": [
415
    "narrative_text[6].apply(remove_citations)\n",
416
    "narrative_text[6].text"
417
   ]
418
  },
419
  {
420
   "cell_type": "markdown",
421
   "id": "578a6d10",
422
   "metadata": {},
423
   "source": [
424
    "As we can see, the citations have been removed. After removing the citations, we still have extra whitespace represented by `\\xa0`. We can clean that up using the `clean_extra_whitespace` cleaning brick."
425
   ]
426
  },
427
  {
428
   "cell_type": "code",
429
   "execution_count": 16,
430
   "id": "7d65d7c8",
431
   "metadata": {},
432
   "outputs": [],
433
   "source": [
434
    "from unstructured.cleaners.core import clean_extra_whitespace\n",
435
    "\n",
436
    "narrative_text[0].apply(clean_extra_whitespace)\n",
437
    "narrative_text[6].apply(clean_extra_whitespace)"
438
   ]
439
  },
440
  {
441
   "cell_type": "code",
442
   "execution_count": 17,
443
   "id": "a37f9bad",
444
   "metadata": {},
445
   "outputs": [
446
    {
447
     "data": {
448
      "text/plain": [
449
       "'Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'"
450
      ]
451
     },
452
     "execution_count": 17,
453
     "metadata": {},
454
     "output_type": "execute_result"
455
    }
456
   ],
457
   "source": [
458
    "narrative_text[0].text"
459
   ]
460
  },
461
  {
462
   "cell_type": "code",
463
   "execution_count": 18,
464
   "id": "25245bc1",
465
   "metadata": {},
466
   "outputs": [
467
    {
468
     "data": {
469
      "text/plain": [
470
       "'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion. Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment. Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property. The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'"
471
      ]
472
     },
473
     "execution_count": 18,
474
     "metadata": {},
475
     "output_type": "execute_result"
476
    }
477
   ],
478
   "source": [
479
    "narrative_text[6].text"
480
   ]
481
  },
482
  {
483
   "cell_type": "markdown",
484
   "id": "3b99ec0b",
485
   "metadata": {},
486
   "source": [
487
    "Now the text is clean and formatted how we'd like it for our model training application. The best way to invoke a series of cleaning bricks is to loop over the elements and call `apply` with all of your bricks. For example, we can apply the cleaning bricks to all of the elements from the ISW article with the following code:"
488
   ]
489
  },
490
  {
491
   "cell_type": "code",
492
   "execution_count": 19,
493
   "id": "0218cc7a",
494
   "metadata": {},
495
   "outputs": [],
496
   "source": [
497
    "for element in narrative_text:\n",
498
    "    element.apply(remove_citations)\n",
499
    "    element.apply(clean_extra_whitespace)"
500
   ]
501
  },
502
  {
503
   "cell_type": "markdown",
504
   "id": "6ecb360d",
505
   "metadata": {},
506
   "source": [
507
    "## Staging bricks <a id=\"staging\"></a>\n",
508
    "\n",
509
    "The final step in the process is to prepare your data for ingestion into downstream systems. We include staging bricks in the `unstructured` package to help with that. Staging bricks accept a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using the `stage_for_label_studio` brick. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task."
510
   ]
511
  },
512
  {
513
   "cell_type": "code",
514
   "execution_count": 20,
515
   "id": "21819f56",
516
   "metadata": {},
517
   "outputs": [
518
    {
519
     "name": "stdout",
520
     "output_type": "stream",
521
     "text": [
522
      "[\n",
523
      "    {\n",
524
      "        \"data\": {\n",
525
      "            \"text\": \"Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.\",\n",
526
      "            \"ref_id\": \"c311a941b80429f2ba0b6a2137f7315e\"\n",
527
      "        }\n",
528
      "    },\n",
529
      "    {\n",
530
      "        \"data\": {\n",
531
      "            \"text\": \"Russian military command additionally appears to have fully committed elements of several conventional divisions to decisive offensive operations along the Svatove-Kreminna line, as ISW previously reported.\",\n",
532
      "            \"ref_id\": \"79748ec84695bd88f41b13e98eae53be\"\n",
533
      "        }\n",
534
      "    }\n",
535
      "]\n"
536
     ]
537
    }
538
   ],
539
   "source": [
540
    "import json\n",
541
    "from unstructured.staging.label_studio import stage_for_label_studio\n",
542
    "\n",
543
    "output = stage_for_label_studio(narrative_text)\n",
544
    "print(json.dumps(output[:2], indent=4))"
545
   ]
546
  },
547
  {
548
   "cell_type": "markdown",
549
   "id": "94e74c2c",
550
   "metadata": {},
551
   "source": [
552
    "Currently, `unstructured` supports the following staging bricks:\n",
553
    "\n",
554
    "- `stage_for_argilla`\n",
555
    "- `stage_for_transformers`\n",
556
    "- `stage_for_label_studio`\n",
557
    "- `stage_for_prodigy`\n",
558
    "- `stage_for_label_box`\n",
559
    "- `stage_for_datasaur`"
560
   ]
561
  },
562
  {
563
   "cell_type": "markdown",
564
   "id": "54477e73",
565
   "metadata": {},
566
   "source": [
567
    "Also included among the staging bricks are functions for converting a list of document elements to a dictionary, CSV, or dataframe. These helper functions are useful if you just want the text and don't need the data pre-formatted for a particular downstream tool. These functions include:\n",
568
    "\n",
569
    "- `convert_to_isd`\n",
570
    "- `convert_to_isd_csv`\n",
571
    "- `convert_to_dataframe`\n",
572
    "\n",
573
    "The \"ISD\" in these functions refers to \"initial structured data\", our standard dictionary representation of text elements. Here we convert the list of elements to a dictionary and a dataframe."
574
   ]
575
  },
576
  {
577
   "cell_type": "code",
578
   "execution_count": 21,
579
   "id": "6d5cf8cf",
580
   "metadata": {},
581
   "outputs": [
582
    {
583
     "name": "stdout",
584
     "output_type": "stream",
585
     "text": [
586
      "[\n",
587
      "    {\n",
588
      "        \"text\": \"Skip to main content\",\n",
589
      "        \"type\": \"Title\"\n",
590
      "    },\n",
591
      "    {\n",
592
      "        \"text\": \"(function(d){\\n  var js, id = 'facebook-jssdk'; if (d.getElementById(id)) {return;}\\n  js = d.createElement('script'); js.id = id; js.async = true;\\n  js.src = \\\"//connect.facebook.net/en_US/all.js#xfbml=1\\\";\\n  d.getElementsByTagName('head')[0].appendChild(js);\\n}(document));\",\n",
593
      "        \"type\": \"NarrativeText\"\n",
594
      "    }\n",
595
      "]\n"
596
     ]
597
    }
598
   ],
599
   "source": [
600
    "from unstructured.staging.base import convert_to_isd\n",
601
    "\n",
602
    "isd = convert_to_isd(elements)\n",
603
    "print(json.dumps(isd[:2], indent=4))"
604
   ]
605
  },
606
  {
607
   "cell_type": "code",
608
   "execution_count": 22,
609
   "id": "706cc9c7",
610
   "metadata": {},
611
   "outputs": [
612
    {
613
     "data": {
614
      "text/html": [
615
       "<div>\n",
616
       "<style scoped>\n",
617
       "    .dataframe tbody tr th:only-of-type {\n",
618
       "        vertical-align: middle;\n",
619
       "    }\n",
620
       "\n",
621
       "    .dataframe tbody tr th {\n",
622
       "        vertical-align: top;\n",
623
       "    }\n",
624
       "\n",
625
       "    .dataframe thead th {\n",
626
       "        text-align: right;\n",
627
       "    }\n",
628
       "</style>\n",
629
       "<table border=\"1\" class=\"dataframe\">\n",
630
       "  <thead>\n",
631
       "    <tr style=\"text-align: right;\">\n",
632
       "      <th></th>\n",
633
       "      <th>type</th>\n",
634
       "      <th>text</th>\n",
635
       "    </tr>\n",
636
       "  </thead>\n",
637
       "  <tbody>\n",
638
       "    <tr>\n",
639
       "      <th>0</th>\n",
640
       "      <td>Title</td>\n",
641
       "      <td>Skip to main content</td>\n",
642
       "    </tr>\n",
643
       "    <tr>\n",
644
       "      <th>1</th>\n",
645
       "      <td>NarrativeText</td>\n",
646
       "      <td>(function(d){\\n  var js, id = 'facebook-jssdk'...</td>\n",
647
       "    </tr>\n",
648
       "    <tr>\n",
649
       "      <th>2</th>\n",
650
       "      <td>Title</td>\n",
651
       "      <td>Search form</td>\n",
652
       "    </tr>\n",
653
       "    <tr>\n",
654
       "      <th>3</th>\n",
655
       "      <td>ListItem</td>\n",
656
       "      <td>Home</td>\n",
657
       "    </tr>\n",
658
       "    <tr>\n",
659
       "      <th>4</th>\n",
660
       "      <td>ListItem</td>\n",
661
       "      <td>Who We Are</td>\n",
662
       "    </tr>\n",
663
       "  </tbody>\n",
664
       "</table>\n",
665
       "</div>"
666
      ],
667
      "text/plain": [
668
       "            type                                               text\n",
669
       "0          Title                               Skip to main content\n",
670
       "1  NarrativeText  (function(d){\\n  var js, id = 'facebook-jssdk'...\n",
671
       "2          Title                                        Search form\n",
672
       "3       ListItem                                               Home\n",
673
       "4       ListItem                                         Who We Are"
674
      ]
675
     },
676
     "execution_count": 22,
677
     "metadata": {},
678
     "output_type": "execute_result"
679
    }
680
   ],
681
   "source": [
682
    "from unstructured.staging.base import convert_to_dataframe\n",
683
    "\n",
684
    "df = convert_to_dataframe(elements)\n",
685
    "df.head()"
686
   ]
687
  },
688
  {
689
   "cell_type": "markdown",
690
   "id": "e572f082",
691
   "metadata": {},
692
   "source": [
693
    "If you have a dictionary in ISD format, you can convert back to a list of elements using the `isd_to_elements` function."
694
   ]
695
  },
696
  {
697
   "cell_type": "code",
698
   "execution_count": 23,
699
   "id": "b2c1282e",
700
   "metadata": {},
701
   "outputs": [
702
    {
703
     "data": {
704
      "text/plain": [
705
       "[<unstructured.documents.elements.Title at 0x28bf910a0>,\n",
706
       " <unstructured.documents.elements.NarrativeText at 0x28bf91460>]"
707
      ]
708
     },
709
     "execution_count": 23,
710
     "metadata": {},
711
     "output_type": "execute_result"
712
    }
713
   ],
714
   "source": [
715
    "from unstructured.staging.base import isd_to_elements\n",
716
    "\n",
717
    "isd_to_elements(isd[:2])"
718
   ]
719
  }
720
 ],
721
 "metadata": {
722
  "kernelspec": {
723
   "display_name": "Python 3 (ipykernel)",
724
   "language": "python",
725
   "name": "python3"
726
  },
727
  "language_info": {
728
   "codemirror_mode": {
729
    "name": "ipython",
730
    "version": 3
731
   },
732
   "file_extension": ".py",
733
   "mimetype": "text/x-python",
734
   "name": "python",
735
   "nbconvert_exporter": "python",
736
   "pygments_lexer": "ipython3",
737
   "version": "3.8.13"
738
  }
739
 },
740
 "nbformat": 4,
741
 "nbformat_minor": 5
742
}
743
unstructured

Использование cookies