{"id":539,"date":"2015-12-20T17:39:54","date_gmt":"2015-12-20T22:39:54","guid":{"rendered":"http:\/\/csclub.uwaterloo.ca\/~k55chen\/?p=539"},"modified":"2017-04-23T13:34:21","modified_gmt":"2017-04-23T17:34:21","slug":"use-python-to-extract-intelligence-indexing-fields-in-factiva-articles","status":"publish","type":"post","link":"https:\/\/www.kaichen.work\/?p=539","title":{"rendered":"Use Python to extract Intelligence Indexing fields in Factiva articles"},"content":{"rendered":"<p>First of all, I acknowledge that I benefit a lot from Neal Caren&#8217;s blog post <a href=\"http:\/\/nealcaren.web.unc.edu\/cleaning-up-lexisnexis-files\/\" target=\"_blank\">Cleaning up LexisNexis Files<\/a>. Thanks Neal.<\/p>\n<p>Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.<\/p>\n<p><strong>Data Elements in Factiva Article<\/strong><\/p>\n<p>Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva&#8217;s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:<\/p>\n\n<table id=\"tablepress-5\" class=\"tablepress tablepress-id-5\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Field Label<\/th><th class=\"column-2\">Field Name<\/th><th class=\"column-3\">What It Contains<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">HD<\/td><td class=\"column-2\">Headline<\/td><td class=\"column-3\">Headline<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">CR<\/td><td class=\"column-2\">Credit Information<\/td><td class=\"column-3\">Credit Information (Example: Associated Press)<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">WC<\/td><td class=\"column-2\">Word Count<\/td><td class=\"column-3\">Number of words in document<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">PD<\/td><td class=\"column-2\">Publication Date<\/td><td class=\"column-3\">Publication Date<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">ET<\/td><td class=\"column-2\">Publication Time<\/td><td class=\"column-3\">Publication Time<\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-1\">SN<\/td><td class=\"column-2\">Source Name<\/td><td class=\"column-3\">Source Name<\/td>\n<\/tr>\n<tr class=\"row-8\">\n\t<td class=\"column-1\">SC<\/td><td class=\"column-2\">Source Code<\/td><td class=\"column-3\">Source Code<\/td>\n<\/tr>\n<tr class=\"row-9\">\n\t<td class=\"column-1\">ED<\/td><td class=\"column-2\">Edition<\/td><td class=\"column-3\">Edition of publication (Example: Final)<\/td>\n<\/tr>\n<tr class=\"row-10\">\n\t<td class=\"column-1\">PG<\/td><td class=\"column-2\">Page<\/td><td class=\"column-3\">Page on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexing\u00aa term)<\/td>\n<\/tr>\n<tr class=\"row-11\">\n\t<td class=\"column-1\">LA<\/td><td class=\"column-2\">Language<\/td><td class=\"column-3\">Language in which the document is written<\/td>\n<\/tr>\n<tr class=\"row-12\">\n\t<td class=\"column-1\">CY<\/td><td class=\"column-2\">Copyright<\/td><td class=\"column-3\">Copyright<\/td>\n<\/tr>\n<tr class=\"row-13\">\n\t<td class=\"column-1\">LP<\/td><td class=\"column-2\">Lead Paragraph<\/td><td class=\"column-3\">First two paragraphs of an article<\/td>\n<\/tr>\n<tr class=\"row-14\">\n\t<td class=\"column-1\">TD<\/td><td class=\"column-2\">Text<\/td><td class=\"column-3\">Text following the lead paragraphs<\/td>\n<\/tr>\n<tr class=\"row-15\">\n\t<td class=\"column-1\">CT<\/td><td class=\"column-2\">Contact<\/td><td class=\"column-3\">Contact name to obtain additional information<\/td>\n<\/tr>\n<tr class=\"row-16\">\n\t<td class=\"column-1\">RF<\/td><td class=\"column-2\">Reference<\/td><td class=\"column-3\">Notes associated with a document<\/td>\n<\/tr>\n<tr class=\"row-17\">\n\t<td class=\"column-1\">CO<\/td><td class=\"column-2\">Dow Jones Ticker Symbol<\/td><td class=\"column-3\">Dow Jones Ticker Symbol<\/td>\n<\/tr>\n<tr class=\"row-18\">\n\t<td class=\"column-1\">IN<\/td><td class=\"column-2\">Industry Code<\/td><td class=\"column-3\">Dow Jones Intelligent Indexing\u00aa Industry Code<\/td>\n<\/tr>\n<tr class=\"row-19\">\n\t<td class=\"column-1\">NS<\/td><td class=\"column-2\">Subject Code<\/td><td class=\"column-3\">Dow Jones Intelligent Indexing\u00aa Subject Code<\/td>\n<\/tr>\n<tr class=\"row-20\">\n\t<td class=\"column-1\">RE<\/td><td class=\"column-2\">Region Code<\/td><td class=\"column-3\">Dow Jones Intelligent Indexing\u00aa Region Code<\/td>\n<\/tr>\n<tr class=\"row-21\">\n\t<td class=\"column-1\">IPC<\/td><td class=\"column-2\">Information Provider Code<\/td><td class=\"column-3\">Information Provider Code<\/td>\n<\/tr>\n<tr class=\"row-22\">\n\t<td class=\"column-1\">IPD<\/td><td class=\"column-2\">Information Provider Descriptors<\/td><td class=\"column-3\">Information Provider Descriptors<\/td>\n<\/tr>\n<tr class=\"row-23\">\n\t<td class=\"column-1\">PUB<\/td><td class=\"column-2\">Publisher Name<\/td><td class=\"column-3\">Publisher of information<\/td>\n<\/tr>\n<tr class=\"row-24\">\n\t<td class=\"column-1\">AN<\/td><td class=\"column-2\">Accession Number<\/td><td class=\"column-3\">Unique Factiva.com identification number assigned to each document<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-5 from cache -->\n<p>Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.<\/p>\n<p><strong>How to Extract Data Elements in Factiva Article<\/strong><\/p>\n<p><a href=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/flow.png\" rel=\"attachment wp-att-562\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-562\" src=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/flow-300x74.png\" alt=\"flow\" width=\"600\" height=\"148\" srcset=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/flow-300x74.png 300w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/flow-768x190.png 768w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/flow.png 1023w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.<\/p>\n<p><em><strong>Step 1: Download Articles from Factiva in RTF Format<\/strong><\/em><\/p>\n<p>It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult\u00a0to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.<\/p>\n<p>Firstly, you may see some people discuss methods for automatic downloading (a so-called &#8220;webscraping&#8221; technique. See <a href=\"http:\/\/thiagomarzagao.com\/resources\/\" target=\"_blank\">here<\/a>). However, this needs more hacking\u00a0after\u00a0Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term &#8220;CAPTCHA&#8221;, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.<\/p>\n<p>Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.<\/p>\n<p>Let&#8217;s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article\/Report plus Indexing as indicated by the following graph:<\/p>\n<p><a href=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva.png\" rel=\"attachment wp-att-573\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-573\" src=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva-300x142.png\" alt=\"Factiva\" width=\"600\" height=\"284\" srcset=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva-300x142.png 300w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva-768x363.png 768w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva-1024x484.png 1024w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>Then you have to download articles in RTF \u2013 Article Format, as indicated by the following graph:<\/p>\n<p><a href=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva2.png\" rel=\"attachment wp-att-575\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-575\" src=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva2-300x145.png\" alt=\"Factiva2\" width=\"600\" height=\"290\" srcset=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva2-300x145.png 300w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva2-768x371.png 768w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva2-1024x495.png 1024w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:<\/p>\n<p><a href=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva3.png\" rel=\"attachment wp-att-607\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-607\" src=\"http:\/\/www.kaikaichen.com\/wp-content\/uploads\/2015\/12\/Factiva3-300x182.png\" alt=\"Factiva3\" width=\"600\" height=\"364\" srcset=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva3-300x182.png 300w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva3-768x466.png 768w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva3-1024x621.png 1024w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2015\/12\/Factiva3.png 1738w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, <strong>the final product will be a table: each row of the table represents a news article; and each column of the table is a data element<\/strong>.<\/p>\n<p><em><strong>Step 2: Convert RTF to TXT<\/strong><\/em><\/p>\n<p>Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this &#8220;hole&#8221; when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format \u2013 Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.<\/p>\n<p><em><strong>Step 3: Extract Data Elements and Save to a Table<\/strong><\/em><\/p>\n<p>This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:<\/p>\n<ol>\n<li>Read in each TXT document;<\/li>\n<li>Extract data elements of each article and write them to an SQLite database;<\/li>\n<li>Export data to a CSV file for easy processing in other software such as Stata.<\/li>\n<\/ol>\n<p>I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.<\/p>\n<pre class=\"lang:python decode:true EnlighterJSRAW \">import glob\r\nimport re\r\nimport sqlite3\r\nimport csv\r\n\r\ndef parser(file):\r\n\r\n    # Open a TXT file. Store all articles in a list. Each article is an item\r\n    # of the list. Split articles based on the location of such string as\r\n    # 'Document PRN0000020080617e46h00461'\r\n\r\n    articles = []\r\n    with open(file, 'r') as infile:\r\n        data = infile.read()\r\n    start = re.search(r'\\n HD\\n', data).start()\r\n    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\\n', data):\r\n        end = m.end()\r\n        a = data[start:end].strip()\r\n        a = '\\n   ' + a\r\n        articles.append(a)\r\n        start = end\r\n\r\n    # In each article, find all used Intelligence Indexing field codes. Extract\r\n    # content of each used field code, and write to a CSV file.\r\n\r\n    # All field codes (order matters)\r\n    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',\r\n              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']\r\n\r\n    for a in articles:\r\n        used = [f for f in fields if re.search(r'\\n   ' + f + r'\\n', a)]\r\n        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\\n   ' + f + r'\\n', a)]\r\n        fields_pos = []\r\n        for f in used:\r\n            f_m = re.search(r'\\n   ' + f + r'\\n', a)\r\n            f_pos = [f, f_m.start(), f_m.end()]\r\n            fields_pos.append(f_pos)\r\n        obs = []\r\n        n = len(used)\r\n        for i in range(0, n):\r\n            used_f = fields_pos[i][0]\r\n            start = fields_pos[i][2]\r\n            if i &lt; n - 1:\r\n                end = fields_pos[i + 1][1]\r\n            else:\r\n                end = len(a)\r\n            content = a[start:end].strip()\r\n            obs.append(content)\r\n        for f in unused:\r\n            obs.insert(f[0], '')\r\n        obs.insert(0, file.split('\/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY\r\n        # print(obs)\r\n        cur.execute('''INSERT INTO articles\r\n                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,\r\n                       co, ina, ns, re, ipc, ipd, pub, an)\r\n                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,\r\n                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)\r\n\r\n# Write to SQLITE\r\nconn = sqlite3.connect('factiva.db')\r\nwith conn:\r\n    cur = conn.cursor()\r\n    cur.execute('DROP TABLE IF EXISTS articles')\r\n    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name\r\n    cur.execute('''CREATE TABLE articles\r\n                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,\r\n                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,\r\n                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,\r\n                   ipd text, pub text, an text)''')\r\n    for f in glob.glob('*.txt'):\r\n        print(f)\r\n        parser(f)\r\n\r\n# Write to CSV to feed Stata\r\nwith open('factiva.csv', 'w', newline='') as csvfile:\r\n    writer = csv.writer(csvfile)\r\n    with conn:\r\n        cur = conn.cursor()\r\n        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')\r\n        colname = [desc[0] for desc in cur.description]\r\n        writer.writerow(colname)\r\n        for obs in cur.fetchall():\r\n            writer.writerow(obs)\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>First of all, I acknowledge that I benefit a lot from Neal Caren&#8217;s blog post Cleaning up LexisNexis Files. Thanks Neal. Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first &hellip; <a href=\"https:\/\/www.kaichen.work\/?p=539\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"_links":{"self":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/539"}],"collection":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=539"}],"version-history":[{"count":68,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/539\/revisions"}],"predecessor-version":[{"id":743,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/539\/revisions\/743"}],"wp:attachment":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=539"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=539"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=539"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}