First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.
Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.
Data Elements in Factiva Article
Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:
Field Label | Field Name | What It Contains |
---|---|---|
HD | Headline | Headline |
CR | Credit Information | Credit Information (Example: Associated Press) |
WC | Word Count | Number of words in document |
PD | Publication Date | Publication Date |
ET | Publication Time | Publication Time |
SN | Source Name | Source Name |
SC | Source Code | Source Code |
ED | Edition | Edition of publication (Example: Final) |
PG | Page | Page on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term) |
LA | Language | Language in which the document is written |
CY | Copyright | Copyright |
LP | Lead Paragraph | First two paragraphs of an article |
TD | Text | Text following the lead paragraphs |
CT | Contact | Contact name to obtain additional information |
RF | Reference | Notes associated with a document |
CO | Dow Jones Ticker Symbol | Dow Jones Ticker Symbol |
IN | Industry Code | Dow Jones Intelligent Indexingª Industry Code |
NS | Subject Code | Dow Jones Intelligent Indexingª Subject Code |
RE | Region Code | Dow Jones Intelligent Indexingª Region Code |
IPC | Information Provider Code | Information Provider Code |
IPD | Information Provider Descriptors | Information Provider Descriptors |
PUB | Publisher Name | Publisher of information |
AN | Accession Number | Unique Factiva.com identification number assigned to each document |
Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.
How to Extract Data Elements in Factiva Article
You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.
Step 1: Download Articles from Factiva in RTF Format
It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.
Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.
Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.
Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:
Then you have to download articles in RTF – Article Format, as indicated by the following graph:
After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:
The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.
Step 2: Convert RTF to TXT
Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.
Step 3: Extract Data Elements and Save to a Table
This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:
- Read in each TXT document;
- Extract data elements of each article and write them to an SQLite database;
- Export data to a CSV file for easy processing in other software such as Stata.
I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
import glob import re import sqlite3 import csv def parser(file): # Open a TXT file. Store all articles in a list. Each article is an item # of the list. Split articles based on the location of such string as # 'Document PRN0000020080617e46h00461' articles = [] with open(file, 'r') as infile: data = infile.read() start = re.search(r'\n HD\n', data).start() for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data): end = m.end() a = data[start:end].strip() a = '\n ' + a articles.append(a) start = end # In each article, find all used Intelligence Indexing field codes. Extract # content of each used field code, and write to a CSV file. # All field codes (order matters) fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP', 'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN'] for a in articles: used = [f for f in fields if re.search(r'\n ' + f + r'\n', a)] unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n ' + f + r'\n', a)] fields_pos = [] for f in used: f_m = re.search(r'\n ' + f + r'\n', a) f_pos = [f, f_m.start(), f_m.end()] fields_pos.append(f_pos) obs = [] n = len(used) for i in range(0, n): used_f = fields_pos[i][0] start = fields_pos[i][2] if i < n - 1: end = fields_pos[i + 1][1] else: end = len(a) content = a[start:end].strip() obs.append(content) for f in unused: obs.insert(f[0], '') obs.insert(0, file.split('/')[-1].split('.')[0]) # insert Company ID, e.g., GVKEY # print(obs) cur.execute('''INSERT INTO articles (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf, co, ina, ns, re, ipc, ipd, pub, an) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', obs) # Write to SQLITE conn = sqlite3.connect('factiva.db') with conn: cur = conn.cursor() cur.execute('DROP TABLE IF EXISTS articles') # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name cur.execute('''CREATE TABLE articles (nid integer primary key, id text, hd text, cr text, wc text, pd text, et text, sn text, sc text, ed text, pg text, la text, cy text, lp text, td text, ct text, rf text, co text, ina text, ns text, re text, ipc text, ipd text, pub text, an text)''') for f in glob.glob('*.txt'): print(f) parser(f) # Write to CSV to feed Stata with open('factiva.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) with conn: cur = conn.cursor() cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL') colname = [desc[0] for desc in cur.description] writer.writerow(colname) for obs in cur.fetchall(): writer.writerow(obs) |
Hi there,
I am using your method to extract information from Factiva.
However, the code has some problem that I cannot run it smoothly.
I have the following error:
UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xa9’ in position 165: ordinal not in range(128)
Could you please help me to solve it?
Are you using Python 2.7? Try upgrading to Python 3.5 to see if the problem will be solved.
Hi Kai,
I have also problems extracting information from Fictive with your code.
I have the following error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 22: ordinal not in range(128)
It seems there is also a problem with “parser(f)”.
Could you please help me to solve it?
Thank you!
With the HTML export rather than RTF, you can get a great representation of the data with this one-liner!
import pandas as pd
data = pd.concat([art for art in pd.read_html(‘/path/to/factiva-export.html’, index_col=0) if ‘HD’ in art.index.values], axis=1).T.set_index(‘AN’)
You are welcome to then use data.to_sql() or data.to_csv()…
Cool! Good to know. I do want to explore Pandas more. Thank you Joel.
Hi! I don’t see the HTML export in Factiva…
Hi Kai,
I am trying to use your method to transform data from Factiva, but have run into an issue. Could you help me with this?
When executing the code I get the following error;
————————————————
File “factiva.py”, line 72, in
parser(f)
File “factiva.py”, line 15, in parser
start = re.search(r’\n HD\n’, data).start()
AttributeError: ‘NoneType’ object has no attribute ‘start’
————————————————
I am working with python 3.6 on windows 7. The location of the script is the same as the txt file.
Your help would be greatly appreciated!!
Hi Joris,
This is probably because the program cannot find text files in line 70. Two solutions: (1) Add the full path to text files in line 70, e.g., for f in glob.glob(‘C:/Downloads/*.txt’). I don’t have a windows machine. Please check glob documentation if the syntax is incorrect. But basically you need to specify the full path. (2) A more future-proof solution is to use PyCharm, an advanced Python IDE which will search for text files as well as other inputs automatically in the folder of the Python code.
I hope this helps.
Hi Kai,
Thank you for your response, I really appreciate it.
I have tried both the solutions, but the error code persists. It does now seem to observe the first txt file but then gets stuck.
In the source folder it does create a DB file but no CSV.
With what OS/configuration does it work for you?
It now reads;
———————————
C:\path\to\factiva-1.txt
Traceback (most recent call last):
File “C:\path\to\factiva.py”, line 72, in
parser(f)
File “C:\path\to\factiva.py”, line 15, in parser
start = re.search(r’\n HD\n’, data).start()
AttributeError: ‘NoneType’ object has no attribute ‘start’
————————
Would i maybe need to use Python 3.5 for it to work?
Once more, thank you very much for your help!
Hi Kai,
Thank you for sharing the code! I have a little trouble running it and hope you could offer some help. Both the db file and csv file always have the first article missing. In the second row of a csv file, it would show nid and id correctly, but nothing for all other elements. Here is an example: 1,3m,,,,,,,,,,,,,,,,,,,,,,,
I am using Python 3.6.5 with Spyder.
Thank you!
Never mind. I found that for whatever reason, there are three spaces instead of one space in front of the first “HD”. Thank you anyway!
Dear Kai,
Thank you very much for sharing your code! However, I ran into some problems when running the codes. The output csv file only had part of the indexing right, and many sentences that start with break words (such as BUT, ALTHOUGH..) and the article contents were unorganized and spread out everywhere in the spreadsheet. Does it have something to do with my original RTF file? (My RTF file look quite different from yours with tables though I have followed exactly your instructions; Mine could be opened using Microsoft Word and there are pictures, ect. in it) How could I fix the problem? Thank you very much!
Hi Grace, the Python program is very picky on RTF layout and you have to make it right. Or, you need to fine tune the program to adapt it to your downloads.
Dear Kai,
Thank you so much for your reply! I will try to do it on Mac and see if I can get the RTF layout right. I am also an Accounting PhD (about to graduate this year) and I greatly benefited from other materials on your websites as well. Thank you for your selfless help.
Is there something I can do to parse the articles which I have downloaded without the indexes by mistake?
Just dropping a comment to inquire if there is a workaround for downloading a large batch of news articles from Factiva. Any suggestions or shared experiences would be really helpful!