Use Python to extract Intelligence Indexing fields in Factiva articles

First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field Label	Field Name	What It Contains
HD	Headline	Headline
CR	Credit Information	Credit Information (Example: Associated Press)
WC	Word Count	Number of words in document
PD	Publication Date	Publication Date
ET	Publication Time	Publication Time
SN	Source Name	Source Name
SC	Source Code	Source Code
ED	Edition	Edition of publication (Example: Final)
PG	Page	Page on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)
LA	Language	Language in which the document is written
CY	Copyright	Copyright
LP	Lead Paragraph	First two paragraphs of an article
TD	Text	Text following the lead paragraphs
CT	Contact	Contact name to obtain additional information
RF	Reference	Notes associated with a document
CO	Dow Jones Ticker Symbol	Dow Jones Ticker Symbol
IN	Industry Code	Dow Jones Intelligent Indexingª Industry Code
NS	Subject Code	Dow Jones Intelligent Indexingª Subject Code
RE	Region Code	Dow Jones Intelligent Indexingª Region Code
IPC	Information Provider Code	Information Provider Code
IPD	Information Provider Descriptors	Information Provider Descriptors
PUB	Publisher Name	Publisher of information
AN	Accession Number	Unique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

Read in each TXT document;
Extract data elements of each article and write them to an SQLite database;
Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

import glob

import re

import sqlite3

import csv

def parser(file):

# Open a TXT file. Store all articles in a list. Each article is an item

# of the list. Split articles based on the location of such string as

# 'Document PRN0000020080617e46h00461'

articles = []

with open(file, 'r') as infile:

data = infile.read()

start = re.search(r'\n HD\n', data).start()

for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):

end = m.end()

a = data[start:end].strip()

a = '\n ' + a

articles.append(a)

start = end

# In each article, find all used Intelligence Indexing field codes. Extract

# content of each used field code, and write to a CSV file.

# All field codes (order matters)

fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',

'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

for a in articles:

used = [f for f in fields if re.search(r'\n ' + f + r'\n', a)]

unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n ' + f + r'\n', a)]

fields_pos = []

for f in used:

f_m = re.search(r'\n ' + f + r'\n', a)

f_pos = [f, f_m.start(), f_m.end()]

fields_pos.append(f_pos)

obs = []

n = len(used)

for i in range(0, n):

used_f = fields_pos[i][0]

start = fields_pos[i][2]

if i < n - 1:

end = fields_pos[i + 1][1]

else:

end = len(a)

content = a[start:end].strip()

obs.append(content)

for f in unused:

obs.insert(f[0], '')

obs.insert(0, file.split('/')[-1].split('.')[0]) # insert Company ID, e.g., GVKEY

# print(obs)

cur.execute('''INSERT INTO articles

(id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,

co, ina, ns, re, ipc, ipd, pub, an)

VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,

?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE

conn = sqlite3.connect('factiva.db')

with conn:

cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS articles')

# Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name

cur.execute('''CREATE TABLE articles

(nid integer primary key, id text, hd text, cr text, wc text, pd text,

et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,

td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,

ipd text, pub text, an text)''')

for f in glob.glob('*.txt'):

print(f)

parser(f)

# Write to CSV to feed Stata

with open('factiva.csv', 'w', newline='') as csvfile:

writer = csv.writer(csvfile)

with conn:

cur = conn.cursor()

cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')

colname = [desc[0] for desc in cur.description]

writer.writerow(colname)

for obs in cur.fetchall():

writer.writerow(obs)

16 Responses to Use Python to extract Intelligence Indexing fields in Factiva articles

Nguyen says:

October 10, 2016 at 3:04 am

Hi there,
I am using your method to extract information from Factiva.
However, the code has some problem that I cannot run it smoothly.
I have the following error:
UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xa9’ in position 165: ordinal not in range(128)
Could you please help me to solve it?

- Kai Chen says:
  
  October 10, 2016 at 10:42 am
  
  Are you using Python 2.7? Try upgrading to Python 3.5 to see if the problem will be solved.
  
Anna says:

February 5, 2017 at 6:59 am

Hi Kai,

I have also problems extracting information from Fictive with your code.
I have the following error:

return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 22: ordinal not in range(128)

It seems there is also a problem with “parser(f)”.
Could you please help me to solve it?
Thank you!

Joel Nothman says:

May 17, 2017 at 6:54 am

With the HTML export rather than RTF, you can get a great representation of the data with this one-liner!

import pandas as pd
data = pd.concat([art for art in pd.read_html(‘/path/to/factiva-export.html’, index_col=0) if ‘HD’ in art.index.values], axis=1).T.set_index(‘AN’)

You are welcome to then use data.to_sql() or data.to_csv()…

- Kai Chen says:
  
  May 19, 2017 at 11:50 am
  
  Cool! Good to know. I do want to explore Pandas more. Thank you Joel.
  
- Jill Adona says:
  
  February 14, 2019 at 9:49 pm
  
  Hi! I don’t see the HTML export in Factiva…
  
Joris says:

February 19, 2018 at 9:10 am

Hi Kai,

I am trying to use your method to transform data from Factiva, but have run into an issue. Could you help me with this?

When executing the code I get the following error;
————————————————
File “factiva.py”, line 72, in
parser(f)
File “factiva.py”, line 15, in parser
start = re.search(r’\n HD\n’, data).start()
AttributeError: ‘NoneType’ object has no attribute ‘start’
————————————————
I am working with python 3.6 on windows 7. The location of the script is the same as the txt file.

Your help would be greatly appreciated!!

- Kai Chen says:
  
  February 19, 2018 at 1:18 pm
  
  Hi Joris,
  
  This is probably because the program cannot find text files in line 70. Two solutions: (1) Add the full path to text files in line 70, e.g., for f in glob.glob(‘C:/Downloads/*.txt’). I don’t have a windows machine. Please check glob documentation if the syntax is incorrect. But basically you need to specify the full path. (2) A more future-proof solution is to use PyCharm, an advanced Python IDE which will search for text files as well as other inputs automatically in the folder of the Python code.
  
  I hope this helps.
  
  - Joris says:
    
    February 19, 2018 at 5:12 pm
    
    Hi Kai,
    
    Thank you for your response, I really appreciate it.
    
    I have tried both the solutions, but the error code persists. It does now seem to observe the first txt file but then gets stuck.
    In the source folder it does create a DB file but no CSV.
    With what OS/configuration does it work for you?
    
    It now reads;
    ———————————
    C:\path\to\factiva-1.txt
    Traceback (most recent call last):
    File “C:\path\to\factiva.py”, line 72, in
    parser(f)
    File “C:\path\to\factiva.py”, line 15, in parser
    start = re.search(r’\n HD\n’, data).start()
    AttributeError: ‘NoneType’ object has no attribute ‘start’
    ————————
    Would i maybe need to use Python 3.5 for it to work?
    
    Once more, thank you very much for your help!
    
Xu says:

January 15, 2019 at 1:06 am

Hi Kai,

Thank you for sharing the code! I have a little trouble running it and hope you could offer some help. Both the db file and csv file always have the first article missing. In the second row of a csv file, it would show nid and id correctly, but nothing for all other elements. Here is an example: 1,3m,,,,,,,,,,,,,,,,,,,,,,,

I am using Python 3.6.5 with Spyder.

Thank you!

- Xu says:
  
  January 15, 2019 at 10:43 am
  
  Never mind. I found that for whatever reason, there are three spaces instead of one space in front of the first “HD”. Thank you anyway!
  
Grace says:

July 25, 2019 at 10:10 am

Dear Kai,

Thank you very much for sharing your code! However, I ran into some problems when running the codes. The output csv file only had part of the indexing right, and many sentences that start with break words (such as BUT, ALTHOUGH..) and the article contents were unorganized and spread out everywhere in the spreadsheet. Does it have something to do with my original RTF file? (My RTF file look quite different from yours with tables though I have followed exactly your instructions; Mine could be opened using Microsoft Word and there are pictures, ect. in it) How could I fix the problem? Thank you very much!

- Kai Chen says:
  
  July 25, 2019 at 11:49 am
  
  Hi Grace, the Python program is very picky on RTF layout and you have to make it right. Or, you need to fine tune the program to adapt it to your downloads.
  
  - Grace says:
    
    July 30, 2019 at 7:55 am
    
    Dear Kai,
    
    Thank you so much for your reply! I will try to do it on Mac and see if I can get the RTF layout right. I am also an Accounting PhD (about to graduate this year) and I greatly benefited from other materials on your websites as well. Thank you for your selfless help.
    
HA R says:

January 6, 2020 at 12:34 am

Is there something I can do to parse the articles which I have downloaded without the indexes by mistake?

R HS says:

January 19, 2024 at 3:36 pm

Just dropping a comment to inquire if there is a workaround for downloading a large batch of news articles from Factiva. Any suggestions or shared experiences would be really helpful!