Use Python to calculate the tone of financial articles

[Update on 2019-03-01] I completely rewrite the Python program. The updates include:

  • I include two domain-specific dictionaries: Loughran and McDonald’s and Henry’s dictionaries, and you can choose which dictionary to use.
  • I add negation check as suggested by Loughran and McDonald (2011). That is, any occurrence of negate words (e.g., isn’t, not, never) within three words preceding a positive word will flip that positive word into a negative one. Negation check only applies to positive words because Loughran and McDonald (2011) suggest that double negation (i.e., a negate word precedes a negative word) is not common. I expand their negate word list though, since theirs seem incomplete. In my sample of 90,000+ press releases, negation check finds that 5.7% of press releases have positive word(s) with a preceding negate word.

Please note:

  • The Python program first transform an article into a bag of words in their original order. Different research questions may define “word” differently. For example, some research questions only look at alphabetic words (i.e., remove all numbers in an article). I use this definition in the following Python program. But you may want to change this to suit your research question. In addition, there are many nuances in splitting sentences into words. The splitting method in the following Python program is simple but imperfect of course.
  • To use the Python program, you have to know how to assign the full text of an article to the variable article (using a loop) and how to output the results into a database-like file (Sqlite or CSV).

I acknowledge the work done by C.J. Hutto (see his work at GitHub).

[Original Post] I find two internet resources for this task (thank both authors):

The first solution is way more efficient than the second, but the second is more straightforward. The first needs extra knowledge of PostgreSQL and R besides Python. I borrow from the two resources and write the Python code below.

Please note, to use the Python code, you have to know how to assign the full text of an article of interest to the variable text, and how to output the total word count and the counts of positive/negative words in text.

In the first part of the code, I read the dictionary or the word list into a Python dictionary variable. The word list used here is supposed to be a .txt file and in the following format:

For accounting and finance research, a commonly used positive/negative word list was developed by Bill McDonald. See his website.

In the second part of the code, I create regular expressions that are used to find occurrences of positive/negative words. The last few lines of codes are used to get the counts of positive/negative words in the text.

This entry was posted in Python. Bookmark the permalink.

14 Responses to Use Python to calculate the tone of financial articles

  1. Ian Gow says:

    I agree that my solution is more complex. But in part that’s because it’s a more complete solution. One has to download and process the data from Bill MacDonald (“see his website for download” implies undocumented steps in the process). Then one has to organize and perhaps process the text so it can be fed to the Python function. Finally, one needs to handle the output.

    I think the first step on my site could be done in Python (rather than R … my decision to use R is more a reflection of my comparative advantage in R than anything inherent to Python). And the second step could be done without PostgreSQL (especially if the first step is done in Python). I think a “pure Python” approach would be more elegant than what I have, at least as a code illustration.

    • Kai Chen says:

      Hi Ian, happy to hear your thoughts promptly – I like your blog and really benefit from it.

      I like how you deal with the regular regression pattern. It is very efficient, saving the trouble to use too many loops. In my experiment, your code is about 6 times faster than the other. I agree that your solution is more complete, and that reading texts from and outputting tone counts to a database is a better idea than reading/writing CSV. In my codes, I do bypass the feeding and outputting part in my post.

  2. Mu Civ says:

    Hi Kai, I’m new to Python, so I really appreciate your code!

    Unfortunately, it doesn’t work for me though. Few errors occured:

    #1 NameError: name ‘re’ is not defined -> I added “import re”, which helped I guess

    #2 NameError: name ‘text’ is not defined -> I defined text as text = “Bsp.text” (which is the document I would like to analyse). This also seemed to help, at least the error does not occur anymore.

    #3 NameError: name ‘count’ is not defined -> I really don’t know how to fix this one though… Can you help me please?

    Thanks in advance!

    • Mu Civ says:

      Hi Kai,

      I’ve already solved my problem.

      Here is the last part of the code (if anyone should be interested):

      # Get tone count

      with open(‘Bsp.txt’, ‘r’) as content_file:
      content = content_file.read()

      count = {}
      wordcount = len(content.split())
      for cat in dict.keys():
      count[cat] = len(regex[cat].findall(content))

      print(count)

      Thanks and have a nice day. 🙂

  3. Tom Jones says:

    Apart from the fact that your code doesn’t actually work, its great.

  4. Victor says:

    Good code!

    For my own code, I realize that I only tested for negating words immediately preceding the positive words, instead of within 3 words. I didn’t read Loughran and McDonald (2011) carefully.

    I also realize that it would be even better if we first tokenize an article into sentences and do the negation test within the boundary of each sentence.

    For the definition of words, there are indeed no single definition. For example, Loughran and McDonald seem to define a word as [a-zA-Z]+. In their master dictionary, you can see “email”, but not “e-mail”. “e-mail” will become two words: “e”, and “mail”. By the same definition, “10-K” will become “K”. Sometimes people remove single-letter word. If you use nltk’s word tokenizer, “couldn’t” will become “could” and “n’t”, and “company’s” will become “company” and “‘s”, “e-mail” will be still “e-mail”, “$5.0” will become “$” and “5.0”. People often apply further screening to remove punctuations, and tokens containing digits and punctuations. I find that after removing punctuations, the nltk tokens will be very close to Microsoft’s definition of words.

    Papers often do not make clear about their own definitions. This makes replication difficult.

  5. Will says:

    Hi Kai,
    Thank you for this. Do you think you could inform me what the results of the test article return? I’d like to ensure the slight terminology modifications I made return the same results as intended. I find 4 positive words, 38 negative words, and 726 total words.
    Thanks!

  6. Anja says:

    Hi Kai,
    I just wanted to say thank you for providing the code! It is simple, flexible and addresses the issues of negation. I’m relatively new to python and could easily apply and adapt it.

  7. Julian says:

    Hi Kai!
    First of all, thanks a lot for sharing your code! As a Python newbie, you really helped me out with that a lot! As I am trying to conduct a sentiment analysis of corporate CSR reports, I am looking for ways to make my analysis more robust. With that in mind, I am wondering whether it is possible to adjust your code in a way that the dictionary words are weighted on basis of their Inverse Document Frequency (IDF) instead of weighting them equally.
    Do you perhaps know a way how to include bag-of-words and TF-IDF in your above code?

    I would be extremely grateful for any help that you could possibly provide me with!
    Thanks so much in advance and best of wishes!

Leave a Reply

Your email address will not be published. Required fields are marked *