For the final project, Matt, Tong and I formed a group trying to do sentiment analysis on Linn’s Journal. First, we divided the whole journal into three parts, and each of us marked up the emotion in our assigned part with “positive”, “negative” and “neutral”. We used the tag “state” and “note” to denote the type of emotion like “<state type = ‘emotion’><note>positive</note></state>”. One example in context of how we marked up is shown in the highlighted part below.
Figure 1. Marking up the Journal.
After marking up the journal, we need to find a way to extract all the information and visualize all those data. To do this, I wrote a python class first to read in the text from the file. Then, it’s going to find the first locator in our case is the “<date type = ‘diary’ calendar = ’0000-00-00’>” to extract date information and add to the list if it’s different from the previous one. Since some use space between words and “=” and some don’t, and some use single quotes and some use double quotes, I decided to get rid of those extra details first, therefore I wrote a static method called “findtag” which takes in a tag without the information you’re looking for and outputs a list of tagwords that later can be used as locators in the search. After searching for dates, I’m able to get a list of indices for dates in the text. Here at first I was very confused what information show I record, should it be a list of dates or a list of indices? Although I’m going to need the date information, I can get it easily by using a list comprehension later, but indices are more useful in the second step. After that, I search for the second locators which is “[‘state’, ‘type’, ‘emotion’, ‘note’]” here to get the emotion between dates. The final step for the data processing is to count the number of emotions presented for each period.
Figure 2. Extracting Information
I think findInfo in the PTs class can be easily transformed into a tag extraction method. In my way of extracting, the sequence of tags will be preserved. To be notice, the locator has to appear in sequence to successfully locate the information. Also, we can set how many words after the locator is the information we wanted.
The final step is to visualize the information we got. To do this I used a python library called matplotlib. At first I don’t really know how to add tags to x-axis and how to annotate the graph, so I searched online and by trying I found out that I can use xticks and rotation to do what I wanted.
To be notice, there’re no journals for the date of Battle of New Bern, and he didn’t wrote any journal directly after that, so the arrow in the graph is actually pointing to a date before the battle, but in general we can see that Linn is usually in good mood before the battle, but after the battle he usually feels exhausted or just bad since he has a great responsibility on his shoulder but there’re death and injuries around him all the time.
During the final project, I think I’ve learned how to start a work. To begin with, we need to know how we want our product to be, and then we can finish it step by step. Even if there’re only three lines, we cannot plot them if our markup is not standardized. We learned how to communicate with each other and how to coordinate our times. Also, I learned how to learn new techniques by ourselves, since we were not told how should we mark up the emotion and how to annotate the graph in python. If I had more time, I might even consider trying to make my python file into a GUI interfaced information extractor. In general, I think it’s a great final project in which I’ve learned a lot from.
Figure 3. Our Final Plot