The Forensic Files API, Part 5

Over and Out

January 2, 2022

Orange text that says Forensic Files and Over and Out

Whew! Well, it has been one hell of a 20-month span since my last post. As I'm sure you may have guessed, I got a little preoccupied with other stuff (gestures broadly at everything). I imagine after reading part 4 of this series, you've been on the edge of your seat.

I feel like I need to wrap this project up before I publish more posts, which is why I went with Over and Out for the title. I know that might sound ridiculous, but it's my website, and I call the shots. Rather than follow the same format as previous posts, I'm going to provide a quick overview and move on with my life. You can check out the code on GitHub if you're so inclined.

Before I get into it, I feel compelled to include the obligatory episode synopsis from the Forensic Files Wiki:

When a pipe bomb ripped through a rural home, killing a young man and seriously injuring his mother, police had no idea who was responsible.

A lot number on a 9-volt battery and the remnants of a mailing label found on a computer's hard drive enabled investigators to determine who sent the bomb, and why.

Project Goals

When I first started this project, I had some pretty specific goals:

  1. Improve my Go skills
  2. Sharpen my JavaScript skills
  3. Learn how to use ngrok with callback URLs
  4. Learn how to use cloud-based text analysis tools
  5. Learn how to set up and interact with a graph database
  6. Learn how to set up a GraphQL backend

I'm happy to say I've accomplished all of them! Admittedly, numbers 1 and 2 are a bit difficult to measure and qualify, but I'm ringing in the new year with positive affirmations. With that being said, let's get technical!

Using the Cloud for Entity Analysis

I leveraged cloud services to transcribe the audio and perform entity analysis. Here's a quick overview of the packages I created to perform each task that interacted with a cloud service (in order of when they needed to be performed):

  1. hearnoevil/server.go runs a callback URL server that writes the results of text-to-speech jobs to a JSON file
  2. hearnoevil/recognition.go runs a recognition (text-to-speech) job using IBM Cloud (the callback URL is generated by ngrok)
  3. killigraphy creates a text file of the episode transcript from the recognition JSON file
  4. tagasuspect uses cloud services to generate an entity analysis from the plaintext transcripts

I used IBM Cloud for the text-to-speech service for two reasons:

  1. They had a very generous free tier (much more generous than Amazon Transcribe)
  2. They didn't require a credit card to sign up for an account

Why does #2 make a difference? IBM doles out 500 free minutes of their text-to-speech service per month. I think their account usage calculations lagged behind the queued jobs, because I was able to run recognition jobs for 200 episodes before IBM cut me off for the month. I'll save you some math: at 25 minutes per episode, that adds up to 5,000 minutes. I was able to use 10 times the allotted amount without getting charged a dime! This has almost certainly changed since then, but what a deal! Kudos to IBM Cloud for letting me get away with that. If someone from IBM Cloud is reading this, and has always been curious as to why they started requiring a credit card to sign up in mid-2020, mystery solved!

I ended up using both Google Cloud Platform and IBM Cloud for the entity analyses of the transcripts. You can see an example of a GCP analysis in the separate forensic-files-investigations repo I set up. I think the entity analysis from IBM Cloud is more comprehensive. I wanted to try GCP out for giggles, but I wasn't really jazzed about the result.

After getting the entity analyses, I ended up a bit stuck. It turns out that text-to-speech is hard to get right. As soon as I started sending the transcripts out for analysis, I realized I was going to have issues. There were words that couldn't be recognized. There were incomplete sentences. The entity analysis tools need the transcript to be broken up into proper sentences, otherwise it throws the whole analysis off. I buckled and manually transcribed about 4 episodes before I said to hell with this.

It turns out trying to automate this was a fruitless venture. For one, it was going to take way too much time to manually transcribe the episodes. On the flip side, there are only ~400 episodes of the show, which doesn't even register on the low end of "Big Data". It would be quicker for me to just skim the recognition jobs and manually load the data into the database. But what database? And what data? I'm glad you asked!

From the Cloud to the Database

I knew I was going to use Neo4j for the graph database. Why? Because, if you ask any developer if they've heard of graph databases, around 90% of them would respond with: You mean like Neo4j? In order to light a fire under my own ass, I signed up to do a presentation of the GRANDstack for the GraphQLPDX meetup group. In case you were wondering, GRAND stands for GRaphQL Apollo, and Neo4j Database. You can read more about GRANDstack at the official website.

I'm not going to cover how to install and configure Neo4j in this post. The process is pretty cut and dried. I just downloaded Neo4j Desktop and followed the instructions.

As I alluded to earlier, I was at a loss for how to get the entity analyses into a format suitable for loading into a database. I also wasn't sure how to structure the data.

Then, like a glorious beacon of light, the internet showed me the POLE data model.

The folks at Neo4j realized that crime does pay, at least in terms of solving it using their technology. The idea is that there are 4 different elements in any criminal investigation: Person, Object, Location, and Event. Criminals (Persons) can be linked in the commission of a crime (Event) by physical evidence (Objects) and proximity (Locations).

Billy shoots someone down the street from his apartment, but they can't find the murder weapon. Police find a 9mm casing at the scene. Jimbo lives two blocks away and has a 9mm registered in his name, and happens to be Billy's brother-in-law. Jimbo reported the gun stolen a week ago.

Billy and Jimbo are linked by proximity and relationship. Law enforcement loads all this data into a graph database, and they solve the homicide by connecting the dots (at least in theory).

Armed with this knowledge, I begin the arduous task of extracting these 4 elements from each Forensic Files episode. Before I can do that, I need to know how to load it. Neo4j uses the Cypher Query Language to get data into the database instance. Cypher is a little wonky. It uses an ASCII-art type of syntax, so the statement to add an Object node looks like this:

CREATE (ballistics:Object {id: '...', name: 'Ballistics', category: 'TOOL'})
The id: '...' field in the Cypher statements throughout this post is invalid. Each statement should have a valid unique ID (e.g. id: 'ck9t5vg5l0001z84egqqbhh8i'). I omitted it, so you wouldn't have to do too much horizontal scrolling.

I soon discover it's actually a lot easier to just read the plaintext transcript and yeet the data from there. Alas, all of my cloud wrangling has been for naught. I end up manually writing the Cypher to load the data. I only loaded the first 8 episodes. The Cypher seed file is in the GitHub repo I made for the GRANDstack demo.

Cypher Seed File

I structured the Cypher file with common elements first, followed by episode-specific elements (grouped by episode). I started by defining common Object and Event nodes:

CREATE (antigenTest:Object {id: '...', name: 'Antigen Test', category: 'TOOL'})

CREATE (beating:Event {id: '...', name: 'Beating', category: 'CAUSEOFDEATH'})

You may be asking: What about common People and Location nodes? Well, I only wrote enough Cypher statements to cover the first 8 episodes of season 1. I know for certain that world-renowned forensic experts were brought in for multiple episodes. I also know that different crimes were committed in the same city, but alas, I didn't get that far.

Next, I created Episode nodes containing the title, season, episode number, etc.:

CREATE (theDisappearanceOfHelleCrafts:Episode {id: '...', title: 'The Disappearance of Helle Crafts', season: 1, episode: 1})
CREATE (theMagicBullet:Episode {id: '...', title: 'The Magic Bullet', season: 1, episode: 2})
CREATE (theHouseThatRoared:Episode {id: '...', title: 'The House That Roared', season: 1, episode: 3})

Each individual episode section contained specific Person and Location nodes. For example, in episode 3 of season 1 ("The House That Roared"), Chris Compano kills his wife Karen with a blunt object after an argument.

First, I defined the Person nodes for the people involved:

CREATE (chrisCampano:Person {id: '...', name: 'Christopher Campano', age: 28, sex: 'M', occupation: 'Unemployed', role: 'PERPETRATOR'})
CREATE (karenCampano:Person {id: '...', name: 'Karen Campano', age: 43, sex: 'F', occupation: 'Bill Collector', role: 'VICTIM'})

Next, I defined the Object and Location nodes:

CREATE (check:Object {id: '...', name: 'Check', category: 'EVIDENCE'})
CREATE (buyForLess:Location {id: '...', name: 'Buy for Less Convenience Store', category: 'LANDMARK'})
CREATE (oklahomaCity:Location {id: '...', name: 'Oklahoma City, Oklahoma', category: 'CITY'})

Finally, I defined the relationship nodes, such as linking a Person with an Episode, and a victim with a crime:

  (chrisCampano)-[:CONVICTED_OF {counts: 1}]->(manslaughter),
  (karenCampano)-[:VICTIM_OF {year: 1992}]->(bluntForceTrauma)

Pretty neat, eh? I loaded this data into my Neo4j instance, and I was off to the races! Well, not exactly. Now I needed to figure out how to interact with the data. As I described earlier, I built an app with the GRANDstack. We just covered the ND, let's move on to the GRA.

The GRA in GRANDstack

Apollo (the A in GRAND) is a platform that makes GraphQL (the GR) much easier to work with on the server. Using the apollo-server library, along with the neo4j-driver and neo4j-graphql-js libraries, setting up the GraphQL server to work with Neo4J was relatively simple. You need to define a GraphQL schema file that reflects the nodes and relationships created in the Cypher file. The server itself can be set up with very few lines of code.

As far as the UI goes, I used React with the appropriate Apollo and GraphQL libraries to interact with the database/server. You select an element like Event from a dropdown, and start typing into an autocomplete box to view all the episodes associated with that event in a grid. You can see the source code for that in the GitHub repo. Full disclosure: I created this almost 2 years ago, so there is no guarantee that it still works.

Wrap Up

That's all she wrote! I can breathe a sigh of relief, knowing I closed the book on this project after a very long delay. I learned a lot from this adventure. Namely, my heart goes out to data scientists. I imagine having to clean up data can be a real slog. The NLP tools out there are powerful, but still not perfect. A lot of manual tweaking is needed. I also learned that graph databases are a lot of fun! I chalk it up to luck that I picked a use case for which graph databases are perfectly suited (criminal investigation).

This project could have easily been 7 or 8 more full posts, but I decided to make it a single summary, so I can move on to other things. Thanks for reading!