Thursday, November 14, 2013

Intro to the MEAN Stack - Part 1 - The Data

I recently changed jobs to join a startup. One of the many reasons why I took the job is the fact that the software is built with the MEAN stack stack – MongoDb, Express.js, AngularjS and Node.js. Prior to joining the company I had dabbled in each of the parts of the stack but I hadn’t used any of them on a ‘real’ project. So to reinforce what I’m learning during the day by using the MEAN stack to build an app to view the stats for the New York/San Francisco Giants. I thought I would share my experiences hopefully to help someone else who's learning the MEAN stack. This is the first post in a three part series that will walk you through building my app. The planned posts are:

Part 1 – The Data: Converts the MySQL data model to a suitable model for MongoDB.

Part 2 – The API: Building out a Node.js based API that will allow us to retrieve the stats.

Part 3 – The UI: Covers building out an AngularJS UI

Goal

The goal for this post is to have the data modeled and loaded into a MongoDB database so we can use it in the next post.

Setup

If you want to 'follow along' with this post you will need to download and install MongoDB and Node.js. Once you have installed mongo and node you can download the part 1 code. Keep in mind before you run the load scripts you will need to do install the node packages. To do that, simply change into the <YOUR CODE DIR>/post-1-the-data/scripts directory and run:

npm install

npm is the node package manager. The install option tells npm to read the package.json file and install any of the requirements that have not already been installed. If you plan on loading the Giants data I have already parsed out you are good to go. However if you want to start the from the beginning yourself you will need to download the data from the Baseball Databank project. The most up to date branch is the 2012update branch. After cloning the repository read the scripts/README.md and you will then be ready to generate your own Giants data or any other team’s data.

My Two Second MongoDB Introduction

Before I dive into the meat of the post I want to give you a very brief MongoDB intro. We will be storing data in collections, which are analogous with tables. Each collection will contain a document, which you can think of as a row in relational databases. As the NoSQL term implies we will not be using SQL to retrieve the data. Instead we will use javascript.

The Data

As a kid who grew up reading box scores every morning while I ate my breakfast I am very thankfully that there is an open source project out there dedicated to providing the statistics for Major League Baseball. It is distributed using 24 CSV files, each file maps to a MySQL table that was used to generate it. All 24 data files either describe a manager, player or a team. So as we build out the database we will create and load three collections: managers, players, and seasons.

Each collection will house documents that have been designed so that one document will represent one player, one manager or one season. This will make our development of the API much easier. For the most part one call should retrieve everything we need. To give you an idea of what type of data the documents will contain I’ve created a map between the collections and files. Remember each file is a table in the Baseball Databank database. In some cases you’d have to write some pretty complicated joins to get the data. In Mongo, our queries will be straight forward.

Managers Collection (4 tables) => AwardsManagers, Managers, ManagersHalf, Master

The Players Collection (11 tables) => AllStars, Appearances, AwardsPlayers, Batting, BattingPost, Fielding, FieldingPost, Master, Pitching, PitchingPost, Salaries

The Seasons Collection (3 tables) => SeriesPost, Teams, TeamsHalf

If you’d like a description of the tables checkout the Baseball Databank README

The MongoDB Side

For the rest of this post I am going to walk you through an example of a document that is stored in each collection. The description will also have examples of how to retrieve the data from the mongo console app. We will start with the simplest of the collections, the managers.


Managers

Any manager that has managed at least a game for either the New York or San Francisco Giants will have a document in this collection. A managers document has demographic information,, managerial record plus any awards they may have won. Our managerial document example is Rogers Hornsby’s. He managed the New York Giants in 1927.

The first property of the document is the _id property. By default each record would have an _id field that is a randomly generated ObjectID created by mongoDB during the insert. Here's what an ObjectID looks like

ObjectId("528398bb3b06760000000004")
In some cases that may work fine but for this collection I am using the baseball databank managerID value. This allows me to take advantage of the built in unique constraint on the _id index. The documents properties are self explanatory however I would like to discuss the record properties.

The record property is an array of JSON objects. Each entry in the array represents a full or partial season with the Giants. Since Hornsby only managed part of one season for the Giants he only has one item in the array. You can think of the entries in the record array as a row in a record table in a relational database. Using an array allows you to keep all data related to managers in one document making it easy to retrieve all of the manager's data when needed. We will make use of arrays in all of our documents. If the Giants had made it to the playoffs or if Hornsby had won any managerial awards his document would have two more array properties, playoffs and awards.

You may have not caught that last bit but documents in the same collection do not have to have the same properties. In all of our collections the documents will have about 95% of the properties in common. I will show you how we can check for the existence of a property when querying the database in a bit. Stay tuned.

You might be wondering how I retrieved the Hornsby document. Let me walk you through the queries I used.

At a command prompt fire up the mongo client by running:

  mongo giants

This will connect you to a local instance of mongodb and switch you into the giants database. You could run just mongo to connect and the use giants at the mongo prompt to do the same thing. For more options read the mongo shell documentation.

Now that I'm in the giants database I can retrieve the Hornsby document by running:

   db.managers.find( { nameLast: 'Hornsby' }).pretty()

What the query says is in the current database search the managers collection for a document that has a nameLast property equal to 'Hornsby'. The find call will return an array containing all of the documents that match our query. The pretty function formats the results of the find in a more human readable fashion. Try the find
call with and without the pretty and you will see what I mean.

If you look closely at the Hornsby document you will see in the record array his only entry has the inseason property set to 2. This means there was at least two managers for the 1927 Giants. To find out who the other managers there were we can run the following query.

db.managers.find( { 'record.yearID' : 1927} )

This will return two documents one for John McGraw and one for Hornsby. Look closely at the query and your will see that I'm using a 'dot notiatiod' to find the managers for the 1927 season. Like the previous find it will return an array of matching documents. Unlike the previous query this one looks inside of the record array. Each document within the record array will have its yearID property compared to 1927. If it has one entry with that yearID then the document will be returned. Going back to my analogy of each entry in the record array being equal to a row in a relation database table you can almost thing of the dot notation as a join. One thing to keep in mind is whenever you use dot notation you must quote the property like I have done. Failure to do so will cause an error from mongoDB.

Remember I said I would show you how to test for the existence of a property? We will get a count of all the managers who have the playoffs property in their document.

db.managers.find({playoffs: {$exists:true}}).count()

This query simply says, find all the documents in managers that have the playoffs property. I am using $exists which is one of the built in query operators. To see what other operators are available checkout the MongoDB Operators page.

The last bit of work I need to do is to setup the indexes on the managers collection. Since most of the queries I run will be on either the last name or to look for a particular season I will add an index on nameLast and record.yearID. Here’s how to create an index in mongo:

db.managers.ensureIndex({ nameLast:1 })
db.managers.ensureIndex({ ‘record.yearID’:1})

These two calls create two separate indexes on the nameLast and record.yearID properties. Notice that I can use the dot notation when declaring an index also. The 1 indicates that we want the index created using ascending ordering. To create an index that uses the descending order swap out the 1 for a -1. Now our managers collection has three indexes: one on the _id property, one on the nameLast property and one on the record.yearID property. To see what indexes are on a collection you can run:

db.managers.getIndexes();

For more information on ensureIndex and getIndexes visit: http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/ http://docs.mongodb.org/manual/reference/method/db.collection.getIndexes/


Players

Just like the managers, every player that has stepped onto the diamond in a New York or San Francisco Giants uniform will have a document in this collection. A player’s document will contain demographics, statistics, and appearances. If the player has been an all-star, won an award or has been inducted into the hall of fame his document will have additional properties. Below is the document for Eddie ‘Hotshot’ Mayo who played for the New York Giants in 1936.

The players document is considerably larger than the managers document is. The reason for that is I chose this design was it allows me to retrieve the Giants history of a player with a single query. Even though I've chosen a player centric design it is still relatively easy to find roster related information. As an example lets say we want to see who else played third for the Giants during the 1936 season. I could run the following query:

The query returns a total of four documents, four full player documents which makes it a little hard to read the names of the players. All I really want to see is the nameLast, nameFirst and the value fiendingStats.G for the players who played third. I can convert the output to only contain the values I've indicated by using a projection. I am also changing the names of the properties to something I find a little nicer to read. Now when I run the query I should have four much easier to read results. The updated query and results are below.

That returns the following:

Now that we have the data in a readable layout I would like to sort the players so that the man who played the most games at third will be listed first. Sorting is as easy as adding the $sort operator. Here's what the query looks like with the sort call added.

Notice that I used the new name that I created in the $project call. The -1 indicates we want to sort the games in a descending fashion. The results of the updated query are below.

If you’ve been paying attention you noticed that I was using a function called aggregate instead of find. The aggregate function allows us to chain commands together. We can use the aggreation pipeline to ‘filter’ our data. It works by passing the results from one task to another as illustrated in the $project and $sort calls. I used $project to rename the fieldingStats.G property to just games. I then used the new name, games, to sort by. Let’s walk through the last query to get a better picture of whats going on.

$unwind

{ $unwind : "$fieldingStats" }, 

What $unwind does is create a new document or each member of an array. That means a copy of the demographics is put together with each entry in the fieldingStats array. So if a player has 10 entries there will be 10 documents with the same demographic information. Each document will have a single entry in the fieldingStats directory. I chose the fieldingStats property to $unwind on because I am only interested in third basemen. Notice that fieldingStats has a $ in front of it. Remember, that means that you want to use the value of the $fieldingStats property in the command. If I executed the query now with only the $unwind call I would receive the following message.

aggregation result exceeds maximum document size (16MB)

The message brings up one thing I haven't mentioned yet and that is all documents must be less than 16MB in size. Remember the unwind creates many new documents. The players collection has 1675 documents in it, if each player has 5 years worth of stats for 3 different positions, you can see how the size of the result set will increase. Thankfully, in my case I'm filtering the results of the $unwind call down so the 16MB limit is not a problem for me. In my three months of working in MongoDB I have yet to have the size limit cause any issues for me.

$match

{ $match : {"fieldingStats.POS": "3B",
             "fieldingStats.yearID" : 1936 }},

The output of the $unwind call are passed as to the $match call as input. $match searches the input documents looking for documents that match the given parameters. In this case anyone who played third base during the 1936 season will be returned. The number of documents have gone from the thousands to four. The four complete players documents are passed the $project operator.

$project

{ $project : { _id : "$_id",
          lastName : "$nameLast",
         firstName : "$nameFirst",
             games : "$fieldingStats.G"}}

I've already gone over what the $project call does so I won't go into it again.

$sort

{ $sort : { games: -1 } } 

Since I have already gone over the $sort call, I won't do it again here.

Indexes

There will be a few more player API calls so I am going to create a few more ensureIndex calls. Here are the players collection indexes.


Seasons

Each season the New York/San Francisco Giants have played in professional baseball is represented by a document in the seasons collection.
Each document in the collection will have the team’s regular and playoffseason records, team statistics, the roster, and list of managers. The document below represents the 2012 season when the Giants won their second World Series championship in 3 years.

To get the 2012 season document I used another select function, findOne. It is similar to find but it only returns a single object. In cases where there are more than one matching document findOne will return the first document found in the ‘natural order’, it will return the first one stored on disk.

The seasons document is similar to the players and managers document in that it has a ‘core’ set of data that pertains the team’s season plus arrays that store information about the players who were on the team that year as well as the managers.

For the seasons collection I will add the indexes below. The API will make use of these indexes as you’ll see in the next post.

Summary

I have taken the data from 18 database tables and stored them in three different collections in my MongoDB database. The new schema will allow us to make the fewest calls to the database when retrieving player, managerial or season related data. Throughout the post I showed you how to run select statements in the mongo client using find, findOne and the aggregation pipeline. I this post helped illustrate some ways that MongoDB can be used to store data in ways that makes using the data easier.

Resources

MongoDB Doc Links

Baseball Sites