Extracting LaunchBox’s Video Game Metadata: Getting Data of Video Games

Jonathan Hernandez
6 min readFeb 20, 2022

***Work done originally on December 2020***

Hello world!!!

Just like I promised in my last post, I have also created a data set of video games from LaunchBox’s (https://www.launchbox-app.com/) XML data.

Folks I gotta tell you, looking at the data that LaunchBox has, there appears to be over 108k+ video games throughout history!!!

That’s a lot of game play and a lot of mashing buttons you could do many generations over lol :)

In this blog post, I will show you how I took LaunchBox’s metadata XML file and was able to get a list of games and their attributes into a CSV file.

Disclaimer: All data and credit goes to LaunchBox. I do not own any rights to the data. The blog and the results of the data are for entertainment purposes only.

Getting the Data

Much like getting the video game platforms data, the same approach will be applied using Python. Instead of reading in the Platforms.xml file, I’m going to read in data from the Metadata.xml file. It’s a large file that contains all the metadata on what LaunchBox has on it’s gaming database.

The metadata XML file can be downloaded from my GitHub repository which I will include at the end of the article.

The code below shows how the Metadata.xml file was extracted:

from datetime import datetime
import xml.etree.ElementTree as ET
import pandas as pd
import re
import seaborn as sb
import matplotlib.pyplot as plt
# read in the data
videogames_xml = ET.parse('Metadata.xml')
root_xml = videogames_xml.getroot()

Next, we make a list of attributes we are going to store our data. Features include things like name, release date, genre, and information about the companies or parties in the making of the game.

# Video Game attributes to extract
videogame_attrs = ["Name", "ReleaseDate", "ReleaseYear", "Developer", "Platform", "Genres", "Publisher", "MaxPlayers", "ESRB", "Overview", "Cooperative"]

Making the data readable as a DataFrame

The next step was to look at the metadata XML file and for each game:

  • Retrieve all the contents previously mentioned in this blog and store them in a list.
  • Append that list in another list.
  • Convert the embedded list into a DataFrame after looping through all video games.

The metadata.XML file has many tags and a large file to observe. What we are interested in is the <Game> tags and their data.

Here is the code snippet:

# store list to keep video game data
rows = []
# loop through each game tag and for each game, get the data per
# videogame_attrs and store in a dictionary and then at the end,
# store in the rows list.
for game in root_xml.findall('Game'):
data = [] # store game data
# loop through each attribute
for field in videogame_attrs:
if game is not None and game.find(field) is not None:
data.append(game.find(field).text) # append the data
else:
data.append(None) # data not found, just set to None
rows.append({videogame_attrs[i]: data[i] # put the game data in the list
for i in range(0, len(videogame_attrs))})
# Take all the games and their data and put in a dataset/dataframe
games = pd.DataFrame(rows, columns = videogame_attrs)

The variable ‘games’ contains the DataFrame of all video games in the Launchbox-app database.

Cleaning the Data

With the data in a dataframe, it is much easier to analyze and clean and alter.

The ‘ReleaseDate’ column has the date and time and just like the previous post of video game consoles, i’ll be removing the time and keeping the date.

I also formatted the dates to a certain date format for sorting purposes.

# There are time/timezones in the 'ReleaseDate' column starting with 'T'
# remove them and just keep the data
dates = [re.sub("T.*", "", date)
if date is not None else None
for date in games["ReleaseDate"]]
games["ReleaseDate"] = dates
# format the Date as well for sorting purposes
dates = [datetime.strptime(date, "%Y-%m-%d")
if date is not None else None
for date in dates]
# save dataframe in a csv file
games.to_csv("games.csv", index=False)

I also saved the dataset as a CSV file in case you want to analyze the data for yourself.

Running Video Game Queries

Well, if you read this far, congrats. No really :)

With our new found data, let’s get some answers on video games.

Give me all the Mega Man video games ever made

You know Mega Man? You know, the ‘Blue Bomber’? The super fighting robot, fighting to save the world from the evil robot master Dr. Wily?? (Cue the Mega Man 1990’s animated theme song or your own personal Mega Man soundtrack)

Mega Man is a popular run-and-gun video game franchise that’s been around for decades.

Let’s see which MegaMan games existed:

Well that is a lot of running and gunning and a lot of robot masters defeated :)

What were the last few games created for the Sega Genesis console?

Where are all my SEGA lovers at? Who is old enough to remember the Sega and Nintendo console war?

So the below query is to see which were the final games of the SEGA Genesis before it was decommissioned.

Several online sources will say “Frogger” was the last game for the SEGA Genesis while the LaunchBox-app.com site has “Duke Nukem 3D” as the last game released.

“It’s time to kick a** and chew bubble gum. And I’m all out of gum.”

Duke Nukem Fans would get the above quote.

How many Nintendo video games are out there? How many PlayStation games out there?

Say no more gamers. I got you covered.

Last one: A bar plot on number of games created by year

Looks like the first 25 years games were created at a exponential rate and growing followed by some declines and peaks.

Also note that not many games were not made in 2020. This could have been due to the economic slowdown we had and/or the COVID-19 pandemic.

Summary

I had so much fun working on this data set :) I can spend hours learning about video games new and old. I hope you enjoyed this blog post and how to extract video game data and gain and answer questions and gain insight. This data set is much larger than the previous post I used. As that is the case, you have a lot more data to play around with and many questions about the data can be looked into.

What other questions would you like me to answer given the data? Let me know your thoughts and comment what you would like.

Link to source code and the CSV (comma seperated value) file for the data can be found here:

The CSV file for video games is called ‘games.csv’. You can use software such as Excel or notepad to open this kind of file if you’re interested.

For those who love working with databases and know SQL, I have two .db files ‘Video_Game_Platforms_DB.db’ and ‘Video_Games_DB.db’ which have database tables for video game platforms and video games respectively.

Lastly, I want to give a big thanks and shout out to LaunchBox for providing me the data to show you guys. This blog post and the previous one would not have been possible if it weren’t for them.

--

--

Jonathan Hernandez

Data science grad who loves blogging about data science topics. Open to work.