Extracting LaunchBox’s Video Game Metadata: Getting Data of Video Games

Getting the Data

Much like getting the video game platforms data, the same approach will be applied using Python. Instead of reading in the Platforms.xml file, I’m going to read in data from the Metadata.xml file. It’s a large file that contains all the metadata on what LaunchBox has on it’s gaming database.

from datetime import datetime
import xml.etree.ElementTree as ET
import pandas as pd
import re
import seaborn as sb
import matplotlib.pyplot as plt
# read in the data
videogames_xml = ET.parse('Metadata.xml')
root_xml = videogames_xml.getroot()
# Video Game attributes to extract
videogame_attrs = ["Name", "ReleaseDate", "ReleaseYear", "Developer", "Platform", "Genres", "Publisher", "MaxPlayers", "ESRB", "Overview", "Cooperative"]

Making the data readable as a DataFrame

The next step was to look at the metadata XML file and for each game:

  • Retrieve all the contents previously mentioned in this blog and store them in a list.
  • Append that list in another list.
  • Convert the embedded list into a DataFrame after looping through all video games.
# store list to keep video game data
rows = []
# loop through each game tag and for each game, get the data per
# videogame_attrs and store in a dictionary and then at the end,
# store in the rows list.
for game in root_xml.findall('Game'):
data = [] # store game data
# loop through each attribute
for field in videogame_attrs:
if game is not None and game.find(field) is not None:
data.append(game.find(field).text) # append the data
data.append(None) # data not found, just set to None
rows.append({videogame_attrs[i]: data[i] # put the game data in the list
for i in range(0, len(videogame_attrs))})
# Take all the games and their data and put in a dataset/dataframe
games = pd.DataFrame(rows, columns = videogame_attrs)

Cleaning the Data

With the data in a dataframe, it is much easier to analyze and clean and alter.

# There are time/timezones in the 'ReleaseDate' column starting with 'T'
# remove them and just keep the data
dates = [re.sub("T.*", "", date)
if date is not None else None
for date in games["ReleaseDate"]]
games["ReleaseDate"] = dates
# format the Date as well for sorting purposes
dates = [datetime.strptime(date, "%Y-%m-%d")
if date is not None else None
for date in dates]
# save dataframe in a csv file
games.to_csv("games.csv", index=False)

Running Video Game Queries

Well, if you read this far, congrats. No really :)

Give me all the Mega Man video games ever made

You know Mega Man? You know, the ‘Blue Bomber’? The super fighting robot, fighting to save the world from the evil robot master Dr. Wily?? (Cue the Mega Man 1990’s animated theme song or your own personal Mega Man soundtrack)

What were the last few games created for the Sega Genesis console?

Where are all my SEGA lovers at? Who is old enough to remember the Sega and Nintendo console war?

How many Nintendo video games are out there? How many PlayStation games out there?

Say no more gamers. I got you covered.

Last one: A bar plot on number of games created by year


I had so much fun working on this data set :) I can spend hours learning about video games new and old. I hope you enjoyed this blog post and how to extract video game data and gain and answer questions and gain insight. This data set is much larger than the previous post I used. As that is the case, you have a lot more data to play around with and many questions about the data can be looked into.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jonathan Hernandez

Jonathan Hernandez

Data science grad who loves blogging about data science topics. Open to work.