Extracting LaunchBox’s Video Game Metadata: Getting Data of Video Game Consoles

Jonathan Hernandez
4 min readFeb 19, 2022

***Work done originally on November 2020***

Hello world!!!

Where are all my gamers new and old at?!

Hopefully you got a PlayStation 5 despite the chip shortage situation going on :)

Love video games and coding?

Well, if you said yes to any one of those, I am here to let you know that I was able to get metadata from LaunchBox (https://www.launchbox-app.com) of their video games database.

I was always interested in having my own internal database of video games for my own analysis. I always wanted to know how many video games existed, how many platforms are out there and so on.

This blog post is about me extracting the list of video game platforms from LaunchBox’s XML metadata.

Looking at the Data

The zip file that was given to me by Launchbox has several XML files. One of them is a file containing all the platforms that Launchbox has in their records.

Here is some of that data as an example:

The platforms that are listed on this file also have other tags in them that define some specs on the platform such as name, CPU, description, release date and so on.

My goal was to take this data and convert it to a DataFrame (a table of rows and columns) where each row represents a console and the columns are the attributes of that console. Putting it in this format helps for better readability and analysis.

The steps were as follows:

  • Read the XML file of video game platforms
  • For each platform, extract certain attributes and store them in a DataFrame.
  • Take the DataFrame and write it to a CSV file.

Getting the Data

I used python to simply read in the XML file as follows:

import xml.etree.ElementTree as ET
import pandas as pd
import re
# read in the data
platforms_xml = ET.parse('Platforms.xml')
root_xml = platforms_xml.getroot()

Once we have the root or starting point of the XML tree, next step is to for each platform tag, read in each of the attributes:

  • Name
  • Emulated
  • Release Date
  • Developer
  • Manufacturer
  • CPU
  • Memory
  • Graphics
  • Sound
  • Display
  • Media
  • Maximum Controllers
  • Notes
  • Category
# Create Empty list to store attributes
platform_attrs = ["Name", "Emulated", "ReleaseDate", "Developer",
"Manufacturer", "Cpu", "Memory", "Graphics", "Sound",
"Display", "Media", "MaxControllers", "Notes", "Category"]
# list to store all the data to be converted to CSV
rows = []
for platform in root_xml:
data = []
# Check if any of them are null and if so, set the value to 'None'
for field in platform_attrs:
if platform is not None and platform.find(field) is not None:
data.append(platform.find(field).text) # add the data
else: # it is a empty value set to None
data.append(None)
# append the data/observation to the rows list
rows.append({platform_attrs[i]: data[i]
for i in range(0,len(platform_attrs))})

Taking the Data Extracted from the XML File and Putting in a DataFrame

The code below shows how to convert the extracted data to a data frame:

# Create a dataframe out of the extracted data from the xml file platforms = pd.DataFrame(rows, columns = platform_attrs)

Only 2 things I wanted to clean in the dataset was to remove the time in the “ReleaseDate” field and just keep the date. I didn’t feel it was necessary to know what actual time the console was released.

# looking around the csv file, there are time/timezones in the date as well.
# let's remove the time and keep the date
dates = [re.sub("T.*", "", date)
if date is not None else None
for date in platforms["ReleaseDate"]]
platforms["ReleaseDate"] = dates

Querying the Data

After all that work, I was curious on a few questions like

  1. How many Nintendo consoles have been made and when?
  2. What are the specs of each Sony console?
  3. Which consoles were using floppy disks and when did they start fading out?

First question can be answered by searching for “Nintendo” in the “Name” column and outputting the name and release date and sort it by “ReleaseDate”.

The second question can be answered similar to the previous question except, search for “Sony” instead. Then list the name, CPU, memory, graphics and, display.

The answer to the last one is just search for “floppy” in the “Media” column and display the Name, release date and, media used.

Anyone got a case of nostalgia looking at the floppy disk results? :)

Summary

I have shown you how I took XML data courtesy of LaunchBox and using programming and logic, convert it to a DataFrame and reveal insights about the data. I love to let the data speak for itself. :)

What over queries can you come up with this data? I would love to hear your feedback.

Hope you enjoyed reading this blog post and links to the data and code can be found here:

On that link, you can also download the .zip file of the Metadata that I used for this blog.

I also made a .csv (Comma Separated Value) file which you can open with any text editor or spreadsheet software of your choice.

--

--

Jonathan Hernandez

Data science grad who loves blogging about data science topics. Open to work.