data-mining games

Strangelove

AI Researcher
AKA
hitoshura
with the pc release of rebirth on the horizon, i would be interested in using the tools mentioned here to extract the text from the pc version and have an easily readable copy of the script(s) and just poke around and see if there's anything interesting there

in the distant past i messed around using other tools to see what i could find (using a iirc ffx tool i managed to find some unused voice lines in 'digital devil saga' that are just text in the actual game). for a few nds games like ffxii revenant wings and xenosaga 1&2 i managed to extract the text, although that was because they were just easily accessible text files and i didn't need to mess around too much. at one point i managed to extract some of the localisation files from uncharted 1 but i don't remember the process i used now, it might have been by trying that same ffx tool who's name i don't remember, but i might have burnt what i got onto a dvd somewhere (that's how long ago this was, people were still burning dvds)

from the above examples it should be clear that i'm mainly interested in getting the language related data out of the games rather than extracting models or romhacking (which i'd be interested in primarily to learn how to access the text). in my limited reading so far one hurdle for a lot of games, particularly older ones, is the way different developers would use different file types to store things so the method that works for one game wouldn't necessarily work for another. i don't know if this is easier with the rise of more devs using 3rd party engines like unreal and unity rather than proprietary ones. that's my wishful thinking, because trying to word out some of this stuff is way over my simple little head lol. you can't tell me i need to use a hex editor, i am a stupid baby

stuff i was going to be experimenting with first:

  • death stranding: i have a pc copy i need to reinstall and then look around and hopefully, it's just sitting in the open somewhere
  • front mission 1st: was going to be looking at the nds release which i was under the impression included additional story content that isn't in the recent remake, but i might wrong on that. if the nds version is too much for me to work out i might look at the remake in the hopes it will be simpler

and a kind of wish list of varying difficulties:

  • final fantasy xvi: quick googling tells me this might be a unique engine rather than unreal like other recent square enix titles so i don't know if other tools will work there
  • xenosaga 1/2/3: i am apprehensive about older disc-based games like these because i get the impression they need more skill than just poking around a pc game's files until you find the localisation stuff, but it would be super cool to me personally to get all the text out of these
 

Strangelove

AI Researcher
AKA
hitoshura
unrelated to any of the above games, i installed alien isolation on a whim and all the localisation files are just. right there in easily identifiable and accessable .txt files

thank you lord, why can't everyone just do this :sadpanda:
 

Strangelove

AI Researcher
AKA
hitoshura
i don't know why i thought newer games would be easier to work with when you don't know anything about what you're doing, this is still too difficult for me lol. i have tried various tools (umodel, watto's game extractor, another one but i'm forgetting the name now) but am still not having much luck. with games that have multiple localisations like most modern ones i can at least find files/folders labelled with language codes but i haven't managed to extract anything useful from them yet.

in no particular order, a list of games i have tried messing around with (and version used) to little avail:

  • death stranding (original version, epic game store): i haven't used any sort of tools on this yet, i just looked though the installed files but they weren't named in a way that's plainly legible to me (a huge dumbass). i downloaded a mod to add ukrainian language support that seems to replace the french one(???) which involves replacing a file, which should tell me which file at least contains the french localisation. but i downloaded the mod on a different computer, and all the files are large container files named like "bh12hfmsokfofhaor" so i need to sort out these files so there' on the same computer first
  • tales of arise (base game, pc game pass): i did at least manage to find directories that suggest they're used for localisation (names including "en/jp/es/ko/de/fr/etc.") but when i tried to open what i found there they all seemed to have the same contents and none of it was the text from the game. i tried using umodel for the first time on this but it kept crashing so i might have messed up somewhere
  • persona 3 reload (base game, pc game pass): again, found files/folders with names suggesting localisation files but haven't managed to access them
  • crisis core reunion (pc version, downloaded from somewhere lol): got it just to test the ff7remake text tools but haven't gotten around to it yet
  • caligula overdose (pc version, let's not ask where it's from): same as previous attempts, no success
  • front mission 1st remake (pc version): using the program the name of which i can't remember now (a file viewer/extractor that opens in your browser), i could look around the files and see some things that might contain text but i couldn't do anything with them at the time. just by browser in the files in window explorer i did find some video files which have subtitles in various languages, which is at least something.
  • front mission 1st (nds japanese version): files extracted, but i couldn't locate the text
  • final fantasy xii revenant wings (nds japanese & euro version): one of the first ones i tried again, since i did it before somehow. while i was mainly going for the japanese text, i tried the euro version as well to use the localised files (which in a multi-lang released will be labelled "en/fr/es/de/etc.") to determine the names and locations of files that are likely to have text in the japanese version without solely relying on names. (also it's easier to search the language codes first to find them than rummaging through everything.) iirc i did find some files, although for languages like spanish and french accented letters weren't displaying properly when i opened them but that's probably on me
  • hayarigami 1 ds (nds japanese version): failed to find the text so i consoled myself by listening to the low res version of the ending song from the game files. i just saw that there's a collected of 1, 2 & 3 from 2023 for the switch, ps4 and ps5 that i didn't even know about which might yield better results (but probably not with me doing it lol)

idk if there's a step by step tutorial out there to walk you through doing this stuff, i feel like i need a lot of help. tales of arise is leaving game pass in under 2 weeks so i would like to try and crack that before then. i have found a text dump from arise but it doesn't seem to be complete (searching the file for a line of dialogue from the game didn't get any results)
 

X-SOLDIER

Harbinger O Great Justice
AKA
X
While I've never attempted data mining as it's rather significantly far outside of my own skill set, I do have a bit of exposure to some subsets of folks who do data mining via the Souls community with things like Bloodborne & Elden Ring as I've dropped very briefly into the DMs or Discords of those groups to see if there are particular bits of game data that anyone knows of which I can't find documentation for when I'm poking around at trying to get a sense of various details of things about game dev cycles from what I can see about the way that design happens to interconnect. There's a section of this video on one of the obscure Dark Souls 2 puzzles that gets into the weeds a little bit about data mining, and that even for all of the reverse engineering & whatnot that's managed to be done with it, none of the community actually even knows what the game engine is even called, and Dantelion is just the community nickname for it.


In general this is one of those times where the Internet is most useful not as a repository of knowledge, but rather as an index of knowledge. While it's unlikely there's a guide that'll be able to get you through it that'll be broadly applicable across a wide range of titles that aren't doing simple .txt files – there's almost certainly other people who're looking to do this as well and who oftentimes have complimentary skill sets to your own where figuring those things out vastly more viable as a group effort. Given the different ways that various dev teams approach things, I'd expect that some of those will end up being more concentrated around various communities specific to those games.

The only person I can think of here who'd likely have a solid perspective on some of that'd be @Shademp with all of the DoC stuff he's dug through over the years that're REALLY deep into the weeds of everything with the game and how locked away various things can be. I know he's also worked pretty closely with the Speedrunning community around as well, and there tend to be a lot of TAS & other development-minded folks who have a particular familiarity with tools like that who are always integral in assisting with routing & optimization for runs of various games, which can mean that they have particularly specialized knowledge that can point you in the right direction even if sometimes it's just knowing which approaches NOT to take. If you're not finding anything that's getting what you're looking for, I'd expect that a Discord / Reddit / Forum that does speedrunning for some of those are likely to have at least a couple of folks who have some experience to help with getting dialogue/text extraction from various games that might be able to give you a hand.


X :neo:
 

Strangelove

AI Researcher
AKA
hitoshura
i thought modern games using the same engine (even if it is modified) would lead to skills being applicable for many titles, as opposed to older titles where they were more likely to be proprietary engines that would need reverse engineering at worst. there’s plenty of ps2 games i wouldn’t mind the text for but i figure they would fall into the proprietary engine category

also while looking stuff up i came across a reddit post that sounded a bit snippy about how this kind of stuff isn’t technically data mining but something else. and idk man, that was just the first way of phrasing it i thought up lol (the post that got the snippy reply wasn’t me or anything but i am still feeling attacked). although my searches did lead me think about looking into the making of fan translation mods (since you obviously need to access the text to change it for the translation)


but besides that, i have had my first success that wasn’t just stumbling on txt files in the alien isolation data.

tales of arise (base game, pc game pass): using a new (to me) program called fmodel to open the pak files i have managed to extract the files from the original Japanese version and a few localisations (i could do more just it didn’t personally interest me, although i might do it anyway since it’s easy enough). all the files what been exported as json files which i can open and read with notepad++ without issue.

initially i tried to use unrealpak, which comes with unreal engine but the guide i followed involved using the command prompt on windows and i couldn’t get it to work. from one of the error messages it seems that the newer version of unreal engine might no longer support working with older projects (i had ue 5 installed and iirc takes of arise if 4.2 or something). tried quickbmg or whatever it’s called but i guess i don’t have/don’t know the proper configuration files to work with arise’s files because i couldn’t get it to work either

i have a spreadsheet containing some text from tales of arise but couldn’t find some story dialogue in it. however, although i haven’t check it thoroughly because i am just over half way through the game myself, the files i have found so seem to include the dialogue from cutscenes as well as item descriptions, field dialogue, etc. the files aren’t the prettiest or most readable (it’s full of written code, they might not be in the proper order the dialogue plays in in-game judging from a quick peruse of the opening), but it might be if not complete then more complete than what i had found. i don’t know how to go about cleaning it up (did people make these spreadsheets themselves or was it automated in some way?) but i will worry about that later

i am just happy i managed this little bit of success :sadpanda:
 

Strangelove

AI Researcher
AKA
hitoshura
fmodel is only for unreal engine games, so i tried it on two of the other unreal games i've been tinkering with:

crisis core reunion: although i opened it and decrypted the files, i only managed to find the localisation files for the engine itself rather than the game. i would assume some of this might have to do with the way i believe they made this game (taking the original psp data and sort of layering unreal on top of it? i think that's what happened with the recent ninja gaiden black rerelease and i'm sure i read ccr was a similar case), but i'm not sure because i had more success with my other attempt. while i could open files i only saw what seemed like partial directories rather than all the files. i don't know if this is error on my part or if it's something to do with the the copy of the game i have. this is a bit of a lesser priority since i think a majority of the text has already been transcribed elsewhere. the exceptions probably being stuff like tutorial text, field npc dialogue, and any unused text

caligula overdose: overdose is an unreal engine remake of the original caligula from the ps vita, and i saw some slides from a talk that sounds like it was build using data from the vita version. i don't know if it was the same method, and there is a video of the corresponding talk but i haven't watched it. but nevermind, because i had a lot i lot more luck with this and seem to have found and extracted the majority (if not the entirety) of the text including localisations. i have some screencaps from the ps4 version somewhere that i will use to check if i can find all the text, but i am optimistic about it. although that might mean converting all the json files i exported to plain text so i can easily search them with window explorer


on the back of this, i'm curious to try fmodel on ever crisis and maybe ff type-0 (was that unreal or did it just also have terrible motion blur originally lol), although if type-0 is anything like crisis core i have concerns
 
Top Bottom