
To be honest I’ve never tried this on a PC.
Also as a heads up sorry if you know already what I’m about to say, I don’t know what your background is so I’m kinda assuming it’s “tech literate but hasn’t done a project like this before”. If it is too complex or too patronizing please correct me by asking questions or telling me what to skip next time, respectively.
I think most companies handle training by renting an instance on a cloud service provider and then running the training there. That’s the expensive part of development but can be mitigated by being smart with how often & what sort of training you choose. At minimum, a lot of debugging can be done on a PC.
Inference tends to be cheaper, so perhaps that could be done on a PC. I’ve never tried personally but could help look into it.
An application like this is usually referred to in industry as a Retrieval-Augmented Generation (RAG) LLM. The premise is you have an external database—such as a wiki, training materials, or set of news articles—and you teach the LLM how to reference that database when generating it’s replies. The database does not need to be large and you can also put in place measures such that it says “I don’t know” if unable to come up with satisfactory grounded answers.
Something to consider would be what sort of database you’d like to reference. I know someone who has a complete backup of the Transgender_Surgeries wiki including the posts it links to for example. I think that would help that community as there is an outstanding issue of most users finding it hard to read the wiki or simply not doing it. If you wanted to create an MVP I would be down to help advise with that and provide guidance/support if stuck or possibly if in need of funds.
I’d do it myself but my free time is mostly taken up by doing some online community organizing & having a boyfriend who exists in meat space. Would rather support/advise instead. I might also be able to pull some IRL contacts from my workplace (with connections to the trans community) as needed, can’t speak for them though.
It’s a really great idea tbh. It’s well defined & frankly I wish more clients came to us with projects this well defined with this good of a dataset. Usually it’s not so straightforward. If you decide its something you wanna keep working on please keep me in the loop. Otherwise I might circle back to it myself & will definitely ping you.
EDIT: some more notes re: the dataset. Once a working MVP is up using something like the wiki, we could figure out how to increase the scope of the database by e.g. ingesting subreddit data (look up Arctic Shift project on GitHub for ways around the API issue) or, better yet, using Lemmy’s APIs since those are all still open (as far as I understand it—still new to Lemmy). Compiling such a dataset and self-hosting a chat bot like that could be a way to ensure community knowledge lives on even if, say, Lemmy instances go dark as the platform evolves.
can’t say for sure but I know mine are doneskies ✂️🍆