

After reading some of the other comments, I’m definitely going to separate the systems. I’ll use something like json or yaml as the output for the raw scraped data, and some sort of database for the final program.
After reading some of the other comments, I’m definitely going to separate the systems. I’ll use something like json or yaml as the output for the raw scraped data, and some sort of database for the final program.
That’s an interesting read. I’ll definitely give json a try too.
Glad I could brighten up your day!
That’s good to know.
Gonna be honest, I’ll need to research a bit more what validating against a schema is, but I get the general idea, and I like it.
For initial testing and prototypes, I probably won’t worry about validation, but once I get to the point of refining the system, validation like that would be a good idea.
One concern I’m seeing from other comments is that I may have more data than SQLite is ideal for. I have thousands of stories (My estimate is between 10 and 40 thousand), and many of the stories can be several pages long.
Gotcha. I think I’m aiming for something that runs off a single program. I want to be able to start it up whenever or even transfer it to a drive and use it on something like my laptop. Your idea sounds like it may work, but I’ll have to give it a deeper look.
I’m not entirely sure yet, but probably yes to both. The story text will likely stay unchanged, but I’ll likely experiment with various ways to analyze the stories.
The main idea I want to try is assigning stories “likely tags” based on the frequency of keywords. So castle and sword could indicate fantasy while robot and ship could indicate sci-fi. There are a lot of stories missing tags, so something like this would be helpful.
What’s your reasoning for that?
At this point, I think I’ll only use yaml as the scraper output and then create a database tool to convert that into whatever data format I end up using.
A few keywords in there I’ll have to look up, but I get the majority of it.
Yeah, I’m not too sure yet how complex the tags will be in the end. They are basically genres at the start, but I may make them more complex as I go.
After reading some of the other comments, I doubt I’ll use yaml as the main storage method. I do like the idea of using yaml for the scraper output though. Would give me a nice way to organize the data elements for each story in a way that can be easily read when needed.
Is this something that can be run locally without a server? I’m aiming for something as simple as opening the notes app on your phone and selecting a story.
That’s a good idea! Would yaml be alright for this too? I like the readability and Python styled syntax compared to json.
Did not know that. I’ll keep that in mind.
Don’t know the limits of Yaml, especially for large chunks of data, but I do like its easy readability and similarity to Python. I’ll probably try out a bit of yaml as well as some of the other recommendations other have given me.
I do like the sound of that.
I’m not too worried about performance, since, once everything is running, most of the operations will only be ran every few weeks or so. Don’t want it slowing to a crawl for sure though.
The text search looks promising. I’ve had the idea of automating “likely tags” that look for keywords (sword = fantasy while spaceship = sci-fi). It’s not perfect, but it could be useful to roughly categorize all the stories that are missing tags.
Couldn’t even bother to include a link? Lol
From Wikipedia, the scandle is described as “a political scandal in the United States that occurred during the second term of the Reagan administration. Between 1981 and 1986, senior administration officials secretly facilitated the illegal sale of arms to Iran, which was subject to an arms embargo at the time. The administration hoped to use the proceeds of the arms sale to fund the Contras, an anti-Sandinista rebel group in Nicaragua. Under the Boland Amendment, further funding of the Contras by legislative appropriations was prohibited by Congress, but the Reagan administration figured out a loophole by secretively using non-appropriated funds instead.”
I’ll give it a look. I’m still in the early stages of the project, so it’ll be a bit before I get to the point where I work on the database side of things.