0

I am wanting to write a node application that looks up a bunch of blacklists from plain text files. Basically are all under directories representing different categories, and in each there is a file called domains which has a bunch of domain names separated by newline.

So my application will look at these files, and when I do an API call with a particular domain, it should look under that directory, find the domains file, and then determine whether or not the domain is listed in that domains file.

I wanted to do this efficiently, both memory-wise and computation-wise. I have been considering several approaches.

  1. First I tried just loading the whole file into a dictionary, with the domain as the key and a category ID as the value. This was really bad for memory, as some of these domains files were as large as 15 MB, and the resulting dictionary is something like 250 MB of memory due to the way lookup tables work.
  2. I considered just keeping a map of category ID to domains list, and then opening the file and searching, and keeping a local cache in case that domain is used again. However, this is really inefficient obviously from a disk perspective as you would be loading the entire file and searching through it when you do a lookup.
  3. I tried using redis to accomplish number 1 but it was even slower.

I am wondering if there is some library out there that maybe uses indexing, where a lot of the list is on disk, index makes it easy to jump to the right spot in the file if a disk lookup is necessary, and maybe then caching sections of the file in memory as opposed to the entire thing. It seems like there must be some library or application suited to this. Any ideas?

Thanks

jusschwa
  • 11
  • 2
  • This would probably be a lot easier using a simple database which would manage the memory usage, caching, etc.. for you. – jfriend00 Apr 04 '21 at 01:48
  • Most of the databases I have looked for are entirely in-memory. Is there perhaps one that stores on disk, but perhaps has a fixed size in-memory cache? I figure most users are going to be hitting a lot of the same domains so we would only likely need a small subset of the acls in memory 90 percent of the time. – jusschwa Apr 05 '21 at 03:04
  • It's a database's job to manage how much is cached in memory and how much is left on disk - you don't have to do that yourself. A popular (for use with node.js) disk-based database that is fairly lightweight is MongoDB. There are several dozen other choices. – jfriend00 Apr 05 '21 at 03:24
  • Thanks for the tip. Looking into it, it looks like sqlite does exactly what I want. – jusschwa Apr 05 '21 at 11:31

0 Answers0