Skip to content

shrimple 🇵🇱 🏳️‍⚧️

shrimple mind. shrimple problems. complex solutions. she/her

Implementing proper natural language grep — approach

Posted on March[²⁰26], Wednesday 11.March[²⁰26], Wednesday 11. By Shrimple 3 Comments on Implementing proper natural language grep — approach

GrepAI disappointed me because, how sad, I once again, in the ai hype, did not expect a project with “grep” in the name to be just a source code indexer that has neither:

    1. a straight-to-the-point command line interface
    2. single line matching behavior, only contextualized secondarily
    3. abstraction rather than specialization into source code
    4. one-off actions, like if there has to be indexing then there should be a way to not start a watcher
    5. any composability without deserialization, designed with imperative programs and not LLMs in mind

But their documentation told me about some stuff like what good (and quite ethical ad lightweight) embeddings model to use and how they compare.

ollama serve

ollama pull nomic-embed-text-v2-moe

grepai init

grepai watch –background

grepai search “blah blah”

Found 1 results for: "blah blah" 
 
─── Result 1 (score: 0.2583) ─── 
File: blah.txt:1-2 

   1 │ 
   2 │ 
   3 │ 

(2 lines — 0.7s~0.9s to ~6s warmup on a decent ultrabook)

And from its codebase, I learnt there is steps:

  1. It chunks into huge blob-chunks with 10% overlap, but we would do this line by line
  2. For each REST request it sends one chunk as “prompt” in the JSON, in order to receive a vector of float32 in a JSON as “embedding”
  3. It does cosine similarity search comparing the vector embedding of the query phrase with neighboring vectors of all the indexed chunks

And so I set out to search for good embedded database solution to keep the embedding-chunk key-value in (I suppose it’s no worth even trying to keep file positions —

  • they may keep shifting,
  • files can be searched conventionally really cheap,
  • and even if something were to get altered there can be cheap Levenstein fuzzy mayhaps fzf to recover the find
  • )

First I found out there’s sqlite-vec. That would be good, but I dug more.

They I found a part of Daniel T. Perry’s series on creating an Embedding-based Steam Search over game descriptions and reviews, Part 3: Querying the Embeddings.

Daniel lays out the options: Facebook AI Similarity Search library, and Hierarchical Navigable Small World algorithm implementation, hnswlib.

  • Daniel was failed by FAISS because the only way to install it is via Conda.

Idk about FAISS, but HNSW gives you an approximate result. hnswlib is a headers-only C++ library with Python bindings, and it seems to produce an index file. I guess I would make a command line tool to generate an index and to query it…

But Daniel also remarks that searching 10 000 embeddings takes 3 seconds on his machine, and it seems to me his vectors might be larger judging by his chosen model, Instructor. He only considered a performant index when expecting to reach millions of embeddings to search.

There’s a reasonable chance the loopback-interface network serialization-deserialization model-memory-allocation delays might be worse than that on my data, and so I’m better off

  • Just focusing on memoizing the embeddings of my queries
    • Perhaps I could be searching through them too
      • I could try adding a select to preserve the successful find

But 3 seconds can still be something, my netbook can turn out to be not quite something, and I might turn out too lazy to implement any simpler Approximate Nearest Neighbors persisted index. It might be coolest to just use the hnswlib. We’ll see where I’ll get with that. Maybe there will be a next post with some scripting to do all that.

Also it seems a warm-up strategy for the ollama will be necessary.

0 Give it a Click if you enjoyed (it does not federate)
Programming Technologies Tags:programming-tips

Post navigation

Previous Post: On OVH hosting, you can’t opt out from protection
Next Post: Check up on RSS/Atom dates in dozen+ lines of Bash

Related Posts

  • Getting TLS1.3 Key Log from Go application with requests by a library, and using it in Wireshark Programming Technologies
  • Replace `chardet` Python library immediately Influencing Society
  • Create Block Theme with Block Editor in WordPress Playground — a first Programming Technologies
  • Trying to run WordPress Studio, failing Programming Technologies
  • Check up on RSS/Atom dates in dozen+ lines of Bash Programming Technologies

Comments (3) on “Implementing proper natural language grep — approach”

  1. Pingback: To do and accomplished, of Week 11 of 2026 – shrimple 🇵🇱 🏳️‍⚧️
  2. kamaś says:
    April[²⁰26], Saturday 04. at 8pm

    Interesting concept, but I’m not convinced by chunking by lines. It may make sense for source code, where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact of local text formatting — so you just get semantically thin chunks and unnecessary embedding formatting quirks into semantic information layer.

    I’m not an expert in the topic (just starting to work with this), bu paragraph-based chunking seems to me like a much more reasonable default, since paragraphs usually correspond to coherent logical units while still remaining compact enough for most downstream tasks. If you really want finer granularity, consider sentence-based chunking instead.

    Reply
    1. Shrimple says:
      April[²⁰26], Monday 06. at 3pm

      where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact

      The matter is that I actually meant to have a tool to apply to files where individual lines carry distinct semantic meaning — they are to be records, in a Unix way.

      It happens also to be the case with Gemtext, the text markup of the Gemini protocol — and as such is used by the bookmarking–note-taking system of the Offpunk browser. A line is a paragraph there, and links and headings are marked-up lines.

      I actually forgot to include what my initial motivation was, and later I just thought that a line-oriented behavior would simply be neat and very weldy because of its predictability and composability with existing line-oriented tooling.

      After all, we can even coerce the semantically chunked information into line-oriented format (possibly even annotating it with file seek information, if not relying on the content itself for the purpose, as searching for already recognized text is cheap; compare the ways of Text fragments in URLs).

      However, semantic chunking would be the way to go with a lot, both natural language and things that tree-sitter could be attached to, with sentence meanings akin to how I recall sentences were understood in modes of Emacs and Vim.

      For example, a tool for semantic chunking search could make one just a shell script away from a particular class of web searching scenarios of needing to refine web search results by shallow-crawling each of a pageful of results for presence of a synonymous sentence.

      https://www.kambr.pl/blog/2026-04-05-sgrep-idea/

      I’m definitely gonna wish to be looking through any code that you happen to hit any kind of pre-functional milestone with <3

      Reply

Leave a Reply to Shrimple Cancel reply

Your email address will not be published. Required fields are marked *

Atom feed for this page

Atom feed for this blog

against-messy-software akkoma Atom|RSS_feeds bash big.ugly.git.patch. chromium-and-derivatives community fragment golang kde language-models-ai links2 linux me microsoft-edge network offpunk offpunk:lists offpunk:redirections oss-contributing perl programming-tips scripting smolweb subscribe superuser window-decorations wordpress-diving Wordpress_ActivityPub_plugin

Categories

  • Guides to Free Open Source

    (1)
  • Influencing Society

    (4)
  • Meta

    (4)
  • Oddities of alternate reality

    (1)
  • Programming Technologies

    (6)
  • Rookie Repairs

    (1)
  • Smol Web Habits

    (5)
  • Software Imposed On Us

    (1)
  • Wild Software Writing

    (8)
  • March 2026 (13)
  • February 2026 (5)
  • January 2026 (10)
Fediverse reactions

Post URL

Paste the post URL into the search field of your favorite open social app or platform.

Your Profile

Or, if you know your own profile, we can start things that way!
Why do I need to enter my profile?

This site is part of the ⁂ open social web, a network of interconnected social platforms (like Mastodon, Pixelfed, Friendica, and others). Unlike centralized social media, your account lives on a platform of your choice, and you can interact with people across different platforms.

By entering your profile, we can send you to your account where you can complete this action.

shrimple 🇵🇱  🏳️‍⚧️
shrimple 🇵🇱 🏳️‍⚧️
@shrimple@www.shrimple.pl
Follow

shrimple mind. shrimple problems. complex solutions. she/her

28 posts
10 followers

Follow shrimple 🇵🇱 🏳️‍⚧️

My Profile

Paste my profile into the search field of your favorite open social app or platform.

Your Profile

Or, if you know your own profile, we can start things that way!
Why do I need to enter my profile?

This site is part of the ⁂ open social web, a network of interconnected social platforms (like Mastodon, Pixelfed, Friendica, and others). Unlike centralized social media, your account lives on a platform of your choice, and you can interact with people across different platforms.

By entering your profile, we can send you to your account where you can complete this action.

  • What if we organized a different kind of hackathon Influencing Society
  • Bugfix for list URI for my Offpunk redirections implementation draft Wild Software Writing
  • A few links to Mozilla Sidebar panels directories Smol Web Habits
  • Slash-hierarchical list names — my draft implementation for Offpunk Wild Software Writing
  • Atom/RSS feeds dish for a browser capable of framesets — with some Perl Smol Web Habits
  • Experimentally expanding Offpunk browser Part 1 (nightly) Wild Software Writing
  • Simplistic reconciliation of mostly-append text files like Offpunk lists: draft involving Kahn’s algorithm Wild Software Writing
  • Amending my Offpunk redirection implementation Wild Software Writing

shrimple@shrimple.pl

Copyright © 2026 shrimple 🇵🇱 🏳️‍⚧️.

Powered by PressBook News WordPress theme