Implementing proper natural language grep — approach – shrimple 🇵🇱 🏳️‍⚧️

GrepAI disappointed me because, how sad, I once again, in the ai hype, did not expect a project with “grep” in the name to be just a source code indexer that has neither:

1. a straight-to-the-point command line interface
2. single line matching behavior, only contextualized secondarily
3. abstraction rather than specialization into source code
4. one-off actions, like if there has to be indexing then there should be a way to not start a watcher
5. any composability without deserialization, designed with imperative programs and not LLMs in mind

But their documentation told me about some stuff like what good (and quite ethical ad lightweight) embeddings model to use and how they compare.

ollama serve

ollama pull nomic-embed-text-v2-moe

grepai init

grepai watch –background

grepai search “blah blah”

Found 1 results for: "blah blah" 
 
─── Result 1 (score: 0.2583) ─── 
File: blah.txt:1-2 

   1 │ 
   2 │ 
   3 │

(2 lines — 0.7s~0.9s to ~6s warmup on a decent ultrabook)

And from its codebase, I learnt there is steps:

It chunks into huge blob-chunks with 10% overlap, but we would do this line by line
For each REST request it sends one chunk as “prompt” in the JSON, in order to receive a vector of float32 in a JSON as “embedding”
It does cosine similarity search comparing the vector embedding of the query phrase with neighboring vectors of all the indexed chunks

And so I set out to search for good embedded database solution to keep the embedding-chunk key-value in (I suppose it’s no worth even trying to keep file positions —

they may keep shifting,
files can be searched conventionally really cheap,
and even if something were to get altered there can be cheap Levenstein fuzzy mayhaps fzf to recover the find
)

First I found out there’s sqlite-vec. That would be good, but I dug more.

They I found a part of Daniel T. Perry’s series on creating an Embedding-based Steam Search over game descriptions and reviews, Part 3: Querying the Embeddings.

Daniel lays out the options: Facebook AI Similarity Search library, and Hierarchical Navigable Small World algorithm implementation, hnswlib.

Daniel was failed by FAISS because the only way to install it is via Conda.

Idk about FAISS, but HNSW gives you an approximate result. hnswlib is a headers-only C++ library with Python bindings, and it seems to produce an index file. I guess I would make a command line tool to generate an index and to query it…

But Daniel also remarks that searching 10 000 embeddings takes 3 seconds on his machine, and it seems to me his vectors might be larger judging by his chosen model, Instructor. He only considered a performant index when expecting to reach millions of embeddings to search.

There’s a reasonable chance the loopback-interface network serialization-deserialization model-memory-allocation delays might be worse than that on my data, and so I’m better off

Just focusing on memoizing the embeddings of my queries
- Perhaps I could be searching through them too
  - I could try adding a select to preserve the successful find

But 3 seconds can still be something, my netbook can turn out to be not quite something, and I might turn out too lazy to implement any simpler Approximate Nearest Neighbors persisted index. It might be coolest to just use the hnswlib. We’ll see where I’ll get with that. Maybe there will be a next post with some scripting to do all that.

Also it seems a warm-up strategy for the ollama will be necessary.

Give it a Click if you enjoyed (it does not federate)

Comments (3)

Pingback: To do and accomplished, of Week 11 of 2026 – shrimple 🇵🇱 🏳️‍⚧️

Interesting concept, but I’m not convinced by chunking by lines. It may make sense for source code, where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact of local text formatting — so you just get semantically thin chunks and unnecessary embedding formatting quirks into semantic information layer.

I’m not an expert in the topic (just starting to work with this), bu paragraph-based chunking seems to me like a much more reasonable default, since paragraphs usually correspond to coherent logical units while still remaining compact enough for most downstream tasks. If you really want finer granularity, consider sentence-based chunking instead.

Shrimple says:

April[²⁰26], Monday 06. at 3pm

where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact

The matter is that I actually meant to have a tool to apply to files where individual lines carry distinct semantic meaning — they are to be records, in a Unix way.

It happens also to be the case with Gemtext, the text markup of the Gemini protocol — and as such is used by the bookmarking–note-taking system of the Offpunk browser. A line is a paragraph there, and links and headings are marked-up lines.

I actually forgot to include what my initial motivation was, and later I just thought that a line-oriented behavior would simply be neat and very weldy because of its predictability and composability with existing line-oriented tooling.

After all, we can even coerce the semantically chunked information into line-oriented format (possibly even annotating it with file seek information, if not relying on the content itself for the purpose, as searching for already recognized text is cheap; compare the ways of Text fragments in URLs).

However, semantic chunking would be the way to go with a lot, both natural language and things that tree-sitter could be attached to, with sentence meanings akin to how I recall sentences were understood in modes of Emacs and Vim.

For example, a tool for semantic chunking search could make one just a shell script away from a particular class of web searching scenarios of needing to refine web search results by shallow-crawling each of a pageful of results for presence of a synonymous sentence.

https://www.kambr.pl/blog/2026-04-05-sgrep-idea/

I’m definitely gonna wish to be looking through any code that you happen to hit any kind of pre-functional milestone with <3

Reply

Comments (3) on “Implementing proper natural language grep — approach”

Pingback: To do and accomplished, of Week 11 of 2026 – shrimple 🇵🇱 🏳️‍⚧️
kamaś says:

April[²⁰26], Saturday 04. at 8pm

Interesting concept, but I’m not convinced by chunking by lines. It may make sense for source code, where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact of local text formatting — so you just get semantically thin chunks and unnecessary embedding formatting quirks into semantic information layer.

I’m not an expert in the topic (just starting to work with this), bu paragraph-based chunking seems to me like a much more reasonable default, since paragraphs usually correspond to coherent logical units while still remaining compact enough for most downstream tasks. If you really want finer granularity, consider sentence-based chunking instead.

1. Shrimple says:
  
  April[²⁰26], Monday 06. at 3pm
  
  where individual lines often carry distinct syntactic or semantic meaning, but in natural language text line-splitting is typically just an artifact
  
  The matter is that I actually meant to have a tool to apply to files where individual lines carry distinct semantic meaning — they are to be records, in a Unix way.
  
  It happens also to be the case with Gemtext, the text markup of the Gemini protocol — and as such is used by the bookmarking–note-taking system of the Offpunk browser. A line is a paragraph there, and links and headings are marked-up lines.
  
  I actually forgot to include what my initial motivation was, and later I just thought that a line-oriented behavior would simply be neat and very weldy because of its predictability and composability with existing line-oriented tooling.
  
  After all, we can even coerce the semantically chunked information into line-oriented format (possibly even annotating it with file seek information, if not relying on the content itself for the purpose, as searching for already recognized text is cheap; compare the ways of Text fragments in URLs).
  
  However, semantic chunking would be the way to go with a lot, both natural language and things that tree-sitter could be attached to, with sentence meanings akin to how I recall sentences were understood in modes of Emacs and Vim.
  
  For example, a tool for semantic chunking search could make one just a shell script away from a particular class of web searching scenarios of needing to refine web search results by shallow-crawling each of a pageful of results for presence of a synonymous sentence.
  
  https://www.kambr.pl/blog/2026-04-05-sgrep-idea/
  
  I’m definitely gonna wish to be looking through any code that you happen to hit any kind of pre-functional milestone with <3

Implementing proper natural language grep — approach

Comments (3) on “Implementing proper natural language grep — approach”

Leave a Reply to Shrimple Cancel reply

Post URL

Your Profile

My Profile

Your Profile

Related Posts

Comments (3) on “Implementing proper natural language grep — approach”

Leave a Reply to Shrimple Cancel reply