Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lunr.js - Simple full-text search in your browser (github.com/olivernn)
81 points by jchapron on March 4, 2013 | hide | past | favorite | 29 comments


I see you are using my stemmer implementation ... snowball and porter2 are better - git clone that instead.

In fact, if you had poked around, you probably could have snagged about 90% of this code from various projects ... too bad I didn't put it together like you did.

Ah well ... internet fame points to you I guess.


I'm not the author, I just stumbled on the project and found it interesting. I guess you could suggest it to him on twitter @olivernn


There's a detailed write up from the author about how it all works at: http://blog.new-bamboo.co.uk/2013/02/26/full-text-search-in-...


There's also a website with docs and an example by the author here : http://lunrjs.com/


As for server side, reds is a simple full-text search module of Node.js. https://github.com/visionmedia/reds


How does this compare to http://reyesr.github.com/fullproof/?


I created a small Jekyll plugin to add full-text search using lunr.js for the generated, static sites.

https://github.com/slashdotdash/jekyll-lunr-js-search


I think I'll put this to use in an internal admin/support system in need of search. All the data is on the client already (AngularJS), and it's less than thousand docs for now.


I am sure there are some applications that might need this due to obfuscation of data from the users... But doesn't the browser already have full text search? (Control-F)?


Full text search usually presumes an index, for a lot of functional differences compared to the browser's naive substring-matching Ctrl-F. And any proper search index is going to be a better user experience than naive string matches.

I haven't read through all of Lunr's docs and source, but based on my Solr/Elasticsearch experience, I'd expect to see (in time)…

Tokenization and (presumably) term normalization/analysis; a faster and smarter query language, for term order independence and boolean combinations of clauses; relevance scores and maybe even score boosting per field.

Better queryability really shouldn't be understated here. Just having term order independence focused on a specific set of JSON is going to be way better than naively matching any substring on the entire rendered page.


That is almost exactly what lunr is doing. It tokenises the input text, stems the tokens and filters out any stop words. The index it can be searched, the order is not relevant, a prefix search is currently used so that you can find documents containing terms without having to type the whole term exactly. The matching documents are also scored as to how relevant they are to the search term.

In the future I want to add even more powerful querying, restricting search to specific fields, taking into account the distance between terms, and adding faceted search to reduce the total documents being searched over.

One of the original goals of the project was specifically to provide a better alternative to just using the browsers built in find-in-page functionality


What browsers do cannot really be called full-text search.

For example, the ability to search for a paragraph that contain two non-contiguous words would be very useful, but no browser (that I know of) is able to return elements that contain a set of tokens.

All browsers do is return exact matches from a string, with no concept of words.

It would be interesting to know if in this solution the index can be persisted to file or if it has to be rebuilt every time?


Supposedly you can persist it with HTML5 on localStorage or the indexed browser DB.


Relevance ranking. Not sure how valuable it is at this scale, but it seems worth a try.


> A browser is required for running the tests.

Why? This is a red flag.


Why is it a red flag for a browser-based javascript library to require a browser for testing?


Continuous integration usually run JavaScript tests in browserless-environments.


While I agree with downstream comments that you can run tests headless or browserless, arguably, you're not really testing the user experience until you execute it the way that the user will execute it. Perhaps this is a case of perfect vs. good enough.


I run mine in a PhantomJS environment, which works just fine for headless browser testing.


Uh, platforms like Node can easily simulate the browser, despite being 'browserless'.


Any stats on the limitations ?

I'm wondering how efficient this would be given that indexing a lot of data via javascript might really not be a good idea..


The example (http://lunrjs.com/example/) indexes 100 stackoverflow questions, some of which are relatively long.

If indexing performance starts to become an issue the whole search index can be moved into a web-worker, which prevents indexing from blocking the rest of the page.


Maybe you can create index on the server and then just load it on a client?

edit: looking at the docs it's unclear if it's possible. I guess index should be a JS object, so it's pretty simple to save it to a disc and then fetch it from client.


Since the library can be run outside of the browser (using node.js for example) the index could be generated server side, and then just passed to the client. I hadn't considered this before but it might be worth looking at.


Depending on the performance of this, it might be awesome to have some serialization format (i.e. inverted, normalized, tokenized JSON).


How do you index your pages? is that a manual process by creating your json file to be read?


This is amazing, and perfect timing for me. Thanks!!


[deleted]


Wrong thread?


Whoops! Very sorry about this.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: