Ruby on Rails Neighbor Gem for AI Embeddings
Over the past 12 months, AI has taken over budgets and initiatives. Postgres is a popular store for AI embedding data because it can store, calculate, optimize, and scale using the pgvector extension. A recently introduced gem to the Ruby on Rails ecosystem, the neighbor gem, makes working with pgvector and Rails even better.
Background on AI in Postgres
An “embedding” is a set of floating point values that represent the
characteristics of a thing (nothing new, we’ve had these since the 70s). Using
the OpenAI API or any of their competitors, you can send over blocks of text,
images, and pdfs, and OpenAI will return an embedding with 1536 values
representing the characteristics. With the pgvector
extension, you can store
that embedding in a vector column type on Postgres. Then, using nearest neighbor
calculations, you can then find the most-similar objects. For a deeper review of
AI with Postgres, see my previous
posts in this series.
The neighbor gem
By default, Ruby on Rails does not know about the "vector" data type. If you've used Ruby on Rails + Postgres + pgvector, you've probably written SQL queries in your migrations, and implemented some other janky-code. The neighbor gem will remove the janky-code, and take you back to a native ActiveRecord experience.
At a minimum, all you have to do is add the following to you Gemfile
:
gem 'neighbor'
Side note: I can't overstate the impact Andrew Kane has had on embedding data in Postgres. He's also making it easy for developers to use those vector data types with Ruby on Rails and Node.
Fixed schema dump
The biggest risk of not using Neighbor is that ActiveRecord will create a
failing db/schema.rb
file. Because ActiveRecord does not understand the
vector
data type, instead of failing, running rails db:schema:dump
will omit
any table with that data type. It will show this error in your db/schema.rb
:
# Could not dump table "recipe_embeddings" because of following StandardError
# Unknown type 'vector(1536)' for column 'embedding'
With Neighbor, you'll get a fully-functional schema like the following:
create_table "recipe_embeddings", primary_key: "recipe_id", id: :bigint, default: nil, force: :cascade do |t|
t.vector "embedding", limit: 1536, null: false
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["embedding"], name: "recipe_embeddings_embedding", opclass: :vector_l2_ops, using: :hnsw
t.index ["recipe_id"], name: "index_recipe_embeddings_on_recipe_id"
end
Notice that Neighbor also understands the []hnsw
index
type](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector)
released with pgvector 0.5.
Side note: for projects that go all-in on Postgres, I opt to use the
following to dump to a db/structure.sql
:
SCHEMA_FORMAT=sql rails db:schema:dump
Easier migrations + data type handling
Without Neighbor, ActiveRecord is not informed of vector. Just as your
config/schema.rb
file is important for your typical migration would look
something like the following:
create_table :recipe_embeddings, primary_key: [:recipe_id] do |t|
t.references :recipe, null: false, foreign_key: true
t.vector :embedding, limit: 1536, null: false
t.timestamps
end
Additionally, you get improved handling of the vector data type. Without
Neighbor, working with embedding data required to_s
to manipulate the values
when inserting into Postgres. But, with Postgres, it's simplifies to a native
process:
RecipeEmbedding.create!(recipe_id: Recipe.last.id, embedding: [-0.078427136, 0.0014401458, ...])
But, wait! There's more …
The nearest_neighbor
method
After you add the embedding
column to a table, you can use has_neighbors
to
define your nearest neighbor queries:
class RecipeEmbedding < ApplicationRecord
has_neighbors :embedding
end
Then, you can find the nearest neighbors like so:
recipe_embedding.nearest_neighbors(:embedding, distance: "euclidean").first
The distance calcuations include euclidean
and cosine
.
Conclusion
Launching a project to use embeddings with Ruby on Rails?
Step 1: use the neighbor gem
Step 2: provision your database on Crunchy Bridge with pgvector
Step 3: profit