ELIAS NEMA - Data Feed #4: Vectors ↗

Weekly Focus: Vectors

Using open-source vector similarity extension for Postgres¹:

What’s Postgres Got To Do With AI?
Using postgres_scanner² to read vectors from PG in DuckDB.

A detailed paper from SIGMOD’20³ describes how Alibaba designed and built an approximate nearest neighbor search extension for vector similarity. Great deep dive into page structures for ANN indexes, and the code is even available (though, not maintained).

³ PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension

Using vector functions in SingleStore’s SQL, but not clear how well the system scales since the example uses only 7000 vectors.⁴

⁴ Image Matching in SQL with SingleStoreDB

An amazing use-case (as far as I’m concerned, data swamps are real) of representing individual columns in the embedding space by utilising pre-trained transformer models. Then, using those vectors to find semantically similar data within your data.⁵

⁵ Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

Learning

Use Apache Iceberg in a data lake to support incremental data processing.

Access Amazon Athena in your applications using the WebSocket API.

Guide to bitwise operators in CrateDB.

Grafana Labs webinars: Reduce MTTR, build beautiful Grafana dashboards, and more.

Anomaly detection on Prometheus metrics.

So you want Change Data Capture?

Patterns for enterprise data sharing at scale.

Deep Dive

A new lecture is out from the CMU Advanced Databases course on Parallel Hash Join Algorithms. If you are into data, you should have a very good reason for not watching this playlist.

Business

Amazing (as always) write-up about using TimescaleDB in the wild, and why compression is crucial for the time-series databases.⁶

⁶ How Ndustrial Is Providing Fast Real-Time Queries and Safely Storing Client Data With 97 % Compression

How Wiz used Amazon ElastiCache to improve performance and reduce costs.

How Delivery Hero uses Kubecost and Datadog to manage Kubernetes costs in the cloud.

An emerging buzzword for data platforms capable of both transactions and analytics workloads – “translytical.” A webinar from SingleStore describes precisely such platforms.⁷

⁷ Webinar Recap: Real-Time Data and the State of Translytical Platforms