Claude-code-plugins-plus-skills langchain-data-handling
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/langchain-pack/skills/langchain-data-handling" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-langchain-data-handling && rm -rf "$T"
manifest:
plugins/saas-packs/langchain-pack/skills/langchain-data-handling/SKILL.mdsource content
LangChain Data Handling: RAG & Document Processing
Overview
Build Retrieval-Augmented Generation (RAG) pipelines: load documents, split into chunks, embed with OpenAI/Cohere, store in vector databases (FAISS, Chroma, Pinecone), and query with retrieval chains.
Prerequisites
,@langchain/core
installed@langchain/openai- For vector stores:
(FAISS) ornpm install @langchain/communitynpm install @langchain/pinecone @pinecone-database/pinecone
Step 1: Document Loaders
import { TextLoader } from "langchain/document_loaders/fs/text"; import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; import { DirectoryLoader } from "langchain/document_loaders/fs/directory"; import { CSVLoader } from "@langchain/community/document_loaders/fs/csv"; // Load a single file const textDocs = await new TextLoader("./data/readme.md").load(); const pdfDocs = await new PDFLoader("./data/manual.pdf").load(); // Load entire directory with type-based routing const dirLoader = new DirectoryLoader("./data/", { ".txt": (path) => new TextLoader(path), ".pdf": (path) => new PDFLoader(path), ".csv": (path) => new CSVLoader(path), }); const allDocs = await dirLoader.load(); console.log(`Loaded ${allDocs.length} documents`);
Step 2: Text Splitting
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters"; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, // max chars per chunk chunkOverlap: 200, // overlap between chunks for context continuity separators: ["\n\n", "\n", ". ", " ", ""], // split priority }); const chunks = await splitter.splitDocuments(allDocs); console.log(`Split into ${chunks.length} chunks`); // Each chunk preserves metadata from the source document // chunk.pageContent = "text content" // chunk.metadata = { source: "./data/readme.md", loc: { lines: { from: 1, to: 20 } } }
Step 3: Embeddings
import { OpenAIEmbeddings } from "@langchain/openai"; const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small", // $0.02/1M tokens, 1536 dims // model: "text-embedding-3-large", // $0.13/1M tokens, 3072 dims }); // Embed a single query const queryVector = await embeddings.embedQuery("What is LCEL?"); console.log(`Vector dimensions: ${queryVector.length}`); // 1536 // Embed multiple documents const docVectors = await embeddings.embedDocuments([ "LCEL is LangChain Expression Language", "Runnables are composable components", ]);
Step 4: Vector Store (FAISS - Local)
import { FaissStore } from "@langchain/community/vectorstores/faiss"; import { OpenAIEmbeddings } from "@langchain/openai"; const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" }); // Create from documents const vectorStore = await FaissStore.fromDocuments(chunks, embeddings); // Save to disk for reuse await vectorStore.save("./faiss-index"); // Load from disk const loaded = await FaissStore.load("./faiss-index", embeddings); // Similarity search const results = await loaded.similaritySearch("How do agents work?", 3); results.forEach((doc) => { console.log(`[${doc.metadata.source}] ${doc.pageContent.slice(0, 100)}...`); });
Step 5: Vector Store (Pinecone - Cloud)
import { PineconeStore } from "@langchain/pinecone"; import { Pinecone } from "@pinecone-database/pinecone"; import { OpenAIEmbeddings } from "@langchain/openai"; const pinecone = new Pinecone(); // reads PINECONE_API_KEY from env const index = pinecone.Index("my-index"); const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" }); // Upsert documents const vectorStore = await PineconeStore.fromDocuments(chunks, embeddings, { pineconeIndex: index, namespace: "docs-v1", }); // Query with metadata filtering const results = await vectorStore.similaritySearch("deployment guide", 5, { source: { $eq: "manual.pdf" }, });
Step 6: RAG Chain (Full Pipeline)
import { ChatOpenAI } from "@langchain/openai"; import { ChatPromptTemplate } from "@langchain/core/prompts"; import { StringOutputParser } from "@langchain/core/output_parsers"; import { RunnablePassthrough, RunnableSequence } from "@langchain/core/runnables"; const model = new ChatOpenAI({ model: "gpt-4o-mini" }); const retriever = vectorStore.asRetriever({ k: 4 }); const ragPrompt = ChatPromptTemplate.fromTemplate(` Answer the question based only on the following context. If the answer is not in the context, say "I don't have that information." Context: {context} Question: {question} Answer:`); // Format retrieved docs into a single string function formatDocs(docs: any[]) { return docs.map((d) => d.pageContent).join("\n\n"); } const ragChain = RunnableSequence.from([ { context: retriever.pipe(formatDocs), question: new RunnablePassthrough(), }, ragPrompt, model, new StringOutputParser(), ]); const answer = await ragChain.invoke("How do I deploy to production?"); console.log(answer);
Python RAG Equivalent
from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Load + split docs = TextLoader("./data/readme.md").load() chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(docs) # Embed + store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") store = FAISS.from_documents(chunks, embeddings) retriever = store.as_retriever(search_kwargs={"k": 4}) # RAG chain prompt = ChatPromptTemplate.from_template("Context:\n{context}\n\nQuestion: {question}") chain = ( {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)), "question": RunnablePassthrough()} | prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser() ) answer = chain.invoke("How do I deploy?")
Error Handling
| Error | Cause | Fix |
|---|---|---|
| Index not saved/wrong path | Call first |
| Embedding model changed | Rebuild index with same model |
| Index doesn't exist | Create index in Pinecone console |
| Empty retrieval results | Chunks too large or query too vague | Reduce , improve query |
Resources
Next Steps
Use
langchain-security-basics for securing your RAG pipeline.