Dense Passage Retrieval of 10-k Reports with Haystack
An Account Executives Approach to Cold Calling
Ben Riall works for Adobe as an executive account manager. He has started and successfully excited a company, has written a book, and currently creates an online course.
With thirty big accounts a year, he is very very careful on who to engage with. And he makes sure he know everything. Therefore, he reads business and annual reports (10-k), and looks for certain aspects such as: ’Does the target company grow in digital experiences (or is this a strategic initiative)?’, because then he has a reason to contact someone in that enterprise. In such a target company there are 10 people that are relevant to contact.
Yet, Ben believes in cold calling. While mail and LinkedIn messages don’t really work since they are too impersonal. He would only benefit from a tool that could find key insights for him in business plans or on the LinkedIn profiles or answer certain questions. So let us tackling his situation!
Questions he would ask are:
- Are there digital experiences are they doing that ?
- What are the strategic priorities?
- Is the company growing ?
- Do they have cash or not?
- Is there things I have not thought about yet?
So how can we parse and get answers to these questions?
Better Retrieval via “Dense Passage Retrieval”
The Retriever has a huge impact on the performance of our overall search pipeline. It can be sparse or dense.
Sparse
Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.
Examples: BM25, TF-IDF
Pros: Simple, fast, well explainable
Cons: Relies on exact keyword matches between query and text
Dense
These retrievers use neural network models to create “dense” embedding vectors. Within this family there are two different approaches:
a) Single encoder: Use a single model to embed both query and passage.
b) Dual-encoder: Use two models, one to embed the query and one to embed the passage
Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax …).
Examples: REALM, DPR, Sentence-Transformers
Pros: Captures semantinc similarity instead of “word matches” (e.g. synonyms, related topics …)
Cons: Computationally more heavy, initial training of model
Results:
Answers to: What are efforts regarding digital experiences?
Rank | Answer |
---|---|
1 | eCommerce efforts and innovation |
2 | investments in eCommerce, technology, acquisitions, joint ventures, store remodels and other customer initiatives |
3 | Same Day Pickup and Same Day Delivery |
4 | social media, online advertising, and email |
5 | security of our digital platforms and keep them operating within acceptable parameters |
Answers to: What are the strategic priorities?
Rank | Answer |
---|---|
1 | improving our customer-facing initiatives in stores and clubs and creating a seamless omni-channel experience for our customers |
2 | Price transparency, assortment of products, customer experience, convenience, ease and the speed and cost of shipping |
3 | to make every day easier for busy families, operate with discipline, sharpen our culture and become digital, and make trust a competitive advantage |
4 | improving our customer-facing initiatives in stores and clubs and creating a seamless omni-channel experience for our customers |
5 | strategic capital allocation |
Answers to: What is the company’s growth?
Rank | Answer |
---|---|
1 | ticket and transaction growth |
2 | net sales |
3 | fiscal 2019 |
4 | $2.8 billion or 2.3% |
5 | 23% of our consolidated net sales |