openai API의 RAG하기!! (2)-여러개 파일!!+html (tool_call 기능 중 Assistants File Search)

2024.11.05 - [데이터&AI/LLM] - openai API로만 RAG하기!! (1) (tool_call 기능 중 Assistants File Search)

지난 포스팅에서 텐센트의 사업보고서.pdf 를 기반으로,

openai의 API의

file search 기능을 활용해보았습니다!!

이번에는 이 file_search의 기능을 더 자세히 알아보도록 하겠습니다!!

0. 미션

이번엔 지난 tencent의 사업보고서외에도 삼성전자 사업보고서, 테슬라 사업보고서를 함께 vector화하여!!

이중에서 필요한 정보를 추출해보겠습니다!!

tencent_2024040801822.pdf

5.58MB

tsla-20231231.html

2.60MB

[삼성전자]사업보고서(2024.03.12).pdf

2.13MB

※ 질문!! pdf html도?? file search에서 가능한 파일타입은!!?

공식 홈페이지에 따르면 아래와 같이 pptx, docx 등 다양한 파일들을 지원합니다!!

Supported files: For text/ MIME types, the encoding must be one of utf-8, utf-16, or ascii.

.c	text/x-c
.cpp	text/x-c++
.cs	text/x-csharp
.css	text/css
.doc	application/msword
.docx	application/vnd.openxmlformats-officedocument.wordprocessingml.document
.go	text/x-golang
.html	text/html
.java	text/x-java
.js	text/javascript
.json	application/json
.md	text/markdown
.pdf	application/pdf
.php	text/x-php
.pptx	application/vnd.openxmlformats-officedocument.presentationml.presentation
.py	text/x-python
.py	text/x-script.python
.rb	text/x-ruby
.sh	application/x-sh
.tex	text/x-tex
.ts	application/typescript
.txt	text/plain

1. assistant 와 user 정의하기! (지난번과 동일)

ㅁ assistant

저는 어시스턴트로 gpt-4o 모델 기반으로 재무분석 도우미라고 정의했습니다!

여기서 중요한것은!! tools 에 file_search라는것을 선언해주는것입니다!

file search는 openai에서 제공하는 베타 기능입니다!

ㅁ user1

- user1은 assistant에 질문할 하나의 사용자(thread)라고 생각합니다!

from openai import OpenAI
 
client = OpenAI()
 
assistant = client.beta.assistants.create(
  name="재무분석 도우미",
  instructions="너는 expert financial analyst야!!. 제시된 내용을 바탕으로 financial statements 의 내용에 대하여 답해줘",
  model="gpt-4o",
  tools=[{"type": "file_search"}],
)

user1 = client.beta.threads.create()
user1

2. 문서를 vector로서 openai에 저장하기!

여기서 우리는 여러개의 파일을 openai의 저장공간에 나의 pdf를 벡터 스토어로서 저장합니다!

# Create a vector store caled "Financial Statements"
vector_store = client.beta.vector_stores.create(name="tencent 사업보고서Financial Statements")
 
# Ready the files for upload to OpenAI
file_paths = ["tencent_20240408.pdf",'[삼성전자]사업보고서(2024.03.12).pdf','tsla-20231231.html']
file_streams = [open(path, "rb") for path in file_paths]
 
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)
 
# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)

file_batch

그럼! 아래아 같이 고유한 ID로 데이터가 저장되었다는 것을 확인할 수 있습니다

3.assistant 정보 업데이트하기!!

이제 1번에서 선언한 assistant에!! tool resources로서 file_search 에 방금 선언한 vector_store.id를 추가해줍니다!!

assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

그 결과 아래와 같이 search에 어떤 옵션이 적용되었는지 (지금은 default지만 이후 튜닝해볼 수 있겠지요?

그리고 어던 vector가 정해졌는지 볼수 있습니다!

4.질문 프롬포트 입력

이제!!

ㄱ. message에 user1로서 궁금한사항을 질문하기!

ㄴ. 질문이있는 user1 thread와 vector가 저장된 assistant를 함께 묶어서 실행!!

ㄷ. 결과물 프린트!

의 부분에서 ㄱ 부분에 아래와 같은 질문들을 해보겠습니다!!

삼성전자에서 판매하는 주요 상품은??
테슬라에서 판매하는 주요 상품은?
텐센트에서 운영하는 게임은 뭐가있어??
테슬라의 사이버트럭은 어떤 공장에서 생산해???
삼성전자 영업이익과 EPS 알려줘
테슬라의 revenue 연도별로 구해줘

그리고 ㄴ / ㄷ의 코드는 아래와 같습니다!

message = client.beta.threads.messages.create(
   thread_id=user1.id,
  role="user",
  content="{궁금한 사항 넣기!}",
)
message

run = client.beta.threads.runs.create_and_poll(
  thread_id=user1.id,
  assistant_id=assistant.id,
  instructions="너는 expert financial analyst야!!. 제시된 내용을 바탕으로 financial statements 의 내용에 대하여 답해줘",
)
messages = client.beta.threads.messages.list(
thread_id=user1.id
)
print(messages.data[0].content[0].text.value)

1. 삼성전자에서 판매하는 주요 상품을 알려줘

아주 깔끔하게 요약해줍니다!!

삼성전자의 주요 상품은 다음과 같습니다:

1. **완제품**: 
   - TV, 냉장고, 세탁기, 에어컨, 스마트폰 등【4:0†source】【4:7†source】.
   - 모니터, 네트워크 시스템, 컴퓨터 등【4:8†source】.

2. **반도체 부품**:
   - DRAM, NAND Flash, 모바일 AP【4:0†source】【4:8†source】.

3. **스마트폰 부품**:
   - 스마트폰용 OLED 패널【4:0†source】【4:9†source】.

4. **하만 관련 제품**:
   - 디지털 콕핏, 카 오디오, 포터블 스피커【4:0†source】【4:10†source】.

이 제품들은 삼성전자의 다양한 사업 분야에 포함되며, 글로벌 시장에서 중요한 위치를 차지하고 있습니다.

2. 테슬라 판매하는 주요 상품을 알려줘

테슬라의 주요 상품은 다음과 같습니다:

1. **자동차**:
   - 테슬라는 현재 다섯 종류의 소비자용 차량을 제조하고 있습니다: Model 3, Y, S, X, 그리고 Cybertruck【8:2†source】.
   - 상용 전기차인 Tesla Semi의 초기 생산과 배송도 시작되었습니다【8:2†source】.

2. **에너지 생성 및 저장**:
   - Powerwall 및 Megapack과 같은 리튬이온 배터리 에너지 저장 제품【8:2†source】.
   - Powerwall은 주택이나 소규모 상업 시설에서 에너지를 저장하는 데 사용되며, Megapack은 상업, 산업, 공공설비 및 에너지 생성 고객을 위한 에너지 저장 솔루션입니다【8:3†source】. 

3. **태양광 에너지 제공**:
   - 테슬라는 레트로핏 태양광 에너지 시스템을 고객과 채널 파트너에게 판매하거나 전력 구매 계약(PPA) 방식으로 제공합니다【8:3†source】. 

테슬라는 이러한 제품들을 통해 전기차 시장뿐만 아니라 에너지 저장 및 태양광 에너지 부문에서도 활동하고 있습니다.

3. 텐센트에서 운영하는 게임은 뭐가있어??

텐센트에서 운영하는 게임 중 일부는 다음과 같습니다:

1. **PUBG Mobile**: 테크놀로지와 디자인으로 큰 성공을 이어가고 있는 인기 모바일 게임【48:1†source】.

2. **VALORANT**: 스타트업 게임으로 Tencent의 인기 타이틀 중 하나【48:5†source】.

3. **Arena of Valor (왕자영요)**: 기존 Tencent의 강력한 모바일 게임 중 하나【48:5†source】.

4. **Clash of Clans**와 **Clash Royale**: Supercell이 개발한 이 두 게임 역시 Tencent의 게임 목록에 포함되어 있으며 이들의 매출에 기여하고 있습니다【48:5†source】.

텐센트는 글로벌 게임 산업에서 막대한 영향력을 가지고 있으며, 다양한 게임을 통해 전 세계의 수많은 사용자를 보유하고 있습니다.

4.테슬라의 사이버트럭은 어떤 공장에서 생산해???

테슬라의 사이버트럭은 주로 **기가팩토리 텍사스(Gigafactory Texas)** 에서 생산되고 있습니다【44:8†tsla-20231231.html】.
이 공장은 미국 텍사스에 위치하고 있으며, 테슬라의 중요한 제조 시설 중 하나로 다양한 전기차 모델의 생산을 진행하고 있습니다.

5.삼성전자 영업이익과 EPS 알려줘

- **삼성전자의 영업이익**: 2023년 연결 기준 영업이익은 약 6조 5,670억 원입니다【36:0†source】.
  
- **삼성전자의 주당순이익(EPS)**: 2023년의 EPS는 2,131원이었습니다【36:15†source】.

이 정보들은 삼성전자의 사업보고서에서 제공된 수치들을 기반으로 하였습니다.

6. 테슬라의 revenue 연도별로 구해줘

테슬라의 최근 연도별 매출은 다음과 같습니다:

- **2023년**: 967억 7천만 달러【32:0†tsla-20231231.html】.

이 데이터를 기반으로 각 연도의 매출을 제공할 수 있습니다. 추가 연도별 정확한 수치가 필요하다면 테슬라의 과거 재무 보고서를 참조하는 것이 좋습니다.

결과를 뽑는 능력이 놀랍지 안나요!!?

+ 추가 속성 알아보기@

그럼!! 이 함수들의 추가 속성들에 대하여 알아보겠습니다!!

1. 몇게의 파일 업로드가 가능할까?

- 512MB이하의 파일이면서 5백만 token이하의 파일로 500개까지 업로드 가능합니다

2. vector_store에 저장하면 비용이 든다!!

> 그 가격은 $0.10/GB/day of vector storage 로서 아주 비싼것은 아니지만, 누적되면 큰비용이되니!!

> 아래와 같은 방법으로 expire 일자를 설정할수 있다!

vector_store = client.beta.vector_stores.create_and_poll(
  name="Product Documentation",
  file_ids=['file_1', 'file_2', 'file_3', 'file_4', 'file_5'],
  expires_after={
	  "anchor": "last_active_at",
	  "days": 7
  }
)

혹은 openai dashboard에서도 설정이 가능합니다!!

3. vector_store에 저장된 파일리스트 보기!!

이후 내가 만든 vector_store에 어떤 파일이 저장됬는지 보려면!? 아래와 같이 코드로서 볼수 있습니다!

all_files = list(client.beta.vector_stores.files.list(vector_store_id = vector_store.id))
all_files

그럼 각각의 파일들이 언제 만들어졌고, 용량은 얼마인지 등을 다 알수 있습니다!

4. vector_store 저장시 chunk 사이즈 및 overlap 설정하기!!

> 자동도 해주지면!! 직접 chunking starategy를 정하려면!? 아래와 같이하면됩니다~~

# Ready the files for upload to OpenAI
file_paths = ["tencent_20240408.pdf",'[삼성전자]사업보고서(2024.03.12).pdf','tsla-20231231.html']
file_streams = [open(path, "rb") for path in file_paths]

# 청크 설정을 포함한 파일 업로드 설정
chunking_strategy = {
    "type": "static",
    "static": {
      "max_chunk_size_tokens": 800,  # 청크의 최대 토큰 크기 (예시: 800)
      "chunk_overlap_tokens": 400  # 각 청크 간의 중복 토큰 크기 (예시: 400)
    }
 }
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
    vector_store_id=vector_store.id
    , files=file_streams
    , chunking_strategy=chunking_strategy  # 청크 설정 적용
)
 
# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)

file_batch

ㅁ 참고 : https://platform.openai.com/docs/assistants/tools/file-search

ㅁ참고2 : openai api 자세히 보기 : https://platform.openai.com/docs/api-reference/files/retrieve

저작자표시 비영리 동일조건 (새창열림)

'데이터&AI > LLM' 카테고리의 다른 글

on-premise로 deepseek-R1기반의 챗봇 만들기!(with Ollama&gradio) (0)	2025.01.28
LLM의 요약을 잘했는지 평가하는 방법! ROUGE 점수! (with python code) (2)	2024.11.09
openai API로만 RAG하기!! (1) (tool_call 기능 중 Assistants File Search) (1)	2024.11.07
openai 의 response_format (Structured_outputs의 원조) (1)	2024.11.06
[무료] OpenAI API를 활용하여 유해성 검증하기!! (moderation API) (0)	2024.11.05

일등박사의 연구소

openai API의 RAG하기!! (2)-여러개 파일!!+html (tool_call 기능 중 Assistants File Search)

0. 미션

1. assistant 와 user 정의하기! (지난번과 동일)

2. 문서를 vector로서 openai에 저장하기!

3.assistant 정보 업데이트하기!!

4.질문 프롬포트 입력

+ 추가 속성 알아보기@

'데이터&AI > LLM' 카테고리의 다른 글

댓글

티스토리툴바

openai API의 RAG하기!! (2)-여러개 파일!!+html (tool_call 기능 중 Assistants File Search)

0. 미션

1. assistant 와 user 정의하기! (지난번과 동일)

2. 문서를 vector로서 openai에 저장하기!

3.assistant 정보 업데이트하기!!

4.질문 프롬포트 입력

+ 추가 속성 알아보기@

'데이터&AI > LLM' 카테고리의 다른 글

관련글

댓글

티스토리툴바