Supported Document Types¶
ChronoScope intelligently processes multiple document formats to extract timeline events.
Overview¶
ChronoScope uses AI-powered document classification to automatically detect document types and optimize extraction strategies. Understanding which document types work best helps you prepare files for optimal results.
Supported File Formats¶
Fully Supported Formats¶
| Format | Extension | Support Level | Best For |
|---|---|---|---|
.pdf | ✅ Excellent | Resumes, CVs, reports | |
| Text | .txt | ✅ Excellent | Plain text timelines |
| Word | .docx | ✅ Good | Microsoft Word documents |
| Markdown | .md | ✅ Good | Structured text files |
Limited Support¶
| Format | Extension | Support Level | Notes |
|---|---|---|---|
| RTF | .rtf | ⚠️ Limited | May need conversion |
| ODT | .odt | ⚠️ Limited | OpenDocument format |
| HTML | .html | ⚠️ Limited | Plain HTML only |
Not Supported¶
| Format | Extension | Why Not Supported | Workaround |
|---|---|---|---|
| Images | .jpg, .png, .gif | No text extraction | Use OCR tool first |
| Scanned PDFs | .pdf (image-based) | No selectable text | Run through OCR |
| PowerPoint | .pptx | Complex layout | Export as PDF |
| Excel | .xlsx | Structured data format | Export as CSV, then import |
PDF Processing (Dual-Library Approach)¶
How It Works¶
ChronoScope uses two PDF extraction libraries with automatic fallback:
- PyMuPDF (primary) - Fast, reliable, works for most PDFs
- pdfplumber (fallback) - Better for complex layouts and tables
Automatic selection:
User uploads PDF
↓
Try PyMuPDF extraction
↓
Success? → Return extracted text
↓
Failed? → Try pdfplumber
↓
Success? → Return extracted text
↓
Failed? → Show error (likely image-based PDF)
PDF Requirements¶
For best results, PDFs must have:
✅ Selectable text - You can highlight and copy text ✅ Structured layout - Clear sections and paragraphs ✅ Readable fonts - Standard fonts, not decorative ✅ No encryption - Unprotected files
Testing if your PDF will work:
- Open PDF in viewer (Preview, Adobe Reader, etc.)
- Try selecting text with cursor
- If you can copy-paste text → ✅ Will work!
- If text is not selectable → ❌ Need OCR first
Image-Based PDFs¶
If your PDF is a scanned image:
Option 1: Use OCR software
# Example with ocrmypdf (Mac/Linux)
brew install ocrmypdf
ocrmypdf input.pdf output.pdf
# Example with Adobe Acrobat (Windows/Mac)
File → Export To → Text (OCR if needed)
Option 2: Online OCR services
Document Type Detection¶
ChronoScope automatically classifies documents into types for optimized extraction.
Detected Types¶
1. Resume/CV¶
Characteristics: - Section headers: "Experience", "Education", "Skills" - Date ranges (Month Year - Month Year) - Job titles and company names - Bullet points with achievements
Extraction strategy: - Focus on work experience and education sections - Extract positions, companies, dates, locations - Parse education degrees and institutions - High confidence for explicit dates
Example structure:
PROFESSIONAL EXPERIENCE
Senior Software Engineer
TechCorp Inc., San Francisco, CA
June 2020 - Present
• Led development team of 5 engineers
• Built ML pipeline processing 1M+ events/day
2. Cover Letter¶
Characteristics: - Narrative format (paragraphs) - Mentions of specific experiences - Date references within text - Personal storytelling
Extraction strategy: - Parse narrative for timeline references - Extract mentioned companies/roles - Infer dates from contextual clues - Moderate confidence (dates often implicit)
Example structure:
During my three years at Google (2018-2021), I developed
a passion for machine learning. This led me to pursue
graduate studies at Stanford in 2021...
3. Personal Statement¶
Characteristics: - Academic focus - Research mentions - Publication references - Conference presentations
Extraction strategy: - Extract research projects and publications - Parse academic institutions and degrees - Identify grants, awards, conferences - High confidence for publications (explicit dates)
Example structure:
My research on neural networks, published in NeurIPS 2022,
built upon work I conducted during my PhD at MIT (2018-2022).
I presented preliminary findings at ICML 2021...
4. General Document¶
Characteristics: - Unstructured text - Mixed content types - No clear timeline format - Informal writing
Extraction strategy: - Broad pattern matching for dates - Extract any temporal references - Lower confidence scores - Manual review recommended
Extraction Quality by Document Type¶
Resume/CV: ⭐⭐⭐⭐⭐ Excellent¶
Success rate: 90-95%
Why it works well: - Explicit date formats - Structured sections - Clear temporal order - Standardized formatting
Typical extraction: - 10-20 events per page - 85-95% confidence scores - Complete date ranges - Locations and people
Optimization tips: - Use standard section headers - Include month/year for dates - Separate entries with spacing - Add locations for each role
Cover Letter: ⭐⭐⭐⭐ Very Good¶
Success rate: 70-85%
Why it works: - Narrative timeline structure - Specific experience mentions - Clear transitions between roles
Challenges: - Dates often implicit ("three years ago") - Locations may be missing - Less structured format
Typical extraction: - 5-10 events per document - 70-85% confidence scores - Some date ranges missing - Fewer location details
Optimization tips: - Mention explicit dates when possible - Reference specific companies/institutions - Use temporal markers ("In 2020...") - Include locations in narrative
Personal Statement: ⭐⭐⭐⭐ Very Good¶
Success rate: 75-90%
Why it works: - Academic rigor with dates - Structured achievements - Publication references
Challenges: - Mixed chronological order - Research spans overlap - Complex temporal relationships
Typical extraction: - 8-15 events per document - 75-90% confidence scores - Good date coverage - Strong institutional references
Optimization tips: - Include dates for all milestones - Reference publications with years - Mention conference dates - List institutions and locations
General Document: ⭐⭐⭐ Good¶
Success rate: 50-70%
Why it's challenging: - Unstructured content - Implicit timelines - Mixed topics - Varied date formats
Typical extraction: - 2-10 events per document - 50-70% confidence scores - Many date ranges missing - Locations often absent
Optimization tips: - Add explicit dates where possible - Structure into sections - Use clear headers - Separate events with spacing
Document Preparation Best Practices¶
Date Formatting¶
Optimal formats:
✅ Excellent:
- January 2020 - March 2023
- Jan 2020 - Mar 2023
- 01/2020 - 03/2023
- 2020-01 to 2023-03
✅ Good:
- 2020 - 2023 (year only)
- Q1 2020 - Q4 2023
- Spring 2020 - Summer 2023
⚠️ Acceptable but less optimal:
- "Three years" (requires reference point)
- "Recently" (too vague)
- "2020s" (decade reference)
❌ Avoid:
- "a while ago"
- "during college"
- "around that time"
Section Headers¶
Recommended headers for resumes:
Work Experience / Professional Experience / Employment History
Education / Academic Background
Projects / Key Projects
Publications / Research
Skills / Technical Skills
Certifications / Licenses
Awards / Honors / Achievements
Headers ChronoScope recognizes: - Experience, Work, Employment, Career - Education, Academic, School, University - Projects, Portfolio, Work Samples - Publications, Research, Papers - Awards, Honors, Achievements, Recognition
Content Structure¶
Example: Well-structured resume entry
Senior Software Engineer ← Job title (clear)
Google LLC, Mountain View, CA ← Company, Location
June 2020 - Present ← Dates (explicit)
• Led team of 5 engineers ← Bullet points
• Built ML pipeline handling 1M+ events ← Quantified achievements
• Deployed to production in 6 months ← Specific timeline
Example: Poorly-structured entry
Worked at Google for a while doing engineering stuff.
Was part of a team. Did some machine learning work.
File Size Limitations¶
Recommended Sizes¶
| File Type | Max Size | Optimal Size | Notes |
|---|---|---|---|
| 10 MB | < 2 MB | Large files may timeout | |
| TXT | 5 MB | < 500 KB | Plain text loads fastest |
| DOCX | 10 MB | < 3 MB | Complex formatting adds size |
| MD | 5 MB | < 1 MB | Markdown is lightweight |
Handling Large Files¶
If file > 10 MB:
-
Compress PDF:
-
Split into sections:
- Upload work experience separately
- Upload education separately
-
Merge in ChronoScope
-
Remove images:
- Use text-only export
- Remove embedded images
- Keep only textual content
Multi-Language Support¶
Supported Languages¶
| Language | Support Level | Notes |
|---|---|---|
| English | ✅ Full support | Primary language, best results |
| Spanish | ⚠️ Partial support | Date formats may vary |
| French | ⚠️ Partial support | Accented characters supported |
| German | ⚠️ Partial support | Works with standard dates |
| Chinese | ❌ Limited | Date formats differ significantly |
| Japanese | ❌ Limited | Complex date representations |
Best Practices for Non-English¶
For best results:
- Use ISO date format: YYYY-MM-DD
- Translate key sections: Experience, Education
- Keep location names in English: "Paris, France" vs "Paris, Francia"
- Use English month names when possible: "Jan 2020" vs "Janv 2020"
Example (mixed language):
Ingénieur Logiciel Senior ← French title (OK)
Google LLC, Paris, France ← English location (Good)
January 2020 - Present ← English dates (Best)
Troubleshooting by Document Type¶
PDFs Not Extracting¶
PDF upload succeeds but no text extracted
Likely cause: Image-based PDF (scanned document)
Diagnosis: 1. Open PDF in viewer 2. Try selecting text 3. If can't select → Image-based
Fix: 1. Use OCR software (see PDF Processing) 2. Re-upload processed file
Low Confidence Scores¶
Events extracted but confidence < 70%
Likely causes: - Implicit dates ("a few years ago") - Unstructured narrative - Missing key information
Fix: 1. Add explicit dates 2. Use structured format 3. Include locations and company names 4. Separate events clearly
Missing Events¶
Some obvious events not extracted
Likely causes: - Unusual date format - Non-standard section headers - Merged entries (multiple jobs in one paragraph)
Fix: 1. Use standard date formats (Month Year - Month Year) 2. Use recognized section headers (Experience, Education) 3. Separate entries with clear spacing 4. One event per paragraph/bullet
Duplicate Events¶
Same event extracted multiple times
Likely cause: Event mentioned in multiple documents or sections
Fix: 1. Use duplicate detection: 🔍 Validation tab 2. Review similarity score 3. Merge or delete duplicates 4. Improve source document clarity
Document Type Comparison¶
| Aspect | Resume | Cover Letter | Personal Statement | General |
|---|---|---|---|---|
| Structure | High | Medium | Medium | Low |
| Date Precision | High | Medium | Medium | Low |
| Location Info | High | Medium | High | Low |
| Event Count | High | Medium | Medium | Low |
| Confidence | 85-95% | 70-85% | 75-90% | 50-70% |
| Best Use | Career timeline | Narrative context | Academic history | Supplemental |
Recommendations by Use Case¶
Career Portfolio¶
Best document types: - Professional resume (primary) - LinkedIn export (supplementary) - Cover letters (for narrative context)
Avoid: - General documents - Informal notes
Academic Timeline¶
Best document types: - Academic CV (primary) - Personal statements (secondary) - Publication lists (supplementary)
Avoid: - Non-academic documents - Mixed-purpose files
Life Story¶
Best document types: - Chronological life summary (create for this purpose) - Personal statements - Diary exports (if structured)
Avoid: - Unstructured journal entries - Stream-of-consciousness writing
Next Steps¶
After understanding document types:
Quick Reference¶
| Task | Recommendation |
|---|---|
| Upload resume | PDF or DOCX format, standard sections |
| Check if PDF works | Try selecting text in viewer |
| Convert scanned PDF | Use OCR software |
| Optimize dates | Use "Month Year - Month Year" format |
| Improve confidence | Add explicit dates and locations |
| Handle large files | Compress or split into sections |
| Non-English docs | Use ISO dates, English locations |
Pro Tip
Create a "Timeline-Optimized" version of your resume specifically for ChronoScope: explicit dates, clear sections, separated entries, complete locations. Keep your original resume as-is for job applications, but use the optimized version for timeline extraction.
Back to Documentation Home