Supported Document Types¶

ChronoScope intelligently processes multiple document formats to extract timeline events.

Overview¶

ChronoScope uses AI-powered document classification to automatically detect document types and optimize extraction strategies. Understanding which document types work best helps you prepare files for optimal results.

Supported File Formats¶

Fully Supported Formats¶

Format	Extension	Support Level	Best For
PDF	`.pdf`	✅ Excellent	Resumes, CVs, reports
Text	`.txt`	✅ Excellent	Plain text timelines
Word	`.docx`	✅ Good	Microsoft Word documents
Markdown	`.md`	✅ Good	Structured text files

Limited Support¶

Format	Extension	Support Level	Notes
RTF	`.rtf`	⚠️ Limited	May need conversion
ODT	`.odt`	⚠️ Limited	OpenDocument format
HTML	`.html`	⚠️ Limited	Plain HTML only

Not Supported¶

Format	Extension	Why Not Supported	Workaround
Images	`.jpg`, `.png`, `.gif`	No text extraction	Use OCR tool first
Scanned PDFs	`.pdf` (image-based)	No selectable text	Run through OCR
PowerPoint	`.pptx`	Complex layout	Export as PDF
Excel	`.xlsx`	Structured data format	Export as CSV, then import

PDF Processing (Dual-Library Approach)¶

How It Works¶

ChronoScope uses two PDF extraction libraries with automatic fallback:

PyMuPDF (primary) - Fast, reliable, works for most PDFs
pdfplumber (fallback) - Better for complex layouts and tables

Automatic selection:

User uploads PDF
    ↓
Try PyMuPDF extraction
    ↓
Success? → Return extracted text
    ↓
Failed? → Try pdfplumber
    ↓
Success? → Return extracted text
    ↓
Failed? → Show error (likely image-based PDF)

PDF Requirements¶

For best results, PDFs must have:

✅ Selectable text - You can highlight and copy text ✅ Structured layout - Clear sections and paragraphs ✅ Readable fonts - Standard fonts, not decorative ✅ No encryption - Unprotected files

Testing if your PDF will work:

Open PDF in viewer (Preview, Adobe Reader, etc.)
Try selecting text with cursor
If you can copy-paste text → ✅ Will work!
If text is not selectable → ❌ Need OCR first

Image-Based PDFs¶

If your PDF is a scanned image:

Option 1: Use OCR software

# Example with ocrmypdf (Mac/Linux)
brew install ocrmypdf
ocrmypdf input.pdf output.pdf

# Example with Adobe Acrobat (Windows/Mac)
File → Export To → Text (OCR if needed)

Option 2: Online OCR services

Document Type Detection¶

ChronoScope automatically classifies documents into types for optimized extraction.

Detected Types¶

1. Resume/CV¶

Characteristics: - Section headers: "Experience", "Education", "Skills" - Date ranges (Month Year - Month Year) - Job titles and company names - Bullet points with achievements

Extraction strategy: - Focus on work experience and education sections - Extract positions, companies, dates, locations - Parse education degrees and institutions - High confidence for explicit dates

Example structure:

PROFESSIONAL EXPERIENCE

Senior Software Engineer
TechCorp Inc., San Francisco, CA
June 2020 - Present
• Led development team of 5 engineers
• Built ML pipeline processing 1M+ events/day

2. Cover Letter¶

Characteristics: - Narrative format (paragraphs) - Mentions of specific experiences - Date references within text - Personal storytelling

Extraction strategy: - Parse narrative for timeline references - Extract mentioned companies/roles - Infer dates from contextual clues - Moderate confidence (dates often implicit)

Example structure:

During my three years at Google (2018-2021), I developed
a passion for machine learning. This led me to pursue
graduate studies at Stanford in 2021...

3. Personal Statement¶

Characteristics: - Academic focus - Research mentions - Publication references - Conference presentations

Extraction strategy: - Extract research projects and publications - Parse academic institutions and degrees - Identify grants, awards, conferences - High confidence for publications (explicit dates)

Example structure:

My research on neural networks, published in NeurIPS 2022,
built upon work I conducted during my PhD at MIT (2018-2022).
I presented preliminary findings at ICML 2021...

4. General Document¶

Characteristics: - Unstructured text - Mixed content types - No clear timeline format - Informal writing

Extraction strategy: - Broad pattern matching for dates - Extract any temporal references - Lower confidence scores - Manual review recommended

Extraction Quality by Document Type¶

Resume/CV: ⭐⭐⭐⭐⭐ Excellent¶

Success rate: 90-95%

Why it works well: - Explicit date formats - Structured sections - Clear temporal order - Standardized formatting

Typical extraction: - 10-20 events per page - 85-95% confidence scores - Complete date ranges - Locations and people

Optimization tips: - Use standard section headers - Include month/year for dates - Separate entries with spacing - Add locations for each role

Cover Letter: ⭐⭐⭐⭐ Very Good¶

Success rate: 70-85%

Why it works: - Narrative timeline structure - Specific experience mentions - Clear transitions between roles

Challenges: - Dates often implicit ("three years ago") - Locations may be missing - Less structured format

Typical extraction: - 5-10 events per document - 70-85% confidence scores - Some date ranges missing - Fewer location details

Optimization tips: - Mention explicit dates when possible - Reference specific companies/institutions - Use temporal markers ("In 2020...") - Include locations in narrative

Personal Statement: ⭐⭐⭐⭐ Very Good¶

Success rate: 75-90%

Why it works: - Academic rigor with dates - Structured achievements - Publication references

Challenges: - Mixed chronological order - Research spans overlap - Complex temporal relationships

Typical extraction: - 8-15 events per document - 75-90% confidence scores - Good date coverage - Strong institutional references

Optimization tips: - Include dates for all milestones - Reference publications with years - Mention conference dates - List institutions and locations

General Document: ⭐⭐⭐ Good¶

Success rate: 50-70%

Why it's challenging: - Unstructured content - Implicit timelines - Mixed topics - Varied date formats

Typical extraction: - 2-10 events per document - 50-70% confidence scores - Many date ranges missing - Locations often absent

Optimization tips: - Add explicit dates where possible - Structure into sections - Use clear headers - Separate events with spacing

Document Preparation Best Practices¶

Date Formatting¶

Optimal formats:

✅ Excellent:
- January 2020 - March 2023
- Jan 2020 - Mar 2023
- 01/2020 - 03/2023
- 2020-01 to 2023-03

✅ Good:
- 2020 - 2023 (year only)
- Q1 2020 - Q4 2023
- Spring 2020 - Summer 2023

⚠️ Acceptable but less optimal:
- "Three years" (requires reference point)
- "Recently" (too vague)
- "2020s" (decade reference)

❌ Avoid:
- "a while ago"
- "during college"
- "around that time"

Section Headers¶

Recommended headers for resumes:

Work Experience / Professional Experience / Employment History
Education / Academic Background
Projects / Key Projects
Publications / Research
Skills / Technical Skills
Certifications / Licenses
Awards / Honors / Achievements

Headers ChronoScope recognizes: - Experience, Work, Employment, Career - Education, Academic, School, University - Projects, Portfolio, Work Samples - Publications, Research, Papers - Awards, Honors, Achievements, Recognition

Content Structure¶

Example: Well-structured resume entry

Senior Software Engineer                    ← Job title (clear)
Google LLC, Mountain View, CA              ← Company, Location
June 2020 - Present                        ← Dates (explicit)

• Led team of 5 engineers                  ← Bullet points
• Built ML pipeline handling 1M+ events    ← Quantified achievements
• Deployed to production in 6 months       ← Specific timeline

Example: Poorly-structured entry

Worked at Google for a while doing engineering stuff.
Was part of a team. Did some machine learning work.

File Size Limitations¶

Recommended Sizes¶

File Type	Max Size	Optimal Size	Notes
PDF	10 MB	< 2 MB	Large files may timeout
TXT	5 MB	< 500 KB	Plain text loads fastest
DOCX	10 MB	< 3 MB	Complex formatting adds size
MD	5 MB	< 1 MB	Markdown is lightweight

Handling Large Files¶

If file > 10 MB:

Compress PDF:

# Using Ghostscript
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
   -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile=output.pdf input.pdf

Split into sections:
Upload work experience separately
Upload education separately
Merge in ChronoScope
Remove images:
Use text-only export
Remove embedded images
Keep only textual content

Multi-Language Support¶

Supported Languages¶

Language	Support Level	Notes
English	✅ Full support	Primary language, best results
Spanish	⚠️ Partial support	Date formats may vary
French	⚠️ Partial support	Accented characters supported
German	⚠️ Partial support	Works with standard dates
Chinese	❌ Limited	Date formats differ significantly
Japanese	❌ Limited	Complex date representations

Best Practices for Non-English¶

For best results:

Use ISO date format: YYYY-MM-DD
Translate key sections: Experience, Education
Keep location names in English: "Paris, France" vs "Paris, Francia"
Use English month names when possible: "Jan 2020" vs "Janv 2020"

Example (mixed language):

Ingénieur Logiciel Senior          ← French title (OK)
Google LLC, Paris, France           ← English location (Good)
January 2020 - Present              ← English dates (Best)

Troubleshooting by Document Type¶

PDFs Not Extracting¶

PDF upload succeeds but no text extracted

Likely cause: Image-based PDF (scanned document)

Diagnosis: 1. Open PDF in viewer 2. Try selecting text 3. If can't select → Image-based

Fix: 1. Use OCR software (see PDF Processing) 2. Re-upload processed file

Low Confidence Scores¶

Events extracted but confidence < 70%

Likely causes: - Implicit dates ("a few years ago") - Unstructured narrative - Missing key information

Fix: 1. Add explicit dates 2. Use structured format 3. Include locations and company names 4. Separate events clearly

Missing Events¶

Some obvious events not extracted

Likely causes: - Unusual date format - Non-standard section headers - Merged entries (multiple jobs in one paragraph)

Fix: 1. Use standard date formats (Month Year - Month Year) 2. Use recognized section headers (Experience, Education) 3. Separate entries with clear spacing 4. One event per paragraph/bullet

Duplicate Events¶

Same event extracted multiple times

Likely cause: Event mentioned in multiple documents or sections

Fix: 1. Use duplicate detection: 🔍 Validation tab 2. Review similarity score 3. Merge or delete duplicates 4. Improve source document clarity

Document Type Comparison¶

Aspect	Resume	Cover Letter	Personal Statement	General
Structure	High	Medium	Medium	Low
Date Precision	High	Medium	Medium	Low
Location Info	High	Medium	High	Low
Event Count	High	Medium	Medium	Low
Confidence	85-95%	70-85%	75-90%	50-70%
Best Use	Career timeline	Narrative context	Academic history	Supplemental

Recommendations by Use Case¶

Career Portfolio¶

Best document types: - Professional resume (primary) - LinkedIn export (supplementary) - Cover letters (for narrative context)

Avoid: - General documents - Informal notes

Academic Timeline¶

Best document types: - Academic CV (primary) - Personal statements (secondary) - Publication lists (supplementary)

Avoid: - Non-academic documents - Mixed-purpose files

Life Story¶

Best document types: - Chronological life summary (create for this purpose) - Personal statements - Diary exports (if structured)

Avoid: - Unstructured journal entries - Stream-of-consciousness writing

Next Steps¶

After understanding document types:

Quick Reference¶

Task	Recommendation
Upload resume	PDF or DOCX format, standard sections
Check if PDF works	Try selecting text in viewer
Convert scanned PDF	Use OCR software
Optimize dates	Use "Month Year - Month Year" format
Improve confidence	Add explicit dates and locations
Handle large files	Compress or split into sections
Non-English docs	Use ISO dates, English locations

Pro Tip

Create a "Timeline-Optimized" version of your resume specifically for ChronoScope: explicit dates, clear sections, separated entries, complete locations. Keep your original resume as-is for job applications, but use the optimized version for timeline extraction.

Back to Documentation Home