dev.to

Handling Big Data Challenges: A Case Study of AllFreeNovel.cc

AllFreeNovel.cc ## Technical Challenges & Solutions 1. Data Ingestion Bottlenecks Problem: Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats: XML feeds from Korean publishers JSON APIs from Chinese platforms Raw text dumps from Japanese partners Solution: # Distributed ETL Pipeline class ChapterIngestor: def init(self): self.kafka_topic = "raw-chapters" self.schema_registry = AvroSchemaRegistry() async def process(self, source): async for chunk in source.stream(): normalized = await self._normalize(chunk) await kafka.produce( self.kafka_topic, value=normalized, schema=self.schema_registry.get(source.format) ) 2. Search Performance Optimization Metrics Before Optimization: 1200ms average query latency 78% cache miss rate 12-node Elasticsearch cluster at 85% load Implemented Solutions: Hybrid Index Strategy Hot data (latest chapters): In-memory RedisSearch Warm data: Elasticsearch with custom tokenizer Cold data: ClickHouse columnar storage Query Pipeline: graph TD A[User Query] --> B{Query Type?} B -->|Simple| C[RedisSearch] B -->|Complex| D[Elasticsearch] B -->|Analytics| E[ClickHouse] C/D/E --> F[Result Blender] F --> G[Response] 3. Real-time Recommendations Challenge: Generate personalized suggestions for 2M+ DAU with

Mar 2, 2025 - 02:58

Handling Big Data Challenges: A Case Study of AllFreeNovel.cc

AllFreeNovel.cc
## Technical Challenges & Solutions

1. Data Ingestion Bottlenecks

Problem:

Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats:

XML feeds from Korean publishers
JSON APIs from Chinese platforms
Raw text dumps from Japanese partners

Solution:

# Distributed ETL Pipeline
class ChapterIngestor:
    def __init__(self):
        self.kafka_topic = "raw-chapters"
        self.schema_registry = AvroSchemaRegistry()

    async def process(self, source):
        async for chunk in source.stream():
            normalized = await self._normalize(chunk)
            await kafka.produce(
                self.kafka_topic,
                value=normalized,
                schema=self.schema_registry.get(source.format)
            )

2. Search Performance Optimization

Metrics Before Optimization:

1200ms average query latency
78% cache miss rate
12-node Elasticsearch cluster at 85% load

Implemented Solutions:

Hybrid Index Strategy
- Hot data (latest chapters): In-memory RedisSearch
- Warm data: Elasticsearch with custom tokenizer
- Cold data: ClickHouse columnar storage
Query Pipeline:

graph TD
    A[User Query] --> B{Query Type?}
    B -->|Simple| C[RedisSearch]
    B -->|Complex| D[Elasticsearch]
    B -->|Analytics| E[ClickHouse]
    C/D/E --> F[Result Blender]
    F --> G[Response]

3. Real-time Recommendations

Challenge:

Generate personalized suggestions for 2M+ DAU with <100ms latency

ML Serving Architecture:

┌─────────────┐ ┌─────────────┐
│ Feature Store│◄─────│ Flink Jobs │
└──────┬───────┘ └─────────────┘
│
┌──────▼───────┐ ┌─────────────┐
│ Model Cache │─────►│ ONNX │
└──────┬───────┘ │ Runtime │
│ └─────────────┘
┌──────▼───────┐
│ User │
│ Interactions │
└──────────────┘

Results:

P99 latency reduced from 2200ms → 89ms
Recommendation CTR increased by 37%
Monthly infrastructure cost saved: $28,500

Key Takeaways

Data Tiering is crucial for cost-performance balance
Asynchronous Processing prevents pipeline backpressure
Hybrid Indexing enables optimal query performance
Model Optimization (ONNX conversion) dramatically improves ML serving